Skip to main content

Webinar recording

Domain-Driven Design patterns for a distributed system

The patterns of Domain-Driven Design were originally introduced when line-of-business systems were still predominantly monolithic. That is not the case anymore. Modern-day designs are not only distributed but also require a wide array of data stores - each specialized in storing different types of data.

🔗A brave new world

New patterns have emerged to help developers deal with these new headaches. DDD expert Szymon Pobiega tells the story of how a group of unconventional software architects discovered the new patterns and used them to overcome the challenges of scaling an online store, designing robust long-running business processes, and developing their system at scale by multiple teams in multiple locations.

🔗In this webinar Szymon will show:

  • How to take advantage of synchronous and asynchronous message processing
  • How to build extendable, robust and asynchronous distributed business process
  • How to deal with duplicated messages when using databases with different transaction semantics
  • What to do when your system grows too big to be handled by a single development team

Equipped with this knowledge, you’ll be able to combine these design patterns to architect your next distributed system to be robust or to refactor your existing system towards greater reliability.

🔗Transcription

00:00:00 Tomek Masternak
Hello again, everyone, and thanks for joining us for another Particular live webinar. This is Tomek Masternak, and today I'm joined by my colleague and solution architect, Szymon Pobiega who's going to talk about DDD patterns in distributed systems. Just a quick note before we begin, please use the Q and A feature to ask any questions you may have during today's live webinar, and we will be sure to answer them at the end of the presentation. We will follow up offline to answer all the questions we won't be able to answer during this webinar. We are also going to record the webinar, and everyone will receive a link to the recording via email. Okay, let's talk about DDD. Szymon, the floor is yours.
00:00:55 Szymon Pobiega
Thank you Tomek. Okay, let's get started. Mandatory slide with my background, but we can skip by really quickly. I used to be more active in the domain driven design community. I'm not that active anymore unfortunately. Hope to get more done in that community. I used to be involved in event sourcing and last thing, but not the least, I am an engineer at particular software, helping people build distributed systems and some of you who may have seen my presentations in the past probably expected this guy. It's a typical architect, an old man with beard and gray hair. Usually, my presentation starts with this guy, designing the architecture. This is not the case today.
00:01:57 Szymon Pobiega
Today, we'll have a much more interesting team of architects that I think better reflect the reality of software engineering, because we no longer usually have a single architect in the ivory tower team. Many team members play the architectural, times all of them playing the architectural at the same time. Sometimes, one of the team members take the lead and also, each of those architects or all those people that play the architect role have their personalities and their backgrounds that influence how the architecture is designed. Okay, so let's get started. This lovely team of architects went on to design a system for selling online potions. Potions are very important for ponies because if you are preparing for a major holiday in the pony land, you need to stockpile some potions, magical potions.
00:03:01 Szymon Pobiega
This e-commerce system is going to help ponies find and purchase the potions they need for a particular holiday. If you're building an e-commerce system, the first thing that you're probably going to start with is a shopping cart, and this is precisely what the ponies did. For the shopping cart, the lead architectural happened to be assigned to Pinkie Pie. Pinkie Pie is our first pony architecture that we are meeting today. She's very enthusiastic about stuff, so she quickly jumped on this task and started designing because that's what architects do, right? That's the first sketch of the architecture of the system. It's pretty straightforward, the most basic type of architecture I would say.
00:03:48 Szymon Pobiega
There is a user interface that runs in the browser, HTTP protocol between the browser and some component on the server side that is the shopping cart component and the database that stores the data. This is something we all start with, but that was deemed not enough because the ponies expected that before the major holidays, there will be a much higher traffic on the website. They wanted to build scalability in, and they heard about message queues, and they heard that messages help in achieving a higher scalability. They introduced a message queue behind the web server, so that when the there is more requests than the shopping cart component can handle, those web requests won't be just plainly rejected.
00:04:35 Szymon Pobiega
They will be queued in that queue, and the shopping cart will update the database, but what happens on the other side of things? How users of that website would know if they managed to put something into the shopping cart or not? Well, the user interface is going to pull the database for the web server. Is that item added? Not yet. Is that item added? Not yet. That's how the interaction would work and that was supposed to work, at least on paper, but the user experience team was not really happy because the delay between some user clicking add item to the shopping cart and actually seeing that item in the shopping cart was much too long. Our pony architects introduced a small change in the system.
00:05:23 Szymon Pobiega
They added another queue and the shopping cart component now updates its database and also posts a message to that return queue, and that message gets mapped to a WebSocket package, and that WebSocket packet gets to the UI. It's much faster now because there is no polling and no polling interval. As soon as the add item to shopping cart message is processed, the user sees a change in the user interface. Okay, that was supposed to be the plan, but here we meet our second pony architect, Fluttershy. Fluttershy doesn't talk too much because well, she's shy, but when she does talk, she usually has some very insightful thing to share. In this particular moment, what she started questioning is the queues.
00:06:17 Szymon Pobiega
All the ponies gathered together and started thinking about queues. When queues help in a distributed system, and they made a list of things that are required to be there for the queues to be helpful. First, there needs to be higher than capacity load. Yes, that's something that the system is going to experience before major holidays, but that higher than capacity load has to be for a short period of time, like seconds or minutes. Well, that's not what the ponies expected from their e-commerce system. They expected the higher than capacity load to last for a few days before the holidays. Hmm, there is something off here and then the second condition for the queues to be helpful is that further action does not depend on the outcome of that previous action. Let me describe it in this way.
00:07:11 Szymon Pobiega
When you're submitting an order, you click submit and then you expect that your order will be processed by the e-commerce site. You click submit and you go out with a dog, or you just go to sleep and expect your order process by next morning, but if you are interacting with a shopping cart, that's not the flow. You're adding items to shopping carts, reviewing the order, clicking here and there, and each action depends on all your previous actions that you did on the website. Queuing your commands to add item to a shopping cart, or remove item from a shopping cart doesn't really help you because you still need to wait, until that item is put into the queue, picked up from the queue and processed.
00:07:58 Szymon Pobiega
Queues are not really helpful in that scenario right, so that ponies had to come back to the drawing board and start over, but this time, the lead architectural was taken by Fluttershy. What she did is she took that basic design with shopping cart, and she used a different approach to handling higher than capacity load. She heard about serverless like Azure functions and AWS Lambda, and she figured that that would be ideal solution for their problem because with serverless, you still don't need to pay for the capacity that you're not using, but you can maintain the higher than usual capacity for extended periods of time, like for three days in a month which would not be the case with the... well, which is the case where queues would not be really helpful.
00:08:54 Szymon Pobiega
They decided to use serverless for scale out, and what serverless did is it pushed them into the direction of questioning how much stuff they do, while handing a web request. It probably could be seen even before they switch to serverless technologies, but serverless makes it really plain explicit that you should just take the request, do the minimum thing you need to do, and return the response. That means that there are some asynchronous things that needs to be done, while processing order for potions that don't really belong to that request response pattern. Ponies had to introduce some solution for that, and the easiest one that they thought was batch job.
00:09:42 Szymon Pobiega
Well, these days if you want to be trendy architect, you don't call it a batch job. You call it a function that is triggered by a crown trigger, but still the concept is the same. You're running some piece of code that is doing some processing on the back end and that worked for a while, but at some point, ponies realized that it's not going to be the solution for them. They need at some point to break down their system into smaller components because just pushing more and more stuff into that batch job was making the code less and less readable. They were worried that they won't be able to build that system on time, so they started introducing components like shipping or billing.
00:10:27 Szymon Pobiega
Now the question is how do you cross the gap between the shopping cart and shipping? The user interacts with the system. They add items to the shopping cart. They remove items and at some point, they click submit to submit the order. When they submit the order, the order information should go to shipping and billing and other components. Also, the shopping cart should be updated to reflect that state. Well, one way to do it is let's just use HTTP. Nobody gets fired by choosing HTTP as a communication layer for their systems, right? The problem though with that approach is that what's happening when you want to submit the order? The user interface sends the HTTP request, and that HTTP request needs to update the state of the shopping cart.
00:11:19 Szymon Pobiega
Let's say update state of the shopping cart to order submitted and then fire another HTTP request that goes to shipping component, and that another HTTP request conveys the shipping information and all the items that need to be shipped, and that information needs to be stored in another database. Only then after those two database updates or inserts are done, we can return the HTTP response to the user interface. Well, you may notice now that it involves updating two separate databases and without distributed transactions and we don't have them for web requests, there are going to be partial failures, so either one of those. This database update succeeds, or maybe the other. Maybe none of them succeeds.
00:12:10 Szymon Pobiega
There are going to be failure scenarios that will result in customers seeing weird things. The shopping cart in the state order submitted, so they would think, "Oh, that's fine. I will get my potions," but shipping would have no information about that particular order because the HTTP request failed at some point. They decided that's not going to be a solution in long term. They tried to redesign it, and they did by breaking down this transaction into two pieces. First, when the user submits the order, the state of the order is only updated in the shopping cart component. Then there is a bulk job that selects the orders in that submitted state and transfers the information about those orders to the shipping component, which stores them in the database.
00:13:03 Szymon Pobiega
If that fails, the batch job can always retry passing that information to shipping as many times as needed. There is failure handling built into that system, and ponies were happy with it at least. At least Pinkie Pie and Fluttershy were quite proud with that, but this is the time where we meet our third pony today. She is Rainbow Dash, and she is known for being very fast, a very fast flying pony. She sometimes pokes holes in the cloud that cause rain or thunder, and she's also known for a thing that many architects are known for. I speak of experience because I used to be an architect. They just come down from their ivory tower, and they leave a drive-by comment about your system, and then they go back up into their ivory towers.
00:13:56 Szymon Pobiega
The development team is left with a comment and that they have to act on and they don't know what to do, so here's her comments. "You decided to build a poor pony's queue, right?" The ponies are puzzled. What did she mean, but she's a very smart pony. All the other ponies started thinking really hard what's going on here, maybe something is really off. After a while, they realized what Rainbow Dash meant. She meant that this system of a batch job and HTTP requests and storing stuff in a shipping database is basically a message queue in disguise because we are taking stuff from one database to another database, transforming it along the way and going through some API.
00:14:42 Szymon Pobiega
It could also be seen as a good old replication of the database, but for this purpose, let's treat it as a message queue. At least that was the first thought that Rainbow Dash had and all the other ponies agreed. They came back to the drawing board, and they replaced that whole system with a nice tidy message queue. They were very, very happy with that design and let's see what the code looks like when they actually started writing it down. This is a web method that handles submit order request. It gets the order ID as a parameter. It slows the order from the database and calls submit method on it, which changes the internal state of that order.
00:15:35 Szymon Pobiega
It begins the transaction and within that transaction, stores that order in the repository, and it publishes the order submitted event that notifies the other components of that system that an order has been submitted. Then it commits the transaction. Sounds simple, but here's a moment where we meet our next pony, Rarity. Rarity is known for having very, very close attention to detail, and she usually notices things that other ponies don't see. This time, she looks at the code and says, "Oh, how are you going to send a message and update the database in a transaction?" Hmm, what does she mean? Well, take a look at those two lines line. In line one, we publish an event and in line two, we commit the transaction that stores a state of our order.
00:16:33 Szymon Pobiega
Now depending on the order of those two operations, we might end up with two partial failure scenarios, which are really not going to be liked by the user. First scenario is a message in that order of things, the order submitted event will get published, but then if line two fails, it means that shipping and billing will carry on processing the order, while the order state is not really updated. For the user, the order appears to be still in the not submitted state, and the user might think, "Oh, well never mind. I won't buy those potions," and they can go away and not expect to have a delivery next day.
00:17:17 Szymon Pobiega
The other situation is if you flip those two lines by trying to fix it, you will end up in an opposite situation where the user would see a correct state order submitted, but the other components of the system would not know about that order. There are partial failures in that scenario, and there are two ways of dealing with those partial failures. The way number one is you can have a distributed transaction between your messaging infrastructure, the bus and your database. Well, you cannot have them any longer, unless you're using old technologies like MSMQ. What's the option two? Option two is to use the same transactional resource for messaging and for database. Some option one is let's use a message queue as a database.
00:18:07 Szymon Pobiega
Well, as you might imagine, message queues are really poor databases. The only other option is to use a database as a message queue. Well, that might sound weird, but at least we can try, or try to imagine it. Relational database can be done with other databases, but let's focus on the relational database scenario. You can imagine a table that has two columns, a sequence number and a payload or a body. Then if you treat that table as a queue, you need a way to receive messages from that queue, and you do it by issuing a destructive select so-called which is a select that disrupt, that is a sub-query to a delete.
00:18:48 Szymon Pobiega
It works in such a way that you select the oldest row from that table ordering by sequence number and with those two table hints, update log and read past, you ensure that concurrent selects will select and lock different rows. The concurrent threads won't step into each other's toes. Finally, you delete that row and the output clause tells it to return to the application server. Of course, the row isn't immediately deleted. It will be deleted once the transaction commits. When we use this pattern, we change the code a little bit. Now, we load the order. We begin the transactions, save the data into the database within that transaction, but we also publish a message within the same transaction. Then when we commit that transaction, both safe and publish will atomically be send to the database.
00:19:44 Szymon Pobiega
That's a huge win because now we are not vulnerable when it comes to those partial failure scenarios, but this is precisely the time when Rainbow Dash flies in again and says, "Woo, so you did decide to build a poor pony's queue, right?" Well, what does she means? Well, you can find that on Twitter. Usually if you ask about the building a queue in a database, you will get a lot of comments along the lines of don't do that, it's not going to work, and blah, blah, blah. One is after consulting Twitter, they decided to actually check by themselves and they picked 8 vCore SQL Azure instance because it's very standard thing that you can replicate yourself. It costs you $2 per hour, not a big amount, not a small amount, whatever.
00:20:36 Szymon Pobiega
That SQL Azure instance with a decent messaging code can give you 800 messages sent and received per second through a queue in the table. That means that it's almost three million messages per hour. Why am I quoting the per hour numbers? Well, you'll see in a second. Almost three million messages per hour means that you pay $0.007 per 1000 messages. Okay. Put it into a context Azure storage queues, if you use free operations for each message, meaning you put a message then you get a message and then you delete a message, costs you around 0.0012 per 1000 messages. That means that well SQL-based message queuing is twice cheaper or 50% cheaper than Azure storage queues, which is the cheap way of sending messages in Azure.
00:21:38 Szymon Pobiega
Well, that's not precisely that because if you use bigger messages, SQL Azure won't be as efficient in pushing those messages through the table, but what is important here is that it's the same order of magnitude, and that 800 messages per second or three million messages per hour is a lot if you're counting your business-related messages because you can earn a lot of money by pushing that amount of messages per second, even if you get a tiny little profit margin for each transaction, but if that is not enough for you, the pattern that you might be interested in is polyglot messaging, which means that you are using two different messaging infrastructures in the same system for different purposes.
00:22:25 Szymon Pobiega
Let's say one part of the system, the part that is handling the user interaction here needs to have strong consistency guarantees, and you're using a SQL server as a messaging queue. Then there is another part of the system for which the most important factor is scalability, and then you use other storage using that part. In between them, you put something like a transport bridge. Such transport which can be built within service bus, for example, if you just use transports directly and to remind you and service bus transports are public API of a particular platform. You can just use a transport outside of the end point if you don't need serialization and pipeline and that kind of stuff to just move messages from one view to the other.
00:23:17 Szymon Pobiega
Coming back to our story, this is the architecture that the ponies were working with. I think this is appropriate time to meet our next pony and last for today or second to last actually. She's Twilight Sparkle, and she is a Martin Fowler type of pony if you know what I mean. She sees patterns where others fail to see in a relationship, and she looks at things, multiple things, compares those things and can see what are the relations. She looks at that diagram and says, "Hmm, there is something into it," and this is what she means by it. She noticed that there is a pattern that is repeated in that system all over again, over and over again. There is a synchronous part of a business process and an asynchronous part of the business process.
00:24:14 Szymon Pobiega
The synchronous part is when the user interacts with that system, for example, through the shopping cart. The user adds things to the shopping cart, removes things from the shopping cart, and makes many decisions, but at some point, the user arrives to a conclusion, either they are submitting that set of data that they built, or they are abandoning it. Hopefully, they submit it and when they submit that set of data, that set of data gets transmitted as a message to the async part of the system. The async part of the system consists of components that communicate through queues, message queues. The asynchronous part of the system is where the majority of the business process is happening, but the user is not waiting for that part to complete.
00:25:02 Szymon Pobiega
Well, they would rather have their goods delivered by tomorrow than in three days, but the user doesn't care if the business process completes in five seconds or 10 seconds or maybe 30 seconds or five milliseconds. That pattern is not applicable to any system. There are systems in which the whole business process needs to be carried out within certain time constraints, but in most e-commerce systems, there is a clear distinction between the synchronous part and the asynchronous part. Usually, the async part is much more complex and bigger, but has no hard time constraints, while the sync part is shorter, much more user interactive and it's much better well done in sync.
00:25:54 Szymon Pobiega
Okay, so let's try to focus now on that async part, and that would be preparing shipment and billing the customer. For this part, the ponies decided to actually re-read the domain event design book by Eric Evans because they expected a lot of complexity being in this part of the system, and they wanted really to design that part of the system well, because they thought that that's their core domain. One of the things that they found in that blue book is this sentence aggregates and entities encapsulate behavior and data. There is a lot of knowledge backed into such a small sentence, and the lead architect here is Rarity, the pony that has the attention to detail that is unmatched.
00:26:50 Szymon Pobiega
Rarity took the blue book and tried to implement the concept called application service or application layer. This is a concept in domain development design that was meant to reflect a use case in the original book. The use cases confirm payment and that use case takes an order ID. It loads the order from the repository. It calls a method on the aggregate, orders the aggregate, and then saves the changes to the database that happened, while executing the confirm payment method. This is the way we used to build domain design systems 10, 15 years ago as a collection of use cases that were independent. The user would somehow select the aggregate that they are operating on and invoke a use case by passing arguments.
00:27:48 Szymon Pobiega
Where was the business process there you may ask. The business process was in the heads of the users. Users would have to know okay, now I confirm the payment, then I prepare shipment, then I do this or that. That was the way we used to build systems back then, and they those systems were usually monolithic. This is not the case anymore. We're living in times where we build distributed business processes, and what is that distributed business process? It's a business process that is automated in software. It's not the user who clicks here and there and processes payments and prepares shipments.
00:28:24 Szymon Pobiega
We expect from the systems that we built, software systems that we built that the process will be carried on out automatically and users will only get notified or operators of the system will only get notified when there is something that the system cannot handle by itself. That's the system part. The distributed part means that the process is distributed among multiple components they're interact. The shopping cart component stores some information in its database and publishes an event, or sends a message to a message queue. That message is picked up by a shipping component or a billing component or name that component, and that other component modifies the state in its own database and sends another message.
00:29:12 Szymon Pobiega
The process flows. It's distributed because it uses different multiple databases and multiple processes that communicate. If you zoom in into that process or a single chain of that process, for example, shipping, you'll see our component that runs the code. There is a database and a message queue on the left and a message queue or topic on the right. A message comes in and message goes out, or messages go out multiple and then there is some state change in the database. Now, let's try to focus on what those messages really are because messages come in different shapes and sizes, but you can generally categorize messages into two broad groups. Well, there are certain different archetypes of messages, but we'll focus on two groups; events and commands.
00:30:03 Szymon Pobiega
How do we decide if a message that drives particular business process should be a command or an event? Well, you may ask Rainbow Dash. She has an opinion about any subject including events and commands and she says, "Let's just use events. Why? Because I Rainbow Dash say so." How would that business process look like if Rainbow Dash was implementing it? Let's focus on the same part of the process, confirming the payment. Now, our "application layer" is not just a method, it's a message handler. It handles events of type order accepted, and the message is passed to that method and we lose the order as previously. Then we confirm the payment, and next thing is we published order confirmed because now the next part of the process will be handled automatically not by the user selecting something.
00:31:06 Szymon Pobiega
We need to drive that business process by publishing another event order confirmed. Now what you may notice here is one strange thing. We handle a message of type order accepted an event, but we call method confirm payment on the aggregate. Where is that mapping? Why order accepted and confirm payment? There is no place apart from that comment in line item eight and nine where that mapping is. The business process definition is implicit here and only available in the comment if somebody tried to comment. When an order is accepted confirm payment, that's implicit part of that code, but then we meet our friend Pinkie Pie again and she also has an opinion about everything. "No! Let's just use commands." Here's Pinkie Pie implementing the same piece of software.
00:32:03 Szymon Pobiega
Now the confirm payment handler handles a command of type confirm payment, not an event command. What happens here is well as usual, we load the order from the repository. We call it confirm payment and then we send another command called schedule shipment. That's another aggregate. Shipment aggregate can start scheduling a shipment of goods, but you notice another thing again here. There is an implicit business process defined here. We call a confirm payment method on the aggregate, but then we send a schedule shipment command. What's the relationship between confirming payment and scheduling shipment? Hmm, only the comment in line nine and 10 tells us that when a payment is confirmed, schedule shipment, so the business process is implicit in that code.
00:33:00 Szymon Pobiega
The question is should we follow Rainbow Dash and use events? Should we follow Pinkie Pie and some commands? Maybe we should do mixed, maybe we should allow each architect/developer to choose their own way of thinking and have a mixture. Hmm, we don't know, but there is one more problem in that code that only Rarity could notice. That looks great if you don't have any conditional logic in your business processes, but frequently, you do and frequently when you call a method on an aggregate, the aggregate can react in multiple possible ways. The aggregates as you recall is the purpose of them is to encapsulate the complexity of the business logic and to guard the domain invariance.
00:33:47 Szymon Pobiega
There might be times where you send a message to aggregate or call a method on an aggregate, and the aggregate refuses to carry out that command because by doing so, it would invalidate some of the invariance the aggregate protects. How does your app service know what messages to send? If that confirm payment method might result in success or failure, we need to add a code like this after let's say we're following Pinkie Pie way. If the order is confirmed, then we send a prepare shipment command. If it has not been confirmed, we send cancel shipment. That looks okay, but not for Rarity. She's really devastated because we just threw in her encapsulation and her aggregates were supposed to guard their data, and not let out any information.
00:34:44 Szymon Pobiega
Thankfully, Pinkie Pie found the solution in a 12-year-old blog post by my boss, Udi Dahan about DOM and events. I tried to rephrase the code that Udi posted 12 years ago in a more modern way, and it goes something like this. We load the order from a database, then we create some in-memory lightweight event broker thing in which we can subscribe to events. We subscribe to payment confirmed event and if that happens, we send prepare shipment command. If there is a payment refused event, we cancel shipment, but where those events come from you might ask. We pass this lightweight in-memory event broker thingy to the confirm payment method, and the confirm payment method that's the part of the aggregate says, "Oh, if it's success, if I successfully manage to confirm the payment, then I'll publish this.
00:35:38 Szymon Pobiega
Otherwise, I publish payment reviews. When I publish this from the aggregate point of view, my callback registered in the layer above gets called, and I sent the prepare shipment method." The big advantage of that way of structuring the code is that the aggregate can have no public properties. The application layer or message handler layer doesn't have to query the order aggregate if the payment has been confirmed successfully. It just passes the event subscriptions, and the aggregate will call one of those callbacks by publishing one of these events. Encapsulation is preserved and the other benefit of it is we have our business process expressed in modern comments because we see in line three and line six clearly the mapping between payment confirmed event and prepare shipment commands.
00:36:35 Szymon Pobiega
The business process is well, it's not on its own. It's not coded in a specific separate artifacted code, but at least it's more explicit than in a comment. Thanks Udi, that can work, but Twilight Sparkle has a better idea, or she thinks she has a better idea. She looks in an aggregate and she really thinks and other ponies agree that aggregates are well suited for processing commands and publishing events. She really got that idea from the domain events blog post because aggregate can change its state, but can also reject to process a command. Aggregate can look at the command and say no, that command is going to validate my invariance, so I'll just reject it, but aggregate can publish in the event nevertheless.
00:37:29 Szymon Pobiega
It can publish something or something was prevented, it didn't happen. That's fine, but the problem is how do you make a business process out of a thing that takes a command and generates events? You cannot chain them together because they are incompatible. You cannot put one aggregate after the other because aggregates expects commands, and the previous aggregate just outputs events. The missing link here that Twilight Sparkle figured out is the saga or otherwise called process manager, but they insist on calling this thing saga because it's more specific to messaging. A saga is a message driven process manager, and that process manager reacts on events published by aggregates and generates commands for another aggregate or the same aggregate, or some different aggregates.
00:38:24 Szymon Pobiega
The other idea that Twilight Sparkle has connected to this one is she recalled reading Path Holland blog post or article about messaging, and the line that she remembered from that article by Pat Holland is messages are addressed to entities. When she recalled that, something clicked and this is the design that she spiked and showed to other ponies. In her design, a payment is an aggregate, but it's also a message handler. There is no special application service layer, or message handler layer that needs to take care of the aggregates, or do the transformation. The aggregates themselves are handling messages.
00:39:08 Szymon Pobiega
This time payment aggregate gets the confirm payment command and assuming it is the happy path, it marks the confirmed flag as true, but also publishes an event payment confirmed. That event goes to an order process which is a saga, the thing that manages the processes and also a message handler in this example... well, because it's a saga, it's an event handler. It gets the payment confirm event. It modifies its internal process state. The saga encapsulates not the business entity state, but the business process state. In this example, payment confirmed, that's the truth.
00:39:50 Szymon Pobiega
It inspects some other piece of its state, like address checked and if it decides that certain criteria are fulfilled, in this example, payment is confirmed and the address is checked, it can send a command to another aggregate, shipment aggregate to prepare the shipment. That aggregate is also a message handler of a command because aggregates handle commands. It does its business logic, and then it publishes shipment prepared, which can be subscribed by the same saga, the order saga or yet another saga, depending on the nitty-gritty details of the business process. We can see here, we can clearly spot that there is some duality and commonality between sagas and aggregates if they are expressed in that way.
00:40:41 Szymon Pobiega
Both have some state, they encapsulate state. The aggregates encapsulate business entity state, the saga encapsulates the process state, and both receive and generate messages. We can call them message driven state machines because that's what they are, and that's a pattern that is very useful in building distributed systems. When you think about this pattern quickly, you might realize that everything that you see looks like a message driven state machine. That would be too easy for Twilight Sparkle to convince the other ponies. Fortunately, there is Fluttershy with her attention to detail. What she spotted is well, that can work that, that looks good, but what happens if you get a duplicate message?
00:41:35 Szymon Pobiega
Of course, Twilight has an answer to everything, the outbox. What is the outbox? For those of you who that haven't used that pattern in your systems, the outbox is something that prevents invoking a piece of business logic multiple times when the piece of business logic is driven by messages. Let me show you quickly how it works. Suppose we are getting a message, and what we do? First, we try to load the outbox record. What that is you will see in it in a moment, because that's the first time we see that message, there is no outbox record for it, it's null.
00:42:11 Szymon Pobiega
We open a transaction and process that message, be it event or command or whatever and the result is that we update the database within the transactions of the changes to the database are not yet committed, but we also generate a outbox record, and that record contains the messages that are going to be sent out as a result of that processing. Then we store that outbox record as part of the same transaction. If line seven succeeds, both the changes to our business data and the messages that are going to be sent out are committed to the database. If line seven fails, then nothing really happens. Suppose line seven succeeds and we go to line nine, we check if the record is dispatched.
00:42:55 Szymon Pobiega
We don't know yet what this passed means, so it's not dispatched. We dispatch the messages from the outbox record. That means that we push those messages to the underlying queues and topics the messaging infrastructure, and those messages are delivered to the downstream systems or components. Then we want to mark that record as dispatched. Meaning, that we send those messages, but let's suppose we're failed here, what happens? Well, nothing is lost because line seven, we committed the transaction and data is safe in the database, both messages and our states change. Moreover, the message that we were processing goes back to the queue because we are using durable queues.
00:43:38 Szymon Pobiega
We are going to pick that message again and retry processing it, but this time, the outbox record is already in the database. It's not no, we are not invoking our business logic again, so we prevented duplicate message processing here. Is it marked as dispatched? No, because we failed previously in line 12, so we dispatched the messages again. You may ask here, "Oh well, now we are generating duplicates, because we are sending the same messages again," but because those are the very same messages with the same message ID and same message payload, the outbox components in all the other downstream services will catch those duplicates and will prevent duplicate processing.
00:44:21 Szymon Pobiega
Now we are going to mark those messages are dispatched and this time, we succeeded here, but before actually committing the message received transaction, we fail. Well, message goes back to the queue. We pick it up again. We're trying, we check the outbox record, it's there and it's marked as dispatched so we just continue, and we ignore that message as a duplicate. Works really well, but you can count on Rainbow Dash trying to question your design, right? Good for you, but my database has no transaction and you might think, "Well, why would you have a database, it doesn't have transaction?" Well, there actually are databases that don't have transaction and have valid use cases, and many of them.
00:45:08 Szymon Pobiega
You can encounter them in modern systems. The major example being if you want your database to be really scalable, you might trade off transactions. It's not like you should trade off transactions, but you might if you are building a key value store, for example, that is supposed to handle millions of updates per second. You might not build transactions into it, transactions between multiple keys that is, but still you can use the outbox pattern. There are certain modifications that you need to apply, but you can define the outbox as part of your entity, as a for example ordered dictionary of string and messages. Then you when you are storing the outbox, it just means you're updating your entity in a database.
00:45:56 Szymon Pobiega
Well, updating your entity in memory before storing it in the database, and that means you are assigning the list of messages generated to the ID of incoming message. Then because you don't want to use all the space in the world, you want to check if this particular outbox is not too long. If it is too long, you just truncate it. Then when you're marking it as dispatched, it's just as simple as assigning null to that ID in our dictionary, so that we don't store the outgoing messages anymore. We don't need them. We just retain the fact that we process the message and well, you might expect she comes again and says, "Well, actually, I wasn't right, my database has some transactions."
00:46:49 Szymon Pobiega
What does it mean some transactions? Well, there are popular databases these days which fortunately have multi-entity transactions, but those transactions are limited. A good example is CosmosDB which now has transactions, but those transactions are limited to a single logical partition, but there is a very easy way to adapt that outbox pattern to those databases too. All we need to do is we need to figure out before processing a message which partition that message is going to, and that might be tricky in a general case, but you know what, if we are treating our entities or our aggregates and sagas as targets of messages, we can easily use a partition strategy that assigns each aggregate or saga a different logical partition.
00:47:40 Szymon Pobiega
By doing so, we can use the outbox pattern very successfully and the result is each logical partition contains either a saga or an aggregate, plus the outbox associated with that message-driven state machine. With that design, finally agreed upon by all the ponies, they were able to launch the product and achieve great success by selling lots and lots of potions. With success, comes well the price and the prices, you need to keep up with the competition because when the competition sees that you are succeeding, they want to replicate your business model. In order to keep up, you need to make your system more complex usually. Well, not in all cases, but sometimes the solution that the enterprises do is start doing different lines of business.
00:48:42 Szymon Pobiega
You were successful as a taxi company or something like that, and you figure out well, let's try to do the same in the food delivery space, not pointing to any company at all in particular. Here's the blueprint of the architecture as it was at this point in time. There was a UI running in the browser. The shell, meaning that part of the system that synchronously interacts with the UI and the user, and then the asynchronous core, right? That's our blueprint, but as you can see, you cannot scale the complexity here. You cannot add more components here because there's no space in my slide. What we need to do is that flipped at 90 degrees, and now we have space.
00:49:28 Szymon Pobiega
We can make our boxes larger and fit more components into them more sagas, more aggregates, more components that deal with the user interface, but fortunately, there is Fluttershy to warn us about something. What she sees is layer architectures, and she has some experience with dealing with layered or multi-tier architectures back 10 years ago. She remembers it's not going to end up well. She says, "Well, layered architectures quickly get messy, and we know about it and ponies knew about it." They didn't try to make those layers wider and pack more stuff into it. The strategy that they I wanted to employ is to build more vertical slices, and each business unit, each product, you name it had a different slice.
00:50:27 Szymon Pobiega
Those slices, well they need to be integrated in the user experience layer because you don't want your user to know that you have different business units in your company, and that potions, for example, are handled in a different components than let's say magical scrolls. You need some integration there, and web components can help you a lot. Then you need some integration at the bottom in the core of your system, and that can be handled by publishing an event. When aggregate publishes an event and then there is a saga in a different slice of the system that subscribes to that. Event driven integration in the core of the system and some web components integration, JavaScript integration in the top.
00:51:20 Szymon Pobiega
With that, I'm almost done with my presentation. I would like to invite you to read a blog that Tomek and I have, and that blog mainly focuses on making the business distributed systems reliable, the outbox part, but also we talk about different stuff related to distributed systems. I also would like to invite you to the workshop that Tomek and I are going to have as part of NBC workshops initiative. It's going to happen on March 11, 12 and it would be about building reliable event-driven microservices. More or less the same topic as today's talk, but more focused on the reliability part of that and with that, let me try to recap what we said today because we're running out of time.
00:52:08 Szymon Pobiega
First, queues are not a silver bullet, especially for the interactive part of the system. Queues might actually do more harm than benefit. Second thing, SQL server used as a message queue can get you a lot, but if that's not enough for your, polyglot messaging is way to go. In many systems, in the business or e-commerce area, synchronous, asynchronous boundary can help you structure your system in a better way, and that's one of the patterns. Then a blue book. I highly recommend the blue book. If you don't have time to read it, just that sentence gives you a lot of knowledge that is in that book aggregates and entities encapsulate behavior and data.
00:52:58 Szymon Pobiega
The blue book probably is not enough these days because their content is slightly, only slightly outdated because today we deal with distributed business processes which require... well not require, but are quite well handled by sending messages to entities. Meaning, aggregates or sagas and that's thanks to Pat Holland's white paper. We call them message-driven state machines, and we want to protect them usually from having duplicate messages by introducing the outbox. Last, but not least, if you want to scale that system, when it comes to complexity and allow more teams to work on that system, vertical slices are the way to go. With that, I'd like to thank you. Here's a bunch of things, and I'd like to take some time for questions.
00:53:54 Tomek Masternak
Thank you very much Szymon. We already have a couple of questions. By the way, if we can't get to the question that you asked, don't worry, we will follow up offline with the answers. Let's start with a question about serverless, and the question is, what is your personal opinion on serverless programming and where would you implement domain layer aggregate business logic, et cetera in a serverless environment where it seems that every function is a separate, but with the shared state?
00:54:27 Szymon Pobiega
Okay. I think I addressed at least partially this question during the presentation. Serverless, what I like about serverless it really pushes you towards making the code that handles the request as short as possible and as focused as possible which is good. Here in the example of this system, that meant that user interactive part of the system is really interactive and handles requests really fast. Then when you're dealing with aggregates and sagas on the asynchronous side of things, serverless can also be useful, but this time, I would recommend using serverless based on message queues, which can be done both in AWS and in Azure. With that approach, the function is driven by messages coming into the queue, for example, and service bus supports sagas so on top of this serverless technologies.
00:55:33 Szymon Pobiega
If you are used to using service bus for sagas and aggregates, you can do it with serverless technologies and benefit from the billing model that allows you to not pay for the capacity that you are not using.
00:55:51 Tomek Masternak
Thank you Szymon. The next question was asked by Oliver, and Oliver is asking how do you keep track of how your application works after you implement a lot of features. How to make it clear where the message comes from, and where it's going in your code base how to manage that aspect of your application?
00:56:14 Szymon Pobiega
Okay, that's a really good question because the distributed systems are... well, they have some good advantages, but one of the weak points traditionally was monitoring those distributed systems and making sure they work as they are supposed to do. What I can recommend is foreign service, but specifically, the part of the platform tools like service control and service insight allow you to process all messages from your system, audit them, and visualize them. That's really useful. If you want to see what sagas are doing and generally what components are sending to each other, and then there is a second thing that is not only as in service, but specific, there is this specification called open... what is called open tracing if I remember correctly, or I might be mistaken.
00:57:11 Szymon Pobiega
Anyway, the idea is that with every request, you're passing the trace ID. Then if you're passing those trace IDs from message to message, from HTTP request to HTTP request, then you can build a graph of what really happened in your system because you have that information embedded in all the calls. It can extend from the very beginning of the interaction, from the user clicking a button to a database and for example, Microsoft technologies in Azure supports those tracing standards. You can figure out the whole interaction from the user clicking the button to CosmosDB throwing an exception, and I know Jimmy Bogard has an open source project that brings support to that those tracing technologies to one service bus.
00:58:13 Tomek Masternak
Thank you Szymon. We are almost on, or we are actually over time, but I think that we can answer one more question and that question asks about chattiness between the services, and do you have any advice about making sure that the amount of that data that passes between the endpoints is minimized if you have any advice on that Szymon.
00:58:41 Szymon Pobiega
Okay. What we usually advocate is to not send data between messages. We have a blog post published that has... yeah, there is a blog post that I can link somewhere. It's called put your events on a diet and basically, the idea is that the chattiness... well, if the service boundaries are correct, there is very little chattiness between those services because the data is not passed between them. That is built on the idea that the service decomposition of the system is based on the data and how data is used, not how the teams are structured. That's in my opinion one of the biggest challenges in current way people build software systems. The services are design in such a way, like they are designed around teams of two pizzas, for example and not around the data that they use.
00:59:53 Szymon Pobiega
As a result, all the services have the same size because the teams have the same size, and all the services need the same data. They need to send data between each other, whereas I believe the correct approach is to actually structure the services based on what pieces of information are used together and updated transactionally. If you structure the system in such a way and then have your services be vertical from the user interface all the way down to the database, you'll probably end up with much less data passing. The only data transfer that you will probably end up with such a system is passing the data into some reporting solution, a data warehouse of some sort, but that's usually handled by data transformation techniques or ETL techniques, and not by messaging.
01:00:50 Szymon Pobiega
My general advice is to look at passing information, or passing data from messaging as a smell in the code that might suggest the service boundaries are wrong.
01:01:07 Tomek Masternak
Thank you Szymon, and that is the last question that we have time to answer today. One additional thing that I wanted to mention is that there are a bunch of events that are happening in February and in March, so make sure to go to particular.net/events. One thing that I wanted to highlight in addition to what Szymon already mentioned is the next webinar, which is going to be about Azure functions in practice which is delivered by Adam Jones. If you're into Azure and serverless, make sure to register and I think that's all that we have for today. On behalf of Szymon, this is Tomek saying goodbye for now, and see you on the next Particular live webinar.

About Szymon Pobiega

Szymon works as an engineer at Particular Software. His main areas of expertise are Domain-Driven Design and asynchronous messaging. He is interested in the intersection of these two topics, i.e., in patterns and tools for ensuring exactly once and in-order message processing. Szymon is a co-author of https://exactly-once.github.io/

Additional resources