Distributed domain-driven design, a no-nonsense implementation guide

00:00:01 Tim

Hello everyone and thanks for joining us for another particular live webinar. This is Tim, and today I'm joined by my colleague and solution architect Szymon Pobiega, who's going to talk about Domain-Driven Design and distributed systems. Just a quick note before we begin, please use the Q and A feature to ask any questions you may have during today's live webinar and we'll try to address them at the end of the presentation. We'll follow up offline to answer all the questions we won't be able to answer during this live webinar. This webinar is being recorded and everyone will receive a link to the recording via email. Let's talk about Distributed Domain-Driven Design. Szymon, welcome.

00:00:41 Szymon Pobiega

Hello everyone. Thank you, Tim, for the introduction and welcome to the talk. I gave this talk the subtitle of A No-Nonsense Implementation Guide because I think that the Domain-Driven Design community suffered a lot during all these years from a just symptom. I call it just symptom meaning it's let's just use event sourcing or let's just use document databases or let's just do x. And by No-nonsense Implementation Guide, I mean something that at least tries to avoid using this just keyword here and tries to present the problem as it is and show possible solutions to the problem at hand and not just blanket statements like these. My background is in Domain-Driven Design, my past employment career also in event sourcing. I've joined Particular almost 10 years ago now and since then I've been helping people to build their own distributed systems, which is fun of course.

I like to use a theme for all my talks and today's theme is potatoes. Well, let's start with a single potato, can have more potatoes, grow more potatoes here, and at this point, you probably puzzled to what the potatoes have in common with Demand-Driven Design. My past talks used pierogi as a theme. Today's potatoes. But when we put those potatoes on a design surface, then they might actually look like something and they look to me like bounded context. So it's something that's one of the DDD practitioners years ago told me as a kind of running joke in the Domain-Driven Design community that you can call the context diagrams where you show bounded context of your system. You can call that diagram potato diagram because most people just draw those bounded context on a wide board in this kind of oval shape that they look like potatoes.

So first thing today is we are going to look at that context diagram and explore one potato at a time to see what hypothetical Distributed Domain-Driven Design system can do. In the first potato here we have let's say order processing where this potato interacts with a web browser, with users and with some API clients. Our partners both can submit orders and there is some order processing logic here that is implemented as a series of messaging components. One of these components is on the border of that potato, but there might be other messaging components communicating asynchronously with that one. And finally there's of course some sort of database and that's potato up there. We have a components that are responsible for generating some documents related to the orders, so let's say invoices and this data here receives that information about the orders through events published by the potato below.

These events are also subscribed by another bounded context on the right, and this one is slightly bigger and contains a process manager. The thing that resembles a flow chart diagram, that process manager has its own data store and orchestrates by sending command messages at other components that do things. And these things can include communicating with an external API, a partner API, for example a shipping component to schedule a shipment, and it may also communicate with some internal APIs that are hosted in yet another bounded context. So we'll discuss this system in this talk. It's a very generic distributed Domain-Driven Design system and our guides here because we need a guide to guide us through the commanders of Domain-Driven Design.

Today, our guide will be none other than Drake with his precious advices on how to build distributed Domain-Driven Design systems and his first advice is, well, you shouldn't be writing code really. What you should be doing is modeling the domain. This is something that I heard a number of times when I begun my career with Domain-Driven Design and understanding was that, well, there's something different about modeling domain than writing code. In fact, what I learned over the years, it's kind of obvious when you learn it, that every code is a model of a business domain, every code in business software, but that model may be more clumsy and more cluttered or less cluttered. And when we say we're doing domain modeling in Domain-Driven Design community, what we mean is we struggle to find a model that is very clean and that cleanness allows us to find one-to-one relationships between the business concepts and the code concepts. So there is one-to-one mapping between the code and the business. That's what we mean by it.

We'll also take a look at the blue book. No presentation about Domain-Driven Design cannot reference that book. So that's the original book by Eric Evans. What we'll take from this book is the subtitle Tackling Complexity At The Heart Of The Software. What does that mean? What I heard a number of times the understanding of that sentence is again, what would Drake say here is we are not focusing on infrastructure code. Even I heard number of times we're ignoring the infrastructure code. Someone else somewhere else will take care of that code, but we want to focus is the domain model code because this is the code that matters and that creates business value for us. And I disagree here especially in the context of a distributed Domain-Driven Design implementation where infrastructure is not trivial and the reason why it is important here in the distributed context is that statement that we shouldn't focus on.

The infrastructure code assumes that all of these are true. That's the underlying expansion of that single assumption. We should not deal with infrastructure code. The assumption can be expanded as the network is reliable, the bandwidth is infinite, the topology doesn't change and so on. These are the so-called fallacies of distributed computing and they are called fallacies because they appear to be true or the people wish they were true. In fact, all of these statements would better be rendered in a negation style because the network is not reliable, the bandwidth is not infinite, there is not one administrator in the system. And today we're going to especially focus on that one, the network is not reliable because I think at least it's most applicable in the design of distributed Domain-Driven Design system because that system is as good as its weakest component.

So we need to focus on what component is the weakest, find these places and make sure these components are taken care of. Coming back to our design board, the blue board here and we have potatoes here and now we can expect this potatoes are in fact bounded context. The one on the left with a chest of gold is our core domain, the one that we carry a lot because it generates revenue for our corporation and it gives us competitive advantage. The one on the right is supporting domain, a very boring one, probably shipping, but we need to represent it anyway. We're a good Domain-Driven Design practitioner, so we circle our bounded context with anti-corruption layer, which helps to decouple this context and helps protect the ubiquitous language used within this context.

And what we want to do as good Domain-Driven Design practitioners, we want to focus on the core domain focus of resources, money, people. All we have here, right? The problem with that order, the issue with that approach is most cases are found. The core domain that gives us the competitive advantage is not really something that interacts with the real world, external world. In order for that benefits to be realized, that revenue to be realized for our company, that core domain that our competitive advantage needs to interact with external world and it does so more often than not through supporting domains and generic domains, it usually doesn't communicate with outside. So if it has to communicate with the supporting domain, how does it? Here we can see that the weakest component thing that I mentioned before, even if our core domain is great and is going to give our company great benefits, these benefits are only going to be realized if that core domain can communicate with the supporting domain.

If messages between these two components are lost or duplicated or otherwise the communication fails, then there is no use of a great core domain. So we need to represent somehow the communication as a domain. That's my proposal to represent it as a separate domain on that chart to make it explicit, we need that communication to happen. Now the question is should we then just abandon our core domain and focus all our resources, time and people here? Well, fortunately not. There is a way to work around that because there are off-the-shelf components like for example, and NServiceBus or RabbitMQ that can be used here as a generic sub-domain.

So that's the amount of code that we need to write for that communication domain is limited and then we still can focus on the core domain, but representing it on our chart is very important. So today for the rest of the talk, what I propose is we're going to focus on tackling the complexity but not at the heart of the software. That's well dealt with in number of books, the blue book and the subsequent thread book and all the other Domain-Driven Design books. What I want us to focus is tackling the complexity at the borders of the software because the borders are where the fun is in a distributed system.

So we are back with our design space with our four potatoes and at first we're going to focus on that lower left potato and things that happen there. These things include message-based communication between the components of that left potato. That message communication can be described as hearing that diagram. A message comes in to a component that is meant to process the message and that component as part of processing the message needs to update a database and send out a message and we want to figure out how to structure that communication so that it's reliable and we are not duplicating and not losing any messages. So first of all, let's focus on the left-hand side here and see what's going on here in detail. What is in fact going on here is the receive of the message is two operations that are separate operations in most queuing system.

First we get the message from the queuing system and that is not destructive operation. That can be done multiple times and does not have any side effects besides us reading the message. There's a separate operation usually that consumes the message, after consuming the message is no longer available, it's removed and the space is cleaned up and then there's a save operation that saves the data, changes the data in a database. The issue here is that while in most modern queuing systems we have no transactions between the Q and the database, so we have those two operations and we have no atomic transactions between them, which exposes us to a thing called partial failures. We can fail partway through that process of consuming the message and saving the data, which means one of these operations might be done while the other might fail and what to do next.

So that leaves us with basically two options here. We can first consume the message and then save the data to database or we can save the data and then consume the message. And here Drake comes with a very helpful answer, we shouldn't first consume the message and then them to save. Why is that true? Well, if we try doing that, we receive the message, we market this process, we consume that message, and then what happens if we fail? If we fail, at that point the message has gone from the message queue, we may still try to process it, but if we fail processing, if the database says, "Well there was a timeout or there was a index problem and please retry," we cannot really retry because the message is already gone. So by doing this approach, we are basically removing all the benefits of having a message queue, which are the message is reliably stored until it's processed.

So the only way to do it while maintaining the benefits of messaging is to first say to the data and then consume the message, but that exposes us to duplicates. What happens when we receive a message, we update the database, but we fail to consume the message. If we fail to consume, then the queue is reliable, the message goes back to the queue and the message queue is going to deliver the same message again and then we need to decide, we have seen that message, we need to ignore it because it's a duplicate. How do we do it? Well, the most intuitive option is to somehow record that we process that message. So we process a message and save the data and record air. We seen that message, we process that message, let's write it down somewhere so that when the same message or duplicate copy of that message comes in, we can check in that place if we already process that message.

But here's the problem. If we have two separate places where we record that we process the message and where we save the business data, then we are again back to square one exposed to that dilemma of which of these operations we can do first. If we first record the message as processed and then save the data, the recording of message being processed is equivalent of consuming the message because next time we see that message, we'll just drop it on the floor thinking it's a duplicate. So just by introducing this processed message store, we didn't achieve anything. The only way to actually solve that problem is to store the information about processing the messages in the same data store atomically in the same transaction as modifying the business data. That sounds hard. Usually when I talk to people who are Domain-Driven Design practitioners because what their immediate reaction is, well, but I don't want to store infrastructure data in my precious database.

That's just domain model, not some clunky stuff about messaging. But my answer is there's no other option. There really is no other option than storing the message processing information in that same database. And what comes in handy for Domain-Driven Design practitioners is that statement about the aggregates. The aggregates are the transaction boundaries. So we cannot store that message processing information outside the aggregate. The aggregate is our transaction. Let me try to show it in a pseudocode without the order aggregate in that case from the repository, we check if that aggregate has processed message with a given ID. If it hasn't processed that message yet we execute the business method for the aggregate, we store it with the repository and we are done. But if the duplicate copy of that message comes in, here we load the aggregate, we check if that aggregate process the message, it did process the message.

So we just skip the processing and we prevent the duplicate, right? What could be even better here is that we can see that actually in the Domain-Driven Design, the repositories are the components that are responsible for transaction boundaries for an aggregate. Aggregate is just a thing that we take from the repository and we store a repository, but what guards a transaction boundaries and what implements the correctness of loading and storing aggregates are repositories. And I think making repositories aware of messaging is a great example of how to make them useful because many people argues that aggregates are useless because they're just an abstraction layer over an hybrid or entity framework. I think they are not. They're very useful when you take into account messaging in a distributed Domain-Driven Design context. So we can restructure that code that just simple refactoring to make the repository responsible for keeping track of message processed.

And now our aggregate is clean, represents just the business concepts and the repo has the process method and then the safe method has an additional parameter which is the message ID, so it can atomically in one transaction safety aggregate and mark that message as processed. That's represented here on that diagram with double arrow, meaning that database serves double purpose, it's above the data store for the business domain and also keeps track of processed messages. That we have the left part covered. Now let's take a look at the right-hand side part here, which is a queue where we publish outgoing messages. And again, we are in a situation where only Drake can help us because we have two operations saving the data and the messages processed and publishing a message and unfortunately we have no transactions between these two. So what's Drake's advice here? His advice here is we shouldn't publish and save, we should save and publish.

Why is that? If we publish and then try to save the database, we can again, because partial failures failed midway through. So we published an event and then we failed saving. What it causes is an effect that I call immediate inconsistency where we really strive to achieve eventual consistency. Immediate inconsistency means that any other compound that subscribe to that event that was published thinks now that the state change happened. For example, an order is submitted while that state change has not happened in that component that publishes the event and even more that state change may not happen in the future because validation problems or any other problems. So that phenomenon called ghost messages is caused by doing publish and save. So we should avoid that. We should do save and publish and if we follow Drake's advice, that's how the pseudocode would look like.

We load the aggregate from the repository, we check it's not duplicate, not duplicate we save, and then we try to publish payment, confirmed event. If we fail here, the message goes back to the queue. The queue is durable. We load that aggregate again while trying again to process the message. This time we have processed the message, right? But here's the thing, we ended up in line seven, we skip line six here. So we basically lost the message. We didn't lose incoming message, we lost the outgoing message. We failed to publish a message where we should have published a message which is equally bad, so we need to do something with that line six. And fortunately the fix to that problem is fairly simple. We just do this simple refactoring, move that line out of the if statement and that should work. So let's give it a try below the order, we confirm the payment, save the aggregate, and then we fail here, which means the message goes back to the queue is reprocessed.

We skip that whole part because we have processing information for that message, but we execute line seven here, the event is published and we are good. So that's how the correct code in that potato looks like and our components that exchange messages in that lower potato can now exchange messages correctly. I forgot to mention that with each potato we're going to explore more complex problems. So we are going to deal with another level of complexity here in this big potato with a process manager.

The challenge here is caused by the state. What I mean is in the previous potato we were thinking about events mostly, and publishing events is usually easier than sending commands. It is usually easier because it's usually based on a state that doesn't change. The event describes something happened as immutable, the event itself is immutable, but the state it was generated from is usually also immutable. The reason why we have more severe problem or complex problem here is that this process manager that is sending out commands and getting responses is constantly mutating the state, which is used to generate more messages. So we have constantly mutable state and that state is used to generate messages.

Why is it a problem? Here's the mathematical formulation of Idempotence, which is in fact a mathematical term, not a software engineering term, and it means the term Idempotence can be applied to a function. A function is Idempotent if applying that function multiple times is equal to applying that function once and in software engineering context, we often equate that function F here to database change. So we think, oh my component, is Idempotent because the data changes, the saves to the database, it does, are Idempotent and this is only partially true. And why it's only partially true, we'll see here by trying to simulate a concurrent processing of two messages by a single aggregate this time, this is a customer aggregate. So we have a message, A, B, C and D, E, F both destined to the same customer, both carry information about the order data.

Here we start with a first message. It's not a duplicate. So we go to line three, we place that order, we save the customer aggregate, and here we check if the number of orders is equal one, if it's true then it's a new customer and we want to send that customer a gift. So we execute that method, but before actually sending the message, we blow up the message queue stop responding, or someone just disabled network connection. Something went off, the message did not get through, but fortunately the queue is durable. So the incoming message will get back to the queue and will have an opportunity to reprocess again the message ABC. But in the meantime, the message DEF was picked up. It's not a duplicate. So we execute the business code here for the same aggregate. We store the state of that aggregate, we check that condition here and order number count is two here.

So we just skip line seven because we don't need to send that customer a gift. It already has two orders. Now again, we have another attempted processing message, A, B, C. We have processing information for that message. So we skip that part, we get to here and order count is two. So we skip sending that message, which leads us to another situation where we are losing outgoing messages. This time we lost that message because sending of that message was based on the state of the aggregate. So our message generation, message sending procedure was not consistent with the state of the aggregate. Is there a way to deal with that? Fortunately yes, there is a good well-known way of dealing with that, which is called the Outbox pattern. A very well documented pattern and a clue of the Outbox pattern is that in that database on the left we store free things in the same atomic transaction.

We store the business data change, we record the information that we processed and incoming message, but we also record the outgoing messages that we are going to publish to the queue on the right-hand side and this guarantees that the whole entirety of the processing is idempotent because we either commit all the results of processing to a database or nothing is really committed. So our function as now is truly idempotent and truly reflects all the side effects of message processing. But now you may ask, but what happened if we commit all that data to the database but we failed to send out messages because something went wrong? Then what happens is that incoming message is not consumed. When we failed midway through publishing messages, the next step would be to consume the incoming message. Because we failed publishing, we also do not consume the incoming message.

So it gets back to the queue is given us again, we give it another try. This time we check in the database and we have all the information for processing that message. That message has already been processed and we know what outgoing messages should be published. We take them out of the database and republish. That's the outbox pattern. We're going to now get back to the first potato and focus on the other aspect of that potato. And that aspect is dealing with user interfaces because this is going to be interesting, and is going to be uncovering some new ground in terms of distributed messaging systems. So let's see what happens if a customer places an order. Here's a diagram for that. The customer uses web browser interacts with some web controller which has an action and that action saves the data and sends a message and the HTTP response that customer expects is should contain information from that database.

So the request contains data what customer wants to do and their response should contain information. What happens? So what we need to do, to achieve here is obviously as you probably can expect here, we want to ensure that what is happening to the database and what is happening to the message on the right is atomic. So either both succeed or none of them succeeds. It's another case here where we have two operations and we need to coordinate them. What would be the easiest way to coordinate them? Well, we know the Outbox, so the easiest way to do would be to employ the same pattern here and store all the information in a database and then publish the messages from the database because we already saved them. The problem though with using Outbox pattern here at the edge, at the border of the system is that the Outbox requires the incoming message to drive the retrying of publishing outgoing messages.

If we don't have that incoming message in the queue and the queue is not giving us that message again and again for retrying processing, we want to be able to reliably publish the outgoing messages and sorry, we can't rely on the user or even on the user browser running JavaScript code to resubmit the same request multiple times to our backend to drive our processes. We need to assume that the JavaScript code running in the user browser is not under our control and cannot be trusted to make that process work. So the work around it with most often used in these cases was a very thin translation layer from the web request to the message. So browser sends a web request to do something and that very thin translation layer text the entire content of the web request, puts it into the message and send us message through the queue.

That allows us to use the Outbox here in that part and that incoming message is used to drive that mechanism. The problem with that approach though is that we introduce asynchronous communication here between the user browser and the database being updated. So no longer that browser can just depend on the HTTP response to contain the information of what happened. The browser needs to either use long following or other methods such as signaler to get the information of what happened. And that asynchronous is difficult to design from the user experience point of view. It's also creates a lot of friction in the development cycle. So we don't want to use that. Fortunately, we now have something that is called Transactional Session pattern. The other way to call it is the Atomic Save and Send, which is a pattern specifically designed to handle the communication at the border of the system where HTTP meets messaging.

The way the transactional session pattern works is like this, let me start by explaining what we want to achieve and then work back the way how it's done. What we want to achieve is from the web browser sends a web request, the controller action needs to modify the data in a database and save the contents of the messages wishes to send to the other components of the system, right? That's the first step. We need the data to be persisted and we need messages to be persisted in the same transaction. That's the first requirement to get everything going. And then the second thing is we want to a mechanism that takes the messages from this database and publishes them reliably to the ultimate destinations. Reliably, meaning that if messages are saved to the database, then there is a mechanism that guarantees that eventually these messages will be delivered.

The easiest way to do it would be to use a timer job of sorts to just pull the database, "Hey, do you have messages, outstanding messages, give me outstanding messages and so on." The problem is these timer jobs are cumbersome to implement. There are even more cumbersome to set up in a scalable manner because you either get the polling that is too frequent and then you are killing a database or the polling is less frequent. And then if it's less frequent, then it can't cope with the load of messages and messages wait too long to be delivered. So it's not really great mechanism. And plus if you take into account Server environment, setting up polling, it's really painful. So instead of polling, what we would like to use is messaging. Let's imagine a trigger message that doesn't contain any content, but is designed in such a way that that trigger message when it arrives at that component triggers this component to look up in a database by some ID.

Oh, are there any messages for that transaction or there? Oh, there are messages. So let me publish them. So instead of having a polar that pulls the database, we want a message for each transaction, for each web request and that message is going to trigger the sending out of the outgoing messages. But then the question is how do we get that trigger message in that queue? And the answer is, well, the controller action is going to send that trigger message. So to summarize, where we get here is that a controller action that is handling a web request needs to send a trigger message and it needs to store the actual messages in the database, and it needs to modify the data in the database. And probably you see that... Who is now coming, Drake is going to help us because we have a dilemma here.

Which of these things we need to do first, trigger message or database update. And of course Drake has a very quick solution. We should do send and save. We shouldn't do save and send. Why? Let's first look at what Drake is arguing against. If we save the data in a database and save the outgoing messages in a database and then send a trigger message, we might fail before sending the trigger message. If we fail before sending the trigger message, then we end up with a database state modified and messages waiting the database, but no trigger message to push them out. If a user refreshes their screen, they would see the state of the order as let's say submitted, modified to submit it. The messages for actually processing the order ready in a database but nothing will push them out. So the business process will be stuck.

The user would won't get their delivery. So the only way to make it work reliably every single time is to first send the trigger message and then save the messages in a database. And so we do this first and we do that second and then that trigger message arrives at that backend endpoint and it contains the ID of the transaction of the web request. So by this ID, we can look up messages in that database and push them out. So far so good. The success scenario works really well, but as we know, network is not reliable. So what may happen is we send a trigger message, but then the transaction for storing the data and the outgoing messages in the database is slow. It's awfully slow, taking a lot of time and we are here in this component having a dilemma what to do.

The trigger message has arrived, we are checking the database, no sign of that transaction yet. What we can do is we can wait 10 seconds, 15 seconds, 20 seconds maybe, eventually we need to give up. We cannot reprocess that trigger message forever because it's going consume some resources for our system, we need to dispose of it eventually. So what can we do? We just drop it on the floor, the trigger message? No, we can't really drop it on the floor because what will happen is that transaction may really be stuck and may be killed, but it may succeed. And if we dropped the trigger message and later the transaction succeeded, then we would end up in a situation where we modified the data and created messages, but there is no trigger message to push them out.

So what we need to do when the trigger message is being dropped on the floor is before we drop it on the floor, we need to create what is called a tombstone. We need to mark that transaction as dead and only then we can drop the trigger message. If we create a tombstone, then when eventually that transaction from here succeeds, well wouldn't succeed when it reaches the point where it would commit, it cannot commit because the tombstone is already there. That implementation of the tombstone is usually a unique index. So that transaction is going to fail. So we'll prevent that change in the database and the user will figure it out that the order is still not processed. So we are safe here, at least from this point of view because, well, the network is not reliable. And a scenario where we can see the network not being reliable again is you can imagine the whole process that we just described successful.

But on the last bit, the ultimate bit of delivering the response to the user browser fails because reasons and well what the user do, well some of the users will just click the submit again, which is called the double submit problem. One of the oldest problems of web development, I remember it from, I don't know, 18 years ago when I was implementing my very first JavaScript code I implemented in my first job was to disable the submit button once it was clicked so that people don't click it twice. But fortunately there are better implementations of the solutions to that problem now and the best one I have found so far is the state-based deduplication. The name might be complex and confusing, but it's fairly simple because that state-based deduplication is usually based on a bully and state which for which the other name is Bully and flag.

So here what happens if the user clicked submit again, we just check if the order is already submitted, we don't need a message ID here or any kind of ID because we are basing our decision on the state of the aggregate. If it's not submitted, then we submit it and publish the order submitted and if already submitted, we just don't do anything. And this pattern is called state based deduplication because it's much more general than just using a bully and flag and it can work based on any sort of state machine that doesn't have a cycle. So as a click state machine. An example of a state machine that doesn't have a cycle is an algorithm for making french fries or Belgian fries, depending on the country you're from, you need to peel the potatoes, cut the potatoes and then fry them, right?

There is no cycles here. You cannot go from fried potatoes to, well a peeled potato back, there is obviously no cycle here and there can be a single message that is associated with each state transition. So there is a message peeled that is associated from the transition of raw potato to peeled potato and a cut message and fry message, also can be associated with a single state transition. So you can see that if you have ready-made fries on your plate and you received a cut message, that is a duplicate because the fact that you have fries on your plate means that you already received and processed the cut message because well, you cannot fry the fries without first cutting them. So that's very simple pattern, but very, very powerful and estate machine in a business software and well if it doesn't have cycles can be used for state based deduplication where messages are basically associated with state transitions.

Now let's look at the other aspect of this potato here, namely our partner API, which allows B2B customers to submit bulk orders. What happens here? Well, the networks of course not reliable and how do our partners deal with networks being not reliable? Well these partners that use our APIs really want their orders to get through. So they would retry calling our APIs and sending their HTTP requests until they see confirmation that they successfully submitted these requests. So we need to deal with duplicates coming our way. What can we do? Can we do state-based deduplication? Not really. State-based duplication requires some sort of state preexisting to the message being sent. And here when the API request is create an order, there is no state preexisting. There is a state in our web interface scenario because most web interfaces are limited to having a single outstanding order.

So that order can be associated one-to-one with customer. Sometimes it's called shopping carts and there is a single thing and it's always there. It always has a state and that state machine doesn't have cycles, so we can use it, here we can't because we are basically each API code creates a new one new thing, new piece of state. So the only thing we can use is use is the ID based deduplication, which is fairly well known from our messaging patterns that we discussed a bit earlier. So we can use ID based deduplication, but where can we get the ID actually? The ID in a messaging scenario is fairly simple because every message in every message queue has an ID associated with it. Here we need to find another way and a good way or the best way to get these IDs is for the client to generate these IDs.

So the client that calls our web API is to provide us with an ID that we can use to duplicate their requests. Now you can ask another question. Fair enough, we need an ID, the client gets the ID, where does the client get the ID? And hint for us, we can find here in our diagram where we call the external partner APIs. We call these APIs while processing a message that comes our way. So we have the ID of the incoming message. We can use that ID either directly or by hashing it or transforming it to provide the ID for the API request that we are sending. So this is a fairly simple pattern here. And now this is our last potato which will introduce yet another level of complexity here. We'll be dealing with documents that are stored in some sort of external store like S3 or blob storage that represent some information like PDF documents for example, that are invoices generated by responding to events and why it is another level of complexity.

I'll explain in a minute. The situation looks more or less like this. The message comes in, that message modifies the data and then we load this modified data and some other data, we transform it and we generate a document. We store that document in some sort of store so that it can be later viewed by the customer. Sounds very simple here. And the simplest way to implement that would be to just use whatever information we have in the incoming message to generate the name for that document. So if it's an order with ID XYZ, we'll just create a document called X, Y, Z in some folder in our storage, right? This is a very, very simple approach. Unfortunately that approach is not correct when we are dealing with concurrent messaging and we are dealing with concurrent message all the time in our distributed systems.

It's not correct because there might be multiple messages racing to our processing component and while the database state will be maintain consistent because we'll use Outbox or similar approach there to maintain consistency with the data, the document that is created is created under the same name in the external store. So we don't really know which of the processing attempts will win and will update the document last and the state of the document will reflect one of the states that was seen by messages coming in. But we have no control over which one of these states will be actually represented by that external document. So we have no way to tie it to that database state. A much better way of dealing with that problem is to generate a random ID here and make that ID part of the data, the name of the document and include the path or the ID of the document in a message that we sent, the follow-up message we sent out so that it can be associated.

Now what can happen is we'll generate multiple documents, but each of these documents will have some correspondent to a state of the database at some point in time. The problem though with that approach is that if we take into account reprocessing and reprocessing can happen because of failures and any other hiccups that can happen in a distributed system, we may up generating multiple pseudo random name documents and only the last successful attempt will actually result in sending out a message that contains link to it. The documents 78 and 79 will be just garbage that is not reachable by enemies. So what we can do is we can optimize that process and before creating these documents, we can commit to a database that those pseudo randomly generated numbers 78, 79, 80. Now with the last successful attempt of processing a message, we'll be able to check the state of the database and see, oh wow.

By the way, we are attempt number 80, but there were two attempts previously, number 78 and 79 and processing the same message. These failed and created documents. So let me just finish the transaction and before consuming the message X, Y, Z, we are going to delete the documents that are garbage and are not reachable. So there is a way of ensuring the garbage is collected here and the whole process doesn't introduce any garbage at all, any additional data that is not needed or that is not used by the system.

Let me briefly summarize because we're approaching our 50-minute mark here. What have we learned so far by looking at distributed Domain-Driven Design? First thing that I would love you to remember is that communication is a separate domain. It deserves being represented on Domain-Driven Design context diagrams because it's important to make sure that it's taken care of and appropriate sub-domains in it are identified and associated with vendor products, for example.

The second thing I repeated multiple times, the network is not reliable and we need to take that into account when designing our system and the way of taking that into account is usually the repository pattern. It's very useful place for dealing with all the things related to network not being reliable. And I'm not suggesting having repository class in your code or not. I'm not suggesting that you should or you must have a repository class in your code base. The modern messaging systems can help you implement the repository pattern in such a way that you might not have a class representing the repository, but the functionality of a repository that is to bind the message processing with the database access is there and is correctly represented. The next thing I would like to emphasize is that the duplication of messages is not optional.

In any message driven system, duplicates can happen and if there is no duplicates, then if you think your system doesn't create duplicates, then probably you are dealing with not at least one system, but at most one system and then by necessarily you are losing data. So duplicates are good because they are proof that you're not losing data. Dealing with duplicates is much better than not having them in a system. And while network is not reliable, I'll repeat that again because the duplication is not optional. I told you to represent the communication as a domain, but what kind of domain is it really? Is it generic domain that we can just move aside? Ideally it would, but I would say it's not the case. It's really a supporting domain. It's likely that in your system the communication is going to be a supporting sub-domain and that means that you need to write some code, you need to be there, you need to write some code there, but chances are there are generic sub domains that they can identify.

Another thing that needs to be addressed here is that maybe a core domain, it's very tempting for an engineer to focus all the efforts there because while it's super important to get these messages across, it's really our core domain. It's really rarely a core domain unless you are working for a messaging software vendor or you are dealing with high frequency trading or maybe some other applications, but it's rarely a core domain. And now because we have some time left before we get into questions, I would like to share with you some attempt at predicting the future, looking at the crystal ball here based on the patterns that we discussed so far. So the general problem with these patterns that I've showed you is that deduplication always requires keeping track of messages we processed and keeping track of messages we processed means we need some space to store the IDs of the messages we processed.

And because we don't have unlimited space, we need to remove the old information and make space for the new information. And that's very cumbersome both from the storage perspective and removal perspective. So it'll be ideal if messages contain the processing information in themselves. So when we consume the message, there is nothing left from it, it's just consumed and doesn't leave any traces and or any garbage. Can we do that? I would say we could, analyzing the patterns we've seen so far, based on two statements that messages are documents and documents are state. Because messages are documents, we can use the pattern that we've seen the last time in the last potato of the consistent document creation create documents that represent messages. And then because messages are state and pre-exist, we can have a state-based deduplication mechanism used on these document represented messages instead of ID based deduplication.

Here's the diagram that we've seen so far for the document creation and if we just substitute some folder for messages, this is a way of creating documents that represent messages. Now the actual message in this case, the DEF message just contains the link, a reference to the message that is represented as document. The other benefit here is that most modern message queues have very severe limitations on message size. Here, if we represent a message as a document, we don't have this limitation because we can go as big as we need because the document can get as big as we want. And then when we receive that message, the received message contains the link. So we look up the actual message content from that document store, we do the processing the regular way, and the result is creating more message documents that we sent out.

And the result, the whole process doesn't generate any garbage that we need to deal with like message processing information in the outbox pattern. And there is one last thing that can be done with that pattern. It can be scaled, well, it can be used also to duplicate web requests and create consistent web responses. So if I look at the crystal ball of message processing, this is what I see as a next thing that is going to come to that space and going to revolutionize that space soon. Last thing, I would like to encourage you to look at the block that I run with my colleague, Tom Mac. It's not been updated very frequently, but there is some useful content about reliable messaging there. And with that, thank you for being here with us. Thank you for watching it if you're watching the recorded version.

00:56:28 Tim

Thank you Szymon. I think we have a little bit of time remaining for one or two questions. If we don't answer your question today, we'll follow up with you offline and make sure to get your questions answered. So let's quickly start with the first question from Chris. Chris mentions the difficulties of dealing with scenarios where messages are received in a different order than they were sent. He's wondering whether this is a problem with how domain events are defined or whether this is just a matter of better defensive programming in the message receivers. Do you have any thoughts on this?

Yes, I think I do. I think there are two things that we need to distinguish here. There is some of the ordering problems can and should be dealt by defensive programming. For example, in the process manager that I've shown in that potato on the top right corner when it deals with external parties, the responses that are coming, are coming out of order and that's the essential part of that complexity of that component here that process managers is that messages come out of order and process managers are way of dealing with these messages that come out of order. But then there is the second part I think that Chris hinted at that that messages that should come in order are coming out of order. And I think this is the great and the best usage of event stream technologies such as Event Store Database or Event Kafka or any event streaming technology, is to use these technologies to send events that are supposed to get in order.

I see these components as components within a service boundary if we are doing SOA or within that potato, the bounded context, if you are using Domain-Driven Design terminology, it's a great tool. It's usually abused in my opinion in many places by using it at the top level, at the architectural level here. My opinion is that you shouldn't need to use it at the top level between those top level components like services and bounded contacts because if there is a ordering requirement of events, that is a very strict coupling between the components. If one component imposes an order of processing of things on another, this creates very strong coupling between those companies. And you should be starting to think why these two are so closely coupled, but it's a very common thing within one of the services or one of the bounded contexts where these complaints should be tightly coupled and are meant to be tightly coupled and needs to exchange information in that way.

00:59:31 Tim

Thank you for that answer. We'll do one more question. This one is from Reno. He is wondering on how to handle integration with third party providers that handle capabilities of two different domains, but there's only one possible point of contact. Can you give the example of a single data stream that contains data for two domains?

Yes. That's one of the patterns that Oud De Han mentions in his advanced distributor system design course. I don't remember the exact name of the pattern, but the idea is I think it's provider pattern I think. And the idea is that there is an infrastructure component that we distinguish some infrastructure. There is an infrastructure component that is dealing with that third party, that infrastructure component receives the message and it defines an interface that is a contract for processing the message of that type. And that contract can be implemented by multiple business components, multiple bonded contacts or SOA services, you name it, that provide implementation for that contract and these implementations are loaded into that infrastructure integration component. So the infrastructure component receives that message from that stream of that first party. And for each messaging that stream or each event in that stream, it calls all the implementation it has and these implementations running inside that infrastructure component.

They can either send a message to their home base, let's say to another component within their business service boundary, or they can access their own database right from that infrastructure component. So what needs to be done there to handle that situation is you need to decouple the physical architecture from the logical architecture, meaning that components from a different logical components from the different business components are running together inside one physical process and dealing with the same message there. But the way they deal with it's up to you how you design it. Some people think it's okay to access a database right from that place. Some people think, no, you shouldn't be dealing with database because transactions and things like that. You should really be sending messages to another component of your business domain from that infrastructure thing. I hope that answers the question.

01:02:21 Tim

All right, that's all we have time for today. Check out particular.net/events for more upcoming webinars and conference talks. And thanks again for joining us. On behalf of Szymon, this is Tim saying goodbye for now. See you on the next Particular live webinar.

Distributed domain-driven design, a no-nonsense implementation guide

🔗A brave new world

🔗In this webinar, Szymon will show:

🔗Transcription

About Szymon Pobiega

Additional resources