Exactly-once processing is easy. But wait! My bill doesn't match my order.

00:00:01 Dennis van der Stelt

Hello everyone, and thanks for joining us for another Particular live webinar. I am Dennis van der Stelt, and today I'm joined by my colleague and solution architect, Szymon Pobiega. He's going to talk about processing messages only once, and how easy it is, or maybe it isn't. Szymon is an expert on this topic, and he presents about it, but also created a website called Exactly Once, of which you can see the URL at the bottom of this slide, exactly-once.github.io. It's got all kinds of interesting articles.

Just a quick note before we actually begin. Please use the Q&A feature to ask any questions you may have during today's live webinar, and we will be sure to address them at the end of the presentation. We already got a question before this webinar, so Szymon will also try to answer that one. We will follow up offline to answer all questions that we won't be able to answer during the live webinar. We're also recording the webinar, and everyone will receive a link to the recording via email afterwards. Okay then, let's start and talk about exactly-once processing. Szymon, welcome.

00:01:18 Szymon Pobiega

Thanks, Dennis, for the introduction. Yes, so let me basically switch to the second slide. Talking a bit about myself, I used to be active in the domain-driven design community some time ago. Then as everybody was excited about event sourcing, so was I. And then I joined Particular Software, where I'm specializing in what Dennis said, in processing messages and ensuring that the processing of messages happens as many times as you want, so usually once.

We will be talking today about pierogi, because I want to add something that is, well, dear to my heart. I'm Polish, and pierogi is a Polish, well, you can say national dish. We make all kinds of pierogi. And the other purpose of introducing pierogi is that I can't really talk about real customer stories, because their businesses very unique and they would probably be not happy if I were talking in details about their businesses. But I can talk in abstract. I can gather all the stories that I heard, and I did so, but I can introduce a very abstract and artificial business problem that still shows the actual technical difficulties that people face, but in a slightly different domain.

So today we'll be looking at a story of team that was supposed to introduce, or disrupt the global, the leader in market for pierogi. So their goal was to create a website to be able to use, and order pierogi wherever you live, no matter which country you live, and you'd get a set of pierogi delivered to your door. You start with a... When you start story of a software project, first you need an architect, right, because you need an architect to create an architecture. And that's the kind of typical architect, taken from The Matrix. The architect knows that the network is actually not reliable, network is reliable is one of the well-known fallacies of distributed software. So fortunately the architect was aware that it's not true, and he designed the architecture in such a way that it was supposed to be resilient to network failures. The idea behind this architecture was to introduce message queuing in that system.

So the client's application, which was the web application, would send a message, submit order, or add item to an order, into the queue. And the backend part of the system, it consists of multiple services. One of them is orders, another one could be billing, shipping, whatever. And those backend services would also communicate through message queues, so when the order service receives submit order command, that command is a type of message that intends to make a change to a database. So, write requests are change, so when orders service receives the submit order command, it is supposed to emit item added event. Event is a type of message that denotes a change that has happened in the business state, and that change can be acted upon by another service such as shipping, so whenever orders decides that it's okay to add an item to an order, it publishes item added event, and then shipping can receive that item added event, so subscribe to that event, receive the item added event, and update the packaging information to include also that new item that has been added to an order.

So that looks like a quite simple architecture. Of course, it's based on services. We won't be going into details what is the size of the services, if they're microservices or not microservices. Bottom line is that there are multiple of these backend components, and there are queues and message topics everywhere, so the communication is asynchronous. Message queues introduce this asynchronous to the system, and they prevent losing messages. So whenever a message is added to a message queue, it will stay there until it's processed.

And the processing might fail for whatever reason. The delivery of the message might fail, or the receiving process might be experiencing some temporary failures, the database might be down for maintenance. Failures happen, and when we're handling a failure in a queue-based system, we have this three-step basic process. First of all, we redeliver the message. In NServiceBus, it's called recoverability. Whenever a failure occurs, the message goes back to the queue and is redelivered. So whenever the message is delivered to a process that handles that message, we check if we already delivered that message. If not, we process the message, we handle the message, execute all the business logic, and persist the state change that happens. So, for example, add an item to the order, and then we mark the message as delivered so that we don't process it again, because, well, if you talk about adding items to an order, we want, if somebody clicks add item button once, we want a single item to be added to the order, not multiple items, right? So that's basic of failure handling.

This is how it looks when you take into account the distributed nature of the system. So a queue, a message queue, or in general a messaging infrastructure, is a different process than our application process. Our messaging infrastructure might be an MSMQ process if you're using a built-in Windows queue, so it might be a cloud-based queue such as Azure Service Bus. It might be RabbitMQ installation somewhere on a different server. But whatever the particular choice of technology is, it is a separate process than your application.

So, we have a basic distributed system here with two processes. The queue gives us a message to process, and then after we process the message, we acknowledge the message to the queue so that we don't process it again. And how do we ensure that we won't process that message multiple times? Well, if you read documentation on message infrastructure that attempts to deduplicate messages, such as AWS FIFO queues or Microsoft Azure Service Bus with deduplication enabled, this is more or less how it looks like. The queue, or the messaging infrastructure, hands you a message, or delivers a message to our application, we process that message, we persist the state's change in our database, and we somehow using the API acknowledge, yes, we processed that message, and then the queue can mark that message as delivered, and it will not attempt to deliver that message a second time.

So it looks quite good as a diagram, but meanwhile there's a problem in that approach. And the problem has roots in the early medieval history, and that's a depiction of the siege of Constantinople. And that city is famous because a number of times in history it was besieged from two sides, from the Asian side and European side. And the idea was that you can only take the city if you attack it from two sides at the same time. And that kind of historical event has been abstracted by mathematicians into the Two Generals' Problem. And the problem, as defined in computer science, is like this. There are two armies trying to attack a city. If any of those two armies attack alone, they would be defeated, but if they decide to attack at exactly the same time, they will win and take the city. The problem is those armies can only communicate via messengers that have to go around the city, and they can be shot down or otherwise incapacitated by the city defenders. So whenever you send a messenger to the general on the other side, you don't really know if the messenger actually gets through. And it's been proven that, given that situation, those two generals cannot establish, cannot agree on a date of attack. So in other words, in computer science words, the consensus in that kind of situation is not possible.

How does it relate to our problem of messages and message queues? Well, in our scenario it's similar, although not exactly the same, because we're not dealing with consensus problem here. But it's kind of similar, so we can use that analogy. If a queue sends us a message to process, our application does something with that message, and the acknowledgement is lost because of network failure, for example. Now what happens? The queue didn't receive the acknowledgement, so two situations might have happened. Either the message didn't get to our application, or the acknowledgement didn't make it back to the queue. And from the queue point of view, those two situations are not distinguishable. So if the queue decides to send a message again to our application, we risk that we'll process that message multiple times. If the queuing infrastructure doesn't send the message, but the actual situation was that the message was lost, not the acknowledgement, the result is we will not process the message, so the message will be lost. Our precious order for pierogi is lost. So that analogy proves that it's not possible to implement an exactly once delivery scenario. And, well, you can read about it on Twitter. Twitter is full of mentions like this.

But what we can actually do is we can have a system, this works perfectly fine, and it's based on the assumption that we process the messages exactly once. We cannot control the delivery of messages, but it's true that process, we can control the processing. And we do it by moving our two generals on one side. Both are now sitting, or standing, happily on our application process. So the difference here is that their messaging infrastructure delivers the messages for processing multiple times. It redelivers the message until it actually gets the acknowledgement that the message has been processed. But each time our application receives the message, it itself has to check the message has been processed. If it's not yet processed, it will process it, and then mark it as processed. So the whole process of checking processing and marking has to happen in our application in a non-distributed way, because otherwise we'll be back into our Two Generals' Problem.

So that's the theory part. I think we can skip past it and get back to our pierogi problem. So if you try to make a system like this, and you're dealing with message deduplication, you can go to Twitter for advice, and that's what people frequently do, I do it myself. And what you usually... Well, in that case, what you are likely to learn is, well, just make your logic idempotent. And, well, that's helpful, yes. You can google up idempotence. And that's a nice mathematical description of what idempotence is, you can have... If a function is idempotent, you can apply that function multiple times, or you can apply the function to the result of applying to that function, but it's still equal to if you apply that function once. Is it helpful for computer science? Well, it's a nice field of study. Is it helpful if you're actually designing a system? Probably not that helpful. You're not writing a function application in such a way in C# when you're dealing with orders and messages.

So it's not... The translation between idempotence as a computer science term and idempotence as understood, or as needed by engineers, is not that easy. But in that case, the development team was actually really really lucky. The design team came up with a very minimum user experience, because that was the start of environment and they wanted to just validate the idea by delivering the MVP. And what's really noticeable in this really simple UI is that you can have, you can order pierogi of course, and you can have multiple items in a single order, like pierogi ruskie, or pierogi with meat, or pierogi with mushrooms. But the limitation for this first release was that within one order, you could only have one item of a given type. So if you already added pierogi with meat, like on this example here, you can't add another item of type meat because, well, that was some business decision that somebody up the management ladder made.

And software engineers that were introduced to this design of the user interface were really really happy, because actually this time, idempotence really helped them. So let's see how they managed to map this process of checking if message was processed, processing, and marking as processed in the application. Sorry, I'm switching slides too fast. So what do I mean by lucky? That user interface is actually a representation of a set, a data structure that has this nice property that operations on a set, such as adding an item to a set, or removal of an item from a set, are idempotent by nature. So this user interface is really designed, or is really helpful when designing a system that is based on natural idempotence of operations.

I want to introduce you to a convention that I'm using in this webinar, notation of the code snippets that I have. So for those of you who are familiar with NServiceBus, so I expect majority of you, that shouldn't be something new. We have a handler that handles some messages, executes some business logic on messages, so in this case it's confirm order handler. It handles messages of type confirm order, it has a method called handle that gets the message and the context, executes some business logic, and eventually it publishes an event, or it sends a command if the message that is going down is a command.

And here's how our debugging will look like today. I pre-recorded the debugging session using the application code that was used by the startup. And here is... Well, the screen is divided into four pieces. On the right-hand side it's actually Visual Studio debugging, and on the left-hand side we have three console windows.

The top window, the blue one, shows the test harness. The test harness has been created by the testing team in that company in order to be able to exercise the system without going through the user interface. So the testing team wanted to script their tests, and not have to click click through them. So we can drive the system by typing in commands, such as submit order, it will send a message. Then there is the red window, which is the actual system that is being tested, or service being tested. In that case, that's the ordering service. And the green one is some other service, such as billing or shipping.

And so one other thing worth mentioning here is that the testing team was quite experienced in testing message-driven systems, and they knew that duplication of messages is one of the most common failure scenarios in such systems, so what they did in the test harness is they built in a plugin that duplicates all the messages if that thing is running in a testing environment. So whenever a message is sent from the test harness, the test harness mimics the actual user interface, that message is copied, and two copies of the same message are sent. This actually allows us to see those message duplication failure scenarios in testing environment without going to production and seeing it fail in production.

So let's see how that test harness works. I can type in submit ABC, which means I want to submit order with ID ABC. What happens is, as you can see here, NServiceBus says, "Immediate retry is going to retry message blah blah blah because primary key violation." That's the first hint of how you can implement message deduplication here. If the ID of a business entity, in here ID of an order ABC, is provided by the client of our API, we can use it for deduplication. So here, the ID order ABC was provided by the test harness, so two copies of the message that attempts to create an order ABC has been sent. Because ABC was used as a primary key, then database did the work for us and detected that the duplicate message was received, and then when the attempt was retried, we detected that it was duplicate.

But that's not the interesting part. The interesting part is adding items to an existing order. So we tried to add, here I'll go with meat, to that order ABC. And what happens is we execute the message handler of course. That is divided into multiple parts. First, we create a data context using Entity Framework, then we load the order, and then we check if it's a duplicate. Also remember that order was a set. So it means that if I already have, in my collection of line items in that order, an item with the same filling as I'm actually adding right now, that means that, well, that item is already there, so our message is probably a duplicate, so I can ignore it. In that case, it's not a duplicate, so create a new line item, and I save changes to a database, and last but not least I publish an event, item added, so that other services can process it and act upon it.

In the console, we see that those two copies of the same message arrived. Two threads, because I'm running a multi-threaded client, two threads started processing those two copies of the same message. One of them succeeded. The other has failed because of the primary key violation here, because I'm using that duplicate, that ID of the order and type of the item as a key here in the database. And on the second attempt failed, the message went back to the queue, when it was picked up again. Here's what happens. We load the order, we check if we already have item of this type, and this time do have because we processed one of those, called this correctly, so we log info and return. So disaster averted, we processed the message correctly.

Because we have our correct answer, let's ship it, right? And the result is that customers are complaining because their orders of pierogi are not delivered. Hmm, what happened? Two things could have happened. And some of you might recognize those dreadful messages. Sometimes message queue, message infrastructure fails. And when it fails, well, it can fail when it's run out of disk space, or some network problems, whatever kind of message infrastructures are sending, kind of infrastructure sometimes fails. And, well, we're used to dealing with those failures. We will just retry processing the message until it succeeds. Well, that didn't work like that. Customers were still complaining, and something was off. They were not getting their orders already, and that their complaints could be traced back to logs in our log system that showed some struggles in the messaging infrastructure.

So what a past team decided to do is they decided to create a scenario in their lab environment to be able to reproduce the problem. So what they did is they changed their testing harness and the test infrastructure in such a way that whenever you try to add pierogi ruskie to an order, that the first attempt to publish a message will fail. Let's see how it works. So we try to add pierogi, and item type ruskie, we check whether we already added this type of item. Nope, so we create a new item, we save the changes to the database, and everything is fine. We try to publish an event to the other services, and that fails. A broker error. That's a simulated error.

But should be fine, right? If it's an error, the message will go back to the queue and we'll attempt to process it again. So that happens, we load the order from the database, we check the items, and guess what, we already have that item of that ruskie in there in the collection of order lines, because we previously successfully saved the data to the database. So, well, we log that that's duplicate, and we ignore that message. Result is, in our orders system, that state is correct, the item has been added, but all these services that were expecting an item added event to be published, well, they don't get this item added event, so they don't update their states, and the item is missing. The customer doesn't get pierogi they expected.

Well, so far we have this strategy to persist data, and then publish event, and that failed on us. We've seen how it caused problems. So, well, what the lazy developer does, they just switch it around. And well, if persist and then publish doesn't work, let's try publish and then persist the state, and that has to work, right? So let's try this one. We submit a new order, and add some meat to this order to validate our new approach. This part of the code is the same, here we load the order from a database, check if it's not a duplicate, not a duplicate, and we add an item. And then there's this change that we introduced, we first publish the event that item has been added, and then we save data to the database. So far so good, right? We detected a duplicate, and we can re-process the duplicate, we can catch the duplicate and return. So everything is fine, works great, ship it. And again, customers are complaining about missing items in their deliveries and paying too much for what they have been delivered.

So what's wrong here? Again, our great quality engineers try to reproduce that system, that failure, in a more controlled lab environment, by introducing some more changes to the chaos-generating infrastructure. So this time they have been able to trace the failures that the customers were reporting to a validation logic that has been added to that system in the meantime. So they want to have a controlled validation logic here, and they decided to add a new type of pierogi fillings with cheese. Everybody knows that you cannot have a Polish pierogi with Swiss cheese inside because they don't match, so they created a validator, an Entity Framework validator, that refuses to store an order if an item of type Swiss cheese has been added.

So let's see how it works. We add Swiss cheese, we load the order, check if it's not a duplicate, no it's not a duplicate, so we publish an event item added here. We try to save changes to the database, and Entity Framework validation says, "No, you are not going to store that order because Swiss cheese is not allowed." And what's happening here is, oh, the downstream systems, the shipping system and the billing system, they were happy to receive the item added event. Those items were added. Well, the event of this type has been sent, and those two services, they think that everything is fine, and Swiss cheese is okay.

So what the team tried to achieve by using messaging is they wanted to have a system that follows the principle of eventual consistency. What they have introduced here is the system that follows the principle of immediate inconsistency, because here we have a situation where the services are immediately inconsistent. The data in the orders service is not consistent with data in shipping service or billing service, because we published what is known as a ghost message. A ghost message is a message that conveys a state that has not been persisted to a database, and that state might disappear while the message will still carry that state.

So, to summarize this part, when we try to publish and then persist, we risk introduction of ghost messages into our system, and those ghost messages make that system immediately inconsistent. If we tried first persist and then publish, we risk that we might not publish some messages, so we will basically lose some orders. Both are not good enough, right?

But, well, they say, "Just make your logic idempotent." Let's go beyond that, just make your logic idempotent, and try to frame the actual problem that the team is struggling. So first, the team, they didn't want to lose messages. If you want to take over the global delivery market for pierogi, you don't want to lose your precious pierogi orders. Second thing is, you want to apply the state change only once. You don't want to multiply the items in your pierogi order, because customers will complain about being delivered things that they didn't order. So we want to apply the state change once. And then, last but not least, we don't want to leak non-durable state in the form of ghost messages.

In order to create or design a solution that fulfills all those three goals, we need to switch our approach. Instead of checking if thing is not a duplicate, then persisting, then marking as a duplicate, what we need to do is we want to persist a change in our database if the message is not a duplicate, but then we want to publish messages regardless. This approach allows us to not lose those messages when we suspect a message might have been a duplicate, but that problem might have been also result of a broker failure.

So, to rephrase that, from the perspective of your application processing messages, there are two situations that are indistinguishable. One of the situations is you get duplicate message, and the other is we failed to process a message, and the failure happened when we were sending out messages to the downstream systems. Those two situations result in the same behavior, which is you get a message for the second time, the database state is already updated, and you don't know what to do with that message. You don't update the database again, because that would be applying the state change the second time, but because you have no idea if you published those downstream messages or not, you need to publish them anyway. What you risk is, well, you might publish them multiple times, but the assumption is that the downstream systems will be able to cope with the duplicates, so they should be fine.

And while the team was debating those solutions and trying to design the code that runs them actually, the sales were growing like this. So the startup experienced exponential growth, and there was a huge backlog of new features to be added, so they really had not that much time for thinking, and they had to act. So the manager comes in and says, "Well, if you could add quantity to the order lines, that would be great." So the team stopped work, stopped refactoring, and went back to the drawing board to see how they can implement the new user interface design that added this ability to change the quantity. So no longer you have that limit of having one item of each type in your order, you can have three, four items, whatever you like.

The problem, the big problem that the engineering team had with this new UI design is that that's no longer a set. So, the previous approach to deduplication would not work here. They need to figure out something new. They eventually figure out a solution based on two key components. First is identity, which is represented by an ID card, the Polish one in this case. And the second one is symbolized with letter C, which is immutability. C is used to represent speed of light in a vacuum, which is immutable. So identity plus immutability allows them to design a new solution for message deduplication.

As it turns out, most messaging infrastructures, and that includes NServiceBus for sure, has the concept of message IDs. Each message is assigned an ID, and the assumption, the very strong assumption here, is that duplicate messages have the same ID. Because messages might get duplicated by the infrastructure along the way, when the messages are copied and resent and rerouted they will always have the same content and the same ID. And if an item is immutable, and with this user interface you can only add items and remove items, so they are immutable, you cannot edit them. If an item is immutable, it can only be assigned a single message, the message that created that item. So if we can, when we add item to the order, if we can actually store the ID of the message that created an item in the order, we will be able to deduplicate incoming messages, because when we get a message, we just need to load the order and check if we already have an item that has been created by the message with a given ID.

So let's see how this code works. First, we want to try the Swiss cheese scenario, if the team delivered a solution that actually solves that problem. So we load the order, we check if it's a duplicate, this time it's not a duplicate so we create a new order line, we store that in the database, and that fails because of the validation. We haven't published any events so far, so it's good.

Now let's try add some pierogi with meat to the order. And meat is a regular filling type that doesn't have any associated specific behavior in the test harness. So we load the order, we'll check if it's a duplicate, no, it's the first copy of the message that we are processing, so we create an order line. We add it to the order, we save the changes to the database, and we publish item added event. So far so good. Of course, the test harness works in such a way that there was a duplicate message. There, the duplicate failed to be processed and is going back to the queue, and now to reprocess that duplicate message. So we load the order, we check the order lines, if we have any order lines that have an ID that is equal to the ID of the current message being processed. And yes, we have, because we already processed another copy of this message, so we log info. But we don't return from that handler. We need to make sure that we publish the item added event, because we can't distinguish this failure scenario from the one that would be caused by a broker failure, so we publish that one again.

And that would be no problem, because the receiver of the item added event will be able to deduplicate them, right? Well, not really. As you can see here, those are the message IDs, and they are different. So we broke a fundamental assumption that the two copies of the same message will have the same IDs. We broke that assumption because here it is not the infrastructure that introduces duplicates, it is us. Republishing events, we introduced those duplicates. So, well, although the team that created the ordering service and maintains the ordering service was happy, the other teams were not really, because they were not able to deduplicate their messages. Here's the situation on a diagram. What's the input of the order system is two copies of the item added event. Both have ID, let's say, 123. The output is two copies of the item added event, but in that case those have different IDs, which breaks the deduplication assumption of the downstream systems.

Let's get back to the drawing board and try to reframe the problem that we already tried to frame. Previously we said we don't want to lose messages, we want to apply the state change only once, and we don't want to leak non-durable state. Well, applying state change only once, it turns out that that's not enough. As my colleague mentioned, said once in the blog that we run together, the Exactly Once blog, what we actually want is we want an endpoint to produce observable side effects equivalent to some execution in which each logical message gets processed exactly once.

Well, that might sound complex. Well, what it means is that we want to... Well, let's see what Twitter says about that. And the Twitter trolls would say, "Just make your logic deterministic." Don't listen to those guys. And in that case, well, they would be partially right. At least that might seem like that would be partially right, because the problem here is that we were assigning those message IDs in a non-deterministic fashion. If we just used some sort of deterministic way to assign message IDs when we are republishing events, the problem would be solved. So, well, what's the solution? You can't just use whatever, some of the operations that we are used to do, like DateTime Now or Guid NewGuid, or Random Next, they are not deterministic. So if you want to follow that piece of advice, you just throw them out the window.

And yes, it's possible, but... And that's something that the team wanted to do. They didn't have more time to actually look at the possible solutions, because the manager came in again and said, "Well, that's great. The sales are growing. The solution works good enough to be deployed to production, but if you could just add another event, the marketing team is interested in a first item added event." The idea is that, whenever someone adds a first item to their order, the marketing team wants to know, because some people are not adding any items to their order, just creating orders and then abandoning the site. And they just want to know and be able to figure out what happens. Nothing could be easier, right?

We just have our handler method, we leave the business logic intact, and at the end of the handler we have code like this. If the number of lines, order lines, order items, is equal to one, after adding that item of course, then we publish first item added event. Sounds simple, right? There is no non-deterministic code here, only deterministic one. If the item count is equal one, publish that event. Ship it. What happens is, well, marketing team is complaining. They don't receive as many first item added events as they should be receiving, based on some other measurements, and they are quite angry.

So the engineering team is looking at the problem, and they are exercising the system, and they have the scenario where they have two add item commands sent to the same order, with two different IDs. So let's see how the system deals with those commands. So first, we receive add item with ID 123, then we check if it's not a duplicate, and it's not a duplicate, right? So we add order line to the order, and we persist that to the database. So far so good. We publish item added event, that goes out, and shipping and billing processes that event. And then we count order lines. According to our code, there is one order line, so what? We should publish first item added event. So we try to do it, but it fails. Broker fails again. Oh well, that's not a problem. We put the add item 123 back to the queue, and we will reprocess that in a second.

But, in the meantime, there comes a different command, add item 234. Somebody clicked on the UI really quickly, and before that first add item command was really processed, add item 234 was picked up. So we check if it's not a duplicate, no it's not a duplicate, it's a new command, we add an order line, we persist the data, we publish item added, that's fine, we count order lines. How many order lines do we have now? Well, we added one when we processed the add item 123, and then we added another one when we processed 234, so order lines are now two. Publish first item added? No, because we have two order items, so we don't publish that event. That's correct behavior so far.

But then 123 comes back in, the one that we failed to process in the first place. So check if it's a duplicate, yes it's a duplicate, so we don't persist, we don't modify state, that's safe. We are safe in that department. We publish item added, because we want to make sure that our downstream systems receive that item added event. That's correct, we don't lose that message, we count order lines, and guess what, it's two, because we already incremented that. We already added another item. So what happens is we don't publish first item added in that case. So we lost that event. It has not been published, but it should have been.

What really happened, let me recap, is that state change logic is executed correctly. The state has been protected, not only in the ordering system, the other shipping system and billing system have the same state, we didn't introduce any inconsistencies. What we didn't do correct is that message publishing is based on the incorrect state, because we based the message publishing of that second attempt to process order 123 on the state that was a result of processing both messages.

So we need to go back to my colleague Tomasz and re-read that again. We want an endpoint to produce observable side effects equivalent to some execution in which each logical message gets processed exactly once. Obviously, that was not what happened, because we failed to process that message. And we call that fulfillment of that promise that Tomasz mentioned in previous slide, in that quote, consistent messaging. Consistent messaging means we are processing messages as if we processed each message exactly once. And in order to do that consistent messaging, we have basically two options. One is when we are republishing messages, like in this case, we need to do it based on the historic state. So the correct thing to do in that scenario was to, when we were reprocessing message 123, we should load the state, ask if it was when we processed 123 for the first time. Which is hard to do if you are not storing historic state. And then the second alternative is to durably store outgoing messages.

And that's something that the team was not really thinking about, but the manager came in and said, well, a bit angry, "If you could finally get that code to work, that would be great." So, they were a bit desperate, and they used the live chat feature, and so they asked a random engineer at Particular, "Hi there. Are there any established patterns for implementing idempotent message handlers?" And the guy says, "Have you tried the outbox?" Well, they haven't. And that's the diagram of how outbox works. I'll skip that one and show you the example how the outbox code works in NServiceBus especially. It's a simplified version of the one that is there in the NServiceBus codebase.

So, when processing messages, first we load that outbox state by incoming message ID. If that outbox state is null, it means the message has not been processed yet. So we process that message, we execute the business logic, and the result is a state change that is going to be persisted in the database, and a collection of outgoing messages. And what we do is we store both. We store both the state change and the outgoing messages in the same storage, in the same transaction. And only afterwards we try to actually send those messages resulting from the processing to the other downstream services.

And, well, suppose we succeed. Last thing is to mark that message, outgoing message, as dispatched, so we don't need to come back to them again. But if we fail in the middle, let's say here, the message goes back to the queue, we load again the outbox, and we check. This time it's not null. We store that outbox data. So we don't execute the processing again, it's safe. We check if the outbox has been dispatched, so that means that we send out the messages. No, it's not. Well, in fact, we sent those messages, but we didn't mark them as sent. So we'll just resend them.

The trick here is that we are resending those messages based on the state stored in the database. So that's 100% deterministic, and those messages were generated only once. We are not regenerating messages each time we attempt to process a message, and that solves the problem of not being able to use Random, GUIDs, DateTimes, and also it solves the problem of our first item added event being regenerated in the wrong way. This time we mark as dispatched successfully, so when there is a duplicate of that message, you can always have a duplicate, we load the outbox state. It's not null but it's marked as dispatched, so we just return and skip the processing altogether, which is really the correct way of handling duplicates.

And then of course the manager is never happy. He comes in and says, "Hey. I heard this thing called eventual consistency. Can you change the code to take advantage of this new thing?" And the engineering team was devastated. But, fortunately, there is a webinar coming in almost a month, on October 20. Our lovely host Dennis will be talking about dealing with eventual consistency, so stay tuned and go to particular.net/events and register for that next webinar, it will be related to that one. Well, I mentioned eventual consistency in my webinar.

So is it the end? Well, not really. That's not the end of the story. You can follow up on our blog, and check out more stuff about processing messages exactly once, because there are things that I haven't covered today. For example, I have not talked about how to deal with deduplication data. If we are marking messages as processed, that mark as processed part needs to be stored somewhere so that we can check later if we processed that message. It's important, it's very important, how long you would store those deduplication data, and how do you deal with it. That's just one subject that we touched in the blog with Tomasz.

And another is, well, I was talking about relational databases. What about document databases, or global scale or universe scale databases? So, especially if you're interested in Cosmos DB and trying out Cosmos DB, and if message duplicates are something that bothers you when using a system based on Cosmos DB, you can contact us, because we are working on something in this area. So this area means Cosmos DB plus message deduplication. And with that, I'd like to thank you, and I want to address some questions that you guys had.

All right. We'll take questions for a few minutes. Again, we'll respond to questions via email or offline if we don't have time for it live. The first question is from Christoff. He says, "You mentioned the Twitter answer about making code idempotent. You present that it has its challenges, but is it correct to say that in the end it basically boils down to making sure that the code is idempotent? Because a broker or anything else can't really help."

Yes, that's correct. What is usually missed when people talk just make your code idempotent is, what is the scope of that F function in the just make it idempotent way? Because it's tempting and easy to just think about the business logic part as the thing you need to make idempotent. Meaning, how do I store the data in such a way that it's idempotent? But what you really need to do to make idempotent is the whole act of processing the messages, which means that that includes sending outgoing messages. And what my example with first item added showed is that it's hard to make sending outgoing messages idempotent unless you actually store those messages in the first place, and then try to publish them. Yeah, I hope that answers his question.

Okay. The next question is from someone else. He says, "What about race conditions? You have two competing consumers try to add an item at the same time, and both publish first item added event, because they don't see each other's pending saves due to transaction isolation."

Okay. So what I didn't mention here is that I sort of made it implicit. I should have mentioned it explicitly. I assume that all of database operations are protected by optimistic concurrency controls. So if we try to save an order, it needs to be in the same version as it used to be when we loaded it, otherwise the operation will fail, or should fail, because... I think it's a really good question, because bottom line is concurrency control, or preventing concurrent modifications to data, is the first step you need to do before you even start thinking about deduplication. No deduplication mechanism will work if you don't have concurrency control in place first. So, for example in NServiceBus, we deal with it based on the outbox pattern. When you use the outbox, and you attach that outbox transaction, NServiceBus handles that part for you. But if you're implementing your own deduplication mechanism, then first thing is, yes, to control the concurrency properly, and then build on it proper deduplication.

Okay. Thanks, Szymon. Another question from Brandon. In a SQL Server environment, would you prefer an outbox pattern or MSDTC?

Well, MSDTC is not going to be supported for the cloud as far as I know. At least Microsoft doesn't mention it. The closest equivalent is the elastic transactions, but they are a bit different, so I... And also from the performance perspective, I would prefer the outbox pattern. Generally in most cases, it will result in better performance and is much more flexible in the sense that it's not limited to on-premise Windows SQL plus MSMQ. So, yeah, in any new system I would prefer the outbox to DTC.

Okay. Robert has a question, and he says, "Even though the outbox seems such a crucial pattern, I could hardly find references to it by the same name online. Does it have a widely accepted name in the industry, or other implementations in the wild?"

That's a really good question and remark. Yes, it's hard to find any references to it. I think there are implementations of the outbox pattern in open-source message bus solutions, such as MassTransit or Rebus. I'm not 100% sure about that, but... It's unfortunate that it doesn't have a widely accepted name. It is not listed in the patterns of enterprise integration architecture, the handbook of messaging, basically. But we hope to get there. I mean, the more popular the pattern becomes, maybe the name will be accepted. I hope it will be accepted. I think it's very useful pattern.

I couldn't agree more. Maxime has a question, "It's clear the outbox pattern is a good option when there are only local database changes that are happening inside the handler. But what if there are other options, like API calls, that have to be done as part of the event handling?"

Okay. I'll try to answer it as best I can. So first of all, the basic rule of thumb is to not mix database access and API calls in the same message handling. So if, as part of handling an event, you will need to call some APIs and store some data, I would advise to split that by sending a message to yourself, or sending a message to a different endpoint. But basically, doing database modifications while handling one message, and then sending a message to call API, or the other way around, bottom line is you don't want to mix those two because they have different guarantees. And the database modifications can be protected by the outbox, the API calls can't.

As for the API calls, it all depends on the protocol that is used. So, ideally, the API allows you to pass some sort of a reference number, and if it allows it, it should also deduplicate the calls based on that reference number. So imagine there is an API that debits an account, or... Yeah, debit an account is a good example. It should allow you to provide some sort of operation ID to that debit operation, and if you send two API calls with the same operation ID that you provide, it should ignore the second attempt. And you can easily derive that operation ID, for example, from the ID of the message. And again, assuming message IDs are the same for duplicate messages, and that's a good assumption to base on, you would be able to maintain correctness in your system, because the other side of the API call is going to be duplicated. But without the cooperation of the actual API provider, you can't guarantee anything when it comes to API calls.

All right. Thanks for answering the questions, Szymon. We are out of time. There are some more questions outstanding, but as I promised, we'll follow up with those offline. Before I wrap up, I'd like to mention that Adam Ralph will be speaking at online events TechConnect and IT Days. And, like Szymon already mentioned, in a month we'll have another live Particular webinar on dealing with eventual consistency on October 20. You can register for it via our website at particular.net. And that's all the time we have for today. On behalf of Szymon Pobiega, this is Dennis van der Stelt saying goodbye for now, and see you at the next Particular live webinar. Thanks.

Exactly-once processing is easy. But wait! My bill doesn't match my order.

🔗Why attend?

🔗In this talk, Szymon will show you:

🔗Transcription

About Szymon Pobiega

Additional resources