Reliable messaging without distributed transactions

00:00 Udi Dahan

Okay? All right. Now, let's do DTC or rather to work without the DTC. How are we doing for time? If you need to build a system that does not rely on distributed transactions, please write this down there are lots of moving parts, okay? I'm going to explain it bit by bit just so you understand the nature of the problem and why the traditional solutions don't work. So it starts out very simply. You have a client that is sending a message, drops it into your queue, and now you start your big old processing that you have over here, okay? Now, there is some database here that you're talking to where we're saying that, this database and this queue cannot be in the same transaction, all right? Now, you have a message that arrives here saying that, "Okay, I'm going to pull a message off of the queue and then invoke some business logic that is ultimately doing something against some master data that we have over here."

01:14 Udi Dahan

What we need to do is, we need to get in the middle of this which is just fine, we can have the sequence of message handlers at play to deal with that where the first answer that people come up with to say that, "Well, when a message comes in, how do you know if you've seen it before." We will have some infrastructure handler over here that will check a table of data that is in this same database, that's important, as the rest of our master data that contains all of the message IDs. So when a message arrives, we will check in our message ID table, have we processed this message before if so, we think, fantastic, we don't need to process it again, and we're done. That's the simplistic solution. In just a minute I'll explain why it's insufficient, all right?

02:09 Udi Dahan

So we have a message that arrives, we check the table, have we processed it before? No. This is done in a single big database transaction together with everything else that we have of our business logic. We say, "Okay, great," write this message ID in here, invoke the business logic, business logic does its work, commit the transaction to the database, everybody is happy. And even if we crash then we think, "No big deal, the transaction has been committed." When we wake up again and the message is still in the queue, we'll know not to process it again. Where's the problem? The problem is when the business logic code in here wants to send out some messages of its own, all right? So when the code in here wants to turn around and talk to that queuing system. When it does a, for example, Bus.Publish as a part of its business logic, what happens then?

03:20 Udi Dahan

So we get a message off of a queue. First time, we check, have we seen it before? No, just fine. Write down that we've processed that message ID. Invoke the business logic, business logic starts doing some work over here invokes Bus.Publish, Bus.Publish talks to the underlying queuing system, right? This queuing system by definition is not enlisted in the same transaction as the database. Those messages can go out. We crash before we committed the transaction to the database, but messages have escaped containing potentially the wrong business IDs, but we talked about that with the FedEx integration example, right? We're putting out the wrong order IDs. We can no longer trust that the messages that go out and the data that is in here are going to be consistent because we don't have these distributed transactions. There's also another problem.

04:25 Udi Dahan

It also depends on the order in which you do things or potentially if you did a Bus.Publish afterwards. Sometimes if you say, "Well, that's easy to solve as long as developers remember to always finish all of their database processing work before they call any Q level APIs," that helps. But it's not enough. If you were to do that, in other words you say, message comes in, you write it in the message ID table, you invoke the business logic, business logic modifies the master data. You commit the transaction to the database, and now you are about to invoke Bus.Publish and your server crashes. Yes, this is a very pessimistic type of situation that, how do you design good, solid, reliable systems? You have to go scenario by scenario, line by line in your code and the server crashes on this line, now what? Solve that. Next line, server crashes here. That's the only way you can get to reliability. So what would happen if your server crashed after committing this transaction, but before doing a Bus.Publish?

05:37 Udi Dahan

Well, you didn't finish processing so this message is still in the queue. You go to process it again, this handler takes a look at the ID table says, "Oh, I've already processed this message, do nothing." And then the publish doesn't go out, in which case we should have published something but we didn't. So we have two problematic scenarios of, doesn't really matter where you put the Bus.Publish, if you put it as a part of the business logic before the transaction is complete, then you might end up publishing data with the wrong ideas. If you put it afterwards, you might end up not publishing. So it's a damned, if you do damned, if you don't type of scenario. There's really only one way to solve this problem is that, when somebody goes to invoke Bus.Publish that actually Bus.Publish should not talk directly to the queuing system, but rather it should be going to a table in the same database so that when a message is being processed, and the business logic is being invoked and the business logic says, Bus.Publish.

06:57 Udi Dahan

What is actually done is, a record is stored in the same database saying, "We have been asked to publish this event, we record that in the same transaction as the rest of the transactions in the database," okay? That's the first step. So always record what was asked, Bus.Send, Bus.Reply, Bus.Publish, and what was the actual payload in here. And then even if you crashed Before you could commit, then you roll back and nothing would have escaped, which is good, or if you crash after completing this transaction, then you have a complete record of everything that needs to be sent out. So that's actually the difference in this handler over here. It doesn't just look at, when a message comes in, have I processed this message before, it also looks at, are there any activities that need to be done for this message that haven't been done if so, it then plays them. So let me use the blue here.

08:15 Udi Dahan

It'll read this table and then replay that out against the queuing system to actually say, "Oh, okay, I have a message in the queue, I look up all the activities based on that message ID, and I redo them against the queuing infrastructure. So that way, I'll make sure that everything that needs to go out does go out," make sense, you with me so far? We're not done yet, there's more to it. The problem that we have here is duplicates going out. So imagine a situation, you start doing this work, you go check, this is the first time you're processing the message, you invoke the business logic, the business logic does its master data work, does it Bus.Publish which gets recorded over here, then as a part of that... And that transaction gets complete. This logic then says, "Okay, great. Now, we can actually start delivering the messages." It starts reading from this table of the things that have been asked to be delivered and it starts delivering them.

09:26 Udi Dahan

So it does a Bus.Send, it does one Bus.Publish, and then the server crashes. So we have already emitted some data, but not yet everything. When we wake up again, the message is still in the queue, we go to read it. We check, "Okay, I've done this, but I haven't completed the full set of stuff so I have to redo the full set," so I redo the Bus.Send, redo Bus.Publish, unless I can do everything, I can't actually clean this table out. Meaning that I made Bus.Publish or Bus.Send twice or more. Now the question is, when I Bus.Send or Bus.Publish twice the same thing, can the deduplication of my subscribers work. And the problem is that, by default, if you do a Bus.Publish twice, the event that goes out each time has a different message ID. In which case your subscribers will look at that and say, "Actually, this second event is not the same as the first event."

10:41 Udi Dahan

So actually when you're doing a call against the queuing system to say, what I need to do is a Bus.Publish, but you actually need to guarantee is that, the message that you have control over the message ID that goes out because you need to guarantee that every time that you read a record off of this, you will set this exact same message ID so that your subscriber logic will be able to deduplicate on its end using the exact same behavior, right? So we need in here not just the regular redo the Bus.Publish but very importantly, to set the message ID. If you do all of this, then you will not need distributed transactions. The problem is actually, testing that what you did works well. It's very easy to build something, but how do you know that it's bulletproof?

11:48 Udi Dahan

So the testing process around infrastructure like this is really quite difficult and that's we in NServiceBus have delayed its delivery until NServiceBus five rather than delivering it as a part of NServiceBus four, why? Because ultimately we need a way to be able to run an automated test that is able to stop our own code deterministically at every single line, and then kill the process. Start the process up again, and assert that the state is correct. So if you thought unit testing was difficult, this is kind of the next level of difficulty.

12:25 Udi Dahan

So you want to make sure that you've really dealt with every scenario that if you crash it at any line of code, the system will behave correctly. That's ultimately what makes the DTC powerful and useful because it solves all those problems for you. Guys, don't worry about it, I got it. So nobody likes the DTC because most people don't realize what's the alternative of actually trying to build it themselves. The assumption is, "Oh, you can just message ID table and you're done," it's not. Any questions about this? No? Anybody willing to build this because waiting on, this is going to be so much fun?

13:11 Speaker 3

When is version five coming?

13:13 Udi Dahan

When's version five coming? We assume somewhere towards the end of Q3 of this year. The design is done, a lot of the implementation is done, we just we've been investing a lot in our own automated testing infrastructure to be able to get to the point that we can not only verify for ourselves, but if I'm going to be telling you, "Hey, look, you should be running your infrastructure on top of this, that it is actually tested at the level of every single line of code crashing that it's going to work," before I can hand on heart tell you, "You don't need the DTC." But ultimately, this is something that is extremely necessary. Anybody running on top of RabbitMQ needs something like this, most people don't do it. Anybody running in the cloud, Azure Service Bus, Amazon SQS, none of them support distributed transactions. You need this, it's not there.

14:08 Udi Dahan

So most of these types of systems, not really reliable. As you can see, you might end up publishing the wrong data, or you might end up just forgetting to publish. And I said a couple of times, it's hard enough to make these systems work when you have infrastructure that you can rely on, and ultimately all you're dealing with is my own business logic. The last thing you want is, all of these types of non-deterministic situations. In messaging they actually have a different name for them. There is, once and only once message delivery, which is what the DTC gives you then there is, at least once message delivery, which is ultimately what we're talking about, RabbitMQ, and friends that don't give you this sort of guarantee and therefore you need to build this yourself. And then there is ZeroMQ which supports, what do they call it? Best effort messaging. It's the, "We will deliver your messages very quickly or not at all. That's our guarantee to you. Okay, thanks."

15:15 Udi Dahan

Though for some cases it's very useful, but be aware that if you require a high level of reliability, and there's almost always some data that requires that, first of all, ZeroMQ is probably not going to work for you. You can't compensate for the lack of durability without reimplementing that yourself and that's a waste of time. Although Ayende Rahien did try, he implemented Rhino Queues, and then he abandoned it as is his is way. So please don't go write your own queuing system, please don't go write your own file system. I was hoping I wouldn't have to implement this myself that eventually the industry would get around to doing it, it didn't. We're all kind of in the same boat together so this can be one that... This is why NServiceBus is open source so that get more eyes on it because this sort of thing really requires it to be quite bulletproof. So hopefully, I'd love to see some cool requests from you guys.

Reliable messaging without distributed transactions

About this video

🔗Transcription