Webinar recording
Implementing an Outbox — model-checking first
The NServiceBus Outbox gives you consistency between database and messaging operations, something that would be nontrivial to do on your own. So how does it work? And how can you prove that it works?
🔗Why attend?
It’s hard to build reliable messaging systems. It’s even harder if you can’t use distributed transactions between your message queues and data storage. Fortunately, NServiceBus fills this gap with the Outbox. The Outbox provides similar guarantees to distributed transactions but avoids the complex and cumbersome Two-Phase Commit protocol. Since NServiceBus introduced the Outbox almost 7 years ago, other frameworks followed and added their own implementations.
In this presentation, we’ll use TLA+ to specify and validate the Outbox. TLA+ is a tool for validating critical parts of distributed and concurrent systems, which makes it a perfect tool for the job. By distilling the Outbox to bare essentials we’ll learn how the algorithm works, which guarantees it gives, and the assumptions it makes about messaging and storage infrastructure.
🔗In this webinar you’ll learn about:
- The main challenges in building reliable systems using modern messaging infrastructure — both on-premises and in the Cloud.
- The Outbox pattern, how it works, and how it guarantees consistent message processing.
- Using TLA+ to model-check the Outbox algorithm. Warning: hard core computer science. Be prepared!
🔗Transcription
- 00:00:00 Szymon Pobiega
- And hello again, everyone. Thanks for joining us for another Particular Live Webinar. This is Szymon Pobiega. Today, I'm joined by my colleague and Solution Architect, Tomasz Masternak, who's going to talk about Outbox in a very specific and interesting way. Just a quick note, before we begin, please use the Q & A feature to ask any questions you may have during today's live webinar, and we'll be sure to address them at the end of the presentation. We'll follow up offline to answer all the questions we won't be able to answer during the live webinar, if there is more questions than we can handle. We're also recording the webinar and everyone will receive a link of the recording via email. Okay, let's talk about the Outbox model-checking first. Tomasz, welcome.
- 00:01:01 Tomasz Masternak
- Thank you, Simone, and hello everyone. As Simone mentioned, today, I will be talking about Outbox in a very specific way, and we will be looking together at implementing the Outbox and what does it take to make a robust implementation. So, let's get going.
- 00:01:23 Tomasz Masternak
- And let us start with a bit of history. Those of you who are in the messaging industry for quite some time, know that more or less 10 years ago, if you would ask any architect or technical person, how is it that they are dealing with making sure that their endpoints are processing messages in a consistent manner? They will very likely respond referring to two-phase commit. And more specifically, if they were working in Microsoft stack, they would mention Microsoft DTC. And that was indeed the case. Whenever anyone wanted to build a system in which there were elements or handlers that were picking up messages from the message queue and were changing database based on the content of those messages, people would make sure that those operations are consistent by using two-phase commit. And that was the factor standard in the industry.
- 00:02:37 Tomasz Masternak
- Now, one interesting thing that happened since then, or even started happening before then, was the fact that the cloud became more and more popular as a deployment environment and runtime environment. And one big shift that came with cloud, was that the messaging infrastructure, which now was available as a service in the cloud and that's the case until today, no longer supported two-phase commit. So basically what happened was that, that bid, that guarantee that previously was provided by the infrastructure is no longer there in the cloud and that was, as I said, a big shift.
- 00:03:20 Tomasz Masternak
- One of the reasons for which two-phase commit was not available in the cloud, was that two-phase commit treats all the elements which participate in the transaction as part of the same administrative and security domain. And cloud providers could not afford to depend their resources on the clients that could potentially connect to some other sources and hold logs and resources in the cloud for a long period of time, that's why, what happened was that, from the client perspective, from the code that is using services in the cloud, two-phase commit and distributed transactions between different types of resources are no longer available. In short, there is no longer a knight on the white horse that will slay the dragon of inconsistency for us. That's no longer the case. That's just history that happened.
- 00:04:17 Tomasz Masternak
- Looking into the future. The very likely thing to happen is that that situation will stay the same. There are islands in the cloud that actually depend on two-phase commit. For instance, Google Spanner is using two-phase commit internally, but this is never exposed to the clients. Also, you can configure SQL instances on Azure, so that they use something which is called elastic transactions, which actually use two-phase commit between those instances. But once again, this is only a single resource and you can outspan messaging infrastructure and database with those transactions.
- 00:05:01 Tomasz Masternak
- So what is the reality that we need to accept, is the fact that, it is the reality in which a lot of anomalies that were not possible because we could depend on two-phase commit, suddenly can happen, and those anomalies can surface to the business logic code that previously was shielded from those anomalies, because we had access two-phase commit. And previously, the guarantee that were given by the infrastructure shielded us from those anomalies, and now, it is us who are designing and implementing message-based distributed systems that need to cope with those anomalies and need to make sure that the code that we deploy to the production, is able to cope with those anomalies as they happen. So, this is a significant shift.
- 00:05:56 Tomasz Masternak
- Previously, this was something which was provided by the infrastructure. Now, those anomalies can bubble up to our code that we write and is part of our process. And if we want to build reliable systems on which our business can depend on, we need to make sure that our code and business logic can cope with those anomalies as they have.
- 00:06:25 Tomasz Masternak
- What are those anomalies and what is the main source of those anomalies? And the short answer is that, the main reason for the anomalies and potential problems in the production with our code, is caused by the fact that the current messaging infrastructure provides a guarantee, delivery guarantee, which is called at least once message delivery. So what does it mean? Well, it means that the guarantee that is given by the infrastructure is that, a single message will be delivered to the receiver at least once. What you might already expect is that at least once, means that it can be one, two, three, four, numerous amount of times. So, that can cause a situation in which a single message is being delivered to the recipient multiple times. And the reason why this is possible, is a direct consequence of the way in which the protocol that is implemented between the message infrastructure and the receiver, and also between the message infrastructure in the sender is designed.
- 00:07:42 Tomasz Masternak
- The design differs between various infrastructures, but the general principle is pretty much the same. And the way in which it works is that, the messaging infrastructure will send a message to the receiver, will make it available for the receiver, but it's not going to delete that message from its internal storage, until that message is being acknowledged by the receiver. Only when the successful acknowledgement message gets back to the infrastructure, it's only then when the message is removed from the persistent storage, that is used by the messaging infrastructure.
- 00:08:24 Tomasz Masternak
- Basically, what can happen is that, for instance, the receiver is operating as usual and everything is fine. It picks up a message from the queue, let's say message number one, it processes part of the business logic, so it executes the handler, but before it can, it is successfully acknowledging the message to the broker, the process fails. What it means is that, that acknowledgement will never get to the messaging infrastructure, and when the receiver gets back online, it will receive that message one more time, because from the perspective of the messaging infrastructure, that message was never acknowledged as it was never removed from the messaging infrastructure.
- 00:09:09 Tomasz Masternak
- Basically, what this can happen is duplication. From the receiver perspective, what can happen is that, a single message can be delivered by the infrastructure multiple times. And I mentioned one failure scenario, which is when the receiver is down and that message gets re-delivered, but there are also other scenarios that can cause that kind of anomaly. Another example is something which is connected with visibility timeout. If you look at SQS, ASQ or Azure Service Bus, the way in which the messages are handed off to the receiver, is that those messages are handed off for a given period of time, which is called least time. And usually it's more or less 30 seconds. And what it means is that, the broker or messaging infrastructure will keep the message, but it won't be visible to any other receiver, except the one that picked up the message. But that is going to be the case only for 30 seconds.
- 00:10:14 Tomasz Masternak
- If, by the time this 30 seconds elapses, the infrastructure doesn't get any acknowledgement, that message will reappear in the input queue. So if for whatever reason the receiver, takes longer than expected and goes past those 30 seconds, it will fail to acknowledge that message, and it will get that message from the messaging infrastructure one more time. So this is the second scenario in which this can happen.
- 00:10:44 Tomasz Masternak
- There are also situations that can happen on the sending guide end, which is for instance, sender failing before it gets acknowledgement from the messaging infrastructure, and generating physical duplicate that is actually a duplicated message in the input queue. So basically, what you see is that, there are a number of scenarios in which a single message can be duplicated for a number of reasons. But all in all, all of those failure situations result in a state that from the processing endpoint perspective or the message receiver, a single message can be seen multiple times. So the way in which it is shown on this diagram is that, even though in the queue, we had message one, two, three, when those messages were processed, but the receiver actually saw was message one, message two, for whatever reason a duplicate of message two and then message number three.
- 00:11:40 Tomasz Masternak
- Okay, great. So, we know that there is this duplication that we need to deal with, which previously, when we were in the two-phase commit was not happening in a lot of cases. Now, another interesting thing, or I don't know, maybe interesting is a good word to describe that situation, but another anomaly that can happen is, reordering of messages.
- 00:12:06 Tomasz Masternak
- One thing that you can sometimes hear when reading or listening about the guarantees that given infrastructure is giving is that it is providing an ordered message delivery. And if you scratch the surface, it turns out that actually there are a lot of asterisks to this statement, and it is possible in a very constrained environment, in a very constrained configuration. And let me tell you when that is the case. We already touched on one of those scenarios. What we were saying was that, there is something which is called visibility timeout, and if a message is not acknowledged before visibility time out elapses, it will reappear in the input queue.
- 00:12:53 Tomasz Masternak
- What can happen is that, a message can be picked up by the receiver, and that receiver can cause an exception. And that exception can be handled inside the process, but the message may never be acknowledged. If that is the case, and if the receiver continues to receive the messages, it will not see the message that it failed processing immediately, but it will start picking up messages that were logically next in the queue. So for instance, if the receiver failed processing message number one, it won't be visible in the queue until the full 30 seconds elapses, but it will be able to process message number two. What we can see here on the diagram is that because that was the case, the logical order of processing was actually message two being processed before message one, and only after that, message number three.
- 00:13:53 Tomasz Masternak
- The other reason for this reordering is that usually the receiver is a multi-threaded or concurrent process. The reason for that is that, we want to increase the throughput whenever we can. And usually the resources that we are using or changing when we are processing the messages like database, allow us for certain amount of concurrency to happen. What we do is that, we might have two or three, 10, maybe 30 concurrent tasks or threats that are picking up messages concurrently. And if that is the case, then at the time when the messages are being picked up, we are effectively having a race condition, because it's no longer up to us to control what is the order of execution of those tasks that is going to actually happen. So we have that race condition on the receiver.
- 00:14:48 Tomasz Masternak
- Reordering can be caused also by the infrastructure. For instance, Azure Service Bus provides ordering guarantees, but only if it's happening in a single partition, if we have a partitioned queue that no longer applies. Also, in SQS, there are partitions underlying the queue, and there are possible failure situations in which some of the partitions are offline, but messages are being served from other partitions. What can happen is that, message one is in partition A, and message two is in partition B, and when the receiver tries to receive a message, the first partition is not available, so it sees the message number two first.
- 00:15:30 Tomasz Masternak
- Once again, there are a number of situations in which reordering can happen, and we just have to acknowledge that's a possibility. We just need to acknowledge that this is the reality in which we are operating. Finally, what we need to realize is that duplication and reordering are happening, sometimes for the same reasons, sometimes for different reasons, but they are happening independently. So, those anomalies, this duplication and reordering can happen at the same time. So, what we need to realize is that even though we might have some ordering at the center end, or some ordering in the queue, potentially, if the messaging infrastructure even defines the order in the queue, the logical order of message processing has to assume that the messages can come in out of order, and that messages can be duplicated. And this is the root cause of all various anomalies that can happen while we are processing messages in the modern execution environments.
- 00:16:37 Tomasz Masternak
- Now, one thing that I want to note here before leaving this slide is the following. I am not making a claim that those situations are happening all the time, or that even they are happening moderately often. In fact, those scenarios are probably a border line of what is happening in any production system. Now, that being said, if we are dealing with business transactional systems, we cannot afford to lose messages to produce inconsistent output, or to have some inconsistency in our system, even in that slim margin of the messages that we are processing, because usually those things refer or represent an important concepts and actions that are happening in our business, and they are very often directly connected to the monetization of the business. That's why, if we want to build reliable systems that our business can depend upon, we need to make sure that, even though those situations are not very often, if they happen, we can still successfully handle those situations in a consistent way.
- 00:17:58 Tomasz Masternak
- Okay. So what is the solution to that problem? You probably already know that Outbox is one of those solutions, which is obvious from the presentation title. But in fact, there are a number of other solutions that are available there, and that you can come across when you're doing research in that area. So, first of all, one of the very often recommended approaches, something which is called item potency. If you never heard about item potency, I recommend Pat Hellens article, which is called "Item potency is not a medical condition". And if you want to dig deeper into that, this is a good starting point. But long story short, what item potency is, is it's basically a property of our code or our business logic, which guarantees that, if a single message is being processed multiple times, the end result of this processing is the same as if the message was processed only once.
- 00:19:05 Tomasz Masternak
- And this actually enables us to cope with the duplicates in a pretty reasonable way, because if duplicates happen, if the duplicate comes one after the other, we can actually produce an output, which is the same for all the duplicates. Now, one thing that people often miss is that, the context for item potency is a context in which, there is one to one communication with some session abstraction being defined between the two. And this is also the context in which the accounts paper was written. If you are doing one-to-one communication with some session concept, for instance, if you're doing HTTP calls between two endpoints, you're good with item potency, you can use it. But if you go into the area of publish subscribe, when there are multiple receivers, multiple senders, then item potency is actually pretty tricky to get right. And from my experience, I can say that, if the system is becoming that complex, it's actually hard to depend on item potency to provide consistent message processing.
- 00:20:21 Tomasz Masternak
- The second approach, which is interesting, is something which is called progress only state machines. And basically what this approach is based on is on the fact that we can treat our logic, business logic as a state machine. And that state machine is transitioning from one state to the other when it processes a message. If we can make sure that that logic is written in such a way, that there are no cycles in the graph that represents the state machine, we can actually pretty easily figure out whether a message that we are receiving has been already successfully processed, and we can discard it and potentially produce a consistent output. Now, once again, it has its context in which that approach works very well, but both the progress on these state machines and the item potency have one characteristic, which is different than the Outbox.
- 00:21:27 Tomasz Masternak
- That characteristic is that, you need to write your business logic in such a way that the property that I mentioned for one and the other holds. So what it actually means is that, in addition to modeling the business domain in your code, you need to model it in such a way that those two properties are properties of the code that you're writing. So it's basically putting even a higher bar on the code that you are writing, and that work has to be done over and over again for each and every handler or endpoint here in your system. What is different about the Outbox, is that Outbox approach can be implemented in a way in which it is independent from the business logic.
- 00:22:18 Tomasz Masternak
- So, this is going to be the hero of our presentation today, which is basically the Outbox. So what I already said is that, the Outbox is meant to solve the problem of, how do we deal with duplicates and out of order message delivery? Which is one. And in addition, what is interesting about the Outbox is that, it gives us an opportunity to write business logic without actually caring much about how do we deal with duplicates and reordering delivery, because we can actually encapsulate that as a part of the infrastructure, or infrastructural code that we are using in our system. What is also interesting about the Outbox is that, Outbox is actually very easy to use once it's available on the platform, or in the library that we are using.
- 00:23:18 Tomasz Masternak
- For instance, if you're using NServiceBus, enabling Outbox is a one liner, that's it. After you enable an Outbox, you are good to go and you can write your handler logic as if two-phase commit was available, as if you operated in a distributed transactions environment. Now, what is not very obvious is that, what is happening underneath is actually far from trivial. And what we will be looking in the next part of our presentation is, what is it actually happening underneath, and how can we make sure that the implementation that we came up with is actually robust enough and can cope with various failure scenarios that can happen in our system?
- 00:24:12 Tomasz Masternak
- Okay, so let's have a look at a logical diagram of how Outbox is actually operating, and why is it that it actually runs. This is the block diagram of what is happening when using Outbox. And in my opinion, a key takeaway from this diagram, for anyone who wants to understand what is the key thing that the Outbox is doing, is the part which is indicated here as, Data store transaction. So what is basically happening is that in step one, we receive the message from the messaging infrastructure. And after that in step two, we open up a data store transaction. In that transaction, we first try to figure out whether we've already seen that message, whether it's a duplicate or not. If it's not a duplicate, we execute the handler, which is step three on this diagram. And what I mean by executing the handler is, executing whatever business logic. And what that business logic does is that, it takes that the data store transaction, and makes whatever changes to the business data it wills.
- 00:25:30 Tomasz Masternak
- After that, what is very important is that, if that business logic produced any outgoing messages that we are supposed to send out, we are not sending them right away. We are actually capturing them, and we are piggybacking those messages as part of the same transaction. And only after that, we commit that transaction to the store. So once again, what is happening is that, we pick up a message, we see whether it's duplicate or not. We run the business logic. We capture the outgoing messages, put them together with the business data changes, and only after that, we commit the data store transaction. And basically what it does for us, is that it makes sure that the changes to the business data and the outgoing messages are committed as a one atomic operation. So that makes sure that our outgoing messages and business data changes are atomic.
- 00:26:30 Tomasz Masternak
- Once we are good with that, we go to step seven, and we take those messages that we persisted in the data store, and we send them out. And after that, if we are successful, we clear out any state that we have in our database. So once again, this is the gist of what the Outbox is doing. It's basically making sure that we automatically commit business data changes and outgoing messages. Then it publishes the messages out. And the last step is, it acknowledges the message that is in the infrastructure. Now, you can think about, what are the failure situations here. So for instance, if we see a duplicate in step two, we actually skip the execution and storing the messages. We go straight away to publishing the messages. And if that's the case, we will publish the same messages potentially multiple times, because if we previously were successful in step seven, but we failed in step eight, we will not know that the messages already went out, but that's actually fine, because we are producing duplicates and whoever is consuming those messages needs to be able to cope with the duplicates. So that's just fine.
- 00:27:46 Tomasz Masternak
- So it seems that the approach makes sense. Now, one thing which is very interesting in writing distributed systems and distributed algorithms, is that, it's very easy, or it's relatively easy to come up with some approach, and then convince yourself, or maybe other people, that there is a happy path in which our approach is going to do whatever we wanted to do. So, as we did with the Outbox on our previous slide, we could show the happy path and say, okay, this is why the Outbox works, because if we go through the step one to nine, we will end up in a consistent output. However, one thing, or actually two things which are characteristic for the distributed systems is that, a, those are inherently concurrent systems. And secondly, a good definition of a distributed system is that, it can partially fail.
- 00:28:55 Tomasz Masternak
- So basically, what it means from our perspective of the Outbox is that, the diagram that I showed you can actually fail in numerous places. And this is actually what makes the whole situation pretty tricky, because happy path is just fine. But though it's the edge cases, that are going to determine whether our system is going to cope well under load and under partial failures, or whether it's going to breakdown, and it's going to cause inconsistencies potentially, lost messages or ghost messages or duplicated processings or whatnot. And the key to making sure that our out Outbox implementation is solid, is somehow being able to figure out, what is the behavior exactly in those edge cases.
- 00:29:57 Tomasz Masternak
- And this is where we come to the concept of model-checking. And to understand what model-checking is, and why it is, or it might be useful in our situation, is to compare it with the usual approach to how we test our distributing systems. And one way in which I can describe numerous approaches to testing distributed systems, which are, let's write unit tests, compliment tests, integration tests, user acceptance tests, let's do small tests, small tests, or UI tests. We could call those tests as auto manual tests. And what does it mean? Well, it means that, it's actually a human being that comes up with the scenario that we want to test. That's a manual piece of work, because when you think about it, if we wanted to write some tests for the Outbox, what would happen is that, we could sit down and we could brainstorm some of the situations in which failures can happen.
- 00:31:06 Tomasz Masternak
- And we could pick up those scenarios that we think are interesting, then we would implement those as tests. And, from that point on, whatever infrastructure and libraries that we are using for testing, will enable us to automatically run those tests. So now you should understand what I mean by manual tests, they are automatically executed, but they are manually created, hence auto manual tests. What is bad about auto manual tests is something that you probably already expect, which is we, as human beings are not able to figure out all the edge cases. It's very hard to figure out what are the various situations in which things can happen and how those failures can compose with each other at the various points in time. What is also tricky about the auto manual tests, is that sometimes, failures are very hard to reproduce in our testing environment.
- 00:32:13 Tomasz Masternak
- So for instance, we might know that the database might be not available, or that the transaction can fail because there is a deadlock and it's going to be evicted by the deadlock resolution module in SQL Server. And it's actually very hard to write a deterministic test that is going to reproduce that situation. So as you can see, that approach to testing is not going to help us much with those edge cases that we discussed when talking about the Outbox. And here comes the model-checking. So what is model-checking? Well, model-checking takes a bit different approach. And basically what it says is that, give me a model of your implementation specified in whatever way. And based on that model, if the model represents the concurrency and failures that can happen in the system, you can give me some properties that you want me to check, and I will generate all the possible scenarios, all the possible executions in your system.
- 00:33:23 Tomasz Masternak
- And I will make sure that I am going to cover all the edge cases, all the situations, no matter how probable or improbable they are, and make sure that in each and every path of execution, in each and every history of your system, those properties will hold. And basically, if we have such a tool at hand, if we are using it, we can actually cope with the situation in which we need to cover the edge case system, to be sure that our implementation is robust enough. And one of the tools that operate in the model-checking space, is something which is called TLA+, which stands for Temporal Logic of Actions. And actually the model in which we express our system using TLA+ is very straightforward. So basically what it says is that, you need to specify the system that you want me to model-check by providing following things.
- 00:34:27 Tomasz Masternak
- First of all, you need to give me what is the initial state of your system, on our diagram here, it's the noted S0. And then for every state in the system, you need to provide me, what are the possible actions that can happen in that state? In our case S0, enables action, A1, A2, and A3, all of those leading to states, S1, S2 and S3. So basically, what is happening is that we are describing our system as something which is in a given state, and then how can it transition between various states? So, what that state looks like for our system, we are going to look at the TLA+ specification in a second, but just so that you have some intuition about that, the state is basically a set of variables that describe the most important bits of our system.
- 00:35:25 Tomasz Masternak
- So in our case, it will be, what is the content of the input queue? Where in the code, or which part of the code is the handler executing currently? How many failures happen for a given message? What is the content of the database? What is the content of the output queue? All of those variables with concrete assignment of values to those variables, is basically a state. And what we can say is that, if the code is in the receive part of the code, an available transition or action is, it can either successfully pick up a message, or it can fail picking up message because of connectivity reasons. So this is one part of the model that we are providing, we are providing, what are the possible states in our system and how those states are connected with the actions which are possible. The other bit that we are providing is a set of invariants.
- 00:36:25 Tomasz Masternak
- And what are those invariants? The invariants are statements about the state of our system, which are to be checked by the model checker. So an example of the invariant could be, a single message, cannot result in two different records being written into the database. So what we can say is, no matter which path in our execution history we choose, we cannot come to the state, such that in the database variable, there will be two records corresponding to the same message in the input queue. That is an example of an invariant. Once again, we will see invariants in a second. And basically, what the model checker is doing is that, it takes that description of our system, of our algorithm, it generates the whole graph, and by the whole graph, I mean, it generates all the possible states, all the transitions between those states.
- 00:37:30 Tomasz Masternak
- You can see here that this a sample diagram for a Canvas algorithm. And basically, you can see that it's pretty big. And then what it does is that, it starts in the initial state and it goes through all possible paths in this graph. And whenever it goes through the state, any state, it takes all the invariants that we gave it to, and checks whether those variants hold in that state. So we can see, it covers all the edge cases, we could never do that. Now, one interesting thing that you might ask is, hold on. But the model probably is going to grow very fast. There is obviously a phenomenon, which is called combinatorial extortion. Once we get to a certain number, in terms of number of messages that we have in our model or a number of concurrent handlers, the possible histories of our system are going to be such a big number, that we won't be able to generate all of them and check all of them.
- 00:38:41 Tomasz Masternak
- And that's true. However, there is something which is called small model hypothesis, which is well known in the model checking side of things. And basically what small model hypothesis tells is that, if your algorithm has a bug, you don't need a big model to find that bug, actually models of trivial sizes are going to surface the problems in your implementation. Now, once again, this is not a theorem that has a mathematical proof, it's just a hypothesis, which however, is based on the number of practical examples and case studies that people share between each other. And from my experience using the TLA+ as a model checker, I can share that, indeed it was never the case that I would start model checking my specification with a small model. And only after I switched to the bigger model, then I surfaced a bug in my implementation, that never happened.
- 00:39:51 Tomasz Masternak
- If something was wrong, I would get the feedback from the model checker pretty much immediately. Okay. So we know what is model checking approach. We basically take our system, we describe it in a bit different way. So we create the model. We provide the invariance, then the model checker picks the model, generates all the histories, goes through all the states and evaluates the invariance. But the question is, what makes a good model? And as you probably know, all models are wrong, but some of them are useful. And the real question here is, what makes a good model from the testing perspective? Because we can judge whether a model is good or not only if we know what is the purpose of modeling in the first place. Because basically what modeling is, is removing irrelevant details from the domain that we are modeling.
- 00:40:51 Tomasz Masternak
- So let's try to think. What is the purpose of modeling in the model checking context? Well, the purpose of the modeling there is to test our distributed system. Okay. So what are the scenarios that we want to test? Well, the scenarios that we want to test, is that our system behaves correctly under concurrency and under partial failures. And as a rule of thumb, what I would say is that, if we want to come up with a meaningful, useful model of our system, from the model checking perspective, we need to make sure that we model correctly, the concurrency. So basically, what are atomic steps in our system and what are not atomic steps? And what are the partial failures that are possible in our system? So those are two things that definitely need to get into the model, if we want to get value out of the modeling exercise in the first place. And as with any modeling approach, we are going to remove way more things than we are going to keep in the model.
- 00:42:07 Tomasz Masternak
- But one thing to keep in mind is that, we want to make sure that the important bits, which is concurrency and the partial failures are the things that are making to the model, because those are the things that are going to potentially bind us in production or in the real system. And this is what we want to actually test. Okay. So let's start looking at the specification and let's try to figure out, how is it to actually use, model checking using TLA+ in practice.
- 00:42:44 Tomasz Masternak
- Let's start with looking at the toolbox that is available with TLA+. So first of all, what we are going to use is an extension to the visual studio code, which in my opinion, is a pretty convenient way to start off with TLA+. And basically what it does, is that it gives some syntax coloring and it enables to run the model checker as a command option from the visual studio code.
- 00:43:14 Tomasz Masternak
- And one additional note here is that, we are not going to use the TLA+ directly here. I think more convenient approach is at least at the beginning, is to use Pascal, which is a language that is looking like C or Pascal, depending on the flavor that then gets translated into TLA+. So what you can see here is basically specification of the Outbox algorithm using Pascal. And you can see that we have a definition of a process which is called a handler, and that has some wild loops, some assignments, we will get into those in a second. But basically what I can do is I can do control P and I can do Parse module. And what it's going to do, is that it's going to update the section which is called, Begin translation. And in that section, we can see what is actually TLA+, which is underlying the code that we are working with in this specification.
- 00:44:25 Tomasz Masternak
- So if you are ever interested in seeing how TLA+ actually works, or what is the syntax, this is where you can see it. However, from our today's presentation perspective, we are not going to delve into that. So basically what I will do is, I will take the model that I have, and I will check that model. And basically what this is going to do is, it is going to take that specification, generate the whole history of my system. And then, for each state is going to check the invariance. And the invariance that it's going to check currently is type invariant. So basically, it is checking that my state actually makes sense, for instance, I'm not putting some random stuff into the database or producing something, which doesn't look like a proper message, et cetera.
- 00:45:22 Tomasz Masternak
- So, the specification here, which is the model plus the invariance, and the invariants are defined here. We will go into them in a second. Plus the configuration that we pass to the model checker, enables us to run the model checker and then, figure out whether all the invariants that we care about, actually hold in our system. Great. So let's look at the first invariant that we are going to check with our model checker, which is, are there any ghost messages possible in our system? And what I mean by ghost message is a message which is sent out by the handler, but it doesn't have any corresponding database change persisted to the database. So let's see if the TLA+ is going to help us. In this section called invariance or the define, I have an invariant define which is called, no ghost messages. And basically what it says is that, for every message in the processed variable and processed is basically a set a collection.
- 00:46:31 Tomasz Masternak
- So for each element of that collection, what needs to be true is that, either there exists a change in the database corresponding to that message, or there are no messages in the output queue corresponding to that message. So basically what that invariant is saying is that, it's not possible to have a situation in which we have a message in the output queue, but no changes in the database. Okay. So let's ask the model checker to check, if there are no ghost messages in the current specification that we have. And let's look at the specification that we have right now, what we are doing in the handler is that, we spin in an infinite loop, while loop. First, we receive our message, and basically, everything which is underneath one label until we get to the other label is an atomic step.
- 00:47:30 Tomasz Masternak
- So this is basically expressing what is concurrent in my system and what is not concurrent in my system. So first, what we do is we receive a message from the input queue. Then we assign a transaction ID, which basically models a business logic execution. And then, we try to send out the messages. And what we can do here is we can either fail to send out, which basically models a partial failure when the messaging infrastructure is not available, or we can be successful, and we are adding a new message with a given ID and transaction ID to output queue. After that, we update the database. So we put some changes in the database. And finally we try to acknowledge, and if we are successful, we are putting messages in the process collection. Okay. So let me now do model checking for that. And surprise, surprise, what the model checker is claiming is that in fact, there is a scenario in my specification, in my model, in which the no ghost messages invariant is violated.
- 00:48:39 Tomasz Masternak
- That's interesting, but what's even more interesting and useful is that it gives me an error trace. And what an error trace is, is basically a path in the graph that I showed you, which represents the execution that led to broken state. So let's have a look at it together and let's try to figure out whether we can see what actually happened here. So we started off in main loop, so here, then we went to the receive label, which is here. Then we did the process. Okay. So far so good. Then we did, send outgoing messages. And then we went to the main loop. Okay, that's unexpected. So we were here and we ended up here, what happened? Well, we actually executed this part, which is, Fail. And what Fail is doing is that, it's actually a piece of logic, which captures the amount of failures for a given message and then goes to the main loop.
- 00:49:40 Tomasz Masternak
- So basically, what happened is that we went up until, Update Database, we failed and we went back to the main loop. Then we did the Receive, one more time, processed the message. We sent outgoing messages one more time. Then, we updated the message. And when we tried to update the message, we basically failed one more time. And the situation in which we ended up, is a situation in which we have a message in the output queue, which is here, but we don't have any corresponding change in the database, which is basically violating our invariant. Okay. So that's interesting. So let's try to actually patch that quickly. So what can we do about it? Well, let's change the order here, that always helps. So let's first start with updating database, and only after that, sending out a message. Okay. I will translate the plus code, the TLA+, which is this command.
- 00:50:47 Tomasz Masternak
- And then I will check the model with TLC. Yeah. And now we are actually good. So we did a change in the order and we ended up in a situation in which no ghost messages are possible in our system. Right. Okay. So let's go to the other anomaly that can happen, or one of the inconsistencies that can happen in our system. We can potentially lose messages. What I mean by losing messages? Well, we can update database, but we might fail to ever send out the outgoing messages that were connected to business logic, operating over the incoming message. So let's go back to the model checker. And let me add in the configuration, one more invariant, which is called no loss messages. And what that invariant is saying, is that one of the two things need to be true, at least one of the two things.
- 00:51:51 Tomasz Masternak
- So either, there are no changes in the database, or there is a change in the database and there exists something in the output which basically connects to that process. So what I'm going to do here is, I'm going to add the invariant. I will check my model with TLC. Okay. And once again, it complains that, in fact, it is possible for us to end up in such a situation. So how is it possible? Well, we go to the main loop, we do a receive. Then we do process. Then we update database. We try to send out outgoing messages, which is here. And then we fail and go back to the main loop. Then we go to the receive, once again, to the process, once again to the update database, and we fail one more time. So basically, because the model is written in such a way that we enable only a single failure for a message, what is going to happen is that, it is possible for changes to the database being applied, but the outgoing messages are not being sent out.
- 00:53:08 Tomasz Masternak
- So basically what I'm going to do now is, I'm going to assume that, in fact, what is possible is that, we have two-phase commit available in our infrastructure. And please bear with me, we are going to loosen that requirement in a second. So basically, right now, those are two independent, non atomic steps. I will model them as atomic steps step, which is basically saying, I have something like two-phase commit at my hand. So we will say update and send, and I will copy this part and I will make it so that it's actually executed here. Okay. I will parse the module one more time, and I will do a model check on it. Perfect. And now we never lose messages because we always make sure that the database is updated, and we are sending messages to the output queue. Okay, great. One more anomaly covered.
- 00:54:19 Tomasz Masternak
- Okay. So another thing that can happen is that, we can have duplicated processing. So what can happen is that even if we have automaticity between updates and sense, a given message can come back in more than once, because we fail to acknowledge message and the broker infrastructure will provide us with that message one more time. So let's see how we could model that. No duplicated processing, and basically, what it means is that for all records in the database, we cannot find another record, which is based on the same message ID, but corresponds to a different handler execution. Okay. So I'm going to add that invariant here. I'm going to check my model with TLC. Okay. And that actually fails. And how does it fail? Let's look at it. We go to the main loop, then we go to the receive, then we go to process.
- 00:55:31 Tomasz Masternak
- Then we do update and send, we want to acknowledge, but we fail. We go back to the main loop. We come back, we pick up the message one more time. We process it, we update and send, and then we fail to acknowledge it one more time. So basically, what happened, is that we went up until here, but we failed acknowledging, and we received the message one more time and nothing in our logic here prevents us for processing the same message more than once. So let me just take a snippet here to make sure that it doesn't take me too long to actually code it. So what I'm going to do here is, I'm going to add an if statement, which says something like, if I didn't process that message already, execute that otherwise do nothing.
- 00:56:34 Tomasz Masternak
- Let me parse the module, and let me check the model with TLC. Okay. Perfect. Now it actually succeeds. That's great. Okay. So that was the third anomaly. What is the fourth anomaly? The fourth anomaly says that, we want to make sure that our output is consistent, but before we go with the consistent output, I'm going to make a small change to the specification. So basically what I'm going to do is, currently what happens is that our model is written in such a way that it says that, the message can fail up to number of times and after that, it goes to the error review. What I'm going to change in the specification is, I'm going to, say the message can fail up to any number of times. And after that it's successfully processed.
- 00:57:28 Tomasz Masternak
- So basically what I'm modeling here is that eventually every message will be processed. This is what I want to have in my model. So what I will do is, I will add a definition of something which is called fails, please bear with me. And I'm going to change the content of the macro fail, which is this one here. And then I'm going to update states, for the steps for the updates, send and acknowledge. So let me copy that over, and basically what I'm saying right now is that, update is a single step. And in that single step, I'm going to process the message if it's not there in the database, that's great. I will change that to send, and I will copy over the code here, but instead of updating the database, I will send out outgoing messages here. I don't need this one anymore and I don't need this one and I will copy this one, yet one more time, to make sure that I model the acknowledgement part correctly as well. So basically, I am going to keep that, but this part goes here. I can remove this bit and I'm done.
- 00:59:02 Tomasz Masternak
- So basically now let's parse the module and let's check it with the model checker using TLC. Okay. So basically now it complains that, I can lose some messages. And the way in which I can lose some messages is that basically currently the send is doing sends only if the message has not been, or the change of the message has not been resistant in the database. So this is the part that you might remember from the Outbox. What we are saying there basically is that we want to push out the messages no matter what. So let me model that. So I'm going to send out the messages no matter what. So I'm going to remove this if statement, I will parse the module one more time and I will do model checker with TLC. Okay. And now this is fine.
- 01:00:00 Tomasz Masternak
- Perfect. Basically what we did is, we no longer require update and send to happen as an atomic steps. And now we are ready to check the consistent output. And what is the consistent output? Well, it's the invariant, which is defined in such a way. And let me add it here so that we all see it. And basically what it says is that, for any message in the output queue, there shouldn't exist any other message which has been generated for the same message, but corresponds to the different business logic execution, which is denoted by the transaction ID. Okay. So that's interesting. So what we are making sure is that, it's not possible to see an outgoing message, which was generated by the same incoming message, but with a different content, basically. So I'm going to add my invariant here. I'm going to parse the module.
- 01:01:03 Tomasz Masternak
- I'm going to check it with the model checker. And once again, that fails, which I think is already expected. And basically, the reason why it fails. I'm not going to go into the details of the trace this time. But the reason why it fails is that, as you can see here, the transaction ID that we are putting in the outgoing message, is that transaction ID that we generated here. And that doesn't actually model what the Outbox is doing, because what we are saying in the Outbox is that, in the database, we are capturing both the business state changes and also the outgoing messages. And the way that I can model here, is that I can assign the transaction ID based on the database change that was caused by the incoming message.
- 01:01:58 Tomasz Masternak
- So what I will do here is, I will first grab the change that was caused by the incoming message, which is basically this in the TLA+ expression. I'm going to copy this, but instead of using transaction here, I'm going to do change.transactionid. So basically, now what I am saying is that, we always push out send messages, but the content of that message is based on the corresponding database change instead of the transaction ID of that given execution of business logic. Okay. So let me parse the module one more time, and let's do the model check.
- 01:02:47 Tomasz Masternak
- Great. And now what happened is that we actually are successful. So looking at the configuration what we can see is that, this approach does not allow for ghost messages, for lost messages, for duplicated processings. It actually ensures consistent output. And if you look at the code of the specification as it is right now, it's actually pretty similar to the Outbox diagram that I showed you. So we receive a message, we execute the handler. Then, we store the changes, only if it's not a duplicate. And we are capturing the transaction ID, and after that, we try to push out the messages using that transaction ID that we persisted. And if that's successful, it's only then that we acknowledge the incoming message as the last thing that we are doing. And basically, what I'm showing here is something which is very close to the Outbox, it's not exactly the Outbox, but I'm hoping that you can see that using the model checking, is a very useful approach to figuring out whether your implementation of the Outbox, is actually the one which is doing what it is supposed to be doing under various failure conditions.
- 01:04:11 Tomasz Masternak
- Because basically what I am saying here is that pretty much on every step, I can either succeed operating on the external resource, or I can fail. And this is what I'm modeling here. So once again, using the model checking technique that I showed you here, enables us to take the model of the implementation and generate all the histories and check invariants that are important from the perspective of our implementation. Okay. And that brings me to the end of my presentation. I hope that you enjoyed it. If you have any questions, I think that now it's a good time for us to go through them. Okay. So let me look into the chat because I think that, if we have any questions...Okay, so we have a question from Genadi, and Genadi and is asking, how do we make sure that the implementation is actually according to the model?
- 01:05:46 Tomasz Masternak
- So that's, I think the number one question in the model checking space. So how do we make sure that the model that we are using for model checking is something which fairly represents the implementation that we are using. So, first of all, I don't know about any production-ready approaches for synchronizing the two, I never heard about that. The only practical approach, which is based on the process that I heard of was approach that was used in CosmosDB, because TLA+ I didn't mention that, probably I should have mentioned that, is used in the production for testing production systems by many companies. So CosmosDB is one, there is also a paper that was written by Amazon employees, engineers at Amazon when they show how the TLA+ is used there, and what was the return on the investment from their end.
- 01:06:52 Tomasz Masternak
- However, the only approach that I know about which is production tested, is the one from CosmosDB which is, whenever there is a bug in the critical part of your system. So let's say that you have a production bug, which points to the Outbox implementation. You need to reproduce that bug with some invariant, ideally, and then first make the change to the model. And only after the model checking passes for the new invariant and the new model, the engineers are allowed to apply the patch onto the code itself. And that basically forces people to make sure that the model is in sync with the code, that whenever the code changes, the model is updated as well. So that's the best answer that I can give. I hope that, that's useful.
- 01:07:55 Tomasz Masternak
- Okay. We have another question, which is, can this be implemented in a C# web API? So I'm not sure if I understand the question, whether the question is about, whether it can be provided as a service, the model checker or whether we can check the C# web APIs. So maybe I will try to answer the second one. And basically the answer that I have is, model checking should not be used to check each and every part of your system. And in fact, what is true of many systems is that, there are well defined places in which the tricky part is encapsulated.
- 01:08:46 Tomasz Masternak
- And those are the places in which the model checking technique makes sense, because basically, as I mentioned, it goes through all the edge cases and checks the invariants. Now, model checking comes with the obvious cost that you need to spend time to actually come up with the model, and then, to keep it in sync, there is additional overhead there. So I would say that for the web API, I would probably not use it because the other approaches to testing, which are valid as well, are probably better in this context, because when we are testing this part of the infrastructure, what we are probably trying to test is whether it works in most of the occasions and cover roughly the most possible scenarios.
- 01:09:38 Tomasz Masternak
- And then the automated tests are just fine. However, in the parts that are at the core of our infrastructure, that basically should be rock solid and our consistency depends on them. Then I would think about using model checking approach. Okay. I think that that was the last question that we had. Okay.
- 01:10:19 Szymon Pobiega
- Yeah. I guess that's all we have time for today. And so if you have any questions, do not hesitate to reach us via email or Twitter. And on behalf of Tomasz, this is Szymon Pobiega saying goodbye for now. And sorry, one last thing that I mentioned in the chats that I need to apologize for, for especially those of you who joined late, we had a technical issue and we used wrong time zone in the Zoom invitation, that's why it came as one hour later than actual time that the presentation started. So those of you who look at the particular website joined at the time when the webinar started, and those of you, I think who look at the invitation joined later, but we'll send you our link to the recording as soon as possible, so that you don't lose the access to the content, of course. Sorry for that and I'll see you in the next particular live webinar.
About Tomek Masternak
Tomek is an engineer at Particular Software. He is passionate about the theory and practice of distributed systems and likes to know why they work or fail and what that means in the first place.