Messages on the Outside, Messages on the Inside
About this video
In the classic paper Data on the Outside versus Data on the Inside Pat Helland argued that data within a service boundary should be treated differently than data residing outside of it.
Here, I shall argue that the same applies to messages. Inside a service boundary messages are tightly coupled to the corresponding data manipulations. Sometimes it is even possible to enforce total order of messages.
The moment the message crosses the service boundary, it enters the no man’s land where bad things happen. Messages get reordered, duplicated or even lost.
Join me in this talk to learn about some patterns you can use to get your messages safely to the other side.
đź”—Transcription
- 00:07 Szymon Pobiega
- Thank you. So yes, my name is Szymon Pobiega. Today, I want to talk to you about why messaging is important, especially when practicing domain-driven design and why messages come in different flavors. And they differ based on where they are used on the inside or on the outside. But let me tell you a few words about myself. So I started my journey with domain-driven design around 10 years ago. I just checked it by porting DDD sample, which was the Java application based on Eric's book to .NET 00:00:40 about 10 years ago. Then I was super excited by event sourcing it seemed to be the best thing since sliced bread and well, that didn't happen, I guess. Event sourcing is still the next big thing, hopefully it will get traction soon. Anyway, then I moved to messaging space now working for Particular Software, the company behind NServiceBus messaging framework.
- 01:11 Szymon Pobiega
- And there, I tried to focus on message delivery patterns, meaning how do I structure the code so that it's easier to process messages and that processing messages is always correct. And this is the topic that I want to talk today. In my free time, I try to play with Legos, building things like that and do some off-roading with my kids. The story that I want to tell you today is based on a fictional company, but that fictional company is a mashup of all my customer interactions in the company that I work for in last couple of years. So there is no one customer that had all the problems that we'll be talking about, but if you take them together, you can build quite a convincing story. And because I'm Polish I wanted to introduce something that is Polish. Well pierogi is a Polish dish, you put something into a piece of dough and boil it.
- 02:10 Szymon Pobiega
- And basically you can put any type of filling in pierogi almost any type and we'll learn later what you can't put into pierogi. Every story of every startup of every company and software system starts with an architect, right? You need to have some architecture and the definition of architecture is the guy that creates architecture, so that's the guy. And obviously to have a good architecture you need to fill it with buzzwords. So this time we have micro frontends here on the top of that architecture diagram. So this is a simple architecture of a system that was supposed to handle these orders for pierogi. So well we have here free microservices or macroservices, shipping orders and billing each with its own user interface and micro-frontend. And for the purpose of this presentation, we'll assume that the green part is done already.
- 03:07 Szymon Pobiega
- And we just need to build the blue part, the shipping part and the billing part. And for that purpose we'll hire two teams of developers, the blue team and the red team. Of course we have lots of funding in our startups, so we can hire the best people and give them work and expect magnificent the results. What happened here in that particular startup is that they group those developers in a very special way. So they hired a bunch of people and then they took the people that had domain-driven design and database background and locked them in the blue room. Blue is sort of domain-driven design and then they took the rest of those people and locked them in the red room. And this allowed them to talk because well they need to focus on their tasks and they don't need to communicate, of course it's my microservices architecture.
- 04:07 Szymon Pobiega
- So they started by trying to make some sense of the architecture diagram that I showed you and create some design for their part of that grand architecture. The blue team started by designing their aggregates. So they figured, well, we have some domain-driven design background. Aggregates is the building block that we'll use to structure our business logic. So that's an example of a shipping aggregate. Everybody knows what aggregate is, right? Yeah. So that's a shipping aggregate. It contains all the information related to shipping and order of pierogi. So it has its root on top and it has some items that are going to be shipped and the destination, that's our nice aggregate.
- 04:55 Szymon Pobiega
- But of course you cannot implement complex business processes using a single aggregate. That's just a thing that contains shipping information and shipping logic. But if you want to actually ship an order that's complex and long-lasting business process spans multiple aggregates. So first we need to send the command ship to the order aggregate, does its own logic. Then we need to update some information in the store aggregate that represents a physical store or a warehouse where pierogi are made and stored. And then we need to create a shipment aggregate that contains all the information and logic necessary to ship the order.
- 05:34 Szymon Pobiega
- So as you can see it's more than one aggregate. And as we know an aggregate is a unit of transactional logic, aggregate protects the business invariants. And it's not advised to modify more than one aggregate within one transaction. So what they did, the blue team, is they figured well, we cannot modify more than one aggregate in a transaction so we'll use messages passed between aggregates to create those complex business processes. So one aggregate does its job, creates a message, sends the message to another aggregate, the other aggregate does its job. But how do you send messages if you have a database background? Well, you use a database because that's your background. So the blue team created a simple queue structure database. That queue had two columns, a sequence number and a message body.
- 06:27 Szymon Pobiega
- So, a message would be a row in that table that contains a sequence number which makes it a queue. Otherwise we would have just a set of rows, sequence number enforcing some ordering here and the message body which contains all the information required to continue the business process. And the basic unit of work in that application would be, take a message from that queue table, execute a logic and then put message or multiple messages to that table. So in SQL, it can be represented as a transaction that looks like this. We begin the transaction with SQL Server or Oracle or for that matter, any relational database. And we do a destructive select, which basically takes one row from a database and locks it so that any other thread cannot delete that row in the same time.
- 07:20 Szymon Pobiega
- So this is our destructive select, we now have all the information required to process the message. Then we load aggregate information to our process. Let's say it's Java processors .NET processor or any other process using object-relational mapper execute the business logic, which is contained in the domain model here. And then as a result, we execute some updates or insights that create new information or update the information in the database. And last but not least, we send messages because we want to inform other aggregates that they need to do some sort of work. So that's our small Lego piece building block of complex business processes. Meanwhile, the red team wanted to start big and they wanted to have a microservice architecture because that's the next big thing. So within the whole architecture diagram, they decided that they want to structure their part as a set of tiny little processes that each communicate with other processes by sending messages over topics.
- 08:24 Szymon Pobiega
- So they had some background with messaging systems, message brokers especially in the cloud. So they figured what can go wrong. So first they started processing their messages with their message broker and the processing of a message consisted of two steps. First, you take a message and consume it and then you process it. For some reason, most message brokers support this mode of operation, I have no idea why. But that's the process, you take the message, remove it from the queue and process it. What happened? The manager came in and said, "Well, we have a problem because in our testing environment, we noticed that we haven't billed some customers. And can you figure out what happens?"
- 09:11 Szymon Pobiega
- So the developers went back to their room and started thinking what could go wrong? Well, what was going wrong here is that in their code there were consuming messages. And at that moment, the message is deleted from the queue and then they were executing the business logic. And if for some reason the business logic failed because that logs in the database or for any other reason, the message is gone. It's no longer in the queue because they use the destructive receive. So well they learned something, at least they were learning and they restructured their process for receiving messages. Now they receive without deleting, process the message and then mark this message as consumed. So that message broker can delete that message.
- 09:56 Szymon Pobiega
- Most message brokers support this mode of operation too. So, they were quite happy. If the business logic fails after receiving a message, nothing is wrong because the message goes back to the queue. It can be processed later. And the manager comes in again and says, "Well, all good. We don't have now customers who we haven't billed, but we have a list of customers that pay twice. What's wrong guys? You cannot write a correct code, what's wrong?" The blue team is doing quite well. So the red team was like, "Ugh, what happened?" So they started analyzing what really happened, looking for the logs and what happened is they received a message in their process, they executed the business logic and then they failed. They failed to acknowledge that they processed the message. So the message went back to the queue and the broker happily redelivered that message again, because it was not lost.
- 10:52 Szymon Pobiega
- Fortunately, they haven't lost the order. Unfortunately, they processed it again thus sending an invoice for the second time. So obviously something is wrong here. And they had a quite long brainstorming session during which they discovered a property called idempotence. So idempotence is quite well-defined in the domain of mathematics as a property of a function. That function is idempotent if you can apply it multiple times and the result of applying a function twice is the same as applying it once. It's nice and formal. What it means for our IT domain or computer science domain is that an idempotent message processor is a message processor that can process a message multiple times. And the side effect of that processing are the same as it would process it once. So they figure out, well, we need to make our processing idempotent, and that was end of the sprint zero.
- 11:52 Szymon Pobiega
- They had to deliver some sort of demo, so they tried. First let's see what the red team did. I'll show you here some testing harness that both teams created together and the testing harness allowed them to test the whole system without going through the UI. They had a special console application. They could type in commands, create orders, submit those orders, build those orders, do that sort of stuff. And the other thing that they figured out quite early is that, if you want to ensure that you don't fail in production, you must test your system for those kinds of failures that you expect in production in earlier environments. So the thing that they feared the most, the red team, was message duplicates because they already seen that duplicates are dangerous and they had problems with them.
- 12:54 Szymon Pobiega
- So what they did is they built in a behavior in their system or a messaging framework that would duplicate every outgoing message. So every message that is sent by this test harness gets duplicated on the wire. They wanted to make sure that they processed those duplicates correctly because they correctly figure out that duplicates would be probably quite rare in testing environments but can happen more often in production where the load is greater and there is more randomness. So, they try to submit an order during the demo and the manager is watching and they submitted an order, duplicate has been detected. So everything is fine. They need to add pierogi with meat to that order and let's see what happens. Here's the code on the right hand side, there is a debugging session in ID and I structured this code...
- 13:45 Szymon Pobiega
- I split this code into multiple blocks. The blue block on top is where the duplicate detection happens so this code detects if a message is duplicated it should be ignored. Then there's red piece of code that executes the business logic. Then there is a green block that saves the data in the database. And finally, the purple one that publishes a message. Why publishing a message? Because we want other microservices to know that we did some sort of processing and those other microservices can act according to their own business logic. So the purple block is there to publish item added event for each item added to our order so that other services can do their job. Let's see what happens first. So that's first copy of our message that was duplicated. Let's see what happens. Because that's the first copy our duplicate detection code didn't do anything.
- 14:39 Szymon Pobiega
- It's not a duplicate. So we execute our business logic, we add a new item to an order. We store the changes to the database, lock some information and publish an event. So far so good. What happened here? We see that there is an exception in the logs. Because two copies of the same message were processed concurrently by two threads, one of the safe to database succeeded while the other failed because of a constraint on the primary key. So what happens if a message fails processing? Well, it goes back to the queue because we are now aware that we should not destructively receive messages from the queue. Message goes back to the queue and will be picked up now. Now, is the second time when we process that message. We execute the duplicate detection code here based on the assumption that an order can contain more than one item of the same type.
- 15:32 Szymon Pobiega
- It already contains item of type meat. Well, so we ignore that message. Everything is good and nice. Everything is happy. Well, the red team is happy. Now the blue team. The blue team has slightly different test harness application but, well, similar. They create order and they execute some code that creates an order aggregate and stores it. Then they need to add some item to that order. They execute another piece of code. What it does is, it loads the order for the object-relational mapper and then adds an item to an order and stores that object in a database. So at this stage, we have an order in a database and that order contains one item. Now we want to submit that order so that it will be shipped. First, we use a domain service to figure out what is the closest store to the customer's location.
- 16:32 Szymon Pobiega
- Then we load that store, that's an aggregate as you remember. And execute a business logic method on that aggregate assigned shipment. That's the code that will do some magic work in that aggregate and well, because it's an aggregate we cannot modify anything else here. That's our only possibility here in that message handler. What we can do now is we need to publish a message or send a message so that another aggregate can receive it and continue the business process. So we create our create shipment message and send it, and that's it. And now we got the create shipment message, another message, another message handler. We execute the code that adds items, in this case, we skipped over this code and we have a new aggregate. So what just happened is we created one aggregate, we sent a message to it as a result of processing that message, we modified some data, sent another message that's resulted in creating yet another aggregate.
- 17:38 Szymon Pobiega
- So we can see, we built our complex business processes by changing those aggregates together through messaging. And the blue team was very happy and very proud of their design. And in fact, testers were also very happy. So testers that were trying to assess the quality of the code were very happy with the blue team, what the blue team delivered. They were not that happy with what the red team delivered. The problem was that when they were going out of the room the manager approached the blue team and said, "By the way, did you know that we have multiple magazines and multiple factories around the world?" Well, they didn't know that. They should have known that if they had talked to the business or they had an idea what the business goals are. And the business goal was basically disrupt the pierogi delivery market all around the world. They should expect that they will have code executed in multiple locations and multiple data centers, they didn't know.
- 18:37 Szymon Pobiega
- That was basically the business architecture of that company. And they had a headquarters somewhere and they wanted to have a storage and production facilities located around the world. And all those storage and production facilities would have their own data center in which part of the code, the blue team creates the shipping code, would be run. So the blue team was supposed to create some code that runs in the HQ that is responsible for shipping and also some code that runs in those remote factories and magazines. Let's see what happened during the sprint one, the blue team tried to make sense of that business architecture design and replaced pictures of buildings with pictures of queues. So what they understood is that they need to run their aggregate changing code in multiple places. And thus they will have database in multiple places. They will have a single database in their headquarters and they will have a databases in each remote facility.
- 19:42 Szymon Pobiega
- The thing is, if the process crosses the boundary. So if that complex process built of aggregate chain crosses the boundary between locations, they need to send a message across to a different location, that looks like this. A single microservice or a single process needs to modify data in one place, in a database let's say in HQ and then it needs to send a message that gets to the remote location. So it needs to access two transactional resources at once. And there is a well-known solution to that problem, it's two-phase commit. A quite old protocol that allows us to implement distributed transactions, that's quite simple. It consists of two stages, a prepare stage and a commit stage. In the prepare stage, if we are talking to the two databases we tell each database, prepare your transaction, prepare your transaction. And then when they report that they are good to commit, we commit both transactions.
- 20:43 Szymon Pobiega
- So they were quite happy, the blue team, with their solution, they didn't need to modify any code. They just need to use the Distributed Transaction Coordinator built into Windows operating system. And basically it looks a bit more complex when you draw that Distributed Transaction Coordinator on the slide it's not that easy as two-phase commit in theory is because those Distributed Transaction Coordinators there on top of the slide they need to talk to each other. So there is a lot of communication puffs involved in the simple task of sending a message and updating a database in two different transactional resources. But the blue team was doing quite nice and making progress in that. While meanwhile, the red team was still struggling with idempotence. They tried to extend the test scenarios that they were executing in their testing environment to cover also more advanced or more rare situations that can happen while processing messages.
- 21:44 Szymon Pobiega
- And one of those situations is your message broker or message queue goes down for some reason, and you cannot send a message. What happens then? They tried to simulate that situation and they simulated it in such a way that they added a type of pierogi filling that causes simulated broker failure that's ruskie pierogi, sort of meat and cheese. So they added this type of filling to their order and they execute the message handler code to see what happens. Here there is no duplicates, there is a single copy of that message that we are processing. So it's not a duplicate, we skip over the duplicate detection code, execute our business logic. So we add an item to the order, we try to store the data in a database. So far so good.
- 22:33 Szymon Pobiega
- We try to publish messages to the other microservices. And now there is the simulated failure, right? Immediate retry is going retry message, blah, blah, blah, because of an exception broker error, but now nothing is wrong here. We failed, but the message is not deleted from the queue. It goes back to the queue and we can process it later. So we retry processing it. And what do we do? We load the order by ID and we check if it's a duplicate. Because the duplicate detection here is based on the assumption that the filling types in pierogi are unique. Well, it seems like it's duplicate because the data is in the database and already modified. So while it's a duplicate, we don't need to do anything. And what is the result? The result is that we process the message correctly.
- 23:23 Szymon Pobiega
- The state of the database is correct. We have an order and one item. The problem is that we didn't send the outgoing messages. The first time we tried sending the outgoing messages, we failed because of the simulated failure. The second time, we didn't send it because we thought that the message is a duplicate. So not a great solution because our business process will get stuck. Every time a broker will fail while sending messages, the business process will be stuck in that place and no messages generated. So, if storing data and then publishing messages doesn't work, well the red team thought, let's reverse the order and first send out messages and then store the data in a database. So they created another test case this time based on Swiss cheese as a filling for pierogi. Everybody in Poland knows that you cannot put Swiss cheese into pierogi.
- 24:19 Szymon Pobiega
- So there is a validation built into the object-relational mapper that prevents this from happening. So every time you want to store an order item of type Swiss cheese, it will fail. Let's see how it fails. So we tried... that's not a duplicate, so we process the message business logic correctly. We publish messages to the other microservices in that system. Publishing succeeds, everybody else knows we added an item of type Swiss cheese and then we try to store it in a database. Well, we try but we fail because our validation logic prevents us from creating an item of type Swiss cheese. And what happens now? As we can see that other system, other microservice, received an event and logged item of type Swiss cheese added to order with ID DEF. What the team wanted to achieve in the first place by using the microservice architecture and especially event-driven microservices architecture is they wanted to have an eventually consistent solution because eventual consistence is good and so on.
- 25:26 Szymon Pobiega
- What they actually ended up having is immediate inconsistent system, because we are creating inconsistent results. Very predictable in that case and there is nothing that will make this state consistent again. What they did here is they published a state that will never happen in that system. The invalid state has been broadcasted and that's really bad. So they got back to the drawing board and they tried to make sense what happened actually. If they use publish and persist, those operations in that order, they risk publishing ghost messages. Ghost messages are those messages that carry a state that haven't been persisted and is invalid. So everybody else knows about state that will not happen. And if they reverse the order and first persist and then publish, then they will not publish those messages in case of broker failures. So both situations are really bad.
- 26:31 Szymon Pobiega
- So let's try to frame the problem. At least the red team tried to frame the problem. What they are actually dealing with when trying to process messages coming from the queues. First, you don't want to lose messages because well losing orders is not going to help you build a startup, this is going to disrupt pierogi delivery market. You don't want to lose those messages, and that's fine. Then you don't want to invoke the business logic twice if you receive a duplicate message. You want the outcome of processing to be the same no matter how many duplicates or how many copies of the message has been delivered, that's idempotence. And third thing is equally important, you don't want to leak non-durable state. If you are publishing messages to other microservices or other components in the architecture, you better first make sure the state is persistent and then publish it or broadcast it to everybody else.
- 27:25 Szymon Pobiega
- Otherwise, you will end up with an immediate inconsistent state. The solution that the red team wanted to use or found out, was those two steps. If you're processing a message, first you need to check if that's a duplicate. Based on any kind of logic that applies to a business domain, check if a message is a duplicate. If you think that the message is delivered for the first time, execute the business logic and persist the state change that resulted from the business logic. And then publish outgoing messages regardless of what happened in the first step. So even if you think a message is duplicate, you need to publish the outgoing messages based on the state that you persisted in the database. Why? Because technically you cannot distinguish two states. First is when a broker failed and the message is not a duplicate, but the fact that you are dealing with the same message again is caused by previous broker failure and you need to send out messages.
- 28:29 Szymon Pobiega
- And the second is a regular duplicate message. Those two states are the same when you look at the database. And the database is the only thing that you can look at to decide what action you should take. So that was the outcome of the second sprint or first, sprint one and then time for the demo. This time their editing was first. So they tried to showcase their solution. They submit an order, add some meat to the order pierogi with meat, execute the duplicate detection code, it's not a duplicate. So they create a new order line and update to that order. Store the order in the database, so far so good. And then they publish that event to all other microservices. That's great. But, well the duplicate generation logic worked well and we have a duplicate message. Message with the same ID. This time the duplicate message is based on their ID, just to show you that there are multiple ways to duplicate messages.
- 29:30 Szymon Pobiega
- So we look at the ID of the message and we look at database. The database shows here, see we already processed message with this ID, so we don't need to process it anymore. What we do is, we log... because we are good citizens and we want to make sure that we log any relevant information. But then we don't exit immediately from the message handler. What we do is, we skip the business logic code and we immediately process to the publishing of outgoing messages because we are not sure if the first attempt of publishing succeeded. So we need to make sure that the other microservices got this message too, yeah.
- 30:10 Szymon Pobiega
- So far it looks good. What they figure out of the plan, they executed the plan correctly. But the problem is here, there are other teams in that organization that receive events published by those two teams and they start raising issues that they see duplicates in their systems or their part of the system. As you can see here there is nothing wrong apart from the fact that those messages published by the red system have different IDs. So what just happened is, the code that the red team used, they used random ID generator for their messages. So each message would be assigned a random ID. And if a message duplicate is introduced somewhere in the transport. So the message queue introduces a duplicate or somewhere in between the message sender and the message receiver, the message ID is good way to duplicate messages, because all the copies of the message will carry the same ID.
- 31:15 Szymon Pobiega
- The problem here is still that there are republishing code, the code that makes sure that even if we suspect duplicate, we need to republish. That code is generating message IDs using NewGuid Method, so those republished messages will have different IDs. So well that's not an ideal solution. To make it work correctly, the red team should have used some sort of deterministic ID generation strategy so that the other downstream teams could also deduplicate based on message IDs. But they didn't get the chance to implement that just yet. Now the blue team, let me skip over the code debugging and just see what's happening here, because they didn't change their code. The only thing they did is they tried to put a database in a different building and use distributed transaction to send messages between those databases.
- 32:06 Szymon Pobiega
- So they create two orders, they submit those two orders and they were quite sure that it will work correctly. The problem is though that they got an error in the log. Eventually the transaction was processed correctly, but there is an error. The error says Transaction Process ID 51 was deadlocked with another process and has been chosen as a deadlock victim. And there is nothing wrong with deadlocks, everybody has seen deadlocks from time to time and they happen sometimes. Well, they are inevitable in a relational database. The problem with the blue team was though, that their system was producing deadlocks in a quite predictable manner. Every time there was more than one concurrent order being submitted, there was a deadlock in a database. Why was that? Well, they had a database in their HQ that represented stores, each store had a representation in the database.
- 33:07 Szymon Pobiega
- And when there were sending messages to the store, the physical location that that store represented, their transaction was locking resources both in the database in HQ and in a remote database, in that remote location. So every time they had two orders that were supposed to be handled by the same storage facility those two shipping transactions would deadlock, and one of them would fail and eventually will be retried. The problem is that if you want to disrupt global market of pierogi delivery, you better have a slightly better processing solution than one order at a time, because it would take you ages to deliver the number of orders that you are trying to deliver. So that wasn't really ideal. And the blue team was really upset. So that all led to frustration.
- 34:04 Szymon Pobiega
- The manager that was in that demo was really not happy with the performance of those two teams. The red team was still struggling with their idempotence code and still struggling to deduplicate the messages coming to their microservices. And while they were struggling with this deduplication code, they were also struggling with business logic code. They had the duplication code all over the place, and it was consuming the resources. So they could not actually focus on implementing the correct solution for billing. Well the blue team, they were struggling with performance really heavily. So that was frustration. Why can't we have both correctness and performance? Here's some way to try to figure out why it happens.
- 34:54 Szymon Pobiega
- The red team meets performance objectives, that's great. They process messages really fast and they create garbage really fast. They struggle to deliver a correct solution, why the struggle? They struggle because they entangle business logic with message delivery logic. They very many small microservices, each of them has its own message deduplication code entangled with business logic code. And that's not going to work correctly. Why they design it in such a way? To have multiple tiny microservices that have this logic entangled. They assume that components live in hostile environment. They assume that each of those tiny microservices will live in a separate data center possibly separate even yeah, separate data center and it needs to deal with all the problems of communication with other microservices on its own.
- 35:56 Szymon Pobiega
- The blue team they delivered a correct solution but struggled with performance. Why did they struggle? They struggle with performance because they use protocols, as two-phase commit protocol, that don't tolerate failures or latency. And that's sort of irony here that the protocol for distributed transactions doesn't really handle physical distribution very well. You can have a distributed transaction in the same data center or in the same product. You rather not try a distributed transaction between physically distributed locations, because it's not going to work. At least not with two-phase commit protocol because of locking nature of that protocol. Why did they use those protocols? They assumed that everything will be working in the same data center that the older code will be running in HQ in the server room, in the basement and not be distributed.
- 36:51 Szymon Pobiega
- So both teams assumed that they had different assumptions with all those assumptions led to really poor design. Why? Well, probably because all they've seen is the hammer they have. If those teams were structured differently, for example, the people on those teams were mixed and have different backgrounds, they would probably arrive with better solutions. Because they didn't mix those people together and they grouped them based on their specialty and background, that's what they ended up having. And now let me for a moment I stop telling you the story about the teams and to look at the coupling in that system. Or in coupling in general. We'll come back to the teams in a second. So talk about coupling, because in our industry we have very little scientific evidence and the only thing that we can strive to is some anecdotal evidence and storytelling.
- 37:58 Szymon Pobiega
- I'm telling you a story and I have some anecdotal evidence here. It's not really related to the pierogi or distributed systems. That's a picture of background radiation in our universe and how it relates to coupling. Well this shows the tiny little differences in temperature in the early universe, a moment after Big Bang. What resulted from those tiny little differences in temperature or density is stars, planets and we here. So, those tiny little differences in temperature led to matter being coupled or gravitating.... well, led to gravitation that created stars and planets. So, that's one analogy. That analogy is an image of US in the night. So as you can see, people are not evenly distributed across the country. There are regions where there are people, big cities and there are regions where nobody lives. So people tend to gravitate towards one another also.
- 39:09 Szymon Pobiega
- People tend to hang out together and they're happy. And same happens with microservices or components. I know that the idea of microservices original was that each microservice has its own database. I'm yet to see that kind of implementation never seen that in real life. In real life the only thing I've seen is, there is some number of databases and usually much more processes or much more microservices, less databases. Why? Because there is not really moral reasons why you would want to run or execute the code in different processes than there is reasons for having separate data servers. So usually the situation is that there is multiple processes that use the same set of data. And those processes are usually quite highly coupled together. And in that environment, that was the environment that the red team, sorry, the blue team operated. You should struggle to achieve atomic store-and-publish.
- 40:11 Szymon Pobiega
- What it means is, when you process a message you would like to have a guarantee that when you store the states change, when you persist the data and published some events to other services both publishing and storing the data happens atomically in one transaction or it doesn't happen. So it's atomically store data and publish messages. That was the property that allowed the blue team to implement their business logic correctly and not struggle with it. But as you've seen on those analogies with universe and map of the United States, those groups of people or of atoms are usually divided by empty space. And the same happens in most distributed systems. There is group of services that operate together and are coupled and another group that is coupled together, but those two groups are distant.
- 41:12 Szymon Pobiega
- There is a space between them. Sometimes they are distributed physically and they really operate on different data and there is no coupling between them. And especially if they are distributed physically, you can't really have atomic store-and-publish. There are no protocols that would guarantee atomic store-and-publish across physical distances. All you can do here is, you can achieve durability. And you should try to achieve durability because you don't want to reimplement TCP protocol in your code, right? So you want to make sure that when you send a message, it will get delivered to the destination sooner or later.
- 41:57 Szymon Pobiega
- Another analogy here could be an island. So imagine those microservices that operate on top of the same database as a happy island. They have their own database, they communicate through that database and they're happy on their island. They don't care about other microservices in different islands. Now that's how the island can look like. You have a nice palm, there's a sun there behind that island and below the sand you have multiple microservices and the database that they share, happy microservices. But they sometimes need to send messages across to another island. An example here would be that this shipping process needed to send a message over to the physical storage location to complete the delivery of pierogi. So in order for that island to communicate with another island, we introduce message queues. Those red message queues at each end of the island.
- 42:54 Szymon Pobiega
- So most of the interaction and most of the job is done within that island. When we need to transition the process to another island, we sent a message over this red queue. But how do we send that message? As we've seen the red team struggled with code that will make sure the message is properly duplicated. So the trick here is to have a separate responsibility for it. To not mix the responsibility for processing messages within that island with sending the message outside. You need a special technical microservice that will take a message from our queue in the database and send it over with this red queue that allows communication between islands. So the full model of an island looks like this. We have a database that also is used as a message queue, as the blue team designed. Then we have those technical microservices, the green gears here. That move messages from the queue in the database to those external queues and allow communication between islands.
- 44:02 Szymon Pobiega
- What you end up having is a bridge, one island does its own processing and moves the message over the bridge formed by those technical microservices to another island. If you remove all the microservices databases from the design, you end up with something like this. You have local queues that allow for having atomic store-and-publish those gears. The technical microservices that move messages around, the bridges. Those gray things are the bridges and another messaging system that allows communication between distributed locations. So, I'm running out of time. I would like to leave you with that message. In 2003, we had polyglot programming. In 2011, we had polyglot persistence, meaning we acknowledged that we might use multiple data storages in a single system. I think in 2019, we need to acknowledge that we need to use multiple messaging frameworks, to make sure that we can deliver our solutions in a way that they are both correct and performant and no single messaging infrastructure can solve all the problems for us in the complex systems that we have in 2019. So, thank you.