Welcome to the (State) Machine
About this video
“Stateless all the things!”, they say.
In the last few years we’ve been brainwashed: design stateless systems, or else they cannot scale, they cannot be highly available, and they are hard to maintain and evolve. In a nutshell stateful is bad. However, complex software systems need to do collaborative processing which is stateful by definition. Stateless myth busted! Collaborative domains deal with long business transactions and need to interact with distributed resources. The traditional distributed transactions approach, even if tempting, is a time bomb.
This is where Sagas come into play. Sagas allow modeling of complex collaborative domains without the need for distributed transactions and/or orchestration across multiple resources. Join Mauro on a journey to discover what sagas are, how they can be used to model a complex collaborative domain, and what role they play when it comes to designing systems with failure and eventual consistency in mind.
(It’s all right, I know where you’ve been.)
🔗Transcription
- 00:00 Mauro Servienti
- Okay. Good day, everyone. Welcome to State Machine. Before we get started, just a couple of logistic notes. If you have any question, there's the Q&A panel from WebEx or you can drop them into the Slack channel for the conference room free or feel free to ping me directly on Slack. That's not a problem at all. I'll do my best to get back to you as fast as I can after the talk. When I was young, it was around 1989, I didn't have a driver license back then. I was 16.
- 00:49 Mauro Servienti
- With friends, we used to go to Milan to clubs to attend concerts. That night, the concert was Red Hot Chili Peppers. They weren't that famous yet back then. We weren't really interested in the band at all. We were going to these clubs just because those clubs back then allowed to do stage diving. Even if I still like to go to concerts, I would love to be able to stage dive again but I've got this security problem that now my age is not going to be allowed that anymore.
- 01:27 Mauro Servienti
- Anyway, back to the topic here. If we were to go to a concert today, what we need to do specifically, two things. The first one being buying tickets. The second one obviously is stage dive. Go to the concert. For more perspective, going to the concert is basically one single operation just buying the tickets and then attend the concert itself.
- 02:02 Mauro Servienti
- If we try to look at the concept from the perspective of what they want to do so that the company or the group of people behind the concept organization, what they have to do is basically keep calm and make money. That's their main goal. In order to do that, what they have to be able to achieve is first of all, they have to be able to display available tickets. Some sort of website or a phone you can drop a call to to tell me many tickets you have available and then allow me to reserve tickets.
- 02:37 Mauro Servienti
- Once tickets are reserved, charge my credit card, obviously. Once the credit card has been charged and they paid, ship tickets somewhere. The interesting things of this operation and many more. Obviously, because if you think to a company selling tickets online, they probably sell gadgets as well. They want to attract marketing relating information. What kind of concerts people look on the website but they don't end buying tickets? They want to track user actions on the website.
- 03:14 Mauro Servienti
- One important thing to notice about these kinds of operations is that sometimes, the order doesn't matter. Sometimes, it does. Let me tell you a couple of stories. A few years ago, I went online to an e-commerce from a very large and famous device manufacturer. I bought device. That device I put it in the shopping cart. I did the entire checkout process. Finally, I submitted my credit card and hit the checkout button and pay button.
- 03:56 Mauro Servienti
- I received a nice email saying, "Your order is on the way. We'll get back to you as soon as it's shipped." Generally speaking, what happens with this large e-commerce web shop for example like Amazon does is that your credit card is generally charged only when your order is shipped. Not at the exact moment when you buy it but when they ship the stuff to you. In that case, what happened is that a few days later, I received another email saying, "Your order is on its way. Enjoy your device."
- 04:29 Mauro Servienti
- Literally, a few seconds later, a text message from the credit card company saying, "Operation denied." I immediately said, "Well, it's eventual consistency." The email went out. They're not really shipping anything. Two days later, the shipping courier showed up at my place saying, "Hey. Here is your device." Meanwhile, for two days, the credit card company was texting me saying, "Operation denied. Operation denied. Operation denied."
- 05:01 Mauro Servienti
- Basically, what happens is that they were trying to charge my credit card but they were constantly failing while the device was traveling to my place. I even called them. I called the customer support saying, "Hey. I got this but I haven't paid for it."
- 05:19 Mauro Servienti
- The lady on the phone told me, "Yeah, I know. I can see that from my CRM thing. I can see that you got the device and you haven't paid for it because we failed in charging your credit card. You know what? From the process that we have, if it's shipped, it's paid. There's nothing I can do. Enjoy your device." Okay. Thank you very much. That's an interesting point. We assume in this case at least that paying must happen before shipping. Is that always the case? Not really.
- 05:53 Mauro Servienti
- Let's take for example highways in Italy. I guess the same happens in many countries where you pay for highways. When you enter a highway in Italy there is a gate and you get a ticket. When you exit the highway, you can do that talking to a human or to a machine. You insert the ticket and then you put in your credit card. The gate opens and you just go. Your credit card is not charged synchronously as you exit the gate.
- 06:28 Mauro Servienti
- Basically, what they are doing is that they just capture credit card information at that time and the ticket information that you just put in. They later on try to charge your credit card. They assume it will work so they are basically saying commands won't fail. Other than assuming that it will work, they're basically saying usually the amount to be paid is so small that even if we fail in charging the credit card, it doesn't really matter.
- 07:00 Mauro Servienti
- It happens so un-frequently that missing a couple of customers doesn't really matter. It happened to me once in probably 25 years. The funny thing was that I was traveling from Switzerland back home to Italy and my credit card was cloned. It was invalid. By the time I exited the gate in the highway, what happens is that they captured information and later on, they tried to charge the credit card but obviously, the credit card company denied the transaction because the card was cloned in the meantime.
- 07:35 Mauro Servienti
- They never came back to me. They are basically the same thing. We are paying in both cases for a service. In the first case, it was a device. Essentially, we're paying for something. In the first case, the order is truly important from their process perspective. In the second case, the order doesn't really matter. They prefer to prevent a possible queue at the gate because if something goes badly, basically you have cars queuing up at the gate because they can't get out.
- 08:04 Mauro Servienti
- They prefer to optimize for the successful case instead of risking of cars queuing up. Let's go back for a second to our ticket buying system. Let's imagine what could be a very naïve implementation. The first thing we said is we need to show tickets. We cannot call something display available seats. We get back a list of seats or tickets, whatever they are. We say let's take two. We cannot reserve seats. We just grab two tickets.
- 08:39 Mauro Servienti
- We have two selected seats or two selected seats and then we try to reserve that. We call the reserve seats and what we get back is reservation that probably contains a reference to our tickets. Along the process, we try to authorize the credit card. We authorize the credit card. If the authorization succeed, then what we do is basically we charge the card and finally ship tickets. That's the process that we described.
- 09:11 Mauro Servienti
- Because we want to make sure that before we ship, we actually succeed in charging the credit card. The authorization process is just to make sure that there are money in our funded credit card. One of the interesting thing of this naïve implementation can be as much complex as you want. It's essentially this kind of process we want to design. One of the problems of this kind of implementation is that these are probably remote requests. We're talking to payment gateway to authorize the card.
- 09:50 Mauro Servienti
- We're talking again to a payment gateway to charge the credit card and giving the authorization that we receive back before. We're talking to probably the shipping courier website or web QI, whatever in order to request the shipment, the pickup of the tickets that need to be shipped. What's the problem with this kind of remote requests? The first one is not really problematic, right? If something goes badly, we just abort the operation. What happens to the second one?
- 10:26 Mauro Servienti
- We basically want to orchestrate in some way a class two operations. Now, if we ever only two operations, what can go badly is that the first one fails the charging of the credit card and we're fine. We just roll back in some way everything and charge to the user. Something went badly, just retry. Whatever. You haven't bought anything. Your shopping cart is now empty and we're fine. What happens if the second one times out?
- 11:00 Mauro Servienti
- The bad scenario is exactly that the HTTP request succeeds but we don't know because the response times out. The response never come back. Basically, we're now in an unmanageable state because we have new idea what the scenario is. When we deal in such a sequential process with remote requests, we're always exposed to this kind of behavior. The fact that we really feel the need of wrapping this into a transaction is a sign that something is smell.
- 11:42 Mauro Servienti
- Usually, what happens is that the first approach to solving this problem is to turn this kind of naïve implementation into what it's generally called the process manager. Process managers and I'm using process managers not necessarily from the domain-driven design word. Bear with me and we'll see what I mean by process in this case. There's an overlap with DDD but it's not necessarily the domain-driven design concept. We said that we need to select tickets.
- 12:20 Mauro Servienti
- The ticket selection process talks to reservation in order to select tickets. We have a request response with reservations saying, "Hey. How many tickets do you have? Please reserve two for me." We have an information that says the user wants to check out. I spell that in an event thing so order, checked out. It's not the event yet. We'll get to that later. Once the checkout kicks out is the order management process. That's the real process.
- 12:59 Mauro Servienti
- What we do is that we reach out to the credit card gateway anti-corruption layer. It can have many names but it's basically a payment gateway that allows us to talk to the credit card provider. One of the important thing to notice is that it's a still request response but we're doing that using a message. In this case, it's exactly a message on a queue. We're moving away from remote requests as remote procedure calls into a queue. The reservation one might be still a remote request.
- 13:39 Mauro Servienti
- That is handled over an RPC kind of approach but now we're moving to a remote request over a queue. We talk again to reservation. Again, in this kind of RPC style and then again, we go back to the credit card saying, "Okay. Charge the card now." We authorize it. We confirm the tickets to reservation and then we charge the card. Finally, we talk to shipping. Basically, what we did is that we isolated remote requests into its own processes talking to them through a queue.
- 14:16 Mauro Servienti
- We eventually solved the original problem. What might go wrong is that let's use the same example before. We send out a message to shipping, the shipping gateway. The shipping gateway talks to the shipping courier, API to HPP again. The response doesn't come back. What happens is that the process manager, the order manager never receives a response message. Now, the other manager isn't in doubt anymore. Shipping didn't happen. I know what to do now.
- 14:53 Mauro Servienti
- I can roll back whatever it means from this kind of process, the entire process sent out to the user we failed in shipping. There is no doubt yet. There is no doubt anymore. There is doubt but that doubt is confined into shipping. Shipping could have said to the operations team, "Hey. I'm having an issue." The operation team comes in and try to fix the issue.
- 15:16 Mauro Servienti
- From the process perspective, isolating the doubt into its own separate process or call it an end point, whether it can do just one single operation allows us to solve the problem of the need of dealing with remote requests that can fail. We haven't solved all the problems. We still have some orchestration still required. Because if we fail in shipping, we need to roll back the transaction with the credit card. There's this chat happening between credit card and shipping.
- 16:04 Mauro Servienti
- The fact is that orchestration is not the only issue we should get rid of. Let's try to have a look at the process manager from the storage perspective. Let's imagine a very simple orders manager where have an order table that contains an order ID, a shipping ID, a shipping status and many more columns. So there will probably be a payment ID and the payment status and the number of tickets and so on and so on and so on.
- 16:36 Mauro Servienti
- Now, let's imagine that someone comes along and says, "You know what? During the weekend, we were at the golf field playing golf. We realized that a very interesting feature could be allow people to collect the tickets at the venue instead of shipping tickets at their place." We have a new requirement now. That is collect tickets at the venue. The problem is that if we look at this kind of scheme, collecting tickets at the venue has an issue. There's no shipping.
- 17:12 Mauro Servienti
- Now it means that shipping ID and shipping status doesn't really represent a full reality of the system. One option is that in order to support this kind of new thing, we have to make shipping ID and shipping status nullable. That collates with the fact that from the high level perspective of the code, we cannot really deal with that. Because what does it mean if there's no shipping ID?
- 17:41 Mauro Servienti
- If it is null, does it mean that there will be no shipping because it's collected at the venue or because there hasn't been shipping yet? We need to change the scheme again adding a third column that represents the fact that kind of shipping. The shipping type other than the shipping status and the shipping ID. This kind of thing is basically telling us that the process manager is not that different from a punch card.
- 18:11 Mauro Servienti
- When the business come along and say, "Hey. We have a new requirement. That new requirement should go there." Basically, if we picture a process manager like a punch card, what happens is that in order to fit a new requirement in a very special place, we have to throw away the entire punch card. There isn't much anything can do. We cannot really put ourself there and fix a hole in the punch card and change some numbers and try to fit the new requirement thing. It's going to be a mess.
- 18:44 Mauro Servienti
- Not to mention that it's very easy that if we picture a very complex process like buying tickets online could be because it's a very simple use case. If you picture that real world process, it's very easy to view a process manager like this. What might happen obviously is that the more features you add, the more the process manager becomes the big ball of mud. It's easy to see what the problem is. We have tickets sitting there in the middle. The big ball of mud now are tickets.
- 19:23 Mauro Servienti
- We have all the services or all the actors trying to do something with tickets like charging the credit cards because tickets represent in order. Now, there's insurance saying, "I need to make sure that these tickets are delivered paying an insurance for this tickets because they are for the La Scala in Milan. They are 400 bucks each ticket." We have marketing trying to do stuff with tickets and then we have shipping trying to do stuff with tickets.
- 19:50 Mauro Servienti
- By the way, there's customer cared down there. You can barely see it. We're Italians so customer care doesn't really matter. Anyway, there are all these services trying to do stuff with our tickets and they are fighting with each other. That's why we need an orchestrator. There's this guy sitting there looking at all these people fighting for tickets, trying to do stuff with the tickets. You can see your code and your processes and your tribes doing stuff with the tickets.
- 20:22 Mauro Servienti
- The orchestrator that acts like a semaphore or acts like the distributed transaction coordinator in a time to make order in the entry. There's a lot of chaos going on. The orchestrator is sitting there trying to fix the stuff. By the way, there are two guys, a lady and a guy sitting over there back there. Who they are? A fail over because obviously, if you are an orchestrator and it fails, then you need replacement for the orchestrator.
- 20:57 Mauro Servienti
- You need to have someone sitting there doing nothing just for the problem of the orchestrator failing. Both the naïve approach and the process manager approach even if the process manager approach by using messages to the couple. Some of the temporal coupling between actors in the system, they can be seen as fragile. First of all, they violate the single responsibility principle. In both cases, we're doing too many things. We're reserving tickets.
- 21:30 Mauro Servienti
- We're charging the user credit card with shipping stuff all in a single place. They're a single unit of deployment. When someone comes and says, "I need a new feature," we need to deploy the entire staff. In being monolithic, they cause a problem to the operations team because they have to deploy everything every single time something changes. Even if that something is very tiny.
- 22:02 Mauro Servienti
- If we're dealing with a large code base and we're using this kind of approaches, then we start having conflicting changes and merge conflicts. Because that's given. If they are a single unit of deployment, probably they all live in the same repository. We have multiple teams touching the same code base or at least multiple people touching the same code base and start changing stuff they should not really change and causing problems to other people. They might be contention and performance bottleneck.
- 22:36 Mauro Servienti
- It's obvious that as soon as we have a performance problem, we have to scale up or scale out everything. We cannot really focus on what we need to scale out is a problem that the charging of the credit card is the problem. The tickets or reservation doesn't really matter. We have to scale out everything just because it's a single unit of deployment. The reality is that there's no spoon. There's no process manager. What can be really tricky in this kind of scenarios is talking to the main experts.
- 23:16 Mauro Servienti
- If we start analyzing this kind of processes, blindly trusting a domain expert, they'll start talking about the order management system. That's natural because that's the way they view things. It's super easy for the engineer that is always in the back of our mind that as soon as they say the word, we have an order management system. We picture that as public class order management system. They say, "We can reserve tickets." Public async task reserve tickets and things like that.
- 23:58 Mauro Servienti
- We immediately transition from what they talk about to a design. The thing is that that leads very easily to a monolithic thing. Because that's the way they describe things. They're describing user mental model. They're describing the way they picture a user doing their main process that is selling tickets. When we listen to them, we should be carefully listening to the words and the nouns and the verbs they use and try to chop it up.
- 24:35 Mauro Servienti
- What we really want to do is try to isolate things that should not stay together. Because our final goal, our final destination in some way is that we want to have autonomy in collaborative domains. We started all this journey by saying we have three actions, four actions to do. List tickets, reserve tickets, deal with payments, charge a credit card, authorize a credit card, ship tickets. Finally, stage dive (that's another problem). Think about if there's IKEA in your country or at least there is in mine.
- 25:19 Mauro Servienti
- Whenever you go to IKEA, you enter the shop. You walk around. You pick up stuff. You make a note where in the warehouse stuff that is to be picked up while you're walking the shop. You reach the cashier. You pay and then it's up to you to decide. Should I try to fit everything in my car? Nice. There's a shipment desk over there. Let me go to the shipment desk with my order and tell them, "Can you ship this stuff at my place?" That's it. IKEA doesn't really know anything about that shipping thing.
- 25:58 Mauro Servienti
- They don't care. The shipping thing doesn't really care about what have you done inside the IKEA shop. They just care about the list of things they need to ship. If you have the list of things, they assume you've paid. Otherwise, you cannot have the list of things. It means that they don't care what happened at the cashier. You might have paid with a credit card. You might have paid with cash. You might have paid with a check, whatever. From their perspective, it's not relevant at all.
- 26:37 Mauro Servienti
- The two domains we're talking about, the IKEA furniture shop and the shipping thing that happens to live inside the same shop, they are completely autonomous from each other. One is feeding the other. There's a dependency between the two. Very similar to the device thing we talked before but they are autonomous. There's a very limited amount of information that needs to flow from one domain to the other, if any in order to allow the second one, in this case, the shipment to ship stuff at my place.
- 27:13 Mauro Servienti
- If we want to transition that concept into architecture, we can't really do using the process manager. Because if you think about it, there isn't anyone when you're walking inside an IKEA shop watching over you like a crow that flies over your head and say, "What are you doing?" I picked up this. You picked up that. Let me add it to the bill and while I'm here, let me try to charge their credit card and see if there's money. That doesn't really happen. There isn't really anyone orchestrating that.
- 27:47 Mauro Servienti
- You're walking to a shop. You reach the end, you're at the cashier. You pay and technically speaking, you're even allowed to go away and leave in there. No one cares. When we want to model that as an architecture concept, we can use sagas. Sagas are designed to achieve autonomy in collaborative domains. Let's start by defining what sagas are.
- 28:18 Mauro Servienti
- The best definition I've found of sagas is basically sagas are multiple workflows, each one providing compensating actions for every step of the workflow where it can fail. Let's try to dissect this definition. The first bit is multiple workflows. You probably already envision that. We talked about reserving tickets. We talked about charging credit card. We talked about shipping tickets. We can transition this to reservation, finance and shipping.
- 29:02 Mauro Servienti
- Those are the three domains or three sub domains or three services or three microservices. Call them as you will. It doesn't really matter in this case that we're dealing with. If we try to dive in and try to picture out what the process could be, here is what it could look like. We have available tickets. In some way, we displayed available tickets on the website. Let's say that there's a website. We can select tickets. We select tickets and we do that using a reservation.
- 29:37 Mauro Servienti
- Selecting tickets means that we're going to put tickets let's say into a shopping cart. Fine. When we're happy with the ticket selection, then we start checking out. At this point, reservation is not really interesting anymore. It's like if you are at the cashier. We reach the cashier. We have some stuff in the shopping cart and maybe some stuff on the list of things that we want to pick up later at warehouse but we have stuff in the shopping cart in some way.
- 30:04 Mauro Servienti
- What we do is reservation checked out. Because we are at the cashier. At this point, the ball or the hot potatoes goes to the cashier. That is finance. What finance does is do something with our credit card and talk to reservation. You know what? The payment has been authorized. There's money on the credit card. By authoring the payment, we lock that money. That money is available to us. That's not necessarily true. We'll get to that later.
- 30:37 Mauro Servienti
- At this point, reservation can say, "I'm happy. Here is your order for you." Here is the checkout list. That is the thing that you can later on use to go to shipping. Finance says, payment succeeded because we already locked the money on the credit card. Now that we have these two information, we have the pickup list and we have the payment receipt, we can go to shipping or the shipment desk in the IKEA shop and do our stuff with them.
- 31:10 Mauro Servienti
- Ship stuff at our place or ship tickets in this case if we are still talking about delivering tickets. What do we have here? We have free workflows because you can imagine that inside these big boxes, there are many things happening. They are workflows in some way. They are state machines. The shipping for example, the finance, it's been easier. Payment is being authorized. It's the state where the payment is.
- 31:42 Mauro Servienti
- When it transition to another state, it's going to be payment successful or payment failed because something might go wrong. Shipping as well. The shipment has been prepared. The shipment has been delivered. The shipment has been sent or has been received by the customer. Here is the customer's signature and so on. All this workflows, they basically participate into the tickets ordering saga. The saga is the relationship between these workflows.
- 32:21 Mauro Servienti
- To be honest, we prefer in particular software to call them policies instead of workflows. Why? Because they make sense from a businessman perspective. Stakeholders they don't understand the word workflow. It's much better to talk about payment policy or shipping policy or the reservation policy. It's much easier for them to understand what they're talking about. Sharing this common lingo is a way and is a nice way to reach and augment collaboration between different groups.
- 33:05 Mauro Servienti
- If anyone of you have already been to an event storming master class, I've done one with Alberto a couple of years ago. You can probably already see what I'm aiming to. That is these four events, reservation, checkout, payment authorize, order created, payment succeeded are what the event storming people call pivotal events. A pivotal event is an event that allows the entire system in this case, the tickets reservation saga to transition to a different state.
- 33:46 Mauro Servienti
- You probably can imagine that inside finance or shipping itself, there's many things happening. There are many events that are not relevant at all from the outside. Who cares? What's important are these pivotal events or what indeed SOA world we call pub/sub events. These are the things that are cross boundary or cross service information that flows around the system to allow the system to transition from state A to state B and state C and maybe back to state B, whatever the process it is.
- 34:22 Mauro Servienti
- Let's now have a look inside the black box. What's inside these sagas? What's inside each workflow? Let's start again from the available tickets. We go to reservation to select tickets. We put tickets in the shopping cart and then we proceed to checkout and then back again to reservation. Reservation publishes the reservation checkout process. That goes to finance that they initiate the payment process. Here is the payment workflow starting.
- 34:53 Mauro Servienti
- The first step of the payment workflow is to send out a message to payment gateway that talks to the credit card provider saying, "Hey. Can you authorize this credit card with this amount for me?" If that succeed, there will be a payment authorized event that goes back to reservation that creates the order then publishes the order created event that again is interesting for finance. It might be interesting for other people as well like marketing for example.
- 35:21 Mauro Servienti
- It is interesting in knowing how many cars convert into orders. Finance goes back to the payment gateway saying, "Hey. Charge this credit card for me. Here is the previous authorization ID you gave me." Finally, publishes the payment succeeded that goes to shipping as the order created one and finally, shipping talks to the courier gateway saying, "Hey. Can you send me shipping code because I need to send tickets to a customer." Is that more complex? Even the naïve approach? Sounds it is, right?
- 35:59 Mauro Servienti
- It's not really. What we are doing is simply letting the complexity emerge. We haven't changed the process at all. The process is exactly the same process we're trying to design using the naïve approach and the process manager as well. Now, it's clear what the complexity is. We simply let the complexity emerge and be clear from the process design perspective. Now, let's go back to the requirement that we got previously. Collect the tickets at the venue.
- 36:37 Mauro Servienti
- If we have this nice process, where we have multiple workflows collaborating with each other and we claimed there's autonomy in there. Collecting tickets at the venue is something that should affect shipping only. That's the only thing we should be able to change. We have to change in order to allow at the venue collection. Let's see if it's possible. We said that shipping in order to be triggered needs both order created and payment succeeded.
- 37:13 Mauro Servienti
- What we can do is basically change the shipping policy and augment it saying, "What's the delivery mode?" Is it collection at the venue or ship at home? If it's not collection at the venue, it's ship at home, we're doing the same process before. We're sending a message. The deliver courier, they will come pick up the tickets and deliver then. We're fine. If it is collect at the venue, then we're sending a message internally to shipping.
- 37:43 Mauro Servienti
- There's another branch happening here that is store for venue delivery. What happens is that there's another policy that starts. That act a batching collection policy for every event that there will be. There will be store for the venue contains the ticket IDs and the event ID. There will be one batch shipping at the venue policy for every future event. That batch shipping policy, we collect tickets and when it's time, they deliver them.
- 38:22 Mauro Servienti
- Once one of these two operation happened, what we can do is that we can consider the original shipping policy as complete. From the outside perspective, shipping is done. If we immediately shipped going on the right side so there's no collect at the venue but ship at home, we're fine. We're happy. It's done. Tickets are on their way.
- 38:50 Mauro Servienti
- If we decided to store for venue delivery, someone else, the batch shipping at the venue policy have picked up and collected those tickets while store them in the batch that need to be shipped in the future. We don't need anymore the original shipping policy. It can be considered done. We can even publish an event like shipment completed. Now, we have this long living policy that sooner or later, we'll decide to ship tickets. We call a courier.
- 39:24 Mauro Servienti
- Give them a batch of tickets and that courier will do their job. Once they're done, they reply back saying, "We're done. Tickets are there," and then can kick off for example marketing saying, "We need to send an email to all the customers that decided to collect tickets at the venue in order to let them know how to collect them when the day of the event will be." At that point, this sub workflow can be considered completed. From the outside, who cares?
- 40:03 Mauro Servienti
- No one knows that we changed the entire processing shipping to accommodate this new requirement. Reservation doesn't care. Finance doesn't care at all. Other parts of the system might not care at all. The second bit of the definition was each one providing compensating actions for every step of the workflow where it can fail. Let's see what kind of failures we can have. Let's say that when payment authorized is raised by finance, reservation fails. Reservation fails might be a bug.
- 40:44 Mauro Servienti
- In processing that message, there's a bug that makes a reservation go nuts or reservation is down. It's not available. It's not really a failure but no one processes the message. What happens is that your created event is never published. Finance never charges the credit card. The payments are seeded event is never published. That means that shipping does nothing.
- 41:13 Mauro Servienti
- The interesting thing of this kind of single responsibility principle application is that now, a very simple failure in one part of the system is fixing stuff for us in some way. Because nothing happens. There isn't a need of having any somewhere saying, "Has the reservation failed?" Yes. I don't need to ship. Because the simple fact that events are not published makes so that the shipment never happens. We have a problem. We authorize the card. What's the problem with authorizing the card?
- 41:51 Mauro Servienti
- It's not really a problem. If you've ever been to a hotel especially in the US, for example, what they do is that they authorize your card when you check in for incidentals. They're basically saying, "You're going to stay here a week. Here is 1000 bucks charged in your card, authorized in your card." That money is locked off. You cannot use the card anymore for that amount of money. They do nothing when you checkout.
- 42:18 Mauro Servienti
- Why? Because at the end of the month, automatically, the credit card provider will discard authorizations that haven't been converted into a payment. They don't really care. We can basically do the same thing. We could say who cares? We authorize the card. Nothing happened. Sooner or later, the authorization would be released. We can decide to be polite with our customers and try to compensate the problem. Let's see what we can do. The problem obviously leaves in reservation and finance.
- 42:54 Mauro Servienti
- Because we published the payment authorized but we never received the order created event from reservation. That means that we have no way to process the second step that is charging the credit card. The first question and that's a very interesting temptation is do we need reservation to publish a failed event? It's very tempting to say yes. I'm very sorry to say that that's coupling.
- 43:28 Mauro Servienti
- Whenever we ask someone else outside of our boundaries to change their processes in order to accommodate one of our needs, we're basically causing coupling. It's not called coupling. It's not binary coupling in this case. It's a process or business coupling but it's still coupling. As we said before, reservation might be down. If reservation is down, it won't process the payment authorized event. It won't be able to even publish a reservation failed event. Still, we won't receive anything from reservation.
- 44:06 Mauro Servienti
- What can we do? There's an interesting thing that we can do. Whenever we authorize the card, if that authorization succeeds, what we can do is that along with publishing the payment authorized event, we can set a timeout. In this case, I'm saying set a 48 hours timeout to release money. What's a timeout? It's a message to ourself. We're basically asking the system to, "Hey. Deliver me back this message in 48 hours." It can be 48 seconds. It can be 48 weeks, whatever the process dictate.
- 44:47 Mauro Servienti
- 48 hours later, what happens is that the message comes back. It reappears in the queue. What finance can do is ask itself. Did I receive the reservation event? Did the order checked out? Did the reservation told me something? If the answer is no, we release money. That's it. Easy peasy. If the answer is yes, we do nothing. We don't need to do anything. Why? Because in the meantime, it means that reservation came back to us with the order created event.
- 45:23 Mauro Servienti
- Finance already charged the credit card, published the payment succeeded event and marked this process specific to these order as completed. The timeout fires and expires for a process that doesn't exist anymore. We're happy again. There isn't really nothing that we need to do. Before going ahead, let's dive a little bit for what a timeout is. Let's try to picture that in code. We said there's the reservation checked out event that is handled by finance.
- 46:01 Mauro Servienti
- On the left side of the screen, you can se the finance goal. What finance does is sends out an authorization request that goes to the payment gateway. The payment gateway handles the authorization request. Authorize the card talking to the credit card provider and reply with a new authorization response providing the authorization. Assuming that the authorization response means that it succeeded. We're back to finance.
- 46:31 Mauro Servienti
- Finance receives the authorization response and says, "I'm going to publish the payment authorized event at the same transaction." I'll get to that in a second. I'm going to request a timeout for 48 hours so that someone will get back to me in 48 hours. When the timeout fires, then we can simply check the boolean saying, "Was it reserved?" Yes. No. Do whatever we need to do based on the process. There are a couple of things I'd like to highlight here.
- 47:07 Mauro Servienti
- Single responsibility principle at its maximum. We're sending out a message to do one single operation. If that operation is not important, let's imagine that we cannot really do authorize card multiple times. Because the third party we're talking to is not that important. We're going to do this one single operation only once. If it fails, we do nothing. We can reply or publish an event the authorization failed or whatever. They'll stop the process.
- 47:43 Mauro Servienti
- Wait for a human to come in and probably pick up the phone and call the credit card provider. "Hey. What happened with this transaction?" Secondly, these two operations, the publish and the request timeout are transactional. Most of the transports or the queuing systems provides a way to have transactional dispatch to the queue. Because both the published and the request timeout are messages that will be sent to the queuing system. Publish is going to behave like a regular message.
- 48:20 Mauro Servienti
- For example, on RabbitMQ, it goes to a exchange and then it's broadcasted to subscribers. The request timeout, it's delayed delivery that goes into native delay deliver infrastructure of RabbitMQ and is then resurrected into the original queue when the timeout expires. They have to be transactional. Otherwise, we're back to the problem of what happens if the publish succeeds and the request timeout fails? That can be a huge problem. Let's have a look again at what happens to the table scheme.
- 48:57 Mauro Servienti
- Because we highlighted this as a problem when we were talking about the process management. The main difference between using a process manager from the storage perspective and using sagas or multiple workflows is that we're basically splitting responsibilities at the storage as well. We can now have each service boundary sub domain microservice, whatever you want to call them. We can let them have their own storage. That can be even different technologies.
- 49:34 Mauro Servienti
- I'm picturing stuff here if they were relational tables. There is nothing preventing finance to use MongoDB, shipping to use SQL and reservation to use whatever, CosmosDB and Azure. It doesn't really matter. The only thing they care about is that they are sharing the same conceptual primary queue. Everyone contains the order ID and this order ID primary key that represent the order that's coming in from the checkout request process.
- 50:07 Mauro Servienti
- It's basically shared across all the services. The important thing is now that everyone can evolve independently. If we try to fit the requirement of shipping tickets at the venue here, we still need a scheme of change in the shipping table. That's not going to be a problem for anyone else. We reach autonomy at all the levels. We're basically cutting the system instead of only horizontal layers into vertical layers. That's why I like to call them vertical slices.
- 50:44 Mauro Servienti
- Each slice, reservation, shipping and finance is completely isolated from all the others. They are just talking to each other using those pivotal events or pub/sub events in order to communicate state changes to the other actors in the system. Let's recap a little bit what sagas brings to the table. The business process is distributed now. We have many parties participating into the business process. As we pictured out, we have finance, reservation and shipping.
- 51:22 Mauro Servienti
- Each one of them respecting the single responsibility principle. If we want to be even more descriptive, we could say that inside each one of them, we can have many workflows like we talked about when we introduced the new feature in shipping where we had multiple workflows again inside shipping. Each one respecting the single responsibility principle doing one single thing. Evolution is not conflicting anymore. It's simpler. We have independent units of deployment.
- 52:01 Mauro Servienti
- We have independent scale out units. That's the important bits. What sagas bring to the table. Obviously, in the interest of time, I have no time to run a demo but there's a demo available that shows all this running with multiple processes starting up. It's built in .NET Core and it's available on GitHub. Here are the links to the demo. By the way, here are the links to the slides. They're already available.
- 52:28 Mauro Servienti
- If you want to hear Udi Dahan, that is my boss, by the way talking about sagas, you can go to that link. There are a couple of hours of video of Udi diving into what sagas are. Let's do a quick recap. Pitfalls first because obviously, we don't have silver bullets. We've seen that in transitioning from the naïve implementation to the process manager to sagas, we gain something at every step. What's the problem of sagas? The main issue of sagas, I don't really define that as an issue.
- 53:08 Mauro Servienti
- The most important thing we need to think about is that now, the state is distributed. We need to monitor the distributed state. That might be problematic. Let's try to picture yourself on the phone with a customer saying, "Where's my order?" You have to look into multiple storage units in order to be able to rebuild the order status and understand what's the status of the order.
- 53:38 Mauro Servienti
- On the contrary, in a process manager with a single table and the naïve approach as well, there's probably a single table where everything is stored is super easy to understand what's the problem. As soon as we started distributing and trying to fit in autonomy into the system, we'll obviously get something back. That is all the advantages that we listed plus a pitfall. In this case, the fact that the state is distributed. What are the main takeaways?
- 54:14 Mauro Servienti
- Behaviors define how to design processes, not data. Follow the coupling and not the data. Listen to events and not to noun. Listen to events and not to noun. Data are a kind of side effect. We need to store data somewhere but the thing that is important is, which kind of processes are touching which kind of data? It's the process boundary that defines which data should be in those boundaries. Everything else that is not touched by that process is not needed in there.
- 54:54 Mauro Servienti
- Once we have identified processes, we can chop them up and start calling them services, microservices and defining boundaries or bounded contacts, whatever kind of architectural approach, style you're using. The second one is that use delayed messaging to model time. Instead of coming up with batch jobs for example, let me go through the entire database of millions of orders we've received to say if someone hasn't paid. What can possibly go wrong?
- 55:26 Mauro Servienti
- What we can do is basically for every transaction like we did for example for the credit card, we can send ourselves a delayed message saying, "Hey. Check this transaction." Was it paid? Yes. Otherwise, release mine. We can model every single interaction using delayed messages to make decisions in a synchronized world in the future. The last bit and this is probably the most important one, there is no such thing as orchestration. That's really, really important.
- 55:59 Mauro Servienti
- Most of the time, there's no need for an orchestrator. There's no need for someone or something overarching the entire process. There's no need for process manager. We can always model this kind of relationship between autonomous components in a system to model what a business process is without any need of building any orchestrator. I usually tell people whenever you think orchestration, think to coupling. Coupling is bad. We can talk about coupling later but definitely the point of the talk.
- 56:39 Mauro Servienti
- Orchestration is not really needed. We can model every single process using autonomous components that can talk to each other using just events or state changes and setting up compensating actions in order to overcome to problems that they might face while dealing with the process itself. My name is Mauro Servienti. I'm a Solution Architect at Particular Software. We are the makers of NServiceBus. There's a few content information you can reach me to. Thank you very much. It was a pleasure.
- 57:17 Mauro Servienti
- If you have any question, obviously, I'll be available in Slack or in the Q&A window now I guess for the next five minutes. Because then, there's the next speaker coming up or ping me directly. I'll be available for the next couple of days. I'll be around for the next couple of days and you can reach to me even privately on Slack. No problem. We can even set up a call if you want to discuss face to face, that it's much better than text. Thanks again.