Messaging: The fine line between awesome and awful

00:00:03 Laila Bougria

Hello, hello. How are you all doing?

Great.

How was your lunch? Cool. Well, you know what? I have a rule about speaking right after lunch. I'm actually half Spanish and half Moroccan. In both of these cultures, we use siestas after our lunch. I don't get a siesta, you don't get a siesta. Okay? That's the rule. Welcome everyone. So around a decade ago, probably ten-ish years ago, I remember that I was working for a customer and we were basically building a sort of retail system. We started from scratch. They had a bunch of physical stores, really successful, and now they wanted to build an online offering, and it was really fun to work on this from scratch. At the time, we basically built a very simple architecture to get started because we weren't really fully understanding yet what the requirements were and figuring it out. So a couple of years in, we ended up with something like this, a monolith.

Now, I'm assuming that you are way more familiar with this structure than I am because this is based here in Oslo, Norway, right? And this is called The Monolith. Now, I use this picture on purpose. Why? Well, because over the last years, monolithic architectures have gotten this sort of negative connotation. Oh, monolithic architectures are bad. But the thing is that they also bring quite a lot of benefits, especially when you're building a completely new application. You don't understand your business requirements yet. You're not entirely sure what you're going to build, and you're kind of figuring things out, right? Also, from a developer experience point of view, it's pretty simple. Usually you have one big solution, it's an F5 run debug experience. Also, versioning wise, you deploy this to production, you don't have to care about versions because you're deploying the whole thing to production all at once every time.

So you don't have to care about, oh, is this component compatible with that component? So there are a lot of benefits, and actually this architecture, it worked quite well for us for a couple of years. But then demand started growing and it did start to show some cracks as it really grew out of bounds that we designed it for. Now, those cracks basically started to manifest as failed requests, sometimes we would lose requests. That was tricky. It was like, oh, the user said that they did something, but we can't see that anywhere. And we also saw that there was a very high strain on the single database that was supporting this application. Now, in the worst case of the scenario, sometimes everything just came crashing down because we couldn't deal with the load that was being thrown at us in the moment.

Now, given all that increased demand, we realized, okay, this architecture was really good for us at the time. Our reality is really different right now, so we need to evolve this thing. And we started to think about how are we going to evolve this architecture to be able to support the load that we are experiencing in production today? Now, at the time, it was like event-driven systems and message-based systems coming up. So we quickly concluded, let's use messaging. It's going to solve all of our problems because it brings actually a bunch of benefits. Now, first of all, it actually will give us better performance, right? Because now we can gradually scale individual components that can deal with all of that load, and we will see our performance rise. We will also see that we have increased resilience because now all of those messages are stored on a broker. If something goes wrong, we can always retry. So that's going to be a lot better.

Also, easier decoupling. One of the main reasons that we love event-driven systems or message-based systems is it allows us a lot more flexibility in decoupling our systems. And like I said, it allows us to just scale the components that need it. We don't have to spend a lot of money scaling up an entire monolithic application, which can become quite costly. So we were all over the moon, super excited, this is going to fix all of our problems, and we immediately jumped in without really considering all of the challenges that now we would have to address as well. And that's when we started to bump in quite a few walls. Now, first of all, one of the things that we saw is that the system became slower. Wait a second. Our whole goal was to make the system faster, right? Well, it was slower and significantly so, so that was not great.

Another problem we ran into is that the UI became incredibly inconsistent. We had customers sometimes reading support cases and saying, "Hey, I just did this, but the system doesn't reflect it," and there was a lot of confusion and very painful. One of the things we also observed is that sometimes we would receive duplicate messages, and the thing is we didn't really account for that. So that's when we started to see failures and even side effects sometimes. Another thing that we did not think about upfront is that sometimes messages could arrive out of order, and the problem in that case is that we didn't really see technical failures like exceptions. No, it sort of turned up into side effects, like stuff that just... You would look at the data and you would say, this can't happen. How did this even get in here? It was ugly.

And then finally, we also found that incredibly hard to troubleshoot because now we had basically messages flowing around the system and this going from there and there, and it was a lot of cognitive load to be able to understand what the system was actually doing. Now we were basically just fighting these things as they happened and just learning while we were doing with an application in production, with customers putting in orders. Quite stressful. And as we were just hot fixing and just working our way around things, after a while we saw that our application became a little bit like this, a distributed big ball of mud. Now, I mean, you have to agree that the monolith from before looked a lot better than that, right? So we had to admit that we had actually made things a lot worse.

So my name is Laila Bougria. I'm a solution architect and software engineer. I work at a company called Particular Software where we build NServiceBus. That's a messaging middleware technology. So you could argue that messaging occupies my day most of my days, right? But even before I was working at Particular, I was out in the field building these systems trying to learn how to incorporate architectural styles like this, and that's exactly what I want to share with you today, like all of those learnings that I experienced the hard way so that hopefully you don't have to. Okay? So let's zoom in into all of these individual problems one by one, starting with the system is slower. After all of those wonderful over-the-rainbow promises that our system was going to be faster, it was actually slower.

Now, in order to understand why that happened, why did the system actually become slower? We also need to understand how the application used to work before we transitioned to a message-based architecture. And one of the things to consider specifically is the main communication pattern, and that was request-response. Now, it doesn't matter whether you're using GRPC or you're using HTTP calls or you're even still using WCF in some of your applications, right? The underlying pattern is the same. It's request-response. You have synchronous communication happening between two components. Now, the idea is that you basically have a sender that sends or requests some information or requests to do something directly to the receiver, and there's a blocking operation until that receiver processes that request and provides us with an answer, and then we continue. Right?

Now, we had been using this pattern for years, so it was only logical for us to start doing this over messaging, and that's where the slowdown actually starts to show, because the thing is that where synchronous communication request-response looks a little bit like this where you have a direct connection between the producer and the consumer and you're just waiting for the result. The thing is that when you start to do this over messaging, it looks very different because now you have a producer sending a message, goes on to the request queue, and the consumer when they are available and when they have time to handle that message will process it. The answer to that message, that goes on to another message and another response queue, which again is routed back to the producer, which will also process it when it's available, when it has the capacity to deal with that.

So, many more hops, more infrastructure, and if you just go in and replace every synchronous call with an asynchronous call, then you can see how you make things slow. Right? Now, one of the things that I always like to stress, especially to people who are sort of new to building event-driven systems, is that it's important to understand that now you've shifted from using synchronous communication to asynchronous communication. And I always like to use silly analogies. So let me tell you about this time where my mom got her first cell phone.

At the time, I of course already had my own cell phone, right? Those were the times. But yeah, I remember us sort of explaining to my mom how she should start using this, and we taught her, this is how you make a phone call. And boy did she know how to make phone calls because she called me all the time. Really, all the time. And I love her, but it was a little bit overwhelming because the thing is that sometimes I didn't have time to pick up. I wouldn't have heard the phone. Or sometimes I was ignoring the phone. Sorry, mom, love you. The thing is that at some point I was really overwhelmed and picked up less and less.

But the problem is that that also had a counter reaction in her. It would completely hijack her day. She'd be super worried asking herself if I was okay, what was going on? So also she was experiencing the side effects. Now, at some point, I came home and I was like, mom, let me show you how to use text messages. Now, the idea that I want you to take away from this is that even in an event-driven system, even in a message-based system, not all communication is suited to be asynchronous, right? Some communication still is useful to do in a synchronous way. If you have someone waiting for that result immediately, use synchronous communication. You can combine them both, right? It requires a shift to see where the type of communication is best suited. Now, with that being said, let's consider which communication patterns that we have available in a message-based or event-driven systems. And there are three different patterns to choose from.

We have one-way communication, we have request-response and publish-subscribe. Now, I always enjoy using real-life analogies to explain these types of things because it makes it a little bit more fun for me and also hopefully for you. So what I'm going to do is use our family dinner time to explain how these different patterns work. So my husband is the one that usually cooks at home. He enjoys doing that, and I enjoy him doing that. So at some point, he will in the evening look at me and he will send me a message and say, "Hey, I'm going to make dinner." Now that message is directed specifically at me. He wants me to be aware. He's not expecting a response. He doesn't need me to get up or do anything. I can just continue with household or helping the kids do their homework, continue on. He just wants me to be aware.

But at some point, dinner is close to ready and he'll pop out his head from the kitchen and say, "Hey, dinner in five." Now, at that point, we're using the request-response pattern because he does expect a response from me. He wants me to basically get up, get with the kids, take them to the kitchen, make sure that everything is on the table, get everyone ready so that when the food is on the table, we can all start. It's still asynchronous communication because it's not that he's standing there like, come on, are you coming over or what? As long as I respond within an acceptable amount of time, it's fine. I can just come over and get things done.

And then finally, if it was a good day, and this applies mostly to my youngest, and she ate well, which is not every day, unfortunately, then my husband and I will look at each other and we'll say, yeah, it's an ice cream type of day. Now, at that point, we will publish an event stating that there's ice cream available. Now, why are we using publish-subscribe? Well, because we're just making the kids aware that they can go get an ice cream if they want. They're big enough to walk over to the freezer, open it up, take their own ice cream. We're not doing this for them anymore. And I don't really care if they heard me. That is their problem. But I can promise you that they're subscribed to the ice cream available event for sure.

Now, the thing is that before we can even consider which pattern is actually the best to start to incorporate in our systems, we have to engage in one of the most important activities when actually splitting up a monolith. And that is decoupling. Because the thing is that when you read about message brokers and message-based applications and event-driven applications, you get this idea that decoupling is just going to magically happen. It's not. This is an activity that you need to engage in. You need to go back, look at your code and find the service boundaries that will allow you to split things up into separate components, or you will still end up with the highly coupled mess, over-messaging, distributed ball of mud.

Now, the thing is, again, whether you're using messaging or not, once you start building a distributed application, whether you have multiple components interacting with each other, these tips will be really important for you to consider. So let's consider a couple of tips that can be helpful when you want to start to untangle application code. Now, we're going to use our trusted place order process because we all sort of know how this works, right? It's comfortable. Now we want to place an order. We want to get some payment package, deliver all that, adjust our stock, and then also check if our customer is now a gold customer or something like that. Now, the thing is that when we engage in a conversation with our business stakeholders, we could quickly come up with a story like this, right? First we do the storing of the order, then we're going to charge the credit card and so forth, because when we engage in conversations, people tell us things in a certain order and we immediately accept it. It's one of those underlying biases in people.

So one of the things that I like to actively do is, as a first step, eliminate all the ordering from your head. Just write down all of those things, individual steps that have nothing to do with them and rather understand the order prerequisites and sort of things from asking questions. Right? Now, one of the things that I usually tend to ask is what are the things in this workflow, if you will, that tend to naturally happen at different points in time? And then you might say, well, packaging and shipping, because that requires some manual intervention, someone coming in, putting stuff in a box, printing a label, and so forth. Right? But it could also be ordering more stock because maybe within this organization, we've decided to always have a moment on Friday evening to order additional stock for the upcoming week. That's one part.

Another thing is that you also need to start to think about what is actually the order? So for example, do we want to adjust our stock only after we've built the order? Well, no. We want to immediately do that when we package the order so we can keep our stock as up to date as humanly possible. Maybe the same thing we could consider with verifying our customer status. In this case, the requirement is that we verify our customer status after the entire order has been fulfilled. But in another retail business, it might be immediately after we charge the credit card. So it's always important to revalidate these assumptions even though we think we know the domain.

Another thing to ask yourself is, am I introducing any technical prerequisites? For example, I've had the habit for years that I would store the data first and then continue so that I was able to link to the data. Store the order, then charge the credit card so I can link the payment to that order. But is that really necessary? Because if you would consider pre-assigning the order ID even before it's been stored in the database, then it doesn't matter whether you store the order first or not. Right? This is where I feel that years and years of using relational databases can sometimes stand in our way and actually force ourselves into situations that provide us less flexibility and more complexity. If we pre-assign the order ID, we can link up things later and enforce that relationship at a later point when the order is then available.

And then finally, we also need to understand all of those individual steps. What would happen if one of the steps fails? So for example, if we're able to store the order, but something fails when charging the credit card, do we then want to go and delete the order from the database? Probably not. No. We just want to send the customer an email and say, hey, your credit card bounced. Can you please update your payment details? If we're able to charge the order, but we don't have sufficient stock, do we immediately want to refund? Most companies don't do that. Once they have your money, they're going to keep you waiting for a couple of days while they back order. But it's important to basically identify the steps that can succeed on their own and also understand if one of the steps failed, what are the compensating actions that I need to take in those other parts that have already been completed?

And then finally, there's also the data. That also really matters because the thing is that I have been talking about an order for a while and nobody's put up their hand of, wait, wait, what do you mean with an order? I don't understand what that is. Because an order is something that we can all reason about. We have a conceptual idea of what that means. But when we are trying to figure out our service boundaries, we need to actually break that apart and start to think more in terms of the attributes of an order and which things actually belong together. Because in the context of, for example, a sales service, we would care about the order reference. Who ordered it? When was it ordered? But in the context of a payment, we only really care about the payment method and we care what the total order amount was and whether they maybe used a gift card or a voucher or something like that.

In the context of a package, we want to know, okay, how many items are there basically ordered? Are there any fragile items that we have to take care of during packaging? Those are completely different concerns. And during shipment, we want to know how many packages are we actually going to ship so that we can understand that. What are the package dimensions? Did they maybe choose expedited shipping or something like that? And then finally, in the context of an invoice, usually we want to know did they ask for a physical invoice, a printed one, or do they want it delivered to their mailbox? And are they maybe VAT exempt? Because then we may need to adjust our invoice. It's really important to start sort of stepping away from that entity-based thinking, and in essence, start thinking about the attributes and which attributes tend to basically change and evolve together.

So finding the right service boundaries also means that we need to understand how that data is evolving. So which behavior is likely to change? What is likely to change is something that we want to isolate from stuff that doesn't change a lot. When it comes to what data, we also want to ask what data tends to change together? If you think about a product, for example, who's ever worked in a system that has a product table? Okay, almost everyone, right? Now, a product has a name, it has an image, it has a price. If the price changes, is it likely also going to affect the product name? Not really, right? But if the product name changes, is it likely to affect the product image? That's more likely if you're doing a rebranding or something like that. So it becomes important to start to think about which attributes change together and keep that data together. So if the data doesn't change often, you can isolate that. And the data that changes more often, you can isolate that away as well. Okay?

Also, what data and behavior tends to depend on each other? For example, I've seen a retail business where they actually had completely different pricing for end customers or customers that were basically also retail, reselling what they were buying from us. It's not that it was VAT exempt, it was completely different pricing. So basically the price now depended on the customer type. Well, if there's a dependency, then maybe we should store that data together. That's the point. And finally, also ask yourself what data pertains to the same transactional boundary? Because if you're looking for high consistency between certain data, then you have to put it together because otherwise the best you're going to get is eventual consistency, right? I've seen this problem all the time. Sometimes it feels hard to accept, but if you want high consistency, it should be part of the same surface boundary. It should be stored in the same database.

Now, one of the things that can really help with this is practicing anti-requirements. Who's heard of that before? Not many hands. Okay. So the idea is basically that you're going to think about the attributes of a concept like an order or a product, and you're going to start asking silly questions to basically try to find where are you able to cut things apart? Like the example that I said previously, if I change the price, is the image of the product also going to change? No, of course not. Laila, do you even know what we're doing here? If people react that way, great, because you just found a place where you can safely split the data apart and it's going to be okay, right? So that's basically the type of conversations that you want to engage in.

So, now of course, this is not something that we can do alone. We have to engage with our business stakeholders. Now, one of the most successful projects that I have worked on in my life is when we migrated a complete banking system from a mainframe to a modern .NET application. What I think made that successful is the fact that they actually moved people that were in the business working with customers on a daily basis that were looking for a different challenge, put them on the IT department, and I could just turn around and ask them questions. We need high collaboration between the engineers, between our architects, between our software engineers and our business stakeholders continuously asking these questions. Just scheduling a meeting and then figuring it out in an hour, that's probably not going to work. This requires a repetitive investment. So event storming can also be a great technique to try to figure out how things work together. And your goal here is to uncover any false assumptions that you may be making about the business domain.

So keep asking questions even if you feel annoying. You're doing your job right when you feel that way, promise. And finally, keep doing this. That's why I always say, if you have the opportunity to pull some of your business stakeholders into the same physical room where you are working, do it. I promise it's going to be a great exercise. So after doing this for a while, we ended up somewhere here, which looks very different than that initial picture that we had, right? Now, it's only when we have this figured out that we could also start to think about, okay, how are we going to communicate in between all of these services? And like we said earlier, we have one-way communication, request-reply and publish-subscribe to use from.

Now, the thing is that there's one here that is always super popular, publish-subscribe. Why? Anyone? Because it helps decouple, right? But the thing is that with anything good in this industry, it's also sometimes overused. And that brings me to the passive-aggressive publisher. Let me quickly circle back to the family dinner analogy. Like I told you, my husband tends to make dinner. I really appreciate that, so I clean up the kitchen after. It's only fair. We're a team. We collaborate. Now the thing is, there are some of these days that I swear I walk into the kitchen and I'm like, did you use every pot pan that we had in the house? Really? Now, at that point, I might decide to say, okay, I'm going to publish an event stating that the kitchen is messy. That's valid, right? I mean, it's a fact. It's a state that the system has arrived to. The kitchen is messy, and it's already happened. It's an event.

Now, the biggest problem here is that as a publisher, I have an expectation, first of all, that my husband is listening to me, that he's subscribed, but even worse, that I expect him to do something. I expect him to get up, come to the kitchen and actually help me clean up the mess. Now, I think we can all agree that this type of communication, not really good for your relationships. Well, guess what? Not really good for your systems either. So that's why it's really important to always keep in mind that you should never use publish-subscribe when you expect something specific to be done from within the context of your own service boundary. If you need a response with any data to continue when you publish an event, no. Then again, passive-aggressive communication. And finally, if you need any control over who receives or subscribes to that event, also not a good fit. Because again, as a publisher, you should never even care.

From your perspective, it shouldn't even matter if you have any subscribers, you should not care. Okay? Now the thing is that if you run into situations like these, the best thing is to still use command-driven communication. Send a message and make your intent clear. Make the coupling clear because coupling is still there. You just made it invisible by using publish-subscribe, but it is still there and will come back to bite you.

Okay, so let's recap this first chapter. I know it was the longest one, so bear with me. Now, as we said, we want to decouple as much as we can by finding service boundaries. Actually, I don't really like the word decoupling because we don't want to decouple. We want to manage coupling. Something that is decoupled does just not work together. So it's about balancing and managing the coupling. Find all of that behavior, the data dependency, stuff that tends to change together, keep that together even though it may feel weird. Don't name your services upfront because that's just going to make it harder to make stuff fit in there that doesn't really fit in there. Be driven by these types of dependencies instead of concepts that you know in the real world. And also pinpoint the transactional boundaries of your services so that you can understand where you need the high data consistency. And finally, that will enable you to choose the right communication pattern to use to communicate across those different services.

That brings me to the next problem, when our UI became inconsistent. Now, the thing is that all of these things were previously executed in a synchronous manner, and now we're using asynchronous communication. And that started to cause some glitches in the systems, like a user would do something and they wouldn't immediately see the effect of what they did, and they would immediately think that something went wrong. Right? Now, the thing is that this is something that really we have to accept a little when we're moving to asynchronous communication because now we're getting a promise that something will be done, but we don't really have any certainty of when that will be done. So there are techniques that we can apply to basically adjust our user experience.

Now, the first one is to adjust the language. If previously you had a system where you placed an order and it said, congratulations, your order is now on the way, well, the thing is, what does the user then expect? That their order is on the way, right? But if you tell them, well, thank you very much, we've received your order and we will get to it as soon as we can. Well, then you're setting completely different expectations also to your user. If they're not then going to be refreshing, they'll be more likely to step away and trust that you'll get to it. So language is really important here.

But there's also another option that we can use, and that's basically sort of creating the illusion of progress. Because what users really miss when they do something, they want to see the result of that. They want to see that I did something. They don't want that to be invisible. So one of the things that we can do is once we send over a message ensuring that our message has been successfully delivered to the broker, we have that promise of it will get processed at some point. Now, what we can do then is use a little bit of creative freedom to basically show the user progress even though it hasn't actually happened yet. Because if you think about the order example again, they just gave us all of their information. They just said what they ordered, we know what they paid, we know where they want it delivered because they just gave us all of that information.

So we could just show them that information. Like most of the systems I've seen, there's a post request and then immediately a get request to get the same information back from the database to then show it to the user. Is that really needed? Not really. You could just immediately do that and assume that it will get stored at some point. Okay? That gives you a little bit of flexibility there with asynchronous communication. You could even keep that around in some kind of local storage or a cache for a limited amount of time, of course, because of course this data is going to become stale, and at some point you can't rely on it.

And that's why in asynchronous systems, we also need to start thinking about defining SLA, service level agreements, because all of those individual steps that we have now, how can we now manage something like, oh, well, it can take longer, 10 seconds for the order to be stored or something is wrong. If we're using synchronous communication, that's something that we could manage with a timeout, for example, but now we can't do that anymore. So it becomes important to start to think about each individual step and ask yourself, what is the longest amount of time for this to be done and that would be acceptable to our business stakeholders? And then of course, we need to enforce them. We can't just think about them. We need to enforce them. And that is where delayed messages come in. Who's ever used delayed or future messages? Okay, a couple of hands.

Cool. That's great because I actually find that this is an incredibly powerful and yet very underused feature in most message-based systems. But basically the idea is that you're going to send a message with a delivery date in the future, and this is supported out of the box by many message brokers out there. You can use this in Azure Service Bus, you can use this in Amazon SQS and even RabbitMQ if you use their delayed delivery plugin. And there are middleware frameworks out there, wink wink, that also do this. But the idea is basically that for every step that you execute, you're going to calculate the SLA expiration date immediately when you send the request. So now you're requesting for that order to be paid and you're stating, okay, this has to be paid two days from now, otherwise my SLA is expired. So what you then do is as you send the message to get the order paid, you're immediately going to also send a message in the future two days.

Now, when that message then arrives, you're just going to verify, hey, that order, 1, 2, 3, has that been paid already? Yeah? Okay, we're good to go. Has it not been paid? Okay, now we maybe need to cancel it, it's been two days. Now, doing it this way is great because it allows you to also recover from technical failures, like maybe the payment provider being down. But also it allows you to recover from business type of failures. Right? For example, the customer's credit card bounced. Everything is technically working, but they don't have any money on their account left. And given two days, you can give them some flexibility to update their payment information. Right?So the idea is that you can then take any appropriate action at the point in time where the SLA expires.

All right, so to recap this part, it's really important when we transition to an asynchronous based communication type of system that we adjust the language in our user interface. The way we communicate with our users is also going to significantly change, even the way the user interface interacts is also going to be impacted. We can also here and there create the illusion of progress, we have to be really careful then to also think about those SLAs and to enforce them as well. And we can use also delayed messages to enforce them and make that a little bit more easy for us as well.

Okay. The next problem, things occur out of order. I feel like I'll never hear the end of this problem. Now, doesn't that look like an amazing sock drawer? I would love for my sock drawer to look like this, but the question is, do I need it to look like this? Does it stop being functional if I just intermingle all the colors all over the place? Not really, right? Let's reconsider this order workflow. Now, this is a sort of flow chart type of thing, so there's already an order sort of in there, but there's also a requirement hidden there because we only want to package the order once the order has been both charged and stored in the database. That's our business prerequisite. Okay? Now, the challenge with doing something like this in a more decoupled system is that those concerns are being handled by different services. So how would we know when both of those things are true?

It's problem like these that I usually see at the basis of people seeking out ordered messaging. These are the type of things. It's not really that you need an order, it's that you have difficult prerequisites to manage. Something to keep in mind whenever you think like, oh, I needs to be ordered. Now, the thing is that we could try to make this ordered by sending out the messages in the order that we expect them to be processed. And this will work actually probably 97% of the time or so. But then there's the 3% and then the system is on its head because there are many reasons that could cause your messages to arrive out of order. First of all, latency, one of the fallacies of distributed computing. If you've never heard of that before, we have books on this at the stand, no marketing, just a really good read. So if you've never heard of those, come by at the stand and I'll give you one very gladly.

But it could also be processing time. You're sending out two messages. First A, then B. but it takes a lot longer to process message A than it takes to process message B. So subsequent messages will appear out of order. It could also be concurrency, right? Now, you've scaled out one of your components, you have multiple consumers that are consuming from the same queue, and you're pre-fetching a lot of messages. Bye-bye in queue ordering, it doesn't exist anymore. It could also be to do an increase in load where you have some of the components in your system perfectly keeping up, and then other ones that are suffering and your message is somewhere there in the queue while the rest of the system is keeping up. Again, this could cause also other messages to appear out of order.

And finally, it could be due to retries. One of the great things of building these types of systems is that you have your message safely stored on the broker. You can continue retrying when failures occur. But that's also going to slow down things and cause that out of ordering issue sometimes. And finally, if one of your services becomes unavailable, that's fine. All of those messages will pile up. But again, this can also cause out of order messages to appear. So one common solution I've seen is to just force order back in. Now, one of the ways to do this is to basically say, we're just going to execute a step, wait until that has been successfully, completely done, and then we're going to kick off the next step.

If you do this, you've basically taken all of the benefits of messaging, comfortably walked over to the trash bin and thrown it all away. This is how you make a system slow because now you're just introducing a lot of latency for no reason. And the thing is also think about the coupling because if you pull in the underlying services that are basically taking care of all of those individual steps, they are now coupled together because now sales knows about payments because it kicks off the next message, and payments knows about shipments, and shipments knows about invoicing and so forth. And by the way, this is just one of the 15,000 workflows in your massive application. This is how you build that distributed ball of mud one step at a time.

But the thing is, I don't want the coupling, I just want the ordering. So how would we go about that? And that's where orchestration can offer some help. Now, orchestration is basically a coordination mechanism that you can use in event-driven microservice type of architectures or applications that basically is going to introduce a central component that drives the business process, that is going to drive that workflow, that order workflow that we saw earlier. This component knows and also takes the responsibility of storing the state of where that workflow is. So it will know, has this already been paid? Has this already been charged, invoiced, or something like that? It will also, based on that information, decide what is the next step and also when that should be executed. And this is basically a way that we can recreate the order, or rather, another way I'd like you to think about it is a way to handle more complex prerequisites. Forget about the idea of an order.

Now, if we try to visualize that a little bit, now we would have the user coming in sending that initial message to the message broker, that kicks off the order orchestrator. The orchestrator will then tell sales to basically store the order and payments to charge the order. At that point, sales is back and says, hey, I'm done, I have it stored. And then the orchestrator will say, oh, cool, but you know what? I can't package this just yet because I need it to be paid for first as well. Can you keep track of that? But it can already continue with adjusting the customer status because we want to do that immediately after payment. And then when payments is back and says, yeah, I'm done now, then the orchestrator can say, okay, now I can continue with the rest of my order fulfillment process.

Now, this is a lot better because now all of those individual services at the bottom, they're not coupled to each other anymore. They're unaware of each other, but there's still a lot of coupling because now we have an individual component, that orchestrator, that is aware of all of those underlying services. And that's also still something that we want to manage and balance across our system, but at least it's already a lot more scoped than having all of these arrows going in every direction. So one of the things that we can basically do is to basically manage the amount of services that an orchestrator knows about. So if you start to ask yourself, what doesn't fit in this list? You could say, well, adjusting the customer status, that shouldn't even be part of a workflow. There's no complexity involved. We are just going to subscribe to the payment event, and then customer status service can do whatever it needs to do. Okay?

But let's argue that, for example, adjusting the stock, that is complex enough for us to introduce a dedicated workflow, a dedicated orchestrator to take care of all of those involved steps. And this basically helps narrow down the dependencies of each individual orchestrator. Now, the main reason I tend to look at orchestration is to handle a lot more complex prerequisites, to create visibility in very, very difficult workflows. But if you're still around tomorrow, I have a talk in the morning where I go in-depth into orchestration. It will be a very intense one hour where we look at all of the trade-offs and what you can do, when you should choose one pattern over the other as well. So definitely check that out.

Now, one of the other things to consider is also to consider what would happen if you need to compensate. We earlier talked about what would happen if I can charge the order, but I don't have enough stock. At that point, if you can't backorder the order for any reason, then you might have to end up canceling the order and actually refunding your customer. Now, one of the things that you will find is that in an orchestrated approach, you already have access to all of those services that you will need to undo the operations that you did earlier. So that can also become a lot more easy in this approach.

All right, to recap this problem, you have to expect out-of-order messages. That's like the first thing that I would like you to take away because the thing is they're going to happen. Don't put your head in the sand. You'd rather think about this upfront and understand how your system would react if this message comes in after that message than for it to happen on a Friday at 5:00 PM when you're bound to leave to the Caribbean or something on vacation. So test those out-of-order cases so that you can understand how your system would react and what you need to do in order to mitigate that. Also, you can consider the orchestration pattern for more complex workflows.

Now, it's also still important to guard the amount of coupling that you introduce in a single orchestrator because this can tend to lend itself as a sort of easy component to just add steps to. And then you're building your next little monolith, which you don't want. And finally, one of the alternatives is actually choreography, which is exactly what my talk is about tomorrow. So I'll leave it at that and hopefully I've made you a little bit curious. All right. That brings me to the next problem. It's impossible to troubleshoot failures. This is another problem that affects any type of distributed system, but even more so if you use asynchronous communication, it's just magnitudes more painful.

So the thing is that in our monolithic approach, when you think about the code execution, it's quite simple, right? You can literally debug through the code, go from one method to the other, see what data is being passed on. You can investigate your call stack, you can understand where you came from and how things change to see where you're at. But the thing is that when you move to a distributed approach to a message-based approach and you think about your code execution, that looks a lot more like this. There's events just flowing all around and it's like, oh my God, this message and that message and what resulted in what? That can become quite overwhelming when you are trying to solve problems in a production environment. So one of the things that we can do... No, one of the things that we have to do is test.

Now, I keep saying this and... Okay, who consistently tests their application code? Okay, I'm seeing a few hands going like this. So I will say it again, testing is really essential to building resilient and trustworthy applications. But if you are building distributed applications, if you're using messaging, if you have events flowing from here to there and things just happening, you cannot afford to not test. Really, you're just making your own life a lot more difficult. So invest in a testing strategy. It's really important. But there's one big flaw in testing because the thing is that we only ever really test the test cases that we even can think about, and there's always something that we forgot that we never even considered could happen, and that's where the fun bugs are. So how do we handle those? Well, that's where we also need to invest in observability, right? Who's heard of observability before? Okay, lots of hands. Cool, really nice.

Now, observability is basically a technique that can help us recreate that visibility that we lose when we sort of shift to distributed applications, especially using asynchronous communication because it's not easy to understand where our errors originate because when you see a failure, okay, it fails in this component, but where is the cost? That could basically be five services upstream. It's really important to be able to have a tool that gives you that visibility. So the whole idea about observability is collecting telemetry from inside of your applications, store that outside of your applications so that you have some data to look at. That can be in the form of logs, traces, and even metrics. And nowadays, we use the OpenTelemetry project to do exactly that so that we can collect and generate that telemetry in a way that is standardized.

Now, I'm not going to go too deeply into this, not because I don't care. I was talking and I forgot about my slides, but fine. It's available cross-platform, cross-runtime, I think 11 languages. So it doesn't even matter which stack that you're using, you'll be able to get that telemetry end to end. So I'm not going to continue talking about this, not because I don't care about it, but because I cared about it enough to have a dedicated talk about it. Who was in my talk last year here at Oslo? Okay, welcome back. Well, for those of you who weren't there, you can scan the QR code. It will take you to one of my GitHub repos where you will find a link to the recording, but also a bunch of additional resources around OpenTelemetry, around observability and even samples for you to play around with as well. And if you have any questions, just come find me and I'm happy to chat.

And that brings me to the last problem of the day. I'm still in time. I'm hearing so much noise. So the duplicate problem. Now, when we build a message-based system, we're using a message broker. By definition, every send or publish operation that we execute in a system is an at least once operation. And that is true unless you're willing to sometimes lose data, because otherwise it could be that a message is never received, which is not usually the case in business applications, so I'm going to assume that you want at least once message delivery. The downside of at least once message delivery is that by definition, at some point you will get a duplicate message, right? And the thing is that now all of your message handlers in your applications need a way to deal with that, need to understand, have I seen this message before? Because then I shouldn't process it again. That or we need idempotency.

Now, I find that idempotency is one of the main paradigms in the IT industry in distributed applications because it's incredibly easy to explain, incredibly hard to implement. So for those of you who haven't heard about it, you can basically think or explain idempotency with the example of a light switch because we have a flip switch, but we also have a push button as a light switch. Right? Now the flip switch is idempotent because it doesn't matter how many times I flip it down, the state of the light is not going to change. The push button on the other hand is not idempotent, because if I keep pushing it multiple times, the state of the light is going to change. Like I said, really easy to explain. In a system, a bit harder. And what do we do with difficult problems? We avoid them, and that's why we have message deduplication.

So message deduplication is a mechanism that will allow you to detect whether the incoming message has already been processed before. So the idea is basically that then if we see, oh, we've already taken care of this, then we'll just throw it out. We've already done this. The way that this works is based on a deterministic message ID. So that means that when we send out a message, we will have to always regenerate the same ID. We can, for example, do that based on an entity ID. We want to charge an order 5673, and then create an ID based on that so that even if we retry that operation, we will send out a message with the same message ID so that we are able to throw it away because we recognize it. Okay? Now the thing is that if you have very high throughput, then this can also become costly because the way that this works is that every processed message, you're going to keep track of that.

So you can't do this forever, and that's why this type of mechanism usually works based on a deduplication window. So we will keep all of the processed messages around for a period amount of time. Now, duplicates tend to come very close together because they're usually a result of retries. So usually that's enough, but it's also not impossible that a message comes in beyond the deduplication window, and then we will treat it as we've never seen it before. So still something to be aware of. But we're not entirely done then because even with message deduplication, we still have the atomicity problem. Now, the thing is that as we're building modern distributed systems, we also started to use different types of infrastructure so that we're using the right tools for the right job. And in essence, that's great, right? Because we're using the right infrastructure for the specific concern that we are handling, and we can do that in an autonomous way across our individual services.

But the problem is no distributed transactions. So we can't create consistency across those different types of infrastructure. Now, all of those different types of infrastructure usually do provide some kind of transaction management, right? SQL Server has database transactions, Cosmos DB offers partition-based transactions, and even Azure Service Bus has cross-entity transactions. The issue is we can't combine those transaction guarantees into one. So they wouldn't work. So if you want, for example, what you do in a database to be consistent to the messages that you send out to a broker, a transaction scope around that is never going to help. Now, one of the ways that we handle that problem in the messaging space is with the outbox feature. That's basically a pattern that's going to give you that consistency across your message operations and your database operations. It still doesn't cover everything, okay?

Let's say that you're in a scenario where you want to generate an invoice PDF, store that on Blob, store the reference to that Blob in your data store in SQL Server, and then publish a message. The storage of that blob, you can't make that a part of the outbox. There's no way to do that, important to be aware of that. But let's say that you do have something like that where you need consistency across more types of infrastructure, a multiphase commit type of situation. That's where you can start to use the saga distributed transactions pattern. Now, any of you have used NServiceBus or something like that? Okay. Okay. So I'm not talking about an NServiceBus saga, just to make that clear. This is a different pattern. It's more of a concept, right? But you can use an NServiceBus saga to implement this.

The idea here is that a saga distributed transactions pattern is basically a failure management pattern. Okay? What we're going to do is create that data consistency step-by-step and implement that multiphase commit if you want by using a sequence of local transactions. So we're going to do a transaction against one piece of infrastructure, and then the next, and if something fails in that sequence, we'll also roll back all of the previous pieces of infrastructure that we touched. I'll show you a quick example so that you can visualize it. So our saga wants to store some data in the SQL database of service A. We'll do that. Okay, everything's fine. Then it wants to store some data in the Cosmos DB in service one. We're good.

Next step is storing some data in the SQL database of service B. Now, something goes wrong there. Now, at that point, the saga understands this, but also knows what has already happened. So it will go back to service A and say, undo what you did in Cosmos, and please also remove what you stored in the SQL database earlier. And by the way, this is a horrible example, and I did it on purpose because if you find yourself using this pattern to create a multiphase commit across multiple services, then you got your service boundaries wrong. This should always be condensed to a single service boundary, but this way you get the conceptual idea of how that would work.

Okay. So when it comes to idempotency I don't have a silver bullet, sorry, but I can give you strategies. The first one if you can, if you have the possibility to do this, make your code idempotent, but also know that this is going to be a continuous effort, that everyone on the team is going to have to understand idempotency because every time something changes, you have to ask yourself, is the change still idempotent now? So it's a continuous effort, but if you can, I would still recommend it. It can save you a lot of trouble. You can also use the saga to mimic those distributed transactions to basically create eventual consistency by using a sequence of local transactions across different pieces of infrastructure. If you have any communication happening within your system boundaries, please do it over queues and don't do it over HTTP APIs because then you have message deduplication out of the box. And just by using a deterministic ID, you won't have duplicates.

If you have any REST APIs that you need to integrate with that are outside of the boundaries of your system, then ask if they're idempotent, because usually that will require you, again, to use some kind of a deterministic ID in case they didn't make all of their handling code idempotent, something to verify. And finally investigate compensating requests if you don't have any guarantee that those REST APIs are going to be idempotent, because if you do something twice, you will also have to understand how do I undo that? And that brings me to the end. I know this was a really packed session. I wanted to give you as much information as I could. We don't have time for questions, but I'm going to be around for the rest of the conference if you have questions.

So a quick recap. Messaging gives you options when it comes to implementing scalable, reliable systems, but it's not going to be your silver bullet. You have to really be conscient of all of the problem spaces that you're also entering when introducing that. So you have to really commit to decoupling your logic and your data, and finally think of how out-of-order processing is going to affect your more complex workflows. Invest in observability. Okay? I'm happy to talk to you about this, really passionate about the subject as well. And finally, make idempotency a key pattern in your designs as well. Thanks a lot for listening. If you scan the QR code, more resources for you, and any questions, please come chat to me.

Messaging: The fine line between awesome and awful

About this video

🔗Transcription