Beyond Pub/Sub - Advanced Messaging Patterns

00:00:10 Poornima Nayar

Hello. Good morning. So how is the coffee levels topped up? Yeah? Good, because I've got a full-on one-hour session for you. Did you enjoy the keynote? Yeah? Okay. I plan to watch it later on async because I was busy preparing for my talk, which you all have turned up to, so thank you. I am Poornima Nayar. I am a software engineer and solutions architect at... Okay, we have some problem here. That's fine. We can go ahead I think, yes. So I'm a software engineer and solutions architect at Particular Software where we build NServiceBus and the platform tooling around NServiceBus. Heard of NServiceBus? Yeah? Good.

I am local. I have just traveled in from Berkshire in the UK, so this is home turf for me. And I will try to answer some of your questions at the end of the session because as I said, I have a full-on one-hour session to give you today or you can catch me at the booth. We have a booth near the coffee on level three, so you can find us, ask us any of the questions. I'll be at the booth after this, and tomorrow, and day after. And should you wish to reach out to me on any of the social media platforms, connect to me on LinkedIn. That's my social media handle, poornimanayar.

So moving on. When teams starts their journey with messaging, Publish/Subscribe or Pub/Sub is what often teams look into. Probably they are coming from breaking a distributed big ball of mud and carving parts out of that and moving into microservices way of architecture. And that's often how peoples or teams start their journey with messaging. Not a big bang rewrite but carving out parts. But that is not just a lot of technical code to move around, but it's also a shift in mindset because you are going from immediate consistency to eventual consistency, but teams do that.

But what is Publish/Subscribe? We have a publisher which publishes messages into a broker and those messages are events. And subscribers might be interested in those events, so they get those messages from the broker, do their processing. So each subscriber gets a copy of the same event, they react to it independently. That is very important, independent reaction to events and then they move on. So publisher is publishing the events, then moves on to do its own work. Subscribers react independently and continue processing messages. They don't communicate with each other. Publisher doesn't know how many subscribers exist. Publishers doesn't know even if a subscriber exist, that is also an important aspect of Publish/Subscribe. Now what is an event? If it is a message that is talking about something that has happened in the past, that is an event. It is sending out a notification into the world that's something has happened, therefore it is immutable, it is facts that cannot be changed.

So the intent of this specific message is letting the world know that something has happened. Now you might come across events that are disguised instructions, so it's actually an event, but it has an instruction hidden in it. That is what people call passive-aggressive events and that's an anti-pattern. Please don't use it. Now that's just something to watch out for. But as promises or benefits of Publish/Subscribe, what do we get? Decoupled services, the publisher does its own work, moves on. Subscribers, they do their own job. So there is no temporal coupling. There's no prerequisite that your services, all your services should be available at the same point of time for your system to keep working. If a part of the system is slowed or there's a performance issue happening in one part of the system, it doesn't impact the system as a whole.

The next benefit you can get out of the decoupled services is you can scale services, not just vertically resources, but you can also scale it horizontally. You can also reuse events. How? Because suppose you have three subscribers which are interested in an event, you want to bring in a food service, fine, maybe I can make use of the same event, subscribe to that event using the subscriber and then carry out some business logic. So this is giving rise to modern evolving architecture where integrations becomes faster and easier compared to coordinating with multiple teams across the organization. When you have event-driven services or Publish/Subscribe is the foundation of event-driven architecture. So when you have such services, the contract centers on event schema. Which means that as long as the teams agree to that event schema, they can add more functionality, update some functionality, they can go for independent deployments. They can swap out entire subscribers or even route the same event to a completely different subscriber without impacting other parts of the system, which means that teams achieve autonomy, they can have separate release cycles and they have clearly well-defined ownership boundaries.

And all of this looks very elegant as a solution and it is architectural freedom that we get with Publish/Subscribe. At this point, we all might be thinking or teams might be thinking Publish/Subscribe checks all my boxes, that's all what I need for my system. And that's where you have comfort zone and that is exactly the problem. You have a great foundation but then you fail to build on top of that. That's when you start to notice the cracks that appear in your system. And what is that comfort zone? You bring in a broker, you bring in your topics or exchanges with publishers and subscribers, events floating around, entire system like tennis balls bouncing off this entire room. That is what you might have. Let us look at our trusted order workflow. This is something that is familiar to everyone, so as boring as it is, I need to keep speaking about it.

So we have our order service that might publish an order placed event into the broker that is then independently subscribed to buy all these four different services. Inventory might start reserving the inventory, payment might start its workflow, fraud might start the fraud check, and shipping might be doing some pre-stage checks. And these will go back and publish more events to say, "Hey, I have reserved the inventory or I did not. I could not because we don't have enough payment." Might go back and say, "Yeah, I have published an event saying payment has been authorized or payment has been failed." Similarly, something goes back to the broker that is events getting published from the fraud service. And all of this might be then watched by the shipping to see what is going on. At the same point, the order service needs to know what the state of things are with inventory, and payment, and fraud. So it might be equally watching out for these events.

At some point, it'll gather the information that yep, now I can say shipment is ready to start. So it might further say, yep, order can be completed. Now we can start the shipment. That will then be picked up by the shipping because it's subscribed to that event. And then it might say, yep, I have started the shipment, everything is good to go. But if you look at this, can you actually see what the entire lifecycle of an order is? It is all about events flowing through the system. Every service is listening to just about enough events and then deciding what to do next. It's kind of a delicate balance. And when you have so many events floating around, there might be that 51st event, but mind you, if you have 51 events being subscribed by a single service, that is a different problem in itself. But if you have that level of events floating through your system, things start falling through the crack.

There might be one situation where things lead to a invalid state of the order which you don't want because then you might have angry customers and business might be suffering as well. And at the end of all it, suppose you're bringing an email service and suppose it is starting to publish more events. Every other service will have to now subscribe to that and change. That will have to change every time something else is published by any of these services that might be your business workflow. So which means that now we have taken all that team autonomy and put that in the bin. Why? Because even the smallest changes needs coordination from multiple teams. So some of the questions that you might ask or see as cracks in your system is when you have multiple consumers that react to the same event independently.

What do we do when there is... How can I ensure that data is safe even during crashes? Because I don't want to lose my data. Now who owns the workflow? Can I get an understanding of what exactly the state of the order is at any point? What happened if a consumer fails? What happens if evens arrive out of order? The fraud check can get cleared before the payment has been authorized, what happens then? Do I just go ahead and ship, or do I wait, or do I just refund completely and cancel the order? We don't know. What are the consistency guarantees? What about if there's a slow or overloaded consumer downstream? What are the compensating actions? Say if the inventory could be reserved only for part of the order, what are the compensating actions? Do I just refund the customer, cancel the order, or backorder? These are all business decisions that you will have to bake into your system as compensating actions.

And if you don't have meaningful answers for all these, then you might have hit a Pub/Sub plateau and you have decisions to make. The underlying problem here is that Pub/Sub, as much as it looks simple, it is that illusion of simplicity at the surface level, but underneath it hides a set of hidden problems. Let us talk about why you might have jumped on that messaging train, loosely coupled system, availability, recoverability, fault isolation, horizontal scaling. But Pub/Sub is not about any of this. Pub/Sub is all about decoupling. That is what Pub/Sub offers. Decoupling is good, but it's not the answer to everything. It's not that silver bullet. And these questions that I spoke to you before, Pub/Sub doesn't try to answer it. That is not its responsibility. Pub/Sub is a communication style. It's not a system design.

So let us now look how we can conquer some real-world problems or we might actually even bring in Pub/Sub to complement it so that we can solve some real-world problems. And some of the questions that I spoke to you about. We will start with the dual write problem. So this is a very common challenge that distributed systems face. So this happens when something tries to write to two different systems and that leads to potential inconsistencies. Let us talk about the payment service. So the payment service is subscribed to the order placed event and it tries to write some data into the database and then publish under the event saying payment has been authorized or payment has been failed. What if it writes to the database and then it fails or it crashes? Then we have a database record that has been updated but none of the other services knows that the payment has been authorized. So everything is as a standstill.

So businesses losing out because the customer is unhappy. What we technically have in a system is the zombie record. Now let us say that I publish the message first and then I do the database update. Again, a crash can happen somewhere in between. So then we have what is known as a ghost message, a message that is floating around the system, but there is absolutely no proof that we have captured that payment to our end. Both of these are not ideal situations and we want to avoid both zombies and ghosts in our system. But what I'm trying to say here is when you have a distributed system, the data loss happens between your database and broker. It is not just in the broker that it happens, it is between the database and your message. So when I as a developer do that code to save the data into the database and then publish a message, my intent is I want to capture the payment once and send a message once. But that's not exactly what happens.

Teams trust brokers for the durability, but still there are events that go wandering and then gets lost and payments go missing. The problem is that we have two non-atomic operations across two independent durability boundaries. That is the database and the message broker. And what we can use to solve this problem is the transactional outbox pattern. It is basically an infrastructure that guarantees consistency between your business data and messages. So what is the approach here? So we are writing the business data and the event to publish or the message to send in a single database transaction to a database. And most often that database where we write to is the database where that service stores its business data. So we are focusing on atomicity of the operations rather than the broker itself and independent durability boundaries. So the message to be sent is sent at a later time, but it is stored into a separate table in the database, let us call it outbox storage for simplicity.

Now looking at the payment service where this is happening, how does it pan out? So when there's an incoming order placed message, the infrastructure begins a transaction that happens underneath the hood because it's the infrastructure doing that. And then my code, my developer code kicks in and it handles the incoming message, processes it, probably saves the data into the database. And then I say that I want to send this message. That is why I have highlighted it in pink. That is my code. And then the infrastructure again kicks in and say I'm not going to send it, but I'm going to store those outgoing messages into the outbox table in the database. It commits the transaction. And then it moves on to the dispatch phase where it sends the outgoing messages in that outbox table and marks those outbox rows are sent.

So when it stores that data, the outgoing message operations are stored against the incoming message ID. So there is proof that against this message, these are the outgoing operations. And finally, when the outbox table is marked as sent, dispatched, it acknowledges the incoming order placed so that it can go back to the queue and say get removed of the queue. So this is phase one where it commits the transaction. The phase two is the dispatch phase. Now what happens if the acknowledge of incoming order place message at that point there's a handler crash, no problem. The message gets retried, it starts a transaction again, but then we have a problem. The payment could be taken twice because of my business logic and there could be more outgoing operations. That is a different problem in itself that we have to solve. But outbox alone can not solve it because outbox is all about protecting the publisher, making it more reliable to make sure that we have that exactly-once processing semantics, what we need is idempotency.

And idempotency is a way of ensuring that no matter how many times you process a message, it has the same effect as processing it only once. So in mathematics terms, think of multiplying by one. That is idempotency for you. Many operations are naturally idempotent and some they are not. So if you have natural idempotency that can be obtained, that is recommended. But if you have specific messages that has specific outcomes, say according to the business logic, you can ensure idempotency that way as well. But that actually needs message processing because it is in the handler code that you can check for it. But a more generic way to do that is deduplication where we check whether the incoming message has been already processed. Remember our outbox table, we always store the incoming message and the outgoing message operations in the outbox table. So that is a good place for us to check whether the incoming message has been processed.

So going back to our payment service, we have the incoming order placed message. We have deduplication kicking in and checking whether the incoming message has been processed. So it goes and checks that outbox storage to see whether there's any proof that it has been handled. So if it cannot find any record in the outbox table, it goes to that phase one where it begins in transactions and then finally commits it and then moves on to the phase two, which is the dispatch phase. Now if there is a record found, it goes directly into the dispatch phase. And in the dispatch phase, it checks whether the outgoing messages has been sent and if it has not been sent, it goes to the sending of the outgoing messages marking the outbox table as sent and acknowledging the incoming order placed. And if it has been sent, it acknowledges the order placed message and then moves on.

So no matter where you have a crash in the handler, even if you get the retry of the order placed event, nothing is going to happen. We are still going to get the exactly-once processing semantics. So if you have come across the term that outbox gives you exactly-once processing, it is the transactional outbox plus idempotency that gives you that. So as a best practice, every service that saves to the database and then tries to send a message out should have transactional outbox pattern. I have shown it only in the payment service, but in reality, every other service that does some database operations and sends messages can use it. So think of those independent durability boundaries. If you have that, you can use transactional outbox pattern. But of course, use it with idempotency to get the exactly-once message processing. In NServiceBus along with outbox, idempotency is baked in, so you always get exactly-once processing.

Now you might think why not a transactional broker? Because there are transactional brokers out there which can say, why do we need outbox? Why not a transactional broker? Because transactional brokers like Azure Service Bus, they can ensure only atomicity between of its own operations that is completing a message and sending a message to another queue in Azure Service Bus namespace, that can be atomic. It cannot account for atomicity around your business data. So even if you have a transactional broker, you need this pattern to make sure that there's consistency between the durability boundaries that is saving your data and sending other messages. But there are some things that outbox cannot do. It is not about exactly-once delivery. So you need idempotency. Exactly-once delivery should never happen from a broker perspective because then you might end up losing your messages, which you don't want to do. Brokers are always about at least once delivery. So you need to have the outbox pattern plus idempotency.

It does not make your events any better. Actually, if you have a chatty event or a passive-aggressive event, then outbox is just faithful and it just sends that out. It does not eliminate the need for idempotency. In fact, I would say, idempotency is your insurance, always have it and it cannot orchestrate events. It is not about orchestration, it is about reliability. So orchestration of events comes in sagas, which we will talk about next. So look at this process again. Look at this order workflow again. As a business, if you need to be able to say what is the status of the order? Can you gather that from this workflow? You might have to go to each and every team. Have you done this? Have you done that? Have you done that? And then try to put together on a piece of paper what's going on. And trust me, when you're on a call with customer, you don't have the freedom, or liberty, or time of that.

Can you say what has been done and what is left to be done? No, that domain logic is scattered across all the service. Each service might be handling just enough success and failure conditions from every other service. But as I said, there could be a 51st event that be floating around sending your entire workflow into an invalid state. Again, not something you want to handle. And in this system, again, every service is subscribed to every other event from every other service. And when you work against such row signals, maintaining such a system becomes very difficult. What you have is pinball architecture in place. Now this is what Pub/Sub turns a workflow into, a bouncing ball of events where each service reacts, then it emits another event that triggers another event to do something and the process, the workflow doesn't exist anywhere. It is just lost in the entire bouncing ball of events.

And when you have pinball architecture over the time, what you have is unintended side effects. You have performance issues and then debugging. That becomes the work of detectives. We don't want to be detectives. We like being detectives. I know it's problem solving, but our life is not all about that, right? What we need to be thinking is that this is a long-running workflow. In reality, yeah, Amazon can deliver an order to me on the same day, but it still takes a few hours, right? If it is a digital delivery, yes, I get it instantly, but if I have a package, it takes few hours. Let's be just realistic about it. So why are we then just making everything so synchronous? Why are we just not embracing the fact that this is a long-running workflow? Let us look at it from a very different angle.

There are multiple messages involved, multiple services, multiple databases involved, and it's not just happy parts that we are looking at. There's alternate parts as well, right? If you have inventory, the reservation might fail for a part of the inventory. Do you backorder? The payment might have failed. You might want human intervention at that point to say why that failed. Do I reach out to the customer for another set of card details? So these are alternate parts that need course correction. So that is where sagas can help coordinate all these long-running business processes. Now you might think when you have multiple databases, why not distributed transactions? Distributed transactions tries to give you atomicity across all services, all or nothing outcomes, and strong consistency. But the assumption is that you have tight coupling. Everything is always available up and running all the time. Things are fast, everything is homogeneous, and everything is under the control of a single team.

But hey, we are talking about a monolith here that is not a distributed system. When you think about the same assumptions in terms of distributed systems, these are five out of the eight fallacies of distributed computing that cannot be assumed. So moving on, why not again distributed transactions? Because business processes are long-running, you cannot have coordinators that are stable and locks that are being held for that period of time. Distributed transactions can fail catastrophically. If the coordinators crash, the participants are left in a state of doubt. There could be technical scaling issues because if one participant is slowed, then there's latency that impacts the entire system and that might lead to cascading failures. You have organizational scaling issues because when you think about distributed transaction, it needs distributed agreement because every team has to agree to those transaction semantics and that is tough. It cannot survive partial failure.

It assumes that network is always up and running. Participants will never fail. That is not going to be the case. It does not map to single... Sorry, I moved too fast there. It does not map to business reality. Distributed transactions is always about rollback, but businesses don't think like that. Businesses think about compensating. If I have a reservation of inventory that failed, I backorder. I cannot under-serve an inventory or unship a package. It is all about compensating. So distributed transactions do not think like that and we have two single point of failures, the coordinator which can crash and then it is system-wide failure and then you have time involved. We are assuming that locks can be held. Participants always come back to the coordinator with their replies at the right point of time and networks respond in time. All these are fallacies of distributed computing again that we are assuming.

And how is sagas different in this instance? If you think about the entire big workflow as a distributed transaction, sagas makes each part of it a local transaction. So that entire workflow, it moves ahead one step at a time where each service commits to its own database. So the locks are held and released very fast and there is no cross-service locks. And it does not communicate using synchronous calls. It communicates using asynchronous messaging. So again, those messages are durably held. Now, business processes are always eventually consistent. Businesses always have delays. What we care about is a very reasonable outcome, right? So why are we fighting for immediate consistency when the reality is eventual consistency. So sagas is all about embracing that.

Sagas deal with explicit compensation. Again, with messaging, what we are modeling is our business domain. So when we have sagas, we can think about compensation. I have not been able to reserve the inventory, so what can I do to compensate it? There's business decision involved that will be your compensating actions and sagas have great memory because what is the state of a workflow? What has been done, what is pending, and what should happen next? Based upon the saga state that is persisted into a storage, sagas can make that call very efficiently. So what makes a saga? The first is saga instance. That is your entire workflow record. One saga instance is one workflow record, which is persisted as the saga state into the database. So it is a durable storage of what has already happened, what is pending, and what are the important decisions made so far? What should I do next that is stored safely in a storage? So sagas need persistence.

Then you have business logic which makes sure that there is no impossible transitions. There is no invalid state of that workflow that is happening. When a message arrives, the saga looks at the current state, values the business rules and then decide what's to do next. It might be sending a new command, moving into a new state, triggering a compensation, any of that. Then you have your correlation logic, which helps map the incoming message to an instance of the saga, which workflow or which order does it belong to? So in our case it can be the order ID, which is the basis of the correlation logic.

Okay, there we go. And then we have our completion logic. What marks the workflow as complete and cleans it up from storage? What is the clear criteria for successful completion, or a failed completion, or cancellation? Saga knows all of it. So putting it all together, what it looks like is with an incoming message, the infrastructure again checks for whether a saga instance exists. If a saga instance exists in the storage, it is then located and rehydrated using the correlation logic. And if it does not locate a saga instance, it starts a new saga instance. So messages can not only be used to locate a saga instance, it can also create a new saga instance. And then you go into processing message. The color is not very visible here, but this is where your business logic happens. Processing message, everything around it is infrastructure. Then you check whether you can complete the saga or is it at a point where I can complete it? Is the workflow complete? Is the business outcome met?

And that can be checked against the completion logic you have in place. And if it is at a state where it can be completed, it marks it as complete, saves, and then moves ahead. If not, it saves the saga and then watches out for the next message. So saga can be said as a stateful message handler, which can actually handle multiple messages. And in between handling those multiple messages, it saves that state durably into the database. So which means that it can evaluate business rules and decide what to do next. So saga has great memory.

Now there are two ways in which sagas can be approached, orchestration and choreography, which is a deep dive topic in itself. But to give you 2 cents, my own 2 cents about it, orchestration is all about conducting an orchestra. Just like you have a central orchestrator conducting a symphony, you have that central orchestrator that instructs specific messages to do actions. So the work is carried out by the services itself, but the orchestrator knows what to do next and instruct services accordingly. It does that by sending commands, which is instructions. It's a different type of message, which is all about instructing and asking to do something. And the saga orchestrator usually subscribes to the events that are published by the services which does the work. So which means that if you look at this component of central orchestrator, you can see what is the state of the workflow because the workflow is centralized. So these are the benefits or this is orchestration is all about. But then because the workflow is centralized, it knows what action to take next. It's that central component which does the entire thing.

But be aware this can be a monolith or become a monolith in itself, and if you're not careful about it, it can attract a lot of code and business logic might sneak into the orchestrator component, which is a strict no. So if you have teams that can manage the orchestrator, this is a very good option to go with. But with an orchestrator in place, what would be the workflow? So the order service publishes an event into the broker, which is subscribed by the order saga orchestrator, and then it instructs the various different services to take actions. So it might say inventory service, reserve the inventory payment, authorize the payment, fraud, start the fraud check. And shipping, you can be at the pre-stage of shipping. So there is no service actually looking at the events now. They are all looking at the commands that are issued by the orchestrator. And each of these services then go back maybe a different point of time back to the orchestrator saying that I have done my job and that is using events.

Then what happens is the order orchestrator, maybe the fraud might come back and say, you know what? I have failed in the fraud check. So there might be business decisions at play to say that, "Hey, inventory you need to hold, you just need to temporarily release that lock on the inventory." So that can be a compensating actions which goes back into the inventory service. Payment might get an instruction say, "Just hold on." And shipping might say, "Don't start your work yet, just hold on for a minute." Eventually things might pass and then the orders might say, "Yeah, now I have a successful business outcome. I have enough points to make a decision about whether it can go into a state where it can be shipped, or I need to refund the payment to the customer, or maybe remove some of the items from the order and then ship the rest in the next shipment or something like that."

So at that point, shipping might get the instructions from the orchestrator, it can do its work, publish an event saying that yeah, it has done its work. And then you have the order of workflow complete. So moving on to choreography, we don't have a central orchestrator here. We have services that react to each other, each other's events with a sequence of steps. It's completely event-driven with no central coordinator and workflow and that logic is within each service. Nothing wrong with it, but the problem is that each service is reacting with row signals here. So which means that there's a chance that you could again run into pinball architecture where there's just events bouncing off everywhere. So how does that look with choreography? There's no central orchestrator. You have events published and moving, floating around all over the place. So you have the order placed even going into the broker, each of the service might pick it up and then go back at specific or some point of time saying that I have done my job.

The order service might pick it up and then publish saying that yeah, everything is happy to go ahead, shipping, you can continue. Shipping does its work, and the order workflow is complete. But the workflow is scattered around various different services. Again, if the fraud check fails, every other service has to then subscribe to that same event and make decisions. So this is where decision-making comes into play. Should I go with orchestration or should I go with choreography? That is where we need to speak about compensating actions. This can help you decide whether to go with orchestration or choreography. Every workflow has not just those happy parts, but the alternate parts. Think about the inventory. I could only reserve a part of it, payment failed, fraud check failed. But what if fraud check actually cleared at a later stage? What if the payment could not be authorized, but while waiting for the payment, fraud check actually cleared, what are the decisions to be made then?

These are all alternate parts that needs the course correction and that is compensating actions. That course correcting actions are your compensating actions. As a rule of thumb, we usually say if it is between service boundaries, go with choreography, and within a service boundary, have orchestration. But we can go beyond it and say, look at compensating actions and if you have compensating actions that depends on business policies, then go with orchestrations. Look at the number of arrows coming in and out of each service. That is an indicator that your system is getting to a point of overwhelm and you probably need a central coordinator. And if I'm to really give you a single sentence about it, if you can explain compensating actions in a single sentence, you can go with choreography. If it needs anything more than a single sentence on who compensates, that is orchestration. That's what I can tell you.

Now what happens with out-of-order delivered messages with sagas? For example, fraud check came in first before payment processing. No problem. With sagas, we are in a place where we know that inventory, and the fraud, and the payment has to complete successfully before an order can be shipped. So each of this needs to be completed not in that specific order. So every time a message arrives from any of these services, saga has taken this action, say inventory has been completed, the payment has been completed, fraud has been completed. So the orchestrator knows, yeah, I have taken all these actions. Now I can inform the shipment to start the service. So the order becomes less of a problem because it has gone through each of that states individually and in a durable way as well.

Now one of the other superpowers of sagas is the concept of timeouts. So the timeout is the upper limit of time for which a saga can wait till a message for a message to be received. And if it doesn't find that or receive that message within that timeout period, it can request a timeout. Timeout is basically sending out another message which can again be handled. So what I'm trying to arrive at is that with the timeouts, you have the ability to send messages into the future, which means that you can get rid of those nasty batch jobs in your system. You might be thinking loyalty points, loyalty service. It doesn't need to be a nightly batch job where an engineer has to be up and awake at 2:00 AM to fix it. You can leave it to sagas. There's an entire blog post around that or come find us at the booth where we can talk to you about it.

Now let us talk about another aspect which is shipping in carriers. We want to ship the customer order. We want to decide on who might ship it the fastest and cheapest, or we might want to build a shipping plan. What are the options available and why cannot Pub/Sub help here? With Pub/Sub, events do not control or know who reacts. With Pub/Sub, we do not know the number of subscribers and we cannot rely on replies either. And this scenario that we spoke about, we require correlation for every order ID, I need to be able to correlate who I sent the message to and who responded to me. We need to be able to rely on the responses and we need to have the knowledge of who got the request as well. So all of that is important. So here comes the next pattern, which is scatter-gather, which is all about reducing latency and increasing the throughput by doing work in parallel and not sequential.

Imagine if I have to reach out to DHL, FedEx, and UPS one by one sequentially. That is reducing latency because all I want is what is the cost? Hey, you all do your work in parallel and let me know that information when you have it. So you are spreading the work parallelizing it rather than reaching out sequentially. That is the scatter-gather pattern. So as the name suggests, there are two phases. The scatter phase where you'll send that message out to all the recipients and the gather phase where each of the recipients respond to or send their messages to an aggregator through a response queue. And that aggregator puts together the result based on some business logic. And for scattering itself, there are two different methods, scattering by auction, scattering by distribution, both are fan-outs and then fan-in, but there are subtle differences. Auction is broadcasting a message to multiple consumers with no control over them. So that is Pub/Sub.

Scatter by distribution is the scatterer knows what that list of recipients is. I know that there are some rules and the list of recipients is put together based on those predefined rules. Maybe for this package it is just DHL and FedEx that can do it for me. So I only reach out to those. With scatter by auction, work can be duplicated, but it is all about picking the best option. With scatter by distribution, work can be split and carried out, but everyone must contribute to the result. So the responsibilities are quite different as well. With scatter by auction, you are saying that hey, try to win or lose quietly. What is that? You have to put together a bid and then come back to me, the aggregator, within a specific amount of time, but accept the fact that your response could be ignored. That is up to me to decide it.

But with scatter by distribution, it is do your part and contribute to the whole. That is the responsibility. So with scatter by auction, again, the intent is to have speed, and quality, and resilience while scatter by distribution is about completeness and correctness. So how does this look like? With scatter by auction, let us say we want to select the best delivery option. So given that we have the destination and the weight and the delivery date, which is all fixed, you might want to know who is the fastest under the budget or the cheapest that meets the delivery date. So the shipping might publish your topic, which then gets subscribed by a number of subscribers. I have marked it 1, 2, 3 because at this point we do not know who those recipients are. They all put together the responses to a response queue and the aggregator based on its own business rules tries to do some kind of scoring and then goes back to the shipping with its response. Now the entire responses that comes back to the aggregator must happen within a timeout so that we do not have increased latency again.

With scatter by distribution, it is all about let me see what the options are. So here the shipping knows that these three are the options because it has put together a recipient list, which is the recipient list pattern, and these are the carriers that I intend to reach out to. So it sends those instructions to the three carriers. They put together their responses individually, goes to the aggregator, which then validates that, checks that, and then goes back to the shipping with the response. So this might be a plan that it goes back to it. This might awfully look like a saga, but there are differences. Scatter-gather is all about parallelizing work to get an answer where saga is all about coordinating a long-running workflow. Scatter-Gather is fan out, collect replies and complete the result while saga is moving workflow step-by-step to a business outcome. That is the criteria for success. With scatter-gather, success criteria is have I got enough answers to aggregate a result?

So scatter-gather is all about answering what is the result, whereas with saga, how can I make a process happen successfully over time and seek it to completion? And the use cases are different as well. With scatter-gather, you use it for read and compute flows like searching, getting a number of quotes and deciding what is the best. With sagas, it is for business processes, long-running workflows, the batch jobs. But there are things that you need to watch out for. You do not get ordering guarantees. You need a way to correlate between requests and doubt and responses received, and you need to watch out for slow recipients. So you need some timeouts in place, make sure that you have idempotent scatterer, otherwise you might get six responses instead of three from the distribution way of scattering. Or you might have a bid storm with multiple answers because all those recipients have received more messages than it should. And use it only for read and compute operations, never for Parallelizing writes, that's not the intent of the scatter-gather.

Now we are going into more non-functional space now because as much as functionality is important, we need our non-functional requirements as well. So imagine our website is very successful, we have a Black Friday sale and the orders are coming in flooding the system. So there's unprecedented load on the website. So how can you survive that spike? Let us look at what is happening. So on a normal day there could be a sender publishing the orders, sender or order service that is just publishing the events into the broker or sending messages. And there's a single instance of the receiver that is churning the order placed events or the messages. So the throughput is maintained and constant on a normal day. But when you have unprecedented load on the system, the receiver cannot keep up to the sender. That is possible.

And we only have one receiver, one instance, once physical instance of the receiver. If it crashes, throughput is impacted. If it cannot keep up, throughput is impacted. So what do we do? We know that with asynchronous messaging we get decoupling out of the box so we can actually physically scale out multiple instances of that receiver so that it can then keep up to the load that is incoming. So that is competing consumers pattern. It is a scaling pattern where multiple concurrent consumers read from the same queue where each message is processed only by a specific consumer. So here there's also the point of inversion of control. Rather than pushing messages into the consumer, the consumers are coming and taking in control saying that I am free to do the message. So let me take that next message of the queue and process it.

And most often, teams start with the competing consumers pattern because it's the most simplest and most effective way to scale throughput. And it is a scaling pattern, not a messaging pattern. So how can it look like in our order workflow? If you look at our order workflow, once the order starts flooding in, it is the inventory service that might find it difficult to keep up. And if you think about it from a business perspective, when the order comes in, we need to make sure that with the incoming flood we are not massively overselling to the point we do not have an entry. Because even compensating actions like backordering cannot work. Because what if you have 10,000 and you cannot fulfill 9,000 of the orders? That's not a situation that you can turn around in a small amount of time, right?

And technically speaking, it is one of the spikiest services as well because it has multiple database rights. Look at stock deduction, concurrency checks, reserving stocks, calls to warehouses. There might be high contention on popular items. So it's exactly the kind of service where you want to keep up with the rate of incoming message. So one way would be to scale out that specific service. And what are the benefits of competing consumers? You have a load-level system that can handle variations of spikes. You have improved reliability, messages are not lost, they are retried. So make sure you have your idempotency. There's no complex coordination needed between the producer and the consumer or between the consumer itself. You can scale your services out because they're already decoupled. There is no routing changes required. It improves resiliency because if you have four instances and one fails, you still have the other three to keep up with your incoming orders. You achieve natural load balancing and with the auto-scaling that is available on the various cloud providers, you can scale out on demand and then scale back when things are quiet or maybe during the night.

And competing consumers are great for do a specific job one scenario, where background processing, sending emails, these are examples where competing consumers can really help and work. And how is it different from Pub/Sub? Pub/Sub is a communication style. Every subscriber gets a copy of the message and reacts to the event independently. It is optimizing for extensibility and broadcast-like scenarios, but competing consumers is all about a scaling. It's a scaling pattern where we optimize for throughput and elasticity. Pub/Sub is all about answering who is interested in knowing something that has happened and competing consumers is all about who should do this work. But you can bring the two together, but mind you, competing consumers always need a queue to work with because they are all competing to get messages of a queue. So if you have to work it with Pub/Sub, you need to bring in AMQP-based brokers like RabbitMQ or Azure Service Bus where the topics can then forward. There were events into queues, and then the multiple instances can then fight for the messages from the queue.

So this is what it would look like. The topic forwards the events into separate queues and then physical instances can compete for messages from the queues. But you have to watch out for or watch out that there's no ordering guarantees and do not use it when there is dependency between tasks because that is the job of sagas. And if you are not careful about how many instances you have, you could be moving the bottleneck to further downstream services like database. And at that point, this become a less viable path. So you need to think about pressure mechanisms as well. And of course, always have a retry policy in place and prevent fake messages from impacting your throughput.

I want to remind all of us about this ancient proverb. An unchecked flood will sweep away even the strongest dam. When you have a distributed system, that unchecked flood is the incoming messages, the data that is coming in. And our dams are our services. So imagine a very busy system. Our system on that Black Friday sale day where requests are coming in like thousand messages a minute, consumers are falling behind, queues are growing without bound, latency is through the roof, and retries are amplifying the workload on the system. So even the most well-designed system you might have in place, you might think you have in place, it is going to work at 200% capacity and we might be reaching a meltdown. So that is why we need back pressure and flow control. Back pressure is saying the... Without back pressure, what we have is unbounded throughput, welcoming unbounded failure. It is like driving a car at 200 miles if that is possible.

Speed thrills, but it kills. So back pressure comes in and says, slow down system, I just cannot keep up. So it is controlled resistance to incoming work. So it works like the smart motorways where it adjusts the traffic, keeping things going without entering a gridlock or a standstill traffic situation. So it is all about slow down and succeed at staying alive rather than work at 200% capacity and then fail massively catastrophically, whatever. So what are some of the common back pressure mechanisms that you can start with? Competing consumers, of course, make sure you have bounded prefetch, concurrency limits, and pull-based consumption of the queue so that you are not overwhelming downstream services. Queue-based load leveling. Use the queue as a buffer for storing messages, rate limiting and throttling to protect your own producers. And if you are at a position where you can prioritize the load there and shed any unimportant load, you can go with load shedding.

So you might want to process the payments, but look at loyalty points later. That is less important than gaining the payment from the customer. So that can be shedding unimportant work and doing more important work. That brings me to the last topic of the day, which is retries. In distributed systems, you can have network failures, database locks, all of which are momentary. Given the time, it'll heal itself. But if you fail immediately at that point, what happens is something that could have been fixed given the time, a blip becomes an incident. Retries is all about that. Retry is letting time fix the problem. You are giving your system a second chance at processing a message or overcoming that failure. So what you're doing is designing your system for self-healing. So how would you retry a message? Do you keep retrying infinitely without any plan? That is gearing up for retry storm.

What is retry storm? When you have the same message being retried many, many, many different times, you are not giving other messages in the queue a chance to be processed. That is one. The queue might be just growing beyond its limit at this point, which means that if queue stops taking messages altogether, the system comes crashing down, you're going to have a meltdown. Yeah. So that is why you need clear retry strategies, and retry counts, and upper limit of the retry counts in place as well. So what are the retry strategies that you could have in place? Immediate retry. This is when I have a failure, I try immediately. It is cheap and fast and it works well for transient exceptions. There's a database log. There's a database which is not available at this point of time. Brief network hiccups. These are all transient exceptions, which given a second or a minute can fix itself.

So immediate retries work well in the scenarios, but make sure that when you have immediate retries in place, you have an upper bound, you try immediately again and again for three times or five times and then stop. And then you move on to what is the delayed retry. This is retrying with the delay. When you see that things don't fix itself, that is a semi-transient exception. Maybe you are waiting for a database cluster to take over. Maybe a service is restarting. So you need to give it a little bit more time to prevent it from going into an incident. So you retry at either fixed intervals that is every 10 seconds you keep retrying or you have exponential back off. You could retry at 10, then when you see that it doesn't succeed, add a 10 and then come back at 20, then add 10 more, come back at 30, and so on. Again, make sure you have an upper limit for that delayed retry as well.

Now when you have exhausted both immediate and delayed retries, then you know that you have some kind of a major problem going on. It could be the message that is corrupt or the handler code itself that is buggy. Then what you want to do is put those messages to aside because at this point they have become poison messages or messages that are corrupt and you want to put them to a side and then try to fix the issue without the intervention of the immediate or the delayed retries. So that is why you have error queues in place. When you have all your retries with immediate as well as delayed retries exhausted, move them into the error queue. As a best practice, always, always make sure that you have error queues because it is that which gives you visibility that something is wrong. It contains those errors and stops that failure from spreading.

So you put aside a set of messages to come back to it later. Remember, with asynchronous systems, one of the key things is that it fails silently. So if you don't have the error queues, you will not find that something has failed because it's silent. So best practice, always have error queues, but that's not all it. You also need to have error queue management and monitoring in place because if you don't have this, your system is incomplete by design. You have a set of error messages, but there is no way to monitor the error queue. There is no way of knowing what is going inside that. So you need some way of monitoring that as well.

There are multiple ways to do it at a queue monitoring and message replay. You have Azure Service Bus Dead Letter Queue and the replay mechanism. You can use that. You have the RabbitMQ Shovel Plugin, which can help you monitor the queue and then move messages between queues if needed. Or you have the particular platform itself where you can see the entire error messages in your system on a dashboard, maybe edit it and retry it. If you want to know more, come to the booth, we'll talk about it later.

So wrapping up, what are the patterns to use and when? I would say do not start with patterns immediately, start with the pain points. Each of these patterns that we discuss today addresses specific problems. So what are those failure modes? So ask yourself those questions. What can be lost? What can be duplicated? What can be eventually completed? What happens if these messages arrive out of order? What if this service fails at 2:00 AM in the morning, what is going to happen? These are your failure points. And based on those failure points, you can make these judgments. So if you want to take a picture of this, you can very quickly.

Done? Okay. And remember as a best practice, apply patterns incrementally. Don't just go all over it, all over the place. Add one pattern, one pain point, solve that, and then observe and then decide again. And if you cannot explain why a pattern exists in your system, remove it. All patterns everywhere, that is a strict no-go. That'll be a system where no one can understand what is going on and no one wants to touch it. Often, a combination of patterns might also be needed. Outbox plus saga, outbox and idempotency together, outbox plus competing consumers, all of that could be needed. But remember, Pub/Sub is what got you started. Patterns help you finish strong. Pub/Sub is your foundation, it's not the finish line. You don't need all the patterns, but you need the right pattern to solve what the problem you have in your system is. So that is it from me today. Thank you for being here.

Beyond Pub/Sub - Advanced Messaging Patterns

About this video

🔗Transcription