So you want to build a service bus

00:00 William Brander

Hi, everyone. My name is William. I'm a South African developer working for Particular Software. Before we get into the presentation, just two things about South Africa. The first is, I know my accent can be a little difficult to understand, so I will try to speak a little clearer. [laughter]

If there is anything that I get too excited about and to go too quickly through, grab me afterwards, we can chat. I'm always excited to talk about messaging systems. The same goes, if there's any questions that I don't answer for you guys properly. Second thing about being South African, and this is on a more serious note, as a South African citizen, I feel it's my moral duty and obligation to apologize to the rest of the world for what we did when we introduced you to Die Antwoord. Don't get me wrong, I liked the music, but I'm just sorry that the rest of you have to experience it from time to time as well.

What are we talking about today? We're going to talk about how to build a service bus. Is there anyone in the room currently using a service bus? Excellent. That's actually quite good. Anyone here not know what a service bus is at all? Pretty sure you just raised your hand for both questions, which is actually kind of cool. I'm not going to go and explain what a service bus is directly through this talk, but hopefully as we go through the patterns and the principles, it'll become a bit clearer for you, and you'll get an idea.

What I am going to do though, is cover why you would want to use a service bus. So some of the reasons behind it, once you move to a messaging based architecture, a lot of the problems that you typically deal with in development become infrastructural problems. And that makes it incredibly powerful to do software development. A good service bus would ensure that you never ever lose a message. It's easier to make systems that are reliable, that can recover from failures. If you've got a table lock and you're trying to update a record and something goes wrong, fixing that just becomes infrastructure, you don't have to write code to prevent that from happening.

Scalability becomes easier with the good service bus, because a good service bus will encourage you to deal with a logical deployment, a logical endpoint, instead of physical addresses and physical nodes. Which means that it's easier for you at runtime after you've written your system to go and add additional notes, and the rest of your system doesn't even have to know about it. It becomes easier for you to scale and develop solutions that can work with more and more demand.

A good service bus also encourages you to decouple your business concerns. So it becomes blatantly obvious when you've got an aggregate route, that's dealing with information that it really shouldn't. If you've got one endpoint that's handling a whole lot of customer order messages, suddenly it becomes incredibly easy to see when you're dealing with account information in the same endpoint.

A good service bus with the API that it provides you with will encourage decoupling. So there's a lot of benefits to using a service bus. It's a tool that you can use. What I like to kind of compare this to is, when you're starting a software project, maybe it's a greenfields project, maybe you approaching this big ball of margin and you're trying to work out how you can decompose us into something a bit easier. You know where you are at the start, and you kind of have an idea of where you want to end up with, but the path that you take to get to the end is an adventure. It's like a game.

At the start of the game, for those of you who haven't played, this is Legend of Zelda, Link, the guy at the bottom, who's the hero and is about to go on this quest. He's about to go on this journey. He knows he has to save the kingdom, but he doesn't quite know how he's going to get there. The old man at the top has given him a sword, and the sword is a tool, something you can use to make his journey a bit easier. Sure, he didn't actually need it to go and save the kingdom, but it's a lot easier to fight monsters with a sword than with a feather duster.

What we're going to be looking at today is how to build that tool, how to build in the analogy, the sword that we're going to give to the hero, who's going to use that to save the kingdom. And what happens to create that sword and to craft it. Our sword today is going to be based off of messaging, because that's the best way to do software development. And we obviously can include the ability to send messages. Once we've got that in place, we're going to take a look at how to route those messages to different endpoints.

From there, we're going to build pub/sub on top of it. And once we've got pub/sub working, we can look a how to implement long running workflows. Some of you may know these as sagas. Once we've got that in place, we then are going to take a look at how to make our service bus resilient, how to recover from failures. What happens when something goes wrong? I'm going to jump straight into the code, because that's how I roll.

The demo today or the demos today are going to be .NET, and they're all going to be using MSMQ, but please rather focus on the underlying patterns, don't let the technology be an issue here. I've given the same talk in Java script with the RabbitMQ endpoint behind a wrist interface. The technology is completely irrelevant, the principles are what's important here. Sending messages, let's jump straight to visual studio. You can't see that at all up there, but hopefully the font is big enough. So with MSMQ the easiest way to send a message is simply create a new instance of a message queue object, and call queue.send. So the message queue object has an address, a physical address with that message must go.

And when you send a message, you can pass any object that is serializable, and that message will get sent to that endpoint. I've got some kind of helpless stuff around here at the bottom to check if a queue exists and if it doesn't create it, but that's kind of fluff at this stage. Sending a message is trivially easy, and if I run this code, it's going to open close. And if I look at Queue Explorer, not that one, that one, if I look at Queue Explorer, I've got my Q that's being created for me. And it's got a message waiting to be picked up by the recipient. Super simple.

On the recipient side, it's just as simple. Open that. On the recipient side, again, you create a message queue object, you subscribe to receive competed event, and then once your message has been received the first step is to de-centralize that message. And then you process the message. You handle it, however you want. So in this case, all I'm doing is, are put into the console. This is the message that was sent. If I run my application, there we go, that's our test message that we sent earlier that's been processed.

Sending the message super, super simple. What if something goes wrong though? So I'm going to simulate an application bug. And this is how I do all my coding. I just throw exceptions everywhere because that's really cool. And my colleagues love me for that. When I process the message, something's going to go wrong. Maybe there's going to be a database lock. Maybe I've got a bug in my code. There's going to be an exception throat. If I send a message to this endpoint again, for one.

I've sent the message, we can see it's sitting there in the queue over there. If I receive the message, but I throw an exception, as you expect my code breaks, which is great. I can't officially write or throw exceptions in code. But the problem is, if I refresh here, that message has disappeared. If this was a process payment message, that money's gone, I'm never going to get that money ever again. And that's how my salary works. I kind of like to make sure that those messages at least get processed correctly.

We need to make this a bit more resilient and with MSMQ, because it's a transactional queuing technology, it's super simple. This is exactly the same code, except that I've added a transaction scope around the send. So my send is now transactional. If anything fails before transaction .commit, that will get rolled back or everything will get rolled back. It's going to run that, and then we go to how to receive transaction to be, get rid of that. It's exactly the same as the previous received message, except again, you do it inside the scope of a transaction. So that's all you have to do to get transactionality around your queuing operations with MSMQ.

If I throw an exception here, just to get there. So they have an exception in my receive, around that. The exception gets thrown as expected, but the benefit here is that on the transactional side, that message is still waiting in the queue. So we can fix our code, we can deploy it again, and then when we run the application, it'll try and process that message a second time. We've got reliability in terms of if things do break, everything breaks, but at least we don't lose information. And that's important for a good service bus to have.

Oops, I have to go to that one. What we've done is we've shown how to send a message. And in the terms of our adventure game, this isn't even stepping out the door. This is as simple as the guy finding pants and putting them on in the morning. We've got the absolute basics, but things are going to ramp up as we kind of get through the rest of the session. What we have done though, is that we've shown that it's stupid simple to send a message to a queue. The question is, is a queue good enough? Or do we need a higher level of abstraction? And I'm going to come back to this question shortly.

We've also shown that you can use transactions to save yourself from failures. If something goes wrong, at least you won't lose messages, your program's still completely broken, but hey, you've got the information there, you can always recover from it. How you recover from that is something we're going to address a little bit later as well. Some errors are transient in that if you just retried that message again it will disappear. And I'll show you how you can implement these as we go through the session.

We haven't even addressed concurrency yet though. So at the moment we take in the message of the queue, we de-serializing it, and then we processing it in a single thread. I've got eight or sixteen, I don't know how many, cores on my machine, surely I can do a little more than one thread of operations. Maybe it's a good idea to spin up a thread for every instance, or if your message that comes through, sounds great until you actually get some real load in your system, and you've got thousands of messages, and you've got thousands of threads that are trying to execute at once.

There's a whole lot of work that you have to do around concurrency, and asynchronicity when you're implementing a service bus. I'm not going to touch on that today, that's an excellent talk by Daniel Marbach. There's a link to it in my last slide. So you can go watch that, it's a three-part talk where he explains how to implement these things properly. But coming back to the question, is a queue enough? And the answer is no, it's not, because with a queue, that's a physical destination where you're sending a message, and you have to know when I send this message, I'm sending it to this location, and that's not good enough.

If we look at routing messages, when you send a message, it should go to a logical endpoint. A logical endpoint is important because you can move that endpoint around, you can scale that endpoint out, you're not constrained to an address. When you talk about routing messages, it's also important to know that when you send a message, you should probably select the destination based off of some content in the message, whether it's the type of message you send in, or some type of discriminator that can be used to route a message to different places.

You don't want to have to when you send the message to find where that message must go every time, that's going to make your deployment structure very fragile. Sorry. Let's take a look at how we can route those messages. Close, close, close, close. Right. What I've got here is a similar program to the ones before, except I send two messages. The first message I sent is a put pants on command. The second one is I have breakfast command, because that's what I do every morning.

I've created one level of abstraction higher than just a queuing object though. And I've got a class called a Sender, so right over here, I've got to sender class. Once I create a new instance of the sender class, I then do some mapping. I'm still explicitly saying where messages must go, but I've kind of separated that from the send operation. I've got a type that put pants on command must go to links address, because he has to put pants in.

If I have breakfast command, it must somewhere else. When I send the message down here, what you'll notice is that when I send the message, there's no address, or location where I'm saying that that message must go to. That is all done by this configuration over here. And the code for this is very simple. The sender just keeps her a list of, or dictionary of types and maps those to different destinations. That's it, we haven't really done much here either, but what we've done is created flexibility.

So I'm not ever going to run that because again, it's so simple. And I do want to get to the interesting bits later. We've added a small bird on top of an already small bird, so we still settling with something really small and kind of useless, but it's looking nice, it's looking encouraging, it's cute. And it's easy to route the messages if everyone knows where those messages must go upfront, which that's great. It sounds feasible until you actually try use this in production, and you end up with 100 nodes. And as soon as you try and deploy or move a node somewhere else, you have to redeploy your configuration everywhere else as well.

That's a lot of shared knowledge that everyone has to have, and it makes your deployment environment very brutal. We can improve this though. So let's try and reduce the number of required note elements, and we're going to do that by three assumptions. The first assumption that we're going to make is that there's a difference, a semantic difference between a command and an event. And this is actually an important assumption even in the DDD space. So a command goes to one location is owned by the destination where it goes, and event comes from one destination.

It's to do a single responsibility and ownership of that data. Once we've made the assumption that there's a difference between the two, we can make two further assumptions. So the next assumption is, that commands have a single destination, logical destination, not a physical destination. And this is important because when you say update credit card details, you need to know where that command goes, and who owns the credit card details. You don't want that to potentially go to one or two places depending on some other discriminator, because then who's actually got the most up-to-date credit card information.

The next assumption is that events have a single origin, but they can go to multiple places. And this is important because an event is something that has happened. This is a fact, this is the truth. So their credit card details have been updated, and then whoever wants to know about that can know about it. But the reason it comes from a single origin is because the origin is only one who should be able to change the critical details. So we've got ownership there, and the implication between events and commands is an important one.

Once we have the assumption that events come from one place or commands go to one place, we can implement pub/sub, and then we can reduce the number of nodes that we have. So pub/sub should actually be called sub/pub, but I think that doesn't sound as cool. And the way pub/sub works is the publisher doesn't know where the destination of the messages must go yet. So if I am interested in knowing when credit card information has been updated, I will then send a message, a control message to the publisher to say, "Hey, let me know when their credit card details have been updated."

The publisher will then say, "Okay, I'm going to remember this, this guy over here at this address wants to know when credit card details have been updated. So when that occurs, I can send a message to that address." And multiple subscribers can do the same thing, they can all send control messages to the publisher to say, "When this happens, let me know about it as well." So when an event happens, when the publisher wants to send message one, it goes and looks up, "Okay, who is interested in this one? Oh, it's subscriber one and subscriber two, here are their addresses, let me send a message to both of them."

That's pub/sub, it's actually really simple, but it's incredibly powerful when you use it correctly. We could extend our publisher to persist the subscriptions. At the moment, if my publisher went down and restarted, if I was just keeping all those subscriptions in memory, when a message one occurred, I wouldn't know where to send that message again. We might want to include some kind of persistence there to store where those messages must go. How would we implement this?

Good. So I've got my sender over here. And the first thing that you will see is that I have introduced our first notion of a bus. This is the first time that in the talk I've mentioned about a bus, there are reasons why you might not want to have a bus in your solution. For NServiceBus, for example, we've taken the bus out because it just adds a lot of complexity later on. But for now, let's deal with the bus. The bus accepts an incoming address as part of its parameter, this is something that we could calculate at runtime. I don't need to pass that in, but for demonstration purposes, it's nicer to be explicit about it and show that the incoming address is needed. The bus needs to have somewhere where it's going to receive messages from.

Now, there's no configuration, and you'll see all I'm doing is sending to events. I probably should differentiate between sends and publishers at this point, not important for this. But I haven't configured where those messages must go. That happens in the actual subscribers. So this is our equivalent of the publisher, the subscribers... I will do the Pence one. This is subscribers do almost exactly the same, they create a new instance of bus, and then they call a method subscribed to messages from.

Our subscriber over here wants to listen to pants having put on events. And remember, we know where events come from so we can send a control message to that endpoint to subscribe. So when we actually call subscribed to messages from underneath the covers, all we're doing is sending a message to that endpoint saying, "When one of these messages occurs, let me know about it." And he has my incoming address. And then on the sender side, it just keeps a record of the types and those destinations where those events must go.

The concepts are very, very simple yet here. Again, in a scope of a transaction, because that's what cool kids do. The next stage you might've noticed is, we've got this concept of a handler, which this isn't important for pub/sub at all, but this is a nice clean interface and that a good service bus will provide for you. All of the decent ones MassTransit, NServiceBus, Rebus, they provide something similar to this. Once a message comes into the bus, the bus goes and looks who is responsible for handling this incoming message? Oh, it's that class, let me create an instance of that class, and then we can invoke that method there.

And that's exactly what happens here. So when the pants have been put on event occurs, our service bus goes and looks for anyone that implements this, I handle pants having put on events. Once it finds one or many, it creates an instance of this and then invokes the hand or method. The code for this looks horrible, now we kind of get an inter reflection stuff everywhere. You can do this using Dynamic, whatever you want. But it's a nice way to make your coding experience a little neater as well.

All of the code by the way is available on my GitHub, so you can go through it if you want. What's important though, is in the handle method, all we're dealing with is this has occurred, there's business logic. So there's a lot of noise that's been taken away from us, which is nice. And we'll see a bit later how this all comes together. We've taken a look at routing and by using a little bit of knowledge, making some assumptions, we've made it that we don't need to configure as much. We don't need to know as much about our deployment environment as we did before.

We only need to know where commands go, or where events come from at this stage. And that's pretty powerful. We can make this even better though, we can have some type of central registrar where as nodes come online, they register with us registrar and say, "Hey, so these are the commands I consume, these are the events I publish as my address." And then when someone sends a message, they go to the registrar and say, "Hey, where does this message go?" This should sound a little bit like a broker. So we could do something like that, we could also do automatic registration where we do scanning of assemblies as the node runs up and see what commands and what events we've got in the assemblies.

Or we could do gossip, implement a peer-to-peer protocol. As a node comes online, it chats to the other nodes, and builds up understanding of the network, the logical network, where things might go. We can do a lot of stuff to reduce the nodes, to reduce the required number of node elements down to pretty much zero. I'm not going to do that here, but hopefully the ideas are fairly simple. If you are using something like AMQP with RabbitMQ as your service bus, or something with a broker, you typically don't have to implement pub/sub yourself.

And the reason is, as I described with the registry, the subscribers and the publishers don't deal directly with each other anymore, they deal with the broker. The way the broker would work is that the two subscribers would send a message to the broker to say, "Hey, broker, when message one occurs, let me know about it." When the publisher publishes the message, it sends the message to the broker who is then responsible for routing that message to where it needs to go, or distributing it.

There's a subtle difference there. The difference between a brokerless and a broker implementation is that a broker one, you've got a central point that you didn't have to worry about scaling out and what happens when there's a network partition and, and, and, and whereas with their brokerless model, typically those would involve some type of store and forward. So the more resilient to network failures, and you've got more distributed tolerance with brokerless models, but they're both valuable and both important.

Can we do a little bit better than this though? We've got pub/sub and hopefully that was fairly easy to see how you would implement pub/sub. And also the pattern behind pub/sub is super simple. Can we do routing paths? And when I talk about routing paths, I'm talking about long running workflows, a business transaction that could span a couple of hours, a couple days. Maybe the order fulfillment process would take weeks depending on your stock availability. Can we use messaging to build something that makes that type of business case scenario easier to deal with?

And the answer is yes. When we discuss routing paths, especially in the concept of messaging and service bus, there's two patterns that most people talk about. The first is known as routing slip, and that's where the workflow follows a predetermined process. A good example of this is, if you go to Starbucks and you order a half Venti Spice Latte child... I don't even know... Oh, we've got Starbucks in South Africa for two months now. They write stuff on your mug and then they say, "This must go to Harold at the end."

The cashier does that, takes your money, gives the cup to the barista, the barista then looks at the cup and says, "Okay, it's a spice pumpkin, something, and at the end, I must give this to Harold." So the actual message, the payload contains the workflow. It's part of the process that we send to a pot of information that we're sending. And then at the end of the whole thing, the cup gets given to Harold, because that's what it says at the bottom. Everything is known upfront there.

The second pattern that people typically talk about is known as a process manager, and process manager allows for the workflow to change depending on things. Depending on the current state, depending on the time that events occur. It's easier to do compensating actions, because you've got something that centrally control the workflow. A good example of this, I think is my wife is a pre-school teacher.

Anytime she asks the kids to do anything, once I've done the small step, they're like, yes, I know, I know, and then they run off and they get a completely wrong anyway, but they always come back and every part of themselves. So there's always a backwards and forwards between the person in charge and the minions that are trying to do something. Important terminology here though, is that there is a... If you hear someone talking about a saga, the implementations of sagas within the service bus space, typically deal with process manager.

But if you reading about a saga in for instance, Gregoire Hope's patterns of enterprise application integration, what he refers to as a saga is actually the routing slip pattern. So if you do read about this stuff after the talk, be aware that there may be some differentiation between what someone means when they talk about a saga. Let's have an example of a saga, so we'll go back to our video game. We've got link, we've got the old man who gives Link a sword, and we've got the bad guy. And then we've got the quest. This is the saga, that's trying to get the adventure to happen.

The first thing in the morning, Link wakes up, he doesn't know that he has to go save the kingdom yet. Nothing's happened. What's going to happen today though, is that the Quest or the Saga is going to coordinate things and structure it so that Link has to go on this adventure, and we've got a video game. Otherwise, it's a really boring game to play. Link wakes up in the morning, and he puts his pants on, and he's really proud of this, so he posted on Facebook so everyone can know that he's put his pants on and you can like it.

He's got his pants on, the Quest likes his post because they're friends on Facebook. And then the quest is, "I know that in 20 minutes time Link is going to finish having breakfast, he's going to finish washing his dishes, and then he's going to walk out the door. If he walks out the door and there isn't something for him to do there, isn't a quest, that's a waste of a day. Let's see if we can get things going." As soon as the Quest or the Saga hears that Link has put his pants on. It immediately tells the old man, get the sword ready, go get the sword ready for Link, please we need that.

It also says to the bad guy, go and kidnap the princess, or go and hold the kingdom at ransom, or whatever the bad guy does in your version of the video game. Which means that as the old man says, "Okay, cool. I've got the sword it's there nice and sharp." And it tells the Quest or the Saga. I've got the sword and it's ready. Now the saga knows, "Okay, Link's got his pans on, the sword is ready. Ah, the princess hasn't been kidnapped yet. So let's just hold on for a minute, maybe we still not going to get this right."

Link then finishes posting that he's had breakfast, because it's on Instagram or Snapchat. I don't know what the cool kids are doing these days. But you could see a picture of his scrambled eggs that he had. Now things are kind of getting close, there's only five minutes left, where Link is going to wash his dishes and then step out the door. If the princess isn't kidnapped within those next five minutes, the whole thing's off, the Saga's canceled, and we may as well try again tomorrow, because at this stage we can do compensate and actions because we've got a process manager, we've got something to coordinate the events.

If the princess isn't kidnapping time, well, then we just send a message to the old man, put this all back in the cupboard and go on your business. However, if the princess is kidnapped, now everything's ready. Now we can send a message to Link saying, "Hey, you need to save the kingdom, do your thing." And then to make it interesting, we'll also tell the bad guy that Link is coming, because you don't want the game to be too easy.

What I've described here, the saga over here is a state machine. It's keeping track of what is the current state of my business operation? What is the context here? I wish my business involved doing quests and save and princesses that'd be really cool, but it doesn't. It typically involves order statuses and processing and bullying. But the saga maintains the current state and triggers things based off of timelines and events that have happened in other distributed systems. I could have five bad guys instead of just the one, I could scale the bad guy out bad guys as a service, that sounds actually pretty cool. Micro bad guys, yes.

But let's have a look at what this looks like in code. Right, we've got Link, so he all saga version of Link. Link a wakes up in the morning, he creates his new instance of the bus and he subscribes to save the princess commands. That's the only thing that affects Link's daily life. Does a princess need saving? Yes, no. Once you push enter, he sends a command to publishers events, or he's saying, "I've put my pants on, the color was green in case anyone was interested."

This is him posting to Facebook and saying, "Look at my cool outfit." When the user pushes enter again, we sent a breakfast has been finished event out. Just to simulate some different timelines. So what happens when we publish the pants have input on event, or for that we've got pub/sub running, and we've got a saga. And when the saga starts the saga subscribes to all the events that are happening within our environment. We're subscribing to the pants having put on event, we want to know when that has happened, we want to know when breakfast has been finished, we want to know when the sword has been prepared, and when the princess has been kidnapped.

On top of that, we've got some code here, when Link's pants have been put on, what do we have to do? Well, we sent two commands at that stage. The first command is to kidnap the princess, and the second command is to prepare the sword. Those are the two commands that we sent earlier. We also then set a timeout, which I'll cover in a couple seconds, what that means. And we say in 20 minutes, or 20 seconds in this case, trigger something so that I can maybe check if we're going to have enough time to go on a quest today, but I'll get to that just now.

Prepare the sword command, where does that go? Well, that goes to the old man, because he has to prepare the sword. And the old man is here, so he subscribes to the prepare sword command. And all he does is, once he receives a prepare sword command, he goes to the cupboard because that's where he keeps his sword. Gets it out, dust it off, wait a few seconds and says, "Sword is ready." You can also see, I can scale this up by taking the thread sleep out. That's how you do professional software development people.

And once he's waited the three seconds, because he's lazy, he then publishes a sword has been prepared event. So this can then go back to the saga. This code is super simple, this is focusing entirely on what has to happen to prepare the sword. There's no other noise around here. And that's important. Once the sword has been prepared, all we do is send a flag to say, "Hey, the sword is ready, true, tick." My to do list is one down. We then go and check, has the princess being kidnapped yet? If it has, then we're ready for the quest, and we can send to save the princess command.

But I've just kind of glibly glossed over this data property. Data and set time out are two very powerful concepts in a saga when our implementation of a saga here, where do they come from? Set time out comes from this abstract class saga of T. So set time out says, "In a month's time, in a week's time, trigger a time out so that I can check what is the current state? What has happened since the last time I had a piece of code run, has anything changed?"

If the princess hasn't been kidnapped within 20 minutes, or the sword hasn't been prepared within 20 minutes of the initial start, I know that we're not going to be ready for the quest, so we need to cancel the whole quest, and we'll try again tomorrow. Time outs are very important from a business functionality perspective, to tie temporal things together without introducing temporal coupling. Do you guys know what temporal coupling is? It's where you've got a time dependent events within your system that you have to worry about.

We've got the timer over there, we've also got this data property, and the data property represents the current state of our saga. And the infrastructure, the code behind the scenes is responsible for making sure that when a message comes in, and it goes to saga A, that the current state of saga A is available for the saga in a consistent format so that we can just go and check, is my princess ready? Is my sword ready? Has Link left the house already? So that state is very powerful to be encapsulated in something that the service bus or the infrastructure deals with us or deals for us.

The code that I wrote to do that is embarrassing, because I'm a bad program, and I should feel bad. It's all of the nested ifs and vars and reflection, and don't ever do this, please. This is horrible. I'm pretty sure anyone could do it better than I did, but the general idea is, you've got something that handles that for you. We've got the sword has been prepared, once the bad guy has kidnapped the princess. On his side, the code looks very similar, when he receives a message to go and kidnap the princess, he also just waits three seconds, and I assume goes to the cupboard, because that's where he keeps his princesses, and takes her out, does so often then says, "I've kidnapped the princess."

And then he sends back another event saying, The princess has been kidnapped." At the time that that happens, we then set the princess kidnapped flag to truth. So remember the data property is consistent here. Two messages come into the same saga at the same time, will be handled sequentially, not concurrently. So you're not going to ever get two commands, or two events triggering the same method, or triggering different methods on the same saga at the same time. We know that we can hear, go and check is the sword ready? And we know that it won't be affected by another process on another thread, somewhere else.

That's sort of the implementation of the saga pattern within .NET. The code is horrible, I'm sorry about that. But you can deal with it. We're now ready for our quest, we've got long running workflows in place. And hopefully what you've seen is that once you have the infrastructure to deal with sagas, the code that you have to write has got a lot less complexity. You don't have to deal with loading the state somewhere and checking, "Well, what is the current thing?"

Maybe someone else's I've taken another thread at the same time. There's a lot of difficulty that's just kind of being abstracted away from you. Having things like the state managed for you, and timelines and triggering events based off of dates and times is incredibly powerful. It takes a lot of the noise away from software development, and you can focus entirely on the business logic that you're implementing. The old man just had to know to go to the cupboard and then tell people, "Hey, I've got the sword." He didn't have to worry about, "Well, I need to tell these people about it, or once I've got the sword, what has to happen somewhere else?"

We've got decoupling, that's been encouraged by the framework, as well as simple infrastructure code. I'm going to take a step back now because we're kind of get into a really horrible implementation. I'm going to go to something a little bit easier to discuss it. And we're going to look at recoverability. What happens when things go wrong? When messages fail and when things go wrong, how can we re recover from it? How can we make our service bus resilient? And I've got to receive class here, receive service bus, sorry. And it's got a handler, and this handle just throws an exception. So I've got a bug in my code, I'm dividing by zero, whatever it is I'm doing, how do I make sure that my service bus can survive these errors?

There's a couple of techniques. The first technique is what's known as immediate retrials or first-level retries. As soon as something goes wrong, once you get an error or something's broken in your application, don't try and put try catches around your application business logic, let the infrastructure handle that for you. Let the exceptions throw, and then have something known as Immediate Retries, which go and just retry the message straight away. Maybe the problem was a database table lock, well, chances are, if we just retry that message, that lock is going to be gone and the message will succeed.

If I'll just run that, I've got my exception handle over there. If I send a message to it, which is this guy. Run than guy, I'm going to send a message. We get our exception. As our best stands now, everything just breaks. I've got it in a try-catch, but that's because I've got some help and methods here. If I enable first-level retry, and I should probably have had this by default, but it doesn't make for a good demo to say, "Look, everything works, and now I can break it." If I enable first-level retry, and I've retried less than three times, just retry the message again." Process the message, then if it fails, try process it again, try process it again, however many times you feel is reasonable.

If that still doesn't work, we've got another concept known as second level retries, or delayed retries that we can implement. And delayed retries is very similar again, except you put a small time delay between when you retry the message. Maybe you put 10 seconds, you try the message. If it still breaks, you put another 20 seconds maybe, or another 10. And if it still doesn't work, just try the message next week. Delayed retries is where you add a small time delay there.

And in the demo code, all I'm doing is saying, "Well, I'm just going to wait a second, then wait two seconds, and wait three seconds to try the message again." Usually this you would want out of process, because if your process crashes, and it comes back online, you want to kind of start from where you were. The better service buses out there have this in a separate pipeline. So it's not tied to the actual process in logic directly.

Sometimes I could still break those, if I run the demo code, we've got first level retry, and second level retry. Our message has failed now. Now we go into first level or immediately try. So it's going to try the message three times one, two, three, so it's still failed. Maybe there's a bigger database lock that we have to worry about. Now we go into second level retry where delays. It waits one second. Now retries, now it's going to wait two seconds. Still fail. And it's going to wait three seconds now, retry the message and exceptions are cool. So it's going to fail.

We still haven't successfully processed the message, obviously something is wrong. We've tried six times to handle this one message and we're just not getting anywhere. And it's certainly at the front of the queue, meaning we can't process any messages behind it. Well, with the way we've implemented our bus at the moment. So we need to do something about this problematic message. And if you're still at the case where you can't handle the message, no matter how many times you retry it, it's better just to take that message off of your production queue and move it somewhere else.

Don't lose the message, but just move it so that you can let the rest of the messages come through. And for that, you can implement an error queue, an error queue is super, super easy to do. If you're in the state where you have to move it to an error queue, just insert a transaction, because you don't want to lose messages. Take the message and send it to your error queue. In the code here, I just create the same one as my queue, except I added .error on and it goes there. If I enable that, we run the code and now have to refer through all those exceptions there. Let's do that.

There you can see it's waiting for three seconds, still fails. And then we send that message off to the era queue. If I go to my Queue Explorer, let's refresh there. There's my error queue, and there's a message that's now sitting in my error queue. So I've taken the message from production where it's causing a problem, I've got a backlog of messages that's just increasing, because I can't get to them. And I've moved this problematic a message to an error queue. Get out of that, there we go.

When things go wrong, when something breaks, you need a way to recover from that. Because we've got transactional queues, it's pretty easy to know when something has broken and not to lose messages. What we can do is if something is a transient error, so maybe it's something that just is going to disappear, we can retry that message a couple of times. Time delays make that a little bit more resilient as well, maybe something would move. We've got a table like that last 10 seconds. Cool, well, if you waited 30 seconds before you retry the message, it'll be gone.

In the event when that still doesn't help, forward those messages to the error queue, you do however then need a way to monitor that error queue, and take those messages from the error queue and get them back into production. When you release version two of your software that fixes the bug, then you can put that message back in production, and everything will work as expected. Where are we, well, we've implemented pub/sub which is fairly simple to implement but very powerful. We've got long running workflows or sagas, which is a little harder to implement, but extremely powerful.

And our error handing means that our service bus is resilient. There's a whole lot missing still, and there's so many bugs in the sample code, so don't even try and start with this. But what we have done is, we've shown that the concepts are simple, so you can go and implement the service bus that you need for your system, if you want. There's a whole lot of features still missing, I'm not even going to bother reading through those. But a good service bus will be like starting your adventure with a sword, and a horse, and a piece of armor, and a dragon, and a boat, and a train, and it'll give you everything you need to implement your system in an easy and reliable manner.

If you want to know more about this topic, Udi Dahan, the guy pretty much pioneered service buses in the .NET space, has got two free days access to his advanced distributed systems design course go.particular.net/devday access code William. And it expires September 25th. Really, really valuable information there. Are there any questions? Again you can chat me afterwards if there aren't.

45:17 Speaker 2

What if Link had put his pants on twice?

Sorry. Can you repeat that?

45:22 Speaker 2

What if Link had put his pants on twice?

If Link had put pants his pants on twice. There we're getting into a concept of messaging or how many times things are messaged or delivered. We've got at least once delivery, where Link could say that he's put his pants on twice, you've got exactly once delivery, where it will only ever be once. And you've got most one's delivery, where maybe a link doesn't let you know that he's put his pants on.

Because MSMQ we can wrap it in transaction, MSMQ simulates exactly once processing, if I was using a different broker technology, maybe like RabbitMQ, there's especially when there's an network partition, there's a high likelihood of having at least once message delivery. And once you start getting into those spaces, you really should make your code and your handlers it in potent. So you can handle processing the same message multiple times. I don't know if you guys ever heard of ACID 2. ACID 2 is supposedly a better take on ACID instead of atomic, consistent, isolate, and durable it's associative, commutative, idempotent, and distributed.

The associative means the order of the messages doesn't matter. Commutative means if you apply the messages in different configurations, you still get to the same final state. Idempotent, meaning you can apply the same messages multiple times, and distributed because I think they needed the D to make it sound cool. Yes.

The question, is if we're sending commands to a logical endpoint and you want to scale that out, do you have to send the message to each of the individual endpoints?

Because the handles will be in a different state depending on which machine the message gets routed to. The short answer is yes, the long answer is it depends on the transport technology you're using. If you are using RabbitMQ, you can quite easily do competing consumers. And as long as your infrastructure to load the saga state is designed around being allowed to be distributed, that'll be fine.

At the transport level MSMQ, you can't do scale out on multiple machines easily. So you would have to implement some type of distributor, or Round Robin logic, but the part where the saga is kept consistent, and that has to be between different instances of machines. You would never store that directly in memory, you may want to try to use, I don't know, read us a bus, or some type of persistence that you can share across your different node instances that is still consistent, and doesn't introduce a single point of failure.

Maybe even implement some type of distributed number of read-writes module before you can say, "Cool, this has done." With the all considerations for scale-out at implementation time, depending on your transport, and your underlying persistence technology. I think we're completely out of time now, if there are any further questions, please feel free to grab me. I'm always very excited to talk about this stuff. Thanks.

So you want to build a service bus

About this video

🔗Transcription