Loosely-coupled orchestration with messaging

00:00:05 Udi Dahan

Okay. Apologies for the technical difficulties. Thanks for coming out. Things are a little bit finicky and tested it before you all came in and everything was working perfectly. That's Murphy's law for you, right?

00:00:24 Udi Dahan

My name's is Udi Dahan. I'm guessing some of you are familiar with my work, and I'm going to be talking to you about loosely-coupled orchestration and messaging and patterns, and all sorts of things around this space of distributed systems. I didn't accidentally use this image, as you can imagine, or just show nudity in front of a group of 100 people just like that, but Sistine Chapel, I guess you can get away with it. One of the reasons why I chose this image to kind of start this talk is to use this as sort of an indication of where we want to go with the design of our distributed systems, that just as human society in the natural world, there is no God from up high as far as we know that's kind of dictating who is doing what, when, where, why and how, and really instructing how everything works. Instead, there's this certain kind of little spark of life that's been given to this societal system.

00:01:36 Udi Dahan

From this point on, really, we've got this collection of loosely-coupled actors, if that's the name that you'd like to use, that each of them is kind of focused on solving its own little set of problems and interacting with other actors in a similar sort of loosely-coupled manner, that there's this element of, let's call it spontaneous coordination that happens between groups of people, and that if we look historically at human society or the natural world as a whole, it looks like this model is fairly scalable. I mean, it's scaled to seven billion people. Yeah, there's war and famine and pestilence, but along the way, we've done some pretty good things, and the most important thing is that the system appears to be robust. I mean, it's been a couple of billion years on the planet and we haven't destroyed it yet. Contrast that with most software systems, where if they live past the age of five, six or seven, without somebody rewriting them, that's kind of a, "Wow, good job" type of deal, so just to kind of give us the spectrum of things that we're looking at.

00:02:50 Udi Dahan

What we're going to be discussing for the next hour and a bit are some of the principles of these types of loosely-coupled environments and how we can model our systems around the same sort of principles to hopefully achieve a lot of the same sorts of properties of robustness and resilience and scalability that the natural world appears to exhibit. Before jumping directly into the good stuff, I'm going to start with a little bit of programmer history to bring up some of the reactions or the feedback that I've heard from various customers and developers over the years as I've been trying to promote these ideas of loosely-coupled orchestration and messaging. In the early days of programming, it was pretty much chaos, kind of like the day zero in the Bible, if you will, where people were using goto's, so I thought goto's were absolutely great when I wrote my first GW-Basic Program, and I could kind of create a, that the best practice was you had line numbers, 10, 20, 30, 40, so if ever you needed to insert something in the middle, you can always do a 35, and goto 35, and voila, the system works. Now, along the way, luckily, we progressed from that into structured programming, meaning no more goto's, and we created procedures that called other procedures that called other procedures, and life was more or less good, until somebody came up with the bright idea of objects. Now, when the idea of object orientation came around, and there's a difference between the idea of object orientation to language support that worked pretty well to broad scale industry adoption, there were a lot of naysayers from the structured programming world that looked at all of this object-oriented stuff and said, "You know, I don't see how you could possibly write and understand the code base with this new fangled object-oriented technology."

00:05:03 Udi Dahan

I mean, imagine you're calling a method on an object, but because of this polymorphism craziness, you don't actually know which section of code is going to run, right? There, you got a base class and three sub-classes, and maybe the methods implemented on the base class, maybe it's overrided in the subclass. How could you possibly maintain such a hideous code base? It is so much better if you just have a single method that you're calling, and that's it. The idea of control is a very important one for us as developers. We want to feel in control of our code bases, and I'm guessing a lot of us were not programming at the time that object-oriented came around, but there was a lot of this pushback, people said, "This is a terrible idea."

00:05:53 Udi Dahan

"It's going to create an unmaintainable mess." Yet, here we are, several years and decades later, where object orientation has kind of become the de facto standard for most programming. From the world of object orientation came callback, and events, where again, bunch of naysayers started popping up saying, "How can you possibly manage a code base with callbacks?," what you're just going to raise an event out into the void, and some code that you don't know about will get invoked and do something, and not only one piece of code, you could have N callbacks that you don't know about, so how could you possibly maintain a code base when you have no knowledge? When you're looking at a piece of code, what other code will be invoked as a result of this? Of course, none of you believe that, but there are a bunch of people out there that this style of programming, of moving to looser and looser coupling, of saying, "What I'm doing over here is more and more strongly isolated from bits of code that are running over there."

00:07:12 Udi Dahan

They're looking at that, they's saying, "That can't possibly work," and of course, if you layer on top of this things like asynchronous invocation or things that are running across the network, it becomes even more stressful, where this is, again, this has been sort of a common theme in software development for a while, this tension between, "We want to make things more loosely-coupled because it allows us to make changes in one area that won't break another. However, at the same time, we like to be able to very quickly and easily get an overview of the system and who is doing what, and how are the flows happening, and these two things are constantly pushing us in one direction or another." What I'm going to be talking about today is trying to come up with a model and a style of development, a set of tools and visualizations, et cetera, that make it easier to program in this way, especially once we introduce, like I said, the asynchronous message-based network communication on top of all of these sorts of things, which have historically always been done in process. Starting with the good old days of development, the structured programming days, the thing about most software projects and the architecture of those software projects is that the decisions about architecture are made early on, usually when we don't have lots of code, so and say, "All right. Let's create a certain database with a bunch of stuff, and we'll have a little bit of code, a little screen over here that puts something in the database and takes it out again."

00:09:01 Udi Dahan

Then, in iteration two, we create another screen and we pull out a common type of utility method that talks to the database, and we're thinking, "Hey, things are pretty good." Then, in iteration three, we add another screen and a little bit of domain object interaction behind the scenes. Again, we're kind of feeling like we've got control of the system, and along the way, iteration four, iteration five and somewhere along the way, the system that we thought we understood so well became this hideous mess of A calls B calls C calls D, and everything's talking to the database and reading and writing and reusing, and somehow, somebody snuck in between iteration four and five of our project and replaced our beautiful, clean code base with a complex big ball of mud. We're not yet sure who it is, but we're sure ... Something bad happened in there, and we can't really put our finger on when exactly the code base became unmaintainable, but this style of saying, "This is one system, this is one code base," and fundamentally, you've got layers, so it's not like we're doing a terrible thing, but just this act of building a larger code base tends to create more and more coupling with it. That's my concern in working with clients and software projects, that most of the architectural choices about what is appropriate for the project are done when people are envisioning this level of complexity. They're saying, "Oh, no, this is great. I've got some MVC over here, some Entity Framework over there, a SQL Server." Fundamentally, it's a simple system. How many times have you heard that statement?

00:11:02 Udi Dahan

Fundamentally, it's a simple system. We've got some data access, some business logic, a UI. Fundamentally, it's simple, but somewhere along the line, it became a mess, and that's one of the things that we want to watch out for, is how can we prevent our systems from turning into this mess, and a big part of that is keeping the coupling low, but in order to keep the coupling low, sometimes we need to introduce patterns earlier on into a project, that when this is all that you're seeing in front of you, you're like, "Wait a minute, you want me to do events and pub/sub and messaging, and I'll need all this infrastructure?" I'm sorry, but agile says you aren't going to need it, but then, three months later, when you've got this crazy mess on your hands, then it's much more difficult to pull that apart and refactor that into a more loosely-coupled pub/sub event type system. That's part of the challenge that I'd submit to you, that when you're working on projects with other team members, and in the early days, you're making architectural choices, don't try to look at things from the perspective of the code base is fundamentally simple.

00:12:24 Udi Dahan

This is basically a simple system. Imagine that last project that you worked on, how it was at the end, the big, tightly coupled ball of mud that it was and how painful that was. That's what we need to design for and we need to start introducing these ideas early on, otherwise, this is what we end up with, right? The death march projects, where we're deep down in the mud. It isn't even clear if everybody's pulling in the same direction.

00:12:53 Udi Dahan

One iteration goes by another, and to a certain point in time, that we stopped counting iterations anymore. The project is going to be released next week for the fifth time, right? If you ever had that, when is the project going to be released? I've been telling you for the past two months, next week, right? The project that's 99% done week after week after week after week, because nobody's ever willing to say, "Actually, I think we're maybe closer to 60%," but sometimes that's the case that we're in, so we need to watch out for this stuff. Sometimes when I come to developers and say, "Hey, you need to make it async, you need to make it pub/sub," they have this reaction of like, "What, are you crazy?" We can barely keep all four wheels on the road with everything being synchronous, sequential in process. It seems totally counterintuitive that having things run more in parallel will actually help. That's part of the problem with a lot of good design decisions, is that they're counterintuitive. Now, along the way, I mean, I've been doing this messaging thing for a while, the tooling often is extremely lacking, so there are justifications when people say, "Hey, introducing this stuff, while it might help, might help in some regard, in other areas, we're not so sure."

00:14:26 Udi Dahan

You can try this argument, the trust me argument. "Trust me, it's going to be better." If you're new into the company, you're the new senior software architect, maybe you have some credibility that you can lean on and you can actually force it through the organization and say, "No. Seriously, this is the way we're going to do it. Everything's going to be better, trust me," or you can try a variant of that argument, "Udi said so." Right?

00:14:59 Udi Dahan

"I was at this presentation and he was really convincing. It sounded like a good idea, so if Udi says it's a good idea, it must be a good idea, and then let's go do it." Be very careful of going to any presentation, getting wowed by the very slick demo that hopefully I will be able to show you a slick demo that will crash, and you'll be like, "Huh, wow," but take everything that I or anybody else say with a grain of salt to a certain level of, "Oh, how can we not throw away everything that we've done and start from scratch, rather, transition to this model gradually?" I'll talk about that later on in the presentation, but in general, these ideas, while they have merit, I usually don't recommend a company trying to go from zero to 60 right away, okay? Let's start small.

00:15:56 Udi Dahan

Let's say we want to introduce some publish/subscribe into our system. Now, the traditional way of say, implementing some sort of business process in our system is a user clicks a button and that arrives at, let's say a sales system, and in there, we start a very long process of, "Wait a minute, we need to validate this order. Then, we need to check that all of the products are still in stock. Then, we need to recalculate the order total and check tax and if the customer has any loyalty points that they can use," and on and on and on and on and on to the point of, "How do we actually charge the customer? After we charge the customer, then we need to ship the product," and we can create a very long, long chain of execution where it may appear at first glance like there is no opportunity for pub/sub.

00:16:55 Udi Dahan

There is no opportunity for anything to be asynchronous. Now, there are a bunch of different techniques that can be applied to the design of systems like composite user interfaces, which make publish/subscribe work even better, okay? I'm not going to spend any time at all talking tonight about concepts of user interface composition, but let me tell you this, it vastly improves the effectiveness of messaging techniques like publish/subscribe. It reduces coupling even more, so if you want, look up other presentations that I've given online about user interface composition and that kind of stuff. Service-oriented architecture, UI composition, pub/sub, loosely-coupled orchestration, all of these patterns fit together very well, complement each other and simplify the system even more.

00:17:50 Udi Dahan

Right now, we're just going to be focusing on the element of pub/sub. Now, when looking at it at a complex business process, one of the ways that you can break things down is to say, "Well, does it have to be an all or nothing success immediately?" What do I mean by immediately? For example, could we take two, three, four steps out of this process, pause there, and then publish an event saying, "Okay, this looks like it's going to be a good order"? In which case, we could have the billing processes itself, the act of charging the customer for their order happen asynchronously with respect to the original order being placed.

00:18:38 Udi Dahan

Now, introducing this element of pub/sub, I don't want to say this is a purely technical concern. Making things asynchronous to some extent makes them eventually consistent. You will want to have a conversation with your domain expert, other business stakeholders, to verify that you're doing this in the right place. Now, for example, if I'm making a purchase on Amazon.com, Amazon knows me. I've been a customer for years.

00:19:08 Udi Dahan

They have my credit card on file. They don't necessarily need to revalidate the credit card on every single purchase, so when I say, "I would like to place an order," they could have a service that says, "Oh, we know this customer. They've been purchasing stuff from us for forever." We can tell them, "Hey, your order's been accepted even before we've charged their card." Why?

00:19:34 Udi Dahan

Because we know that they're good for it. It allows us to give a much faster user experience, rather than synchronously blocking everything. "And, you know, if there's a problem with charging the card, then we can send them an email," also asynchronously and say, "Hey, it turns out there was a problem with your card. Please give us a different credit card." We can resolve this, again, asynchronously with respect to the original business process.

00:20:03 Udi Dahan

A lot of the business processes that I've seen in systems, they don't necessarily have to be they'll wait until you get to the very, very, very end before you give your user feedback and tell them it's done. Some of the things can be handled asynchronously. Shipping, well, clearly is going to be asynchronous with everything else, but again, instead of invoking shipping logic directly from the original method, we could have shipping as a separate set of code subscribed to an order billed event. Again, asynchronous with respect to the billing process, asynchronous with respect to the sales process, where similarly, if there ended up being some sort of problem with shipping, we could send an email to the customer saying, "Hey, look, we've actually had a problem here. We couldn't find the street, or when we shipped, you weren't home," or whatever it is, and then have a separate set of logic happening here, separate from the core logic up there.

00:21:11 Udi Dahan

Now, whether shipping is ultimately subscribed also to the order accepted event, that's a separate decision, but when looking at introducing publish/subscribe communication, don't always expect that it's going to be at the places where you have your regular request/ response web service type logic. In a lot of cases, the best place to put it is kind of, wedge it right into the middle of your core business logic, which again, may seem a little bit counterintuitive of you, "Well, this is our core domain logic. Why would we want to introduce, excuse me, any publish/subscribe in there?," and well, exactly because it is your most complex logic because it will be volatile. It will be changing over time instead of having one monolithic pile of business logic. If you can divide the sales-centric logic and have that separate from your billing domain model and have that separate from your shipping domain model where things are being run asynchronously from each other, that's the highest level of loose coupling that you can achieve between these types of code bases.

00:22:25 Udi Dahan

Now, again, this is not the kind of thing that I'd suggest you take lightly. "Oh, I was at this presentation yesterday. Hey, boss, you know that really core, important business logic that our entire company depends on for its financial future. I want to rewrite that. I want to do some publish/subscribe over there." Why?

00:22:44 Udi Dahan

"Because Udi said so." Okay? Don't go running ahead and doing that indefinitely. Don't go blaming me when it goes wrong, okay? More often than not, I see people get overexcited about these patterns, and they end up misapplying or overapplying them to cases where it doesn't make sense, okay?

00:23:04 Udi Dahan

There's a lot of power in these tools. A lot of benefits can be realized, but if you use them in the wrong place, you may end up not with eventual consistency, but eventual inconsistency, and that's not so great, okay? The problem with eventual inconsistency, it's very difficult to write a unit test or an acceptance test to actually check to see if your system is behaving correctly, so all of this parallelism does have a certain cost with regards to testing, and we're going to talk about that, but that's the double-edged sword of loose coupling. The more you disconnect things, the more loosely-coupled they are, the more there are opportunities for independent failures and for independent bugs. The good news, being that if you made a change to some logic over here, the likelihood that you're going to break logic over there or over there is next to nothing, so when you decouple things not only from, let's say an object-oriented perspective, but from an event-based perspective and from an asynchronous and vocation perspective, the likelihood of breaking something really drops low, and that's ultimately what helps projects get done faster, because the vast majority of time that we end up wasting on software projects is around regressions, right?

00:24:32 Udi Dahan

We fixed a bug over here, then we changed something over there, and we rebroke something that we fixed earlier. All of that time wasted on stabilizing and restabilizing the system, what's known as debugging and rebugging the system. There's a great old UNIX thread on the topic of, "If what you do after you code is debugging, when exactly did you put the bugs in?" It must have been when you were coding, so a better name for coding would be enbugging, okay? The software development life cycle is enbugging, debugging, rebugging, and then again, okay?

00:25:15 Udi Dahan

We want to try to break that cycle or at least partition it into smaller areas so that if we accidentally rebug a piece of code over here, that doesn't end up rebugging the code over there and over there, okay? Those are the benefits of loose coupling, but like I said, double-edged sword. Now, this whole element of pub/sub, yes, there can be more than one subscriber and I'll be illustrating these sorts of patterns and code in a little bit, but you want to be careful that you're not turning everything into an event. Occasionally, I see people getting so excited about publish/subscribe, is that they take a fundamentally request/response interaction, so for example, validation request, validation response, and they kind of twist that around. It's, "Oh, pub/sub is awesome."

00:26:09 Udi Dahan

"Udi said so," so instead of just doing a simple request/response interaction, they publish a validation requested event, which is subscribed to by something over here, and the thing over here publishes a validation responsed event, which is then subscribed back over there, and then they try to do some sort of weird correlation thing to make sure that the validation responsed is connected to the original context of the validation requested code, and they just tie themselves up into all sorts of weird scenarios. If it doesn't look like pub/sub is fitting, don't force it, okay? If something is fundamentally request/response, use request/response, but if there is an opportunity for pub/ sub.

00:27:03 Udi Dahan

Oftentimes it'll be in a place that you're not doing regular request response. Can be in a place where the fundamental question that I suggest you ask is could this be a synchronous with respect to the first day? And part of the ways of addressing that is, if the first bit succeeds and the second bit fails, should we roll back the first bit? Going back to the amazon. com case, when I make a purchase on Amazon and the credit card billing process fails, it's not that Amazon throws away my entire order and says, "That's it, forget about you. Your credit card failed. Your entire order is garbage, complete garbage," and deletes all of that data because that's the traditional approach to coding things, right? You do a synchronous, one big transaction. Insert this, calculate that, validate ,this insert this, call that. If anything fails along your big transaction, it all rolls back.

00:28:11 Udi Dahan

That's the business question that you should be asking your domain. I could say, if this bit of the process fails, should we really roll back everything? Is all of the data then considered useless and should be purged? More often than not, your domain expert will tell you "No, no, no. That's a terrible idea. Just because we couldn't do this does not mean that the rest of the data is garbage." So look for those opportunities as a part of your business analysis, for things that could be partially successful. Then that's an indication. The thing that can be partially successful is its own service, which publishes an event when it reaches that point of stability and then the rest of it that can succeed or fail independently of that, can happen in a separate subscriber. You will need this interaction with your domain expert in order to find the right places to introduce it.

00:29:12 Udi Dahan

That's one of the things, and that's one of the reasons why I say that so many of these message driven patterns are not only compatible with domain-driven design philosophies. They actually promoted even more strongly in the organization. Because in order to introduce pubs up, you have to have a conversation with your domain expert. You have to work at creating a ubiquitous language. You have to figure out what are the correct transaction boundaries? What can succeed independently of something else? Transaction boundaries? Who's read the domain-driven design book? A bunch of you. There's this great concept in DDD is called aggregate routes. Aggregate routes are transaction boundaries. Being able to do this analysis from an asynchronous perspective often helps you uncover better aggregate routes than maybe you found before. This is not just pumps up from the perspective of, it makes your code based more loosely coupled, or it allows processes to run in parallel, which gives you better scalability. Along the way, it actually improves the quality of your domain. It improves the quality of the aggregate routes that you identify. But it takes a good, strong interaction with the domain experts. That's on the logical side of things. I want to pivot for a second to the physical, transactional reliability side of things.

00:30:50 Udi Dahan

When you are leaning heavily on published subscribers, saying if this bit succeeds, then I want to make sure it succeeds independent of that bit. If that bit fails, want to make sure that no data gets lost. You're going to want to make sure that the infrastructure that you're leaning on has all of the necessary moving bits to give you the highest level of reliability. Let's say an order is coming in, whether this order is coming in is the place order command over here, or whether we're looking at the order accepted event arriving in billing. One of the things that service bus technology does/should do, is to make sure that the business logic that you're writing around processing that message, in this case assuming you're talking to a database, is really as transactional as you think it is.

00:31:48 Udi Dahan

So when building traditional HTTP communication. HTTP unfortunately is not very reliable or fault tolerance. By introducing queues, you get a very high level of reliability where the bus technology ultimately integrates with whichever queue you're using MSMQ, Rabbit, MQ, active MQ, et cetera, invokes your code in the context of the same transaction that it pulled the message off of the queue. So that any work that you're doing against the database, ultimately enlists into this larger MP and transaction so that if anything bad happens midway through your processing, it could be that the database server crashes. It could be that your application server crashes. It could be that when you try to connect to the database. It throws a connection refused by the remote host exception at you. It could be that you get a deadlock exception. Any number of bad things can happen in a production environment.

00:32:56 Udi Dahan

The good news is under these conditions because your database work is being run in the same transaction as your message queuing work that not only will the database state rolled back, but also the message will roll back to the queue. Thus guaranteeing you that you will not have lost the important data that's in there. Like I said, this is very important when you're saying, "I'm in the middle of charging the customer's card, and then my server crashed." That's a terrible thing to happen, but it's not really preventable. Servers crash. It's physics. You can't really fight with the laws of physics and win, at least not for very long anyway.

00:33:42 Udi Dahan

When the message rolls back, it can then be tried again. Now depending on the nature of the failure, maybe trying again immediately make sense. For example, if we had a deadlock exception, trying again immediately, beautiful. For other types of situation, database server is down, we might want to wait a little bit longer. So having the order back in the queue, very useful. Make sure that we as developers don't need to rebuild all sorts of crazy HTTP retry logic infrastructure consult that for us. But ultimately there needs to be some sort of retry thing happening. This is one of the things that various bus technologies provide, whether it's in service bus or mass transit, I think rhino service plus also uses it as well.

00:34:37 Udi Dahan

Who's using a service bus of some kind? Quite a lot of you. For those of you who aren't using the service bus, who's using just a straight queue? When you're using a queue directly without any intervening service bus. For those of you using a queue, what type of queue are you using? Let's try MSMQ for those of you? Got one hand. Rabbit MQ directly? Another one. Active? Another one. Let me see which other ones. The bigger guys, WebSphere? Sonic MQ? Tibco? Stop me when I say something that sounds like a crystal.

00:35:24 Udi Dahan

So if you're using queues directly, one of the things that you ultimately have to do is to manage the transactions yourselves correctly. And if there's any retry logic to be done, you also need to manage that correctly. And you need to be extremely careful with certain queuing technologies because not all of them support multiqueue transactions. So if you have, as a part of some of your code over here, logic that publishes an event or sends out another message. Then you want to guarantee that any messages that you ask to be sent as a part of your business logic while processing that message, did not get sent as a part of this rollback.

00:36:17 Udi Dahan

Why is that dangerous? I've seen this happen and again, it only ever bites people in production, which is what makes it scary. You go to the database, you insert an entry into a table, you get back an identity for that entity like: "Oh, okay, great, I've just created an order object. Let's publish an event. Telling everybody about the new order and its ID." So you go to publish that event. Now, if you're queuing technology or bus technology does not fully encapsulate the outgoing messages as well as the incoming messages, it could be that that message escapes your transaction. And then what you'll have is you'll have some subscriber that receives that events, but your database has rolled back because of some other problem later on in your processing logic.

00:37:12 Udi Dahan

Sure, you'll reprocess that order at some later time, but at that later time, it may receive a different ID. And now you have a problem that you have a subscriber that has heard about an order with an ID that doesn't actually exist over here. I can tell you these sorts of things, they're very difficult to debug. You don't want to put yourself in that position.

00:37:40 Udi Dahan

And occasionally what happens when I talk to people about, "You should think about messaging and puffs," and they're like, "We tried that once, it created a total mess. It wasn't a ventral consistency, was eventual inconsistency. We had these events flying around, but when we looked in the database, the data wasn't in there." This is scary stuff. The worst case is not when you can't find the entity in the database, but when you find the wrong entity. Charging a customer for somebody else's order. They don't like that. It kind of depends, right? If you bought your really big, expensive piece of electronics and the other customer bought a book, "Yeah, ship me that and I'll pay for that." Of course, if you're the other customer yet, you're not particularly thrilled about the wires getting crossed the other way.

00:38:36 Udi Dahan

When going to introduce publish subscribe, you want infrastructure that gives you these very strong once and only once message processing guarantees that not only is that on the way in message processing, but also for any events that are going out, that you got full and total physical encapsulation of those transactions. Now, some people say, "well, that makes it slow, right? Why use the distributed transactions that makes the system slow." Be very careful. You've probably heard the statement before: "Premature optimization is the root of all evil." Who's heard that statement before? A bunch of you. That element of, we want the system to be fast so we're not using distributed transactions.

00:39:23 Udi Dahan

Now, if you tell somebody, then why don't you just stop using transactions entirely? I counted. That's crazy talk. If the foremost thing in your mind is performance, then you just make everything really fast. But if you understand that there is some need for transactions, if you understand there is some need for consistency, then let's have a more specific discussion as to the level of distributed transactions that we need. Important in this is that it's not a distributed transaction that spends more machines. The same number of machines is before two machines, your app server, talking to the database server as if this were an HTTP call directly coming into your code that was managing this transaction. So it is a higher level of a transaction, but it is no more distributed than your regular transactions were to begin with.

00:40:26 Udi Dahan

Unfortunately, distributed transaction got a really bad name. So people shy away from them constantly. Without necessarily stopping to think. It's not actually any more distributed than what you're doing. The only thing that is really distributed here is that instead of having one resource manager, as a part of this transaction, we have the queue resource manager as well, which can be local.

00:40:51 Udi Dahan

Yes, it might be a little bit slower because we're listing another resource manager, but it's not fundamentally much slower than your regular code. And believe me, I've worked with clients on this stuff every once in a while I get a client they're like: "In service bus is slow." "Really? How so?" "It's only doing 50 messages per second for me." "Okay, that's interesting. I've run tests where I have in service bus doing 2,500 messages per second. What are you doing?" "In service bus is slow." "No, what are you doing?" "The distributed transactions are killing me." Lots of these types of beliefs that they just kind of restate. "What's your business logic doing?" "It's not my business." "Tell me, what is your business logic doing as a part of this?" "Well, you see I'm talking to my database. I'm opening up a transaction. I'm doing the calculation that sometimes can take up to 30 seconds and service bus is slow." "No, seriously, you have business logic that's taking 30 seconds for a single message. There's your problem." "No, no, I've tested my business logic. It's fine. And service bus is slow. You're using distributed transact."

00:42:11 Udi Dahan

Sometimes people they're like a dog with a bone. If you hear distributed transactions are slow, enough times, you'll start to believe that they're really, really slow. Ergo, they're the cause of all performance problems in your system. Most of the time, again, for those people that need to do super high performance, beyond 2,500 messages per second per node in your system. I've seen this kind of thing run a hundred node cluster. All of these things running in parallel, it scales up to hundreds of thousands of messages per second. That's sort of a whole system. For those of you that need more than that, if you need more than several thousand transactions per second for a given machine, there's an F sharp presentation upstairs. They love to talk to you about how to do that. That's what they do.

00:43:07 Udi Dahan

Here's the crazy super-duper high-performance type stuff. And there are domains where that's really important. I'm not making light of that. But for a lot of business scenarios, the messaging with the database with distributed transaction, it is absolutely fast enough. More often than not, the reason your system is slow is because your code isn't optimized as good as it could be, or your database is the bottleneck. I've had this one time, separate clients, similar type of story in service buses slow. We went through the process. What's it talking to? They're like, "It's operating on a single record in the database. There's no reason this should be slow." "Okay, yeah, that's weird. How big is that record?" "It's got 256 columns on it, but we're needing to make it bigger." I'm like, " Okay, what else is talking to that record? I'm curious." I go, "No, everything in the system opens up transactions against that record as a part of its logic." "Oh, okay, so you have a contention problem. Other parts of the system are locking out your part of the system, distributed transactions are slow. Your design is slow."

00:44:23 Udi Dahan

So please, don't worry so much about the transactional nature of the infrastructure that you're working on. If you have performance problems, nine times out of 10, if not 99 times out of a hundred, it's either here or here. Nothing related to the service bus or the queuing technology, that stuff is pretty fast. And, going back to where we were on the previous slide, the benefit of a system built on published subscribe is that you have one little note over there, potentially doing 2,500 transactions per second. Another note over here, running asynchronously with respect to the first one. So that's another 2,500 messages per second. And you have for every one of these things doing pub supper messaging, it is not blocked waiting for anybody else. The only thing to be careful of is if you have a big central database in the middle, that everything is talking to. The ideal path to move to when you want to build a truly loosely coupled pub subsystem is instead of having one big master database to look at how can I partition my database as well, so that I have a sales database, which is separate from the billing database, which is separate from the shipping database. So that ultimately there is no traditional master database, which holds everything in it. Of course the DBA will hear this and say: "Heresy! Bring them at the stake."

00:46:11 Udi Dahan

Now you can tell them, look, I don't necessarily mean that each of these will be a separate database server. You can host all of these three, four, 10 independent database schema on the same database server. Then you can get centralized, backup, and management, all that kind of stuff. I'm not going to get involved. But by having separate database schema, then again, you take the autonomy and the loose coupling to the next level, knowing that if I change some business logic over here, that in doing so I need to change part of a table schema. I can know that when I'm changing the table schema, I'm only changing my table schema and no other code except me is going to be effected by that. That's sort of the next level of where you can evolve to, in terms of loose, coupling Baya, published, subscribed is taking the data and making it also behave the same sort of way. And of course that improves your performance as well, often by more than the same factor, than the number of services that you have.

00:47:31 Udi Dahan

So if you had a single master database and you divide that up into three, arguably your performance is going to be more than three times better because not only are these things running in parallel with each other, the size of the data that each transaction operating on is much smaller. If you want, you can talk through the scenarios with your DBA saying: "Look, instead of having tables that are this wide, now have three tables that are much smaller. If you want to get on your DBA's good side, say "this is just the vertical partitioning that you DBAs have been doing for years. We developers we're a little bit slow. It takes us some time to get with the program. So we've decided to adopt that. And in addition to the vertical table, partitioning that you guys have done, we just put a little bit of pumps up in there to kind of keep it all in sync in a loosely coupled fashion.

00:48:28 Udi Dahan

They're like, "Flattery will get you nowhere, but I like vertical partition." So there are ways. Sometimes it surprises people. DBA gets on board surprisingly quickly with a lot of these patterns because put yourself in their shoes. You are tasked with managing the company's monstrous database where developers left, right and center are running complex, horrendous code that opens up transactions for very long periods of time. Anytime there's a problem with your database, the entire company is affected. So ultimately anything goes wrong, it's your fault. But of course you're not ever in any position to do anything about it because developers are writing big, gigantic transactions.

00:49:17 Udi Dahan

As developers start writing smaller, more focused transactions on smaller sets of tables, you regain more control. Things run faster. Ultimately you can get a DBA on board with this, but do not be afraid of transactions or distributed transactions that service buses are running. For those of you that are using a queuing infrastructure without any service bus technology on top of it. I implore you, pick something, anything, start using it, write your own. I don't really care. This is not the kind of code that developers that are focused on business logic should be thinking about. Wait a minute, how do I set up the multi-crew transactions, so that events won't escape before my transaction is done? That's scary code. You don't want to go any near that. You want to have that set up as infrastructure that's automatically taken care of everything.

00:50:24 Udi Dahan

In some cases you have more permanent problems. It could be that you weren't able to successfully implore or analyze the data that came in off of the queue. It could be that you had a bug in your code. Not your code, somebody else on your team had a bug in their code, which caused problems in your system. In a synchronous type of architecture, I'm talking about HTTP web services, rest, whatever you want to call it. A calls B call C calls D. If there's a D serialization problem, the thing fails. The client ultimately gets propagated back there that says:"Hey, something bad happened." Maybe it will collect some log files along the way.

00:51:10 Udi Dahan

The problem fundamentally is that we've lost the data, right? Think about your average web service call. If you can't de-serialize, that's it. Poof. The data's vanished. The problem is that a lot of times de-serialization exceptions, they're not because the data is actually bad, but rather we built a version and plus one of our system that didn't happen to be backwards compatible with version N. So we deployed a new server and we've got existing clients that are trying to send a data and our service: "I don't understand. I don't understand. I don't understand. Error. Error. Error." Data's just being dropped left, right, and center until somebody notices that the log files are filling up with errors. Then, "Oh crap," quickly revert back to the previous version and then try to fix the problem and redeploy.

00:52:06 Udi Dahan

If this sounds like any one of your last deployments, I'm feel very sorry for you, but it is reality. The thing is nobody ever stops to think about that there was a de-serialization exception, there was a bug in our code. That's a problem. They don't stop to think about is all of the production data that tried to float through the system, at that upgrade points in time. That's now quietly fizzled into the mists and nobody knows about it. It's actually good news. The good news is nobody knows that the data fizzled into the mist, because if the CEO knew that we actually lost a whole bunch of production data, probably be furious.

00:52:51 Udi Dahan

Good thing, no CEOs, be careful when you send this link out, remove it from the CEO distribution list. In a message based architecture, if there's a problem that we can't de-serialize this, one of the things that we can do, because we are asynchronous with respect to the color, is that beyond retrying. It doesn't make sense to retry something that you can't de-serialize. The benefit by having a queue is you can take this message and move it to a different queue. Maybe there's a problem with this message. Maybe it's garbage, but maybe it's actually some valuable business data. Let's put this in a separate queue so that an administrator or somebody else can manually take a look at it and figure out what we should do. Queuing systems called this a poisoned letter queue, sometimes, or an error queue. The advantage being your system does not lose data. Even when you did a poor upgrade of the system and there were bugs in there, or you couldn't de-serialized or anything like that.

00:54:03 Udi Dahan

This makes upgrading your system a whole lot safer type of approach. Okay. Now, if there's a problem, like I said, messages can be moved to an error queue. And this is where most native queueing systems and service buses stop. And why for several reasons, administrators not to I mean, they're there if the system is going to be built with messaging anyway. Well, okay, then. But if this is the first time you're introducing messaging to an organization, they're going to be like, what? Another thing to monitor, it's not enough that I have to monitor your web servers, and I have to monitor the database server. Now I have to monitor this additional queueing thing. And it's not like one queue, you're giving me like lots and lots of queues. And what if something goes wrong? They don't like it. Usually the, "I don't like queues" is not an indication if there's anything fundamentally unmaintainable or unmonitorable about queues.

00:55:03 Udi Dahan

It's a lack of tooling. Sometimes it's so who is running on MSMQ? Anybody? Okay, got some hands over there. MSMQ has a very bad reputation. A lot of it is based on well, I remember using MSMQ back in 2002 and it was a piece of crap. Do you know that it's been 12 years since then? Sometimes that you know, how you know that you're getting old is that you kind of say you kind of had this well, in 2002, or 98 or, for me one of the biggest moments was I was talking to some college students and I happened to mention the.com bust. And they kind of looked at me, this these kind of deer in the headlights, the what? Like, really? It wasn't that long ago was it? I said? I said right. 12 years now, 13 years, all of a sudden is this, it's been 13 years. It seems so fresh in my memory. Same sort of thing with technology.

00:56:23 Udi Dahan

It's like I remember using MSMQ as a pile of crap. Like, well yeah, it was a version one or version two product from Microsoft at the time. And yeah, most version one two products from Microsoft are pieces of crap. Not to say that it's only Microsoft that's afflicted. You know, IBM is the same thing, oracle's it's the most technology vendors, version one is kind of a piece of crap. But along comes version two and version three in version four, and eventually it gets there. So MSMQ is kind of been painted with that same brush.

00:56:54 Udi Dahan

It used to be a piece of crap, but it's pretty solid. And the big thing that MSMQ has going for it and a lot of other queues don't have is that it supports distributed transactions, it supports multi queue transactions. And it's always all of this stuff out of the box in a cluster friendly way without a whole lot of setup . RabbitMQ for example, does not support distributed transactions. ZeroMQ does not support anything. Anybody using ZeroMQ? Okay, yes, ZeroMQ. The reason that they put zero in there. Okay. They're like, it's an MQ, what does it give me? Nothing! Does it give me reliability? No, it gives you nothing? Does it give me transactions? No, it gives you nothing does it give you clustering? It gives me nothing. You can build all that yourself. I don't want to say no. ZeroMQ has marvelous documentation that describes all of the great ways that you can solve the problems that it doesn't solve for you. Okay. So it's one of the things you need to be careful though.

00:57:59 Udi Dahan

ZeroMQ has a marvelous reputation, people rave about it constantly, I'd say to no small extent because of their awesome documentation and here, I'm kind of forced to look at myself in the mirror are NServiceBus documentation is pretty crappy. Okay, sorry about that. By the way. Anybody who's using in NServiceBus, incidentally. Okay, got some hands. Okay. So yeah, we're working on that. We're trying to make it better. But you know, a lot of things NServiceBus does together with MSMQ. You're not going to find an equivalent of level of reliability around RabbitMQ or ZeroMQ or ActiveMQ. I remember some of you, you said ActiveMQ, right? You got to be really careful with ActiveMQ, we thought that we had good support for ActiveMQ until we actually started deploying it at a really high scale at a specific client. It says that it supports distributed transactions, it's .net client does not support distributed transactions well, okay. It supports XA transactions on Java perfectly, not on the Microsoft platform. So you want to watch out for that. Okay, so you're good, just don't integrate with any Microsoft crap. Okay. .

00:59:06 Udi Dahan

All right. Now, the thing about messaging based systems is that they can allow you to write code, which is ignorant of a lot of things. Okay? And it's true ignorance is bliss, I've got to tell you after building these systems, on top of this sort of reliable service policy type infrastructure, it really allows you to focus on what's the business problem I'm solving, rather than, wait a minute, has this message escaped and how do I de duplicate? In my business logic, if I've received this message before, all sorts of problems that occur in queues based environments, that again, give some of that a poor reputation, you need to be aware that not all queues are created equal. While service buses have kind of tried to create a certain baseline around that not always successfully. In terms of our code, ultimately, what's needed is a higher level of monitoring than say your traditional, oh, just make sure that the servers are up, check the log files that no exceptions were logged. Is the database server online?

01:00:22 Udi Dahan

In terms of monitoring, when you have a queue based environment I mentioned before, if there are errors in the system, ultimately, instead of saying, hey you know, in a traditional web services type of environment, if you have an exception, that exception is logged, depending on how you implement logging, could be logged to a local file, which then ultimately needs to be rolled up to some central location, or you have some logging code that is talking to a database directly. Anybody doing logging that talks directly to a database? Yes. Okay, some hands going up, of course, then you have the issue of when I can't connect to the database, that the system is throwing exceptions, and it's trying to log them. But when it tries to log them, it throws another exception that it can't connect to the database to log the exception.

01:01:11 Udi Dahan

So things are going bad, but nobody's hearing about it because the system is kind of stuck. Okay? Occasionally, people try wrapping or having the log do and do a queue based store and forward type of operation to bring it to a central location, where it can then talk to a database, anybody doing queued logging? It's actually not such a bad idea, because for the most part, it disconnects the storing of the log to a durable location, like a database from the business logic processing of your system. Okay? So you do want very much to centralize your logs but more importantly, you want to do that when there are errors, and usually not very much before them. A queueing based environment, because when there are problems, it can catch them after it does all of the retries and store and forward them to a central location. That is absolutely brilliant.

01:02:10 Udi Dahan

I can tell you, every one of our clients that is doing this model, they not only do they swear by it, the administrators swear by it. And the developers, the reason why they're happy is because the administrators no longer phone them up at 3 a.m saying there's a problem with the system. I need help. Okay, most of you, I'm guessing you've been promoted out of that problem. Right. But that's an issue that we have in IT is that the only problems that get solved are the problems that are felt by the senior developers. Okay. The issue is that we take all sorts of problems that we don't like dealing with and we make the junior devs deal with it, we make it other people's problems. And then we don't go and automate that away.

01:02:54 Udi Dahan

So one of the best things that happened with regards to testing in the IT world is that Martin Fowler was able to convince lots of senior developers, " Hey, you guys should be writing unit tests." All of a sudden, testing became a certain badge of honor for senior developers. I write unit tests. I have a suite of unit test this, I have this percentage code coverage. And then you saw unit testing frameworks getting created and continuous integration environments getting built. So anything that you can convince senior developers to do it's a good thing. It's quality, magically improve anything that senior developers don't do kind of stay stuck, where it's been for the past 5-10 years without really going anywhere. And logging and monitoring and that kind of stuff. It's kind of not really advanced very much. There are some interesting startups doing some work in that area.

01:03:49 Udi Dahan

Monitoring-wise next generation. One of the things that the queues do, as I said, error queues, we can have and this is one of the new tools that we've created around NServiceBus that it sits on top of the error queue in your system, feeds that into a database of sorts and has this nice little web application over here that anytime that a message fails, meaning arrives at the error queue, it pops it up says, hey, look, a message has failed, eight messages have failed. Click here to see the details. And then it can actually tell you the full history of what went on over there. So this message arrived at the error queue from that server on that machine. It was being processed by you know, the whatever class and this line of code, here are the full message contents.

01:04:39 Udi Dahan

That is something that is absolutely brilliant, I've got to tell you, when going to debug a system, and all you've got is a log file saying something failed but you don't actually have the raw business data, all of the parameters that were passed into your code, which caused it to blow up, it's really difficult to reproduce the problem. The advantage with the message based system is it can actually carry with it the full payload that caused the problem. So that you can take that error message off to your developer environment say, "Oh right yeah, this message actually does cause my code to blow up", without trying to think, okay, it blew up on this line of code. What exactly were the parameters that caused my code to blow up? So for those of you who, whether you're using messaging or not using messaging, when you have an exception that's logged, who makes sure that they have all of the parameters logged as well, that came along with the exception? Yes. Okay.

01:05:42 Udi Dahan

I see some hands going up, maybe 30%? For those who have you who raised your hands, is it useful? Yes. Okay. I've seen all the hands go up. If you could have done it earlier on the project, would you have done it? Yes. Okay. The rest of you, please do this. This is for your benefit. Now, whether you're implementing this type of logging intercepting infrastructure yourself, and you can do it with regular web services, by the way, don't want to say that you can't, messaging just kind of has it almost built in by default. Okay. So you get the same sort of behavior, something goes wrong, you have the payload, you can reproduce it very easily, in addition to the stack trays and, as I mentioned earlier, when you fix the bug, you still have the original message data. So that billion dollar order that came in from Warren Buffett exactly when you were upgrading your system, you can then replay that message and have it be processed successful.

01:06:46 Udi Dahan

So for all those of you that raised your hands with regards to, "Yes we love the invocation data as a part of the exception when things fail." So you have a kind of tool which allows you, you or administrators to replay that invocation against your system. Okay, I'm seeing a whole lot less hands than there were before. Okay. It's really important. I mean, if you've captured the data, this is valuable business data. Somebody wanted this to go into the system, you'll want to be able to play it back against the system. Again, this is the kind of things that it's just a matter of fooling around in service bus, we've been building this sort of tooling to make it easy for people to have all this happen out of the box. And then you can really quickly click through to one of these messages and say, retry it. And then off it goes sent back reprocessed. Hopefully now you fix the bug successfully. And everything's great. Now with me so far, yeah. Okay, good. Auditing. Auditing is another absolute amazing, brilliant feature of message based systems. Sometimes called journaling, by the way.

01:08:02 Udi Dahan

So ultimately, when you have a system that is very publish, subscribe messaging centric, loosely coupled, and all that kind of stuff, as the messages flow through your system and maybe I'll go back a couple of slides over here. When a message processes successfully, we can take a copy of that message, store it and forward it to a separate audit queue, to be able to say, I did this in a distributed system where you have things running in parallel with each other. Being able to have a record of this happened at this point in time is very useful. It's even more useful going on, when you can associate messages across multiple places. One important thing about so those of you using queueing or service buses, who does auditing? Who has auditing turned on in your queue based system? Not lot of hands going up, please, please, you're using service buses, you're using queues, turn on auditing turn on journaling, you will thank me later. I will explain to you why it's important.

01:09:10 Udi Dahan

Now one of the reasons why people leave journaling turned off is because they leave the messages in the queuing system and after turning it on once, and then leaving it running, the queuing system kind of building up. So you know, it's been two days and I've got a million messages in auto queue. And it's been a week and I've got 10 million messages in auto queue. Eventually I run out of storage, or depending on the implementation of your queueing system. Often what it does is it has in memory indexes that it holds for every message in the queue. So it can run out of memory because the indexes are used up. And then the queueing system starts throwing exceptions. And then the whole system grinds to a halt and of course, the administrator just say, "piece of crap queueing system, can't do anything right" no, it's not a problem with the queuing system, the queuing system is doing what it's supposed to do. Okay.

01:10:10 Udi Dahan

Auditing and journaling, very useful information but you don't want to leave that stuck in the queuing system, you want to take that pull it out and stick it in some database for longer term storage. Also, the reason you want it in a database is because there's gold in there, it really can help you visualize and query, what happened in my production system. Where did things go wrong? Now, this element of where did things go wrong after it's in longer term storage, you require all sorts of information about the message, not only to say, okay, we had this event, and this is what happened with it. You'll want to have the ability to trace this event was caused by this other message which was caused by this other message that happened over here. So if you are building ServiceBusey loosely coupled message based systems, make sure that you add a type of conversation ID header to everything that's going through your system. Now, this is really, really simple to do.

01:11:24 Udi Dahan

So you know, from whether it's the first user click, or a file drop that arrived, you have some code that's kicking off a process, new up a guid, and stick that as a header on that first message and for every bit of processing downstream, just copy that header onto it. When you do this, you in essence, create a end to end system trace of your loosely coupled environments so that it's easier to query afterwards to see, okay, why did things behave the way that they did? So the disadvantage of queueing systems, as I said before, you can't look at the code and see one big gigantic message top to bottom and see what it does. That's because it's loosely coupled and separate, and, you know, asynchronous and all that kind of stuff. But being able to query what happened at runtime and see this is how things behave, that can make your life so so much easier, not only in production, by the way, okay? It's valuable in production, but also for development.

01:12:35 Udi Dahan

Imagine being able to build a message based system, debug something on your machine, and have something show you the visualization of, okay, it started with PlaceOrder and then after that, you know, an event came out called order accepted, which then came out to order build, which then came out to order ships. Being able to see the thread of the flow across the different endpoints of your system, to be able to look and say, wait a minute, there should have been another event over there. Or wait a minute, why didn't we send an email? When you have all of these moving parts that are all running in parallel with each other, sometimes it's difficult to debug this kind of see, did everything that should have happened actually happen? Being able to have an audit log of everything flowing through your system, end up in some sort of centralized database, all correlated nicely and neatly with a little bit of visualization on top of it will make your lives as developers when building the system infinitely better. Okay.

01:13:41 Udi Dahan

Now, this is another tool that we've created around NServiceBus because hey, you know, we're living and breathing these problems every single day. We need it. And I got to tell you, it changes so many things. In addition to this, we can show which end point is this coming from where is this going to all sorts of great things Now, again I'm not saying use NServiceBus but take a look at the kinds of things that we're doing to get ideas for what you could do in your environment. So for those of you running on Java, I'm sorry, there is no, "a JServiceBus" that does everything that NserviceBus does. Now most of the stuff that we're building, it's not rocket science. Okay, it's just some, some some basic tools and techniques that people have been in the trenches saying, look, we built this for ourselves, you can build this too. The ideas are simple, auditing and journaling something supported by every single queueing system. Turn it on, have a little bit of code that's taking the data out of your journal, dump it into a database, it's not that hard.

01:14:50 Udi Dahan

Make sure that every message that's flowing through your system contains this type of conversation ID. Why? So that you can draw these sorts of pictures, making it easy for developers to figure out why the system is behaving a little bit funny. This is also extremely valuable in a production setting. In working with clients, for maybe let me preface this, a lot of times when I talk to developers their primary concern about production system is if there's a bug and an exception is thrown, then how do we handle that? The thing that I'm more concerned about is if there's a bug in production, and an exception is not thrown. It's the systems that quietly fail, or work wrong, without really anybody noticing. Okay? Those are the kinds of really scary things. So being able to have a tool that actually shows you this is what happened in production.

01:15:52 Udi Dahan

When your business user is saying, "you know, something's weird with that account. I can't really put my finger on it but in the last week, you know, just things are going wrong. The numbers, I mean, they're they're adding up mathematically, but they're wrong." That's a great case where it would be useful not to just say, Okay, let's look in the database and see, what is the state of that account right now. That doesn't necessarily help you. I mean, databases are very good at capturing now states, but are not very good at telling you. Well, what happened yesterday and the day before? And so well, we have logging for that. Most of the time, the problem with logging is it's just this, no, I don't remember who told me this originally said, logging is creating haystacks around needles. There's some gold in there but I'll be damned if I can find it.

01:16:53 Udi Dahan

There's just so much going on and it's really hard to say, How much should I be logging at any one point in time and you only find out after the fact that somebody has said, "something weird's been going on?" You're like, okay, I'll turn up the lock and he's like, no, it started a while back, you know, trace back everything that happened in the last week. Again, having a database that you can go into that audit log, say, find me all of the messages that pertain to a specific account ID in this past week and visualize me the flow can make it much easier for you together with a domain expert to kind of trace through the business flows that relate to that and for them to say, you know that that bit over there? That seems a little weird.

01:17:41 Udi Dahan

Open up the message data, the advantage of having an audit log is it contains all of the business data in each of the messages, say, not only this was the state, this was the state transition that occurred at that time and then your domain expert can say, "you know, it was there four days ago, in this part of the system, that things started working wrong. And then you can go back and say, "Yeah I remember, we deployed a new version of that system into production roughly four days ago." So it turns out, there's a bug over there and now we're in a better position to find it because we have this type of audit history. Okay? So, turn on the journaling, dump it out to a database, get the necessary headers flowing through all of your messages. It will make life so much better for you both in development and in production. Making Sense? Yes, not going too fast for you. All right.

01:18:44 Udi Dahan

Thank you, right? No, this is what I get to do this dancy dazi slick demo with all of the things all of the screenshots that I put in front of you but before I jump directly into, you know what, what using these tools can be like and giving you a sort of sense for the experience of this balance of building message driven systems from sort of the lowest level of queueing to high level tool support are there any questions that kind of been talking at you for quite some time? Lots of questions? Okay. Let's say yes. The idea that, that we came up with in and around in service bus was, if people draw these sorts of diagrams for building their message based systems, why can't they code in such a way that actually connects to these sorts of diagrams? Why? Because having good documentation of a message based system is important. We saw some of these things later on. For example, the visualization of the flow, et cetera.

01:19:46 Udi Dahan

So for those of you that aren't interested in NServiceBus, you're not on the .net platform, I'm not going to be offended if you want to go, okay. For those of you that aren't on the .net platform, or aren't using and service bus, it's unlike the other tools that I've shown you, for example, the monitoring and the audit visualization, those types of things are not very difficult to build, the thing that I'm going to show you now, actually is pretty difficult to build. So it's not the kind of thing that, okay, I'll just sort of look at it, get the general idea and then build it myself over the weekend. Okay?

01:20:23 Udi Dahan

We've been spending roughly the last two years building something like this, to get it to the point that it's actually working well. So all those caveats aside, the idea that we wanted to do in and around NServiceBus was to make it simpler and easier for developers to communicate with each other, what is the design of the system that we've built, and also to be able to connect what they built to what's actually running? Okay? So let me just get this out of the way over here so we have a little bit more space. Okay.

01:21:03 Udi Dahan

We've created an extension to the Visual Studio environment that allows us to build NServiceBus centric systems. And let's say I'm building an Amazon2, because I know everybody likes to have a bookstore. And what this environment does is it gives me, and I was never really thrilled about drag and drop tools. I've always hated them with a vengeance. So it's kind of a fitting irony that I ended up creating one myself to kind of say, well, for years you've been saying you don't need draggy, droppy tools they're a waste of time to create crap code and all that kind of stuff. Now you've actually got to stand up in front of all the people that you said that to and say, but this one's really different, I promise.

01:21:56 Udi Dahan

What this environment does is it allows us to design endpoint centric systems. For example, let's say I have a FrontEnd web type MVC environment, and along the way, so I'm using this to build not only the messaging components of my system, but to integrate the messaging parts of my system with the rest of the system. And so it's generating properties for me, and eventually you're going to start seeing services up here. But this thing that you have over here, let me zoom in a little bit. That's my MVC FrontEnd endpoint. And now I want to start using this to send some commands or publish events. I can do that directly from here and I'll have a Sales service, which is what we saw before, with a PlaceOrder command. And as I'm doing this, it generates the message types for me, it generates a message handler component that says, okay, that's going to be the thing that it's going to be routing to.

01:23:00 Udi Dahan

And I can pop in here and say, show me the code of the PlaceOrder message, which is just a POCO. And I can put whatever I want in here. Let's say what I have is an integer OrderNumber. What we try to do is try to avoid taking over the actual code that you'd be writing either for defining your messages or defining your logic. But ultimately give you a little bit of a facility for navigating your code base a little bit better. Over here, I now have a PlaceOrderProcessor component. And I could do all sorts of things there, as you can see over here, including deploying that, just so that I have a physical place to do things in a BackEnd. Now this NServiceBus Host is a very simple console application, which deploys as a windows service. It makes it easy to debug as well as easy to deploy. And as I'm doing this, ultimately it generates the code, all of the configuration, all of the routing config, which is the part that I always hate doing and it's very error prone when I'm coding things on my own machine. And now I can actually send a message from a front-end to a backend, but this is a little bit boring. Let's actually build a slightly larger system, which is what we were seeing before. So we've got this Sales backend with a PlaceOrderProcessor. Let's have it publish an event and we'll call this OrderAccepted, which is what we've been looking at for the past couple of minutes. And I'll open up the OrderAccepted. And over here also pass in that OrderNumber that we saw. Now I've published an event and just to make sure that the OrderNumber propagates correctly from the original command to the event that's published, I'll pop into the code over here. And you know, like I said before, this is plain old object, this is your code. You put whatever business logic you need in here talking to your database, talking to Mongo, publishing events.

01:25:14 Udi Dahan

And what we're going to do over here is ultimately we're going to implement the... After I build it. Implement the code that ultimately publishes out the event and set and propagates the OrderNumber along. So we'll do that really quickly over here. And here it's saying configure the OrderAccepted message that you're going to publish. And I'm just going to quickly set the OrderNumber based on the incomingMessage.OrderNumber. Hopefully I'm not going too fast and that you can see that okay in the back. Now I've published an event, have the original order number. Let's add a subscriber. So we've got Billing and we've got Shipping, right? Now it creates a service for each one of those as a container for these components. And we also have an event handler for each one of them - that is an OrderAcceptedProcessor. We'll deploy this really quickly to a separate backend. Let's call this BillingBE. And of course, all of this type of events, messages, pub/sub, turning into regular code projects in terms of Visual Studio. Nothing extremely fancy over there. And we'll deploy this one to a Shipping backend - ShippingBE.

01:26:59 Udi Dahan

And with that, we have now created a system that has a web front-end and three backends doing some one way messaging, doing some pub/sub. But, in order to see all of the features here, I'm not going to be demoing everything that we've got, but just want to give you an idea of how simple it can be to start from scratch. And in a couple of minutes, say what I did on the white board now I have code that actually runs that is using an underlying queuing system. And if you're interested, there is the ability to set at the level of the properties of the entire system, things like the Error Queue, unfortunately my magnifier is not... A little bit small because of the... Come on. It's not that for me, Views. Okay. Forget that.

01:28:05 Udi Dahan

I can show the, what's the error queue for the entire system. What's the Audit Queue for the entire system, as well as this thing over here, the Transport. I can say, you know what I don't actually want to run this entirely on MSMQ, now I want the system to run on RabbitMQ. Now I want the system to run using SQL Server tables as Mike Hughes, or I want this to run in the cloud on AzureQueues. Why? Because, ultimately it's the same fundamental logical flow. I just want to swap out the underlying transports because ultimately the infrastructure says, well I'll do the adapting for you for each environment. All sorts of nice little things to allow you to focus on building your system and what it's for and adapting all of the way. How do I make this run on MSMQ, RabbitMQ? How could I use this persistence, that persistence?

01:29:02 Udi Dahan

And of course you can choose this for each endpoint as well. What we're going to do now is we're going to run it. As you can see, we have this new front-end that pops up to life. And while MVC gets going, we have all of these backend processes that are sitting there. This is the Shipping backend. Incidentally the yellow warnings that you see is when the infrastructure is creating the queues for you based on a convention, based on the name of the endpoint that you've given it. So here it says, okay, I'm calling this Shipping dot backend or Amazon2.ShippingBE backend then creates a bunch of queues because it's always annoying to run a system and then to get to the point where all of a sudden you realize you forgot to create one of the queues, you've got to stop debugging, create the queue and start debugging again in order to actually get your stuff done.

01:30:01 Udi Dahan

All right. This is the MVC front-end that we've created. One of the things that we do is we generate this little Test Messages widget, because well, you always want to be able to kind of poke at your system and see how it behaves. We create a Test UI for each single message type that is accessible from your front-end, so that you can really easily type some something in over here, 4567, and send that message out. Again, all sorts of little tools to make life easier on the developer. When I send this message out, then I can go and take a look at here's my Sales backend and as you can see, the white line over that says Sales has received PlaceOrder and Billing has now received OrderAccepted and Shipping over here has received OrderAccepted as well.

01:31:02 Udi Dahan

Nice, simple, easy, all of 10 minutes. I started have a new system running, debugging it, all of the messages are flowing everywhere. And we have this picture over here, which you guys might remember from before, where when you're looking at this, I want you to not look at it where it is. I want you to imagine that you're an actual developer with the proper development environment with two or three monitors around you. Where on monitor number one, you have this picture in front of you and monitor number two, you have the code of your business logic that you're debugging and on monitor number three, you have this picture that's actually showing you, fully synchronized with your current debug session, how things are actually behaving in front of you. And again, this is synchronizing in real time. As I mentioned before, building this thing, not very difficult. But the whole Visual Studio extensibility side of things and getting everything in sync and then we've been working on that for some time.

01:32:13 Udi Dahan

This type of information shows us basic messaging flow. I can also take a look at the body of a given message. Over here it shows me this is XML serialization of what's going on over there, as well as we have a whole bunch of headers on the message properties that allows me to know that this was processed by version 4.4.2 of NServiceBus. Its content type is XML. Here are all of the headers that came along with that. I also have Performance information about that message that tells me when it was sent, when was processing started, when did it end, what was the processing time, what was the delivery time across the queuing system. All sorts of nice metadata to be able to keep track of what on earth just happened in my environment. And if there were any errors while processing the message, then they would appear over here in this little Errors tab telling me what the ExceptionInfo was, et cetera.

01:33:16 Udi Dahan

Now let's do one more type of thing because what we're missing in this diagram is what we actually had in the original picture over here that Shipping was handling both Order Accepted and Order Billed and arguably was doing some more business centric correlation between them. So, let's do that. We've got the OrderAcceptedProcessor over here from Billing. We're going to have that publish an event saying OrderBilled and we'll propagate again, just to keep things simple, the OrderNumber in there. And making sure that that goes across the right way.

01:34:18 Udi Dahan

Then ultimately we will, after we propagated the OrderNumber, we're going to use the OrderNumber from the OrderAccepted event and the OrderBilled event to make sure that ultimately everything correlates in the exact same Saga instance. I apologize that my machine is a little bit slow, but you know, you're probably not going to be coding on a laptop when you're building a very large system or, well, you might, if you're working on a strong enough machine, but this one isn't as strong as I'd like it to be. Here we'll set the OrderNumber based on the incomingMessage.OrderNumber as well. And the final piece of this puzzle is to have Shipping, this component over here, handle not only the OrderAccepted event, but also the OrderBilled event.

01:35:07 Udi Dahan

What we will do is we'll say "Subscribe to Event", and here it gives me a suggestion OrderBilled and I say, yes, that's the one. And when I do this, it says, Oh okay, I see you have a component that's going to handle more than one event. You're probably going to want to do some sort of correlation between these messages. Would you like me to change this component into ... in NServiceBus terms we call it a saga, which is an object that does this sort of event based or any type of message based correlation. I say yes, please. All right, which message type should start your long running correlation process.

01:35:48 Udi Dahan

And this is where back to your comment before I said, well, either one might actually arrive first. I don't want to count on the fact that OrderAccepted will always arrive before the other one. So I'll say, do both for me. Done. And then it generates that code. And it also generates a little bit of extra code that intentionally causes the build to fail because currently this tool is not intelligent enough to say, well, I actually see that you are coordinating, that you're correlating an OrderAccepted and an OrderBilled and both of them have an OrderNumber property that's an integer. So let me just figure that out by myself and set up the correlation. We're going to be doing that, you know, that will be in the next major release.

01:36:32 Udi Dahan

We can expect to strike all of your code for you, right? What we do in here is we generate a bunch of code saying, okay, what you need to do is to set up some mapping between OrderAccepted and OrderBilled saying how the message's properties mapped to the long running processes properties. We do this so that we make it very easy for the compiler to tell you, this is the code that you should be writing. Like I said, in the future, we'll figure this out for you, but it's a nice sort of thing. We can specify the message.OrderNumber for each of these events.

01:37:14 Udi Dahan

And then we're going to be mapping that if you're familiar with the Saga term, great. If not, just think of it as an object, that handles messages that has state. It's really that simple. We'll have the Saga also have an OrderNumber property on it. And then we'll ask to generate that, make that an integer, turn that into an auto property because that's the way to do things. And in the end, we now have our correlations set up. The thing that is important to do when you are correlating between events is you need to make sure that if you have any state, you need to think about what would happen if these two messages were processed in parallel, like different threads are on different machines. You really need to make sure that the correlation that you set up over here with an OrderNumber you're going to want to have this to be unique.

01:38:22 Udi Dahan

You want only one instance of a Shipping process for any OrderNumber, even though the Shipping process can be triggered by more than one event. This is the kind of thing that will guarantee that even if things run out of order, when they go to persist, if they're persisting at the same time, one of them will fail because of a unique constraint violation, roll back and try again and then it will join into the existing process. There is just one more small thing that we need to do over here to get this finally done, done, and that's to go into our Shipping processors code and make sure that we're setting the OrderNumber.

01:39:12 Udi Dahan

Let me make that a little bit bigger for you. And then let's also handle the OrderBilled events as well. So again, we'll just do a very simple copy paste operation here. And I, if I had more time, I would actually write a utility type method. And the final bit of this puzzle is this method that it also generates for you that tells you all messages received. If you're correlating between multiple messages or events, ultimately what you're interested in, has everything arrived. Ultimately the code behind the scenes is tracking the messages in persistent durable state saying when it's arrived, well, this is the method that I'm going to call in here, do whatever you want to do. In this case, all I'm going to do is I'm going to write to the console saying "All done", and then we're going to mark this process as complete.

01:40:17 Udi Dahan

Why? Because, well, that's all that there is to do in Shipping right now. Let's run this and fingers crossed. Hopefully it will all work like magic. Right, where's my UI front-end? There it is, test a message, send an order number 9999, send. And we see that Sales has received PlaceOrder as before Billing has received OrderAccepted as before. And Shipping has received both OrderAccepted and OrderBilled and has invoked all done, which means that all messages have been received and correlated successfully. But because I never trust console right lines in a distributed system, for that I have this tool over here, which is showing me, okay, I've got this front-end, which is sending a PlaceOrder to a backend, which is publishing an OrderAccepted event. And that's going to the Shipping backend over here and goes to the Billing backend over there and Billing publishes OrderBilled, which arrives at Shipping back in.

01:41:28 Udi Dahan

And you can see that there's this little box over here that is different on them. That is that long running process, the Saga object that's doing the correlation. And I can click through to that. And it's going to show me the visualization of this long running process. Showing me here came in OrderNumber 9999, which caused my Saga to be initiated, set the OrderNumbers property of that object. Then later on in OrderBilled event arrive, this is the data from the OrderBilled event. It was correlated with the exact same instance. It didn't really update the OrderNumber at all, and then the Saga was complete.

01:42:13 Udi Dahan

This allows me to look at what I think the system is supposed to do. The console debug saying, this is visually what the output was. And this gives me insight into what actually happened under the hood in terms of each message. Where it went from, where it went to, how it was all correlated.

01:42:34 Udi Dahan

And again, this process didn't take very long, but it makes my life easy or I'd say easier when it comes time and somebody says, you know, explain to me how the system works. Then I can come back to this diagram and say, well, this is the documentation. It starts over here, it flows over there, this is the events. And another developer can really quickly get up to speed. Even though everything's asynchronous, in parallel and loosely-coupled they get this picture that says, this is the system at an abstract level, this is how it works and this is what it does. And if you want to work on any part of it, you just kind of go in and you know, let's look at that code over there. That's look at that code over there.

01:43:22 Udi Dahan

And when you run it, you got to get that full visualization being, so again, imagine the two or three screen model where you have this on one screen and you have this on the other screen and say, is this really what I expected to happen? It's through this sort of tooling. And like I said, we've been working on this for quite some time. That it's my hope that you will become that messaging patterns, technologies, approaches, architectures will become much more accessible to developers. Because, you know, I kind of look back to what it was like to build these systems without it. And it's, it's really kind of, you're pulling your hair out. I'm actually in a pretty good situation. That's how you know that I'm a good developer, right? That, you know when things are going wrong and this makes zeroing in on them much quicker, easier. And again, there's ultimately all of the reliability and all the great things at an infrastructure level.

01:44:21 Udi Dahan

If you're interested in all this stuff, you can come to our website, which is Particular. net, go download it. If you don't want to download it, we have this cool little trial, Online Now thing. We actually have a virtual machines hosted in the cloud that you can log into, where all of these tools are already installed. You can start playing with it because I know, I absolutely know how you hate installing things on your machine. Come to this environment, I don't know if every single hands-on lab that we have over here still has, or has all of the new tools, because these are really brand new bits, but this is a good way of playing with it, getting a sense for what it could be like building systems this way.

01:45:14 Udi Dahan

And also even for those of you not on .Net or you're not interested in using NServiceBus, giving your own team ideas about what sort of tools should we create for ourselves, or what ideas can we take away from this so that we can advance the level of our tooling to make it easier for more developers in the company to use the tools, adopt ideas of messaging, because ultimately sort of rounding back to the picture that I had over here towards the beginning, we still have a long ways to go.

01:45:50 Udi Dahan

Event oriented architecture, these types of things. This is sort of just one stepping stone along the way of building more loosely-coupled, more autonomous systems, more reliable systems and software is eating the world, right? Almost everything that we interact with in our lives is becoming driven by software. If we don't make our code and we don't more reliable, more robust, we don't make it easier for other developers to make their code more reliable and more robust. Ultimately the world that we live in will start to reflect the poor quality of code that we've created.

01:46:28 Udi Dahan

Please do go out there, apply these principles, try it out, play with the technology. And I hope I've given you at least something to think about and maybe something to play with over the weekend. Thank you very much.

Loosely-coupled orchestration with messaging

About this video

🔗Transcription