Q&A on Advanced Distributed Systems with Udi Dahan

00:00:03 Michael

I'll kick things off now for everyone that's here. And as we go, I think people obviously get the gist. But first off, I'd just like to welcome everyone. I think it's going to be a great session with a lot of value that's going to be heard. And I guess myself, I feel very honored and privileged and lucky to have Udi join us. So I know I've been sort of in this Dev community for a while now. And a lot of people speak of going to all of these courses and doing this training and stuff. So it's great to have the man here to sort of give us some advice. So I'll just start with some intros. And probably, I guess, Udi probably doesn't need much of an intro, but we'll we'll give him one. And I guess for those of you that don't know, he's obvious a guru in the distributed system space, message, event driven area. He's also the creator of NServiceBus, which I know a lot of companies in Australia and New Zealand that I deal with use that.

00:01:02 Michael

And he also deals a lot of, I guess does a lot of good courses and Advanced Distributed System design space, which I also know a lot of people on this course have gone on. And last year as well, we gave it for free during COVID, which was really nice. But yeah, as I said very privilege and honored to have Udi here. So welcome, Udi, thank you. And we've got Amir, Amir Mansell Ali, who's a tech lead at Critic Clear who's also got a wealth of knowledge in this space, has been in this disputed event driven space for a few years now. And he's also working with Critic Clear, which is a company that I know really well and feel very privileged to get to have them partnering with us to put this meet up on.

00:01:53 Michael

They are very interested, they're actually a FinTech here in Australia, they IPOd at the back end of last year. And they've got a really interesting roadmap for the next sort of 12 to 24 months in that distributed event driven space. Oh, it's great to have also someone with such good knowledge to be able to ask Udi some questions, because that's sort of, I guess, not my forte. So welcome here. And I guess lastly, we've also got Neil, he can sort of, I guess not see down here, but he's going to be looking at the questions and curating them to make sure, I guess questions that are getting asked to get the most value to everyone in the group. So thanks for Neil for joining us.

00:02:35 Michael

And just a little bit about me. So my name is Michael, I am one of the co founders of Endeavor Recruitment. And we specialize in helping, I guess, startup, scale ups and tech focus businesses scale, and the technology in product space. And I guess why I wanted to put this together with Critic Clear is, over the last sort of 12 months, it's been a lot of conversations around, I guess, teams wanting to move from the monolith to distributed systems or going through distributed system journeys, and not quite knowing what to do. And I thought, who better to help people avoid those pitfalls and get really good answers then Udi himself. So it's really cool this has finally come together.

00:03:18 Michael

And I guess the structure of this is, we've got a few questions to kick things off for people that have sent stuff through over the last day or two. So thanks very much for that. But we want to make this as interactive as possible. So I have sort of three questions or so but throughout this conversation, please put in the, I guess in the q&a section, down there, any questions you have, I don't think there's a better opportunity where you can actually get the information from the horse's mouth. So please use this. And the more, I guess you put into this, I think the more you'll get back. And also, it's just the questions going in there, please update them. If you think they are of interest. Obviously the questions that are more of interest to the whole group will sort of be the ones we answer, but I guess let's kick things off. Really excited to do this. So if there's anything you'd like to say Amir or Udi before we dive into it?

00:04:12 Udi Dahan

Thanks very much for having me, take part in this, it's been a while since I've gotten to Australia and that region, looking forward to being able to travel there again, hopefully in the near future.

00:04:31 Michael

Yes, hopefully very soon. Well, most of us in Malmo, we're just in another little lockdown, so might be a little while longer, unfortunately. But I'll hand it over to Amir to kick things off.

00:04:45 Amir Mnasell Ali

Cool. Thanks for that Mike. Thanks for organizing this. So we've got quite a few questions to kick off. I think the first one I wanted to cover was pertinent, because it talks more broadly about SOA and Service Oriented Architecture and domain driven design, is a question by Yong, and the question goes like this, the pain point of software design is how to deal with changes, and no matter how well we design the system, there will always be new, weird, unexpected requirements that break assumptions and make things difficult. This is an implementation or a manifestation of Murphy's Law in the software industry. On the other hand, SOA, Service Oriented Architecture and DDD claim that they are suitable for only stable businesses which really have unexpected requirements. This seems not aligned with reality, instead of resolving the pain point, SOA dodges it and can always claim this architecture is not suitable for your business. How do you respond to this criticism?

00:05:54 Udi Dahan

I'm not sure if the source of the claim to say SOC is only suitable for stable businesses or environments that don't change. I can say that, first of all, kind of characterizing things in a binary way of those that change and those that don't, is something of a false dichotomy, that kind of every environment changes in various ways over different timescales. But I do think that we as software designers and architects need to be aware of the tools that are at our disposal, and to use the right ones for the right case.

00:06:41 Udi Dahan

Now, briefly in the, let's call it the early stage startup space, where there's an idea but it hasn't really achieved what's known as product market fit. So we're building something, we have some early users, we think that the problem that those users want solved is x, and we're sort of heading down that path. But it's possible or even to the point of likely that somewhere along that that path, we'll discover that maybe that problem is not shared by enough users, or maybe that problem is not valuable enough to the users that we're targeting for us to make a viable business out of it.

00:07:31 Udi Dahan

So on the business side of things, there can be a significant pivot towards an adjacent or slightly different set of problems. Now, there really is no software technique that is going to allow you to pivot significantly from one thing to another. Beyond I'd say, not trying to design upfront for something that will be hugely scalable. I think that as designers and architects, we've viewed the need to rewrite a system as a kind of failure on our parts, where I'd say that when we're talking about sort of the early software startup type of space, it's not, that's actually something that should be accepted and embraced to say, hey, you actually built software good enough that allowed you to live long enough to attract enough users interest, venture capital, what have you, to get to a size where the software as you designed it is no longer able to meet with the demand of the users that are coming in.

00:08:44 Udi Dahan

Congratulations, that's a business success and a software success, you now get the chance to do it again, better, knowing more. So embracing that idea of, you're going to rewrite these things in various ways over time, and in sort of that early startup scale up type of mode. And the story is bound, this happened to eBay, they rewrote their software, I think they documented six different generations of architecture of their software over time. Twitter also did it, Facebook did it. I think a lot of the most successful stories that you hear are those that did a kind of significant rewrite, or two or more overtime, so that's something that I think we can say, hey, that's okay, don't try to prevent that too much.

00:09:42 Udi Dahan

Now, pivoting back to the elements of service rendered architecture and domain driven design, they're not monolithic practices, that either you do the whole thing or you do none of the thing. There are bits and pieces of them that can be applied in various ways, one of the things, one of the common mistakes that I think are made, regardless of whether you're using SCO or DDD or CQRS or layered architecture or all of those techniques is that, us people working in the software space, we tend to treat requirements that we get from business stakeholders and users as authoritative, meaning we assume that they know what they want, and that they're able to articulate that in a good way. Yet, I think we've all had an experience where we've delivered exactly what the user is asked for, but when they see what we've delivered, their response is, yeah, but that's not what I meant. Or, I know that, that's what I asked for, but can you make it do this other thing.

00:11:01 Udi Dahan

And the greater the number of stakeholders and the user populations that you're targeting, the higher the likelihood that there's miscommunication and lack of alignment between them. So you might deliver something that is exactly what one stakeholder group wanted, but it's not compatible with the desires of others. And I don't think we spend enough time on sort of the requirements side of things. You don't just gather requirements in that they're sort of lying around and you pick them up, you document them, there's deep work to figure out based on the various user and stakeholder input, what should we be doing, and it's a very interactive process to come to a coherent set of things that you can then start talking about how should we architect this thing.

00:11:57 Udi Dahan

And I think that, that's not something that's appreciated enough. So some of the times that people run into issues, again, whether it's SOA, DDD, CQRS, event sourcing, all of the types of techniques, it's because those upstream activities of requirements, analysis, and let's call it iterative evaluation, and generation and pushback, and all of those things, that's not done nearly enough. So in my experience, after you go through those exercises on the requirement side, just like in the implementation side, we have patterns, there are certain requirements patterns that you can start to see emerging. And when you see something that doesn't really fit into any pattern, you could have what might be called as a requirement smell, we're familiar of that in terms of code smells, of saying, when you see this kind of code, you get that whiff that something's off here, I can tell, there going to be problems around this area code.

00:13:11 Udi Dahan

There are the equivalent things on the requirements side, where you can kind of go through and say, something feels off here. If we do exactly what this says, it's not going to go well. But it's a whole other set of knowledge that broadly speaking in the industry is not appreciated enough and not enough time is spent on those things. And then essentially, we're kind of looking for these other architecture design techniques to try to compensate for them, and they just can't. So in my experience when doing the appropriate requirements techniques and surfacing appropriate requirements patterns through those elements, a lot of times SOA and DDD end up fitting those very well, but it's not a simple or easy thing to do, it takes quite a bit of knowledge, skill and time to do it well. And some organizations because they don't appreciate that, they don't invest the knowledge skill or the time in those activities, and thus end up having poor results later on. So that would be my response to the criticisms about saying, this hammer constitutes a terrible screwdriver.

That's fair. I think we all can resonate to that product and requirement experience. Moving on, there's another question about, and I'll just read that out. This is from Justin, Justin Golberg, we have a micro service platform built with multiple services that you can buy an insurance policy through, and a customer portal that serves mostly consolidated read data to a customer. Instead of the customer portal calling multiple services, we have an aggregate of the data that builds a read data model in MongoDB, when events are published, the portal subscribes to certain ones and adds to the aggregate. We're contemplating having a read model based on the entity sample policy aggregates and sort of an aggregate for the UI, in this case the customer portal, came to get your thoughts on having an aggregate versus calling the services directly. If we use an aggregate would it be recommended to do it for the entity or the domain or the UI?

00:15:55 Udi Dahan

Okay, this is somewhat specific question. And I'll try to sort of extract certain general principles out of it. One of the things that's described here is a common architecture approach, where you have system A, let's call that front end, that is interacting with users and doing a bunch of things and you have a number of let's call them back end systems for simplistic thing. And there's data in the backend systems that need to be surfaced to the front end systems. And then we get questions of all, what's the right way of doing this? Should we do this in synchronous HTTP request response fashion? Should we do some sort of publish subscribe model where the back ends publish data, and then the front end caches that in some kind of way, whether caching that in memory, whether that's a durable cache, like for example, as the question mentioned, in MongoDB. And when we do that, we feel certain pain around the logical coupling that, when changes are made in the data model of one of those back end systems, then we end up having to make changes to the front end systems or other systems that depend on them, regardless of whether we do the synchronous request response model, or whether we do the event driven publish subscribe model, because essentially, we have large, it's not just amounts of data, but large sets of attributes that essentially are coupled on both sides of the wire.

00:17:45 Udi Dahan

So part of the framing of the question is that that description of the architecture is both a logical architecture as well as a physical architecture, meaning the front end system is a certain unit of deployment, you can point to a certain set of processes in production and say, that thing there, that's the front end system. And that's back end number, once you can see those things in production their runtime processes, as well as when we design the system, we treat them also as logical boundaries. That's one set of code for the front end stuff, and that's another set of code for the backend stuff. Now, that approach of having the logical architecture and the physical architecture essentially be one to one, creates a bunch of the pain that people subsequently feel. And I talk quite a lot about this in the course, I don't have time to cram five days of training into five minutes of response.

00:18:59 Udi Dahan

But essentially, that's the mistake. It's, you're trying to solve too big and too rich of a problem with too few degrees of freedom. Now, again, this is a very common way that people go about doing architecture. But it's not just a software architecture technique, it's also a management technique, meaning that the people that are managing the systems and the teams, they say this is not only a front end system and the front end code base, we have a front end team. And we have a front and we now have a project that the front end team is doing to achieve certain things for the users of that system. And then in the back end, we have back end code base and we have back end teams.

00:19:50 Udi Dahan

So essentially we're connecting both of the team structure and the project management structure, and the code structure, and the deployment structure. And essentially, if you can think of them as sort of pairs of shoes, we're tying all of their shoe laces together, right? So it's like four different people whose shoelaces are tied together. So if you're familiar with kind of those, what they call three legged races that the kids do at birthday parties, where they kind of take two kids, and then they tie their legs together and say, okay, now race. That's really hard to do, because each person has lost a degree of freedom. And now there needs to be much higher level of coordination.

00:20:36 Udi Dahan

Now, imagine that it's not just two people, now it's four people, right? And the level of coordination and the, okay, everybody, we're not moving left foot, no, no, not your left, my left. So everything kind of slows down, and that's the coupling that we're feeling. So the problem here to say, how do I solve this problem is, don't tie those shoelaces together, don't create that kind of coupling in the first place. But the thing is, that's not always our decision as software designers and architects, were sort of part of this larger environment where the team structure has already been set, and the responsibility that this team is responsible for that code base, and that code base turns into that deployed system, a lot of these things that were sort of thrust into them, and say, great, now race. And then the complaint is you're not delivering software fast enough, you're not running fast enough with having all of these shoelaces tied together. I know what the problem is, you're not being agile enough, we need to have daily stand up meetings, and we throw in a whole bunch of other ceremonies without realizing what are the root causes, why these things are problematic.

00:21:52 Udi Dahan

So the core solution is, don't couple your logical and physical architecture, the way that the code is structured versus how it's deployed, and doesn't have to be and most of the time shouldn't be one to one. So you can have components that are developed in one code base that are hosted, they're actually running in another system. So this is a very old practice, that's called SDKs right? Nowadays, with the new get on the .NET platform, jars and wires that were on the Java platform, Node And what it's done with all of its NPM packages, essentially has created units of packaged code that you can then take into another runtime process, and host it and run it there.

00:22:46 Udi Dahan

There's no HTTP call between your code and an NPM package that you're using, right? You just call it directly, same thing with third party nougat packages, etc. So we don't think about those things as architecture per se, but they are tools in our architectural tool belt of saying, well, should this team really be creating a separate system that we call over HTTP, or pub/sub event driven architecture, and then caching and all those things? Or would it be better for them to take some of their logic and package it up as an SDK, which we can then host and call directly?

00:23:26 Udi Dahan

So there's a lot more options available than the ones that we usually look at. And different cases call for different tools. Now, the thing is that sometimes that makes people uncomfortable because saying that there's a certain simplicity in havering, well, we're using HTTP calls between all of our microservices. But the price of that architectural simplicity is so much pain and difficulty in so many other places, that in my experience, it's not worth it, you need to go to a richer set of tools and practices to arrive at an overall lower level of complexity for the broad set of problems that you have. And that's as much time as they can take on this question right now.

And I think that that makes a lot of sense. And in the react world, I've seen in the front end components by Payment Gateways actually, they're publishing front end components for you to use, directly integrated into your app, which is same concept, it's composing UI from various services. So that's good. This is one particular question that's coming up quite a few times, I think three or four around the same subject on event payload and sharing data using events. I'll read out one question on that. It's by Nie Raj, but then there's a couple of more people asking the question around the same topic. While publishing message events from one microservice, to the other service, what should we include in the message payload, only the domain model ID or the whole domain object? And he's given a particular example of an audit completed event should pass an order ID or should the whole order object so that the consumer doesn't need to fetch the order again. And there's a couple of more people, Justin, and Jim, who's asked the same, or on the same topic.

00:25:38 Udi Dahan

Right. So that's essentially a follow on question to the last one, so much that it sort of comes back to, we've created a certain division in our architecture, let's call them components A, B, C, D, that there's a certain set of data that all of these attributes one, two, three, four, all of these things. And the question is, how do we get this set of attributes from point A to point B, and to C, and D, etc, when there's a large amount of processing. So if we use the example that you mentioned, kind of the order entity, there's kind of the front end where the person is buying stuff, then you've got payment related things, then you've got shipping fulfillment related things, you've got any number of downstream systems that may be called micro services, that are all kind of need the same payload of data.

00:26:42 Udi Dahan

So if you can kind of think of that, you've kind of slice things up, and you've got this data flow between things. And that's sort of a common architectural approach that you kind of take things and sort of divide them up things closest to the user, things that are closest to the warehouse and kind of create that sort of decomposition. And then it's back to that question of saying, well, how do I get this data between them? Should I have a shared database, and then just essentially pass an ID between them, and then everybody looks up the data that they need from the shared database, where one of the downsides people say is, well, that's hidden coupling, right? Because I'm not making explicit, what is the data that's actually being shared between these units of code.

00:27:31 Udi Dahan

So the contract between them is challenging. And if one team wants to change the data structure, let's say for compliance reasons, we now need to collect date of birth. So they go in there and they go into that shared database structure and say, "Okay, great, we've added another field into that table, its date of birth, and now that's not nullable," then that can blow up a whole lot of other systems, right? That haven't been changed. So it's, shared database is bad, let's not do that. And then it's, "Oh, okay, great, so we're going to now make it explicit and pass the data round." Again, it could be a DTO over HTTP, or it could be a DTO over some sort of messaging infrastructure transmitted, whether that's full duplex request response, or whether it's pub sub.

00:28:22 Udi Dahan

But the better way to look at that is to say, the pain that you're feeling is, again, an incorrect logical subdivision, that instead of dividing things up a certain way, and having sort of a common set of data flow all the way through all of them, is to try subdividing it a different way. So for example, to use this order example, say things like the payment information, we could say, there's a certain set of data in the larger payload that deals with payment data. Sometimes I'll see developers, they'll represent that as a value object in DDD type terms, it's not truly an entity, because in many cases we're actually, we don't want to persist it, because you've got the PCI DSS types of things that say, well, we don't actually want to touch that data, because that creates a technical legal liability for us.

00:29:30 Udi Dahan

So treating it as a value object is something sort of passes through, but doesn't ever get persisted with an ID, that seems like a good fit. But then later on, when the business says, oh, actually, we won't allow people to store the credit card information, and then we need to put an identifier around and start persisting and it gets more complex. But if you look at that set of data and say, well, there's a front end piece that needs to collect that information, there's a back end piece that needs to go and talk to a payment gateway to actually charge the money and deal with all of those things, there's a customer service, like our support people, they have some kind of system that they use to talk to customers that are complaining about things to there's an issue with my credit card, switch to another credit card, they have other users that also need to touch this information, you have a number of systems that are all touching this information.

00:30:27 Udi Dahan

A better way instead of saying, duplicating this and pub subbing that, is essentially saying, all of those things are one logical service, right? So instead of saying we've got all of these things that are distinct, we take a subset of that and carve out a chunk of each one and say, they're all part of the same thing. For lack of a better term, let's call that a payment service. And that payment service has a front end UI piece that the end customers are using for putting in their payment information. And it has a back end piece that is dealing with the integrating with the payment gateway. And it has another UI piece for customer support people to go in and see those information and update those things.

00:31:14 Udi Dahan

And essentially, all of those bits of code are part of the same logical service, and can have a single database where they're talking to. And then we can say the same thing about shipping information. So the customer's full address type of information, where does this thing need to get shipped to. And again, a lot of times in DDD terms, people would view that as a kind of value object that then gets passed between all of these systems, better way to think about it is you have a cross cutting, for lack of a better term, shipping service, that it has a front end piece here and a back end piece there and a warehouse piece there and a customer sort of support personnel piece over there. And all of those pieces are part of the same logical service. And because they're part of the same logical service, there's no real problem with putting that data in a shared database, because they're logically coupled to each other because they're part of the same logical service.

00:32:14 Udi Dahan

And when you take that step and say, okay, so why exactly are we organizing our teams around the physical deployments rather than these logical cross cutting services. And then once we do that, then we say, well, we don't need to share the payment information, that value object, that doesn't need to be known by any of the code and the shipping service, and the data structure of the shipping address, and all those things, none of that needs to be known in the payment service. So essentially, those bits don't need to share any data. And then we can simplify that down to identifiers that get shared and composed between them. And that leads us farther down a composite type of architectural approach, that the front end is a composite. Essentially, what that does is it starts changing our thinking about when a user is going to start filling in some information, that they're not just filling in one pile of information, but essentially they're filling in information that's going into multiple logically independent services, where that data needs to be correlated some way.

00:33:33 Udi Dahan

So the technique that makes that easiest, is to allocate some sort of main correlation identifier, in this example, we may call that an order ID at the beginning of the checkout process, rather than at the end. So the way the most developers think is, well, I collect up a whole bunch of information, I might put that in session state or somewhere., and then when the user has done the process and then I submit that to the back end, then something goes and talks to a database, only then do we start creating identifiers for things. Now, when you do that, you're essentially giving up a bunch of degrees of freedom, and saying, you know what? What if we created the order ID at the beginning of the process, and that was a globally unique identifier, or universal unique identifier, GUID, UUID, whatever, we're essentially saying, "Hey, the user clicked Checkout, they started the checkout process, let's generate that globally unique identifier," say, the order ID is going to be this for everything that happens next.

00:34:46 Udi Dahan

And then when we're collecting up the payments information, was saying, "Hey, this is going to be the order ID of whatever you do," and then that payment service can store and correlate the payment information with that known upfront order ID, and then when it comes to the shipping screens, we pass that known upfront order ID, say, "Hey, this is the order ID that's going to be coming down, you can store your information indexed by that order ID," and then the next service to the next service, so that, essentially, we're only sharing the order ID. And each of those services is holding on to the order ID as a mechanism of correlating things, so that at the end of the process, each service knows how all of those things get stitched together.

00:35:41 Udi Dahan

So I know that this brings up a whole bunch of other questions, and as I mentioned before, this kind of, can't cram five days of training into just an hour. But that's sort of the the architectural sort of different approach to sidestep the problem of, so should I do a thick event or a fat event? Because in both of those cases, it's, I'm sharing a whole lot of attributes between these different code bases, whether it's via a shared database structure, or explicitly via some kind of DTO structure. And the answer is neither, architect your system differently so that you only end up having to share IDs. And that, that is really the only thing that's shared. So that's the mechanism to get to the real logical decoupling. Joe, hope that helps.

Okay, I think you've covered all of those questions and that service boundaries always a bit challenging thing to identify and get right. Okay, there's one particular question that's being outputted in the channel, which I'll ask, it's from Rahul, microservices turning out to be distributed monoliths, do you see this happening more and more? If so, where do you think we have fundamentally gone wrong?

00:37:13 Udi Dahan

Do I see this more and more? So I'd say, not proportionally speaking, but absolutely speaking, yes. Meaning that the industry as a whole has continued to grow, al right? So the stats are something in the form of, that we sort of doubled the number of developers in IT roughly every five years or so. So yes, it is to be expected that having more developers working on more projects, more systems and more companies out there that you will get sort of just more of what was there before in sort of absolute terms. In relative terms, I'd say probably also, yes, because let's call it the proportion of people that know how to design systems well, and the rate at which people learn how to design systems better, is quite a bit slower than the rate of growth of the industry as a whole.

00:38:18 Udi Dahan

So maybe that's, I'd say, one to 2% per year, hopefully, in terms of sort of the level of development and growth around, so yes, relatively speaking, we're also seeing more distributed model is being created. But I'd say more generally speaking, this is not a new phenomenon. So, before micro services came onto the scene, web services was kind of the previously used moniker for these things. And they were largely speaking the same thing, it's a set of code deployed to a separate process that is talking to other bits of code deployed in their own processes. And essentially, you end up with coupling, because again, it's the physical decomposition rather than appropriate logical decomposition. So essentially, we've been doing distributed monoliths pretty much since the first wire was laid between two computers.

00:39:33 Udi Dahan

But I'd say that, probably over time, it's continued to get worse and worse, primarily because the tools for writing code in that type of fashion have gotten better and better. So if we go back to the early days of distributed computing, and again, that's a relative term, the old timers will say, no, it happened before that, but just to use terminology that might still be heard of by some of the younger folks. In the early days, in the Microsoft world, we had this thing called COM, which was the Component Object Model, and it's distributed variant called DICOM, right? Now, DICOM, was just a pain in the ass technology to get working, it was not a trivial matter to have two processes talk to each other over DICOM. And the non Microsoft world, we had this other thing called CORBA, the Common Object Request Broker Architecture, which was another technology to allow different pieces of code on different machines to talk to each other.

00:40:46 Udi Dahan

And because those things were relatively early technologies, they were not very developer friendly. So people who were building systems at the time using those technologies, because it was so painful to have pieces of code talk to each other over the wire, developers would say, do we have to make that a remote call? Can't we organize the code in some other way, so we don't have to use these god awful remote technology tools. And that was a good segue, because it essentially caused people to think about their problem in more ways. And to not do remote calls left, right and center. Over the years, the technology enabling remote calls got gradually better and better and better and better and got to the point that it's pretty much as easy to do remote call as it is to do a local call.

00:41:46 Udi Dahan

So because of that ease and simplicity, people are not thinking twice about doing a remote call, let's say, so we're going to have that thing be a micro service over there, sure, no problem. So we do another remote call, and we do another remote call, and we do another remote call. And then we chain those remote calls to other remote calls. And so in that sense, things have gotten worse, because the tooling has gotten better. So the tooling has made it simpler and easier and faster to do the architecturally wrong thing. And that's just created more and more architecturally wrong things that are happening. Again, both on an absolute basis, as well as on a relative basis.

00:42:42 Udi Dahan

And I for my part, try to both sort of on the architectural teaching side of things, to say, here's the better way of doing things, here's why it's better, here's the pain that you're going to have from doing that, what appears initially easy hath, however, I don't have a great sales pitch. In other words, my sales pitch is, I'm going to need you to slow down and figure out what the requirements really are supposed to be because they're not what you think they are. And that's not a coding exercise that you like and enjoy doing and are really good at, there's this other technique that you're going to have to learn and get good at. So that's going to be slow and unpleasant for you at first. And that after you do that this architectural technique of dividing things up, it's going to be more complex, you're going to have more moving parts and more packages than what you're used to.

00:43:39 Udi Dahan

So kind of the initial complexity that you're going to deal with when going down this path is higher than what you're currently familiar with. So it's a terrible sales pitch. In many ways, it's kind of like the, I want to lose weight, well, you're going to have to start eating better and you're going to have to start exercising, and that's going to be a long path, and it's going to require discipline, like, well, I don't want any of that, I want the pill that you can give me that will tomorrow make me lose weight, feel great and look good.

00:44:16 Udi Dahan

And that's essentially the sort of the equivalent, it's the, I don't have anything to sell you, but sort of the long but sustainable path that is going to require sort of greater degree of discipline. But I can say that, as an industry, the other path has been done by pretty much every single generation of the industry. And it's proven itself to fail at kind of even the medium scales of complexity and load. So for the types of systems that we're building today, there almost isn't an alternative, because the simpler techniques, the distributed monolith techniques, that they just fall over that much faster in the face of the world that we're trying to address with those systems.

Makes perfect sense. I think this is another good follow up on that from David Cameron, distributed systems have become a lot more common over the past few years with some of the platforms becoming a bit more sophisticated. What problems do you see now that aren't being solved well, also, what common mistakes do you often see teams making as they start to build distributed applications?

00:45:49 Udi Dahan

So in terms of platforms, I assume that probably the best characterization of these richer platforms are what we're seeing from the cloud vendors, whether it's AWS or Azure, you're getting much richer set of services that are available across all of them. So in the early days, pre cloud, you have, let's say, in terms of messaging technologies, there were a certain number of them, but you could say, the top, maybe a dozen across all of the platforms. And now you're at a point where each cloud vendor has roughly a dozen plus themselves. So AWS has SQS and SNS. And what else is there? There's another event, not event bridge, that's Azure is one. In any case there just dozens across all of them. Yes, thank you, AWS kinesis, and Azure Event Hubs and Azure Event Bridge and Event Grid. So, in terms of the richest, there's a lot more out there to choose from.

00:47:12 Udi Dahan

Now, that's kind of created a certain paradox of choice, where people kind of look and say, well, too many choices, before if I was in the Microsoft environment and I had to use a messaging technology, there was MSMQ, and that was it. If I was in the Java environment, then over there, they had their JMS, the Java Messaging Specification. And, well, there were a number of technologies that fit the Java Messaging Specification, it was, this is the API you use, and here are another number of vendors that fulfill that API, but essentially, your code is written one way.

00:47:55 Udi Dahan

So if you're in Java, this is how you do messaging, distributed pub/sub type things. If you're in Microsoft, this is how you do that. There's one way and that was it, so we got a whole lot more choice, and a whole lot more, well, if your events are like this and you have to handle that sort of load, and you don't want to process them one at a time, and instead you're processing them in batch, then this is a better technology for you to use, unless other characteristics and then you've got this other thing.

00:48:26 Udi Dahan

So there's a lot more that is out there, and that has, through sort of the mechanisms of developers just kind of saying, "oh, new shiny toy. I wonder what I could do with this," causes more of those things to be used in more systems. At the same time, yes, there is a greater necessity for building more complex distributed systems just based on sort of the business challenges that we're facing. But the sort of the downside of that beyond sort of the complexity is that, sort of the surface area that developers have to deal with has grown dramatically, because each of those services has a different SDK. So the way that you do a thing in service A is different from how you do the same thing and service B.

00:49:20 Udi Dahan

Even if you're on Microsoft Azure, and using C#, I'm not talking about the, well, I'm using the PHP library for this thing versus the Ruby library for that there. But even in the same programming language with these differences, that all I want to do is publish an event, it's like, "Well, there are different ways of doing that." So it's gotten a lot more complex. And they think that has not ultimately served us well as an industry. Because essentially, it's pulled developers down into a level of detail that is not conducive of them actually solving their business problems well.

00:50:06 Udi Dahan

I think that, well, historically, what I've tried to do within NServiceBus is to say, there's a certain number of technologies out there, that certain subsets of them can be combined in certain ways to create a an architecturally cohesive and viable platform to use the initial word from the question, but if you try to combine this thing with that thing, that's going to be quite difficult for you to do and achieve all of the things that you want. So some of the problems that have occurred as a result of that is, for example, the element of transactions, where transactions used to be a much more widely implemented core building block of almost all infrastructure technology, that you could essentially push a button, say transactions on, and that would just sort of magically flow and make sure that your system is consistent.

00:51:18 Udi Dahan

And all of the different pieces of things would enlist automatically into these distributed transactions, lets say transactions depending on your platform, where we've moved to as an industry is that we've said, well, transactions make things slow, or essentially, consistency has a cost, but we want our systems to be fast. So let's turn transactions off, because that makes things so much faster. And then we took another step further, and this initiated with I think, probably MongoDB is probably one of the first ones, that said, first of all, transactions are not just one thing, right? You've got the ACID acronym, Atomic, Consistent, Isolated, Durable. So, MongoDB says, well, we're going to give you durability mostly, but we're not going to give you the other things, you just figure that stuff out. And we're going to call that eventual consistency. And now it's really fast. It's web scale.

00:52:16 Udi Dahan

So if you want to be web scale, you should be using this technology. And developers, we have a certain affinity for benchmarks and things that go fast. That's, well, SQL Server, Postgres, whatever, that can only do X 1000 transactions per second. MongoDB can do 50,000 transactions per second, even though they're not the same transactions, I want my system to be fast, I will choose the system that has the larger number of transactions, messages, requests per second thing, because that will make my system fast. And then I'll be able to scale and everybody will be happy.

00:53:00 Udi Dahan

Unfortunately, as an industry, we've discovered that, yes, when you take the car and you remove the bumpers, and the airbags, and all of the features that provide safety, remove the brakes from the car, you can make a car that goes really fast, it's not a very safe car to drive in, you're going to crash a lot, and you might lose an arm and a leg along the way, but it'll be fast.

00:53:30 Udi Dahan

And essentially, that's what happened initially with MongoDB as people kind of started running their systems, and they started losing data and getting inconsistent data. And then kind of realizing, oh, that thing that was called eventual consistency, that, that's the thing that I have to do, that the infrastructure doesn't guarantee that I will get eventual consistency, what it's saying is we've removed all of the things that guarantee you consistency, now it's your job to figure out this really complex thing to get it done correctly, both between your business code and all the other pieces of infrastructure that you're using around MongoDB, or any other type of technology.

00:54:15 Udi Dahan

So now, not to say that MongoDB is bad, and Postgres is good or whatever. But to be aware that with these extra options that we've gotten, we're now in a position, say, if you want to use those things, that close to the metal, you now need to know a hell of a lot more about the technology than the previous generation did. And you're going to need to know a hell of a lot more about the business to find out which bits can live with which levels of consistency and isolation and atomicity. And a lot of those things are connected with dividing up your service boundaries better of saying, is all of this thing, this order entity, should that be one entity, or can that be subdivided to N separate entities? Is that one big transaction, or is that N smaller transactions.

00:55:12 Udi Dahan

But eventually, you're going to get to the point that for anything you're going to sort of come to the smallest unit of data that the business says, that thing needs to be consistent, right? You're never going to get away with no consistency, no isolation, no atomicity, there's always going to be some unit. And that's when you need, whether it's your messaging system, your database technology, your integration methods, all of those things need to be coordinated in a certain manner in order to provide the necessary level of atomicity, consistency, isolation and durability for those micro units to use that same term of microservices.

00:56:02 Udi Dahan

Now, with all of these services, what has ended up happening is, people have essentially been sort of growing their own infrastructure by cobbling these things together, running into problems, creating layers of stuff amongst those different technologies, then telling the other developers on their team, no, no, don't use that thing directly, use our layer on top of that. Downside being that the developer that created that thing, two years later leaves, and the other developers are left having to maintain it without really knowing very much about it, it's usually not very well documented, and there might be some bugs in it that that person didn't get around to fixing. And then you're kind of in trouble.

00:56:57 Udi Dahan

So I think part of it is appreciating that connecting all of these technologies together in ways that will provide the needed guarantees for your software is a non trivial task, and that those things will change, because each of these services, they continue to evolve. So, the fact that if a feature in a given service worked a certain way as of a given version, doesn't mean it'll work the same way in the next version, right?

00:57:31 Udi Dahan

So, if that's one of the things that kind of part of my sales pitch, going to say, that's essentially what I've been doing for the past decade plus with NServiceBus, is saying, I know, you all have been building these kinds of layers and abstractions and tools, and whatever, around all of these technologies, and you found out that in the beginning everything looks fine, but that two to three years down the road, and you realize you've kind of created a little bit of a Frankenstein, and then you're like, "Well, okay, I'll move on to another company and I'll do it differently over there," the poor developers that are left behind, they're kind of stuck.

00:58:19 Udi Dahan

So hopefully that, via whether it's NServiceBus or just these kind of talks, this is a small contribution that I can make, say, those are harder problems than you think. I and my team have been spending a decade plus figuring out and solving those problems across all of these tech stacks in ways that are tested and documented and backwards compatible and scalable and all of the things that you want. Yes, full disclosure, there is a cost to that, there's a cost for us to build it, and to continue maintaining this. The companies that have been using it so far seem to have been getting value from it, say, at least give it a try, because you don't want to be, again, in sort of that MongoDB situation, finding out that you've lost a leg along the way, and then figuring out how to get that back in the middle of in an E-commerce Black Friday type of event. Those types of mistakes in production systems are hugely expensive compared to whatever cost of infrastructure you'd be putting on the other side.

Cool, thanks for that. I'm glad we can all have Ferrari's with the guardrails, of course with the safety. Moving on, another specific question, I think this is about a talk that you might have given a few years ago, this is by Neil specifically with regards to building a rules engine that runs the rules that are injected to the rules engine from other services and assemblies, packaging these assemblies to be available to the rules engine is usually a deployment concern, however versioning of these assemblies can become a challenge, and current DevOps tools like as Azure DevOps or Octopus Deploy, don't exactly make this easier, is there an easier way to address the deployment concern?

01:00:28 Udi Dahan

So for those of you that this topic of rules engine, this is the first time you're hearing about it, I have a presentation on YouTube about micro services and rules engines that I've given, look that up and find out more information about that idea. Broadly speaking, not only about rules engine, but the architectural pattern known as engines was actually quite widely practiced for some period of time, before micro services came onto the scene. The idea of an engine is that you have some sort of technical infrastructure thing that allows you to put, to host inside it a whole bunch of other units of things that all sort of get run together. So before we started calling them search services, they were known as search engines. So that's an instance of the engine pattern. Before we started using the term service left, right, and center, rules engines were actually a very well known and broadly used technology category for building a complex business software, with the idea of essentially saying, we want to have rules being written by different people that all kind of get pulled together and run in one place.

01:02:00 Udi Dahan

So that's sort of broadly speaking, the idea of rules engines for those of you that haven't heard of it before. Now, the deployment problem that you're mentioning is better characterized as was mentioned in the question, as a versioning problem, right? So deployment problems, there are certain categories of deployment problems that say, sort of the tooling is somewhat immature, so it can be made smoother, put those to one side, the most problems around the deployment side are the versioning problems of the code that's going through them. And the majority of the versioning problems are due to coupling between pieces of code. So if I have two totally independent components, two totally independent packages that don't call each other directly, then I update the version of one, that has no impact on the other. I can deploy, version A of component one and version B of component two, and there's no problem. And I can switch to version C of this. So it's only when there's coupling between things, only when things talk to each other that we essentially create versioning problems, because the data that gets shared and the way they communicate with each other breaks between versions.

01:03:37 Udi Dahan

So that comes back to everything that we've been mentioning earlier on this call, is to say, try to minimize the amount of communication between things, try to minimize the amount of data that gets shared between those things, and use all of the architectural techniques to get to that kind of place. Now where are we left with in terms of let's call it the necessary logical coupling in a type of rules engine. So what you'll have is the coupling essentially between the rules engine itself and the components that are using it. So the rules engine will have some kind of API SDK type of thing, that those rules will be essentially making use of. So the interaction between how do I go about versioning the rules engine itself, the infrastructure of the rules engine, and how does that affect the components that have been written against it?

01:04:37 Udi Dahan

So for that reason and whether we're talking about rules engines, or service buses, or database APIs, when you're getting into infrastructure technology, you'll see that there's a high degree of conservatism around how those APIs evolve over time. So things don't just disappear from one version to the next. That's, well, we're going to be deprecating that, it's still there, but you should start migrating your code to use the new API's, it'll be a warning in this version, it'll turn into an error in that version, but it's going to be an error and not an actual exception that blows your code up. And very gradually, this infrastructure changes and moves over time. So that the code that is written against those APIs and SDKs does not run into the versioning problems, which then does not create the deployment problems that are referred to in the question.

01:05:43 Udi Dahan

Now, the more specific, the domain that you're doing for your rules engine, it is wise to actually have some of those domain concepts percolate through the API of the rules engine itself. So for example, instead of saying, I'm going to use a generic rules engine, and then I'm going to try to build something on top of that. So for example, somebody mentioned insurance, I need to price an insurance policy or something like that, a whole bunch of rules that influence how an insurance policies get priced. So I could use a generic rules engine, and essentially have all of the domain complexity remain in the components themselves, but when I do that, I may end up in a place where I kind of have to start sharing data and creating coupling between the rules themselves. So let's start having sort of passed data between them, and that creates a certain amount of coupling.

01:06:57 Udi Dahan

And architecturally speaking, I don't want to encourage that, because then it's very easy to move from the necessary data, like the order ID that I mentioned before, as being passed along, then you start adding, well, we're going to have a running total alongside the order ID, and everything will just do a plus equal or multiply by, add 10% on top of that, whatever. So I want to be cautious about allowing data passed between the different rules that are running in my rules engine. And the way to do that is essentially to create a more domain specific, let's call it a pricing engine, right? Say, I'm going to take maybe the core rules engine, and I'm going to wrap that with a slightly more domain specific pricing engine that is going to pull in the necessary domain concepts of what constitutes pricing and is truly at the base, and does not depend on the rules that are running in it, versus leaving the things that are rule specific in the rules themselves.

01:08:16 Udi Dahan

So whether I'm doing a pricing engine, or customer credit rating, type of thing, or a risk engine, or a fraud scoring engine or any type of thing that deals with a certain domain, I'll usually pull in some domain concepts into the engine. So it's not just a purely generic rules engine, there's some part in there that says, oh, okay, there's this thing over here, but that will often mean that the rules need to be structured now in a more limited fashion. So that sort of comes back to me in my interaction with my business stakeholders and say, in order to write the rule, the way that you want to write it will create a great deal of complexity in the software that's going to cause a great deal of pain for us down the road. Is there another way that you can achieve your business objectives by structuring your rules using only these primitives?

01:09:26 Udi Dahan

That's back to that initial conversation that I mentioned at the beginning of the requirements analysis. It's not just accepting every requirement as is and saying, "How do I handle that thing," there are certain requirements that should not be handled or should be structured differently. And a lot of times in my interactions with domain stakeholders, they say, "Well, I don't know, but this is the easiest way for me to do it." I understand that's the easiest way for you to do it, but what you've done is you've created, let's call it a certain level of complexity on your end, but that creates a massively higher complexity on the other end, which creates an overall poor solution. If we take your complexity and ramp that up by 50%, that can decrease my complexity down by 50%.

01:10:23 Udi Dahan

But the thing is, your complexity was, let's call that a 10 before, my complexity was 1000. So your complexity going from a 10 to a 15, allows my complexity to go down from 1000 to 500, the overall quality of the solution is quite a bit better just by trading 50% here for 50% there, because it's not the same overall mass of stuff that we're doing. And that's part of our role as architects, is to find ways to say, Where should we be moving? Which complexity? And what kind of leverage does that give us? And sometimes that means coming back to the stakeholders and say, we're not going to implement the thing exactly as you want it, and here are the reasons, let's find another way of doing that.

01:11:17 Udi Dahan

So again, for more information about sort of the technical, how to go and build these types of different examples, rules, engines, see that YouTube video, it'll give you more examples like that. But it's worthwhile learning more about rules engines in general for people that are doing architecture. And we do one last question then we wrap up?

Yeah, sure. I think there's one question in the channel, again, being uploaded, do you use event storming or domain storytelling? What aspects do you like or dislike of it?

01:11:55 Udi Dahan

So I think that one of the things that I liked very much that I believe event storming made quite a bit more popular is the collaborative part of it. It's the, let's take domain folks, and architecture type folks, and put them in the same room and have them talk and analyze the problem together, using various types of techniques, I think is a huge step forward from where we were as an industry before, where to somewhat grossly characterize it using sort of the scrum of a, well, we get user stories, but there isn't very much conversation as to how do those user stories get written?

01:12:54 Udi Dahan

And that essentially, once they're written, they're largely kind of, I wouldn't say set in stone, but there's very little flexibility perceived around them. And I think that what event storming did as a practice is it said, "That thing there, that's the problem." We need more flexibility around these user stories, the way that we achieve that is by moving upstream of the user story itself, and having our folks and the people that are using the software and defining what they want and the business objectives to collaborate much more closely together on what problems are we solving? Why, what kind of outcomes do we expect to achieve from that?

01:13:40 Udi Dahan

So I think that bit is the hugely important piece that I hope the people that are learning and applying those techniques don't lose sight of, because kind of like micro services, people got hung up on the word micro, right? And said, We need everything to be small because it's micro services. And I think that is a potential risk that I've seen manifest a number of times when people are doing event storming, is they get hung up on the term event, without sort of saying everything needs to be events, that's the important thing, no, it's not. The important thing is the collaboration, the flexibility, the understanding, and communicating about different ways of achieving these business objectives.

01:14:33 Udi Dahan

So I think those things are hugely important, and unfortunately, like any type of methodology, as more and more people get into it, it sort of gets watered down. And some of the initial spirit sometimes gets lost in that exercise. But to kind of bring that around, say, if you're doing event storming, domain storytelling, whatever, and you don't have any business stakeholders in the room, you don't have any users in the room, and it's just essentially software people that are doing it, then you've missed the plot, right? You don't have the right people in the room in order to make that technique effective.

01:15:21 Udi Dahan

Now, I hear from people, again, whether, I was talking about requirements analysis before, whether we're talking about event storming now, but it's really hard to get those business stakeholders into the room, I know, where did you get the impression that doing things well was going to be easy. So it sort of falls back to that same trap. Again, we all want to be thin and buff and highly energetic and feel good and all those things. But the sort of the daily slog of doing all of the necessary practices to achieve that, but that's really hard to stay disciplined over the long term. It's like, it's kind of true in all walks of life, as well as in software. So spend maybe just a couple minutes and say, what things can you try to get those people into the room with you.

01:16:33 Udi Dahan

So, depending on your role, simplest thing to do is knock on the door of whoever person is, user, stakeholder, whatever, to say, "We're about to build the feature that you were talking about, I just want to tell you, five minutes, how I understood it, and how that's going to be beneficial to you, and tell me if I understood you correctly," my guess is nine times out of 10, they will say, "No, that's not it." And you'll get some feedback from that. So you might not get them into the room at the time that you want, but go into their rooms, steal five to 10 whatever minutes of their time, however much as you can get, technique is, just explain the thing as you understand it. And they will usually correct you and say, "Wow, really good that you told me that, we now have a problem, because we've made all sorts of architectural decisions based on the assumptions that I just said, that we're now going to have to rework. So we've got other features that we're not yet starting to work on them, could we set a meeting to go through them because, otherwise, we're going to just end up doing tons and tons of really expensive rework."

01:18:04 Udi Dahan

And hopefully, that will get them into the room with you that first time after you've explained why not having them in the room before is creating delays and things being more expensive. So that's sort of your first tip that you can, next thing tomorrow, immediately start doing to start those conversations happening.

01:18:34 Udi Dahan

Other things is, don't just talk to one of them, right? Do that with one, and then do that with somebody else. And say, "It's interesting because so and so that I talked to earlier, said something that does not quite aligned with what you're telling me now." So like, "Really, I thought we were totally aligned in what we're doing, so well, at least not the way that I understood, right? You're telling me x, he told me this other thing. And that's why we interpret it in this way. And that's why we made these architectural decisions, what you're telling me now, as a result of our conversation, and what I heard from him, we should have the meeting with the three of us before we start building these features, because otherwise, we're going to build something that neither of you are happy with."

01:19:30 Udi Dahan

But have that conversation with more than one stakeholder, and that will make it visible that there is lack of alignment and misunderstanding between them, make that visible and known to them and say, we should have a proper means, it shouldn't just be, I barged into your room for five minutes, and now we've been talking for 15 minutes, and I want to be respectful of your time, but we've surfaced a problem here that really should be addressed. So let's schedule a meeting with the three of us or the two of you and the rest of the team to clear this thing up. Because, we were planning to start that work in the next sprint. And now we realize that, that's not really well enough understood for us to get it into the next sprint. Because otherwise, we would have built the wrong thing. So we've got some time, let's have that conversation. And then we'll put that into the sprint after.

01:20:26 Udi Dahan

So some of these types of, sometimes people hear requirement analysis, that sounds so heavy, it's just people talking to each other, right? But part of it is getting that communication and collaboration. And here's what I heard you say, did I understand that correctly. And a lot of time as software people, we don't appreciate enough of the importance of those communication skills, and collaboration techniques. And I'd say that is something that we individually and as an industry need to spend more time talking about and getting better at, and incorporating into our processes in order to get to much better places.

01:21:16 Udi Dahan

And after doing those things for a while, my guess is, those business stakeholders and users that you're talking to, they'll start to realize, we want to be in that room with you when you're doing your requirements analysis and architecture discussions. We see why it's valuable for us to be there, or the expense that would have been incurred if we weren't there. So if you do that a while, then they'll start to show up. And then it will be easier to do your event storming and domain storytelling processes, and have all of the necessary people in the room together with, right? But it's something of that kind of really starting small and this type of guerilla techniques to kind of do some things under the radar to get things moving in the right way.

To get the culture moving really in the company of collaborating. I couldn't agree more. I mean, having the developers or the architects closer to the customer, closer to the problem and the kind of conversations that brings about it's really, really very positive and uncovers a lot of things early on. Cool. I think that was it. That was the last question for tonight. Thank you. Thanks a lot Udi. Over to you, Mike.

01:22:47 Michael

May I say, thanks a lot Udi. I know I definitely got a lot of information, I can only imagine what the people listening from a technical point of view got a lot of about it. So thanks very much for giving your time. I really appreciate that. And I also know you mentioned to me as well, if there were a few questions that people wanted to, I guess dig deeper on or find out more about, you're more than happy to directly message early on LinkedIn, if Jake's there, and he will get back to you or point you in the right direction. And also, thanks to Amir and Neil from Critic clear, they did a great job of asking the questions and creating the questions, and also everyone that listened in. And obviously, I appreciate your guys time. And I hope you guys all got value from it. And I guess from my point of view, thanks for coming, and perhaps to see around soon.

01:23:39 Udi Dahan

And one last thing I'm going to tell you is, we're going to be opening up some of the course for free again in the next, I think day or so. So if that's something that you want, please do message me, I'll also pass the invite over to Mike and Amir so that they can transmit that to everybody who signed up for this. So you can get more depth on the topics that we've discussed. And please do reach out to me, whether it's over LinkedIn or Twitter, all these other places, whether it's to me or the folks at Particular Software. This is what we do. We spend all day thinking about and working with clients on these things. So whatever we can do to help you all out, please do feel free to get in touch. So thank you all very much for your time, and for the great questions and conversation. I really appreciate it.

01:24:39 Michael

Awesome. Thanks Udi. Thanks, guys.

Q&A on Advanced Distributed Systems with Udi Dahan

About this video

🔗Transcription