Top 5 techniques for building the worst microservice system ever

00:01 William Brander

We're going to be talking about the top five techniques to build the worst microservice system ever. I really like this image as a representation of microservices. Because you've got this gentleman on a scooter, it's a bit washed out on the projector, but he's going really fast and he's not hampered by any bloat like protective headgear or anything. And the center of gravity is really high and the pivot point is really low, and all it takes is one bump in the ground and permanent brain damage, just like microservices. So who am I and why am I qualified to talk to you about microservices? My name is William Brander. I'm from South Africa. I know the South African accent can be a little bit difficult to understand at times, so I will try and enunciate a little bit clearer than I usually do, and I will try speak a little bit slower than I usually do. Hopefully that helps.

As you can see from the mug, I'm the world's okayest programmer. I'm also incredibly passionate about building bad systems. I hope some of that passion is going to come through and energize the room so that we can end the day on a high note. There are three things that I dislike about this mug. The first is that the A in programmer is tilted ever so slightly to the side, just enough to bother my OCD. The second is that the font is dangerously close to Comic Sans without actually being Comic Sans because that would be sort of cool in an ironic way. But this isn't, it's just like a weird knock-off. And then the third is that the mug is unfortunately unverified. It may surprise you to learn that there was not in fact a worldwide census done on developability with me landing square in the middle.

So I went about looking to see if there's other ways that I could prove that I was an okay programmer, and I spent some time looking through my company's Slack. And my co-workers definitely don't think I'm the best programmer either, but at least one of them doesn't think I'm the worst. So I've got that going for me. But what does being an okay programmer have to do with writing bad systems? Being an okay programmer means that I've probably written some systems that I'm proud of and some that I'm a little bit ashamed of, but I'm sure everyone has written some bad systems. Has anyone not written a bad system before? If anyone's hiring, look around for hands now. We've all written some bad systems. So what do I bring to the table that can elevate me and can really make it that I'm the guy to teach you how to write bad systems?

Well, that's got to do with who I work for. I work for a company called Particular Software. We make NServiceBus. Has anybody heard of Particular? Well, we've got a booth outside, so if you haven't heard of us, we've done a really bad job. Anybody heard of NServiceBus? I hope everyone... Anybody using NServiceBus? Thanks more salary. So for those of you who don't know, NServiceBus is a library and framework for building distributed systems. So whether you want to build microservices or solo, or just partition your system, NServiceBus can help you do that. But that's not what this talk is about.

So what does working for Particular have to do with making bad systems? Well, at Particular, if you work on the engineering team, you also do support. There's no distinction between engineering and support, which means that if you work on NServiceBus, you support NServiceBus and you support the customers that use NServiceBus. And that's honestly one of the more fun things that I get to do is support because I get to see a lot of different systems applied in a lot of different ways at a lot of different scales, and I get to see how people are using NServiceBus to build these distributed systems.

But here's the thing about support. If one were to assume that system quality follows a normal distribution where you've got some really poor quality systems on the one side and a few really high quality systems on the other side, with most systems having reasonable quality in the middle, that doesn't really help with support because the people that have got the high quality systems aren't coming to support for help. And the people that have got the medium quality systems aren't coming either because they can probably work out their problems themselves. So the types of systems that I see when I do support are the ones down here, and I see a lot of these. In fact, as far as I'm concerned, based on my history with support, it means that the distribution of system quality does not follow normal distribution like this. As far as I'm concerned, all systems are bad and as an industry, we should be ashamed of ourselves for what we're doing to our poor customers.

So microservices, right? Why do we do microservices? The consulting answer for why we do microservices is we want to decrease coupling, so we can have independent deployability, we can scale things up and down without bothering the rest of the system and blah, blah, blah, blah. Nobody does microservices for that. We do microservices so that we can increase complexity and decrease performance. We do microservices so that we can come up with the most convoluted way to render angle brackets for a browser. And most importantly, we do microservices because we don't want to be seen to be doing a monolith. Monoliths are bad. If you start a project and say, "Well, I'm going to start with a monolith architecture", you get looks of despair from your colleagues. Every new project is "It's going to be a microservice architecture, right?" Monoliths aren't bad though. They're good.

This picture on the left-hand side is of the Kailash temple in India. It's carved from a single rock making this the largest man-made monolith in the world. The ERP system you've been working on for the last seven years doesn't come close. Here is a different picture of it with a slightly better angle. You can see there's people on the... Probably in the way there, there's people on the side on the left to give you an idea of the scale of this. This is huge. This is a massive undertaking. They carved this from hand. Well, I mean not from hand, they didn't claw the rock away, they had tools. But there weren't electric or pneumatic tools. This must have taken an immense amount of planning, dedication, work, effort. This has been a labor of love for someone and a whole bunch of slaves probably, but it's beautiful.

Monoliths can be beautiful. Don't think you have to do microservices because the alternative is monoliths. There's a spectrum and a whole lot of options in between. As a side note, could you imagine what this would've looked like if they had built this using Scrum? So for the rest of the talk, I'm going to assume that we have a system and we are now trying to transition the system into microservices and make it as bad as possible. And we're going to see a few techniques that we can do with that.

So we've got a monolith. It's a monolith, so it must be bad. So it's a tire fire of a monolith. And we're going to transition this to microservices. And a typical technique, and it's come up a few times at the conference already, is to apply with something known as the strangler fig pattern, where you take a bit of functionality in your system, say, the charge credit card method, you isolate that from the rest of your code and then you move that off to a service and then your main monolith can call the new service instead of having to have that call in process. Yay, we've microserviced.

What have we actually done though to our system? So if we apply this everywhere, what is the result? Well, let's take a look at the throughput of the system and how the throughput changes as the load on the system changes. We'll start with the monolith. So as the load of the monolith increases, the throughput will increase as well. At the start, it'll increase fairly linearly. And by that, I don't mean one-to-one, but there'll be some K-factor where, as the load increases by one, the throughput will increase by point somewhere between zero and one, hopefully not zero. And as the load increases even more, that throughput should increase a little more as well. But the rate of change of increase will decrease a little bit because as the system is doing more things concurrently, there's more handing over to different threads, there's more exchanges, you're waiting for resources, so it'll start to taper off a bit.

And eventually, it's going to get to a point where the throughput is going to drop off completely. And you might ask yourself, "Well, why? What is the behavior that's causing this to happen?" And this has got to do with the way that .NET does memory management. So in .NET, there's three types of garbage collection events, generation 0, generation 1 and generation 2. Technically, there's also generation 3 or large object heap, but for the purposes of this discussion, we can fold that into generation two. And the way that garbage collection works is that when you try and allocate a memory in .NET, if there's certain criteria met, .NET goes and says, "Okay, I need to free up some memory. Let me go and do a gen 0 garbage collection event." And a gen 0 garbage collection event is intended to clear up resources that have been allocated and freed fairly quickly.

So var X in a function, var X equals ABC. And then the function finishes, the garbage collector can retrieve that memory and return it back to the pool. But if the garbage collector tries to retrieve or tries to free up a bit of memory in gen 0 and it can't, it promotes that to generation 1. So generation 1 is intended to be for longer-lived variables, a longer-lived memory. Perhaps, it's a variable on a class and the class sits around for a bit longer and does a few other things. If a generation 1 garbage collection event runs and it can't clear up memory because something's still referencing that object, it promotes it to generation 2 and that's where the problem happens. So generation 2 in .NET is a stop-the-world event where, as a generation 2 garbage collection event starts, all user threads are paused in your app domain, which means only except for the thread during the garbage collection.

So all of your other threads will pause. And many years ago, Stack Overflow experienced this and Marc Gravell wrote a blog post explaining why this happened and what they were trying to do to work around it. These three big spikes over here are increases in response time because the gen 2 garbage collection event was pausing all user threads in the application. So the performance at these points was very, very poor. It's gotten a lot better since then. After this came out, we've got the concurrent garbage collection configuration, and now that's been replaced with the background garbage collection configuration, which doesn't... It interleaves some of the garbage collection for gen 2 and it more proactively gets gen 1 and gen 0 garbage collection done. So gen 2 garbage collection now pauses and then resumes a little bit, and then the garbage collection carries on a different thread a little bit later. So the pausing is sort of split out. It's still a horrible experience when you're dealing with performance.

But that's our monolith. Our microservice is better now. So what happens when our microservice experiences load? As the load increases, the throughput will increase. The throughput will also increase at a fairly linear rate, but it will increase lower than the monolith because you've introduced a network hop. Network operations are expensive, which means that these threads that are doing the processing are probably going to hand around a little bit longer. So there's going to be more contention for resources, which means that flattening off is going to come sooner. And because they're hanging around even longer, there's more chance that they're going to be promoted to generation 2, which means that drop-off is going to come even sooner. So by simply distributing our system, we could have potentially made the overall throughput of the system worse.

The first rule of distributed computing is don't distribute your computing. But we've gone ahead and done this, and we've potentially made the throughput of our system worse. This isn't always true. For instance, if the charge credit card method was on the hot path and we isolated this and moved this off, we could then scale that hardware up independently. But we're not going to do that. We're just going to microservice the way we learned that at the conference. So we're going to do everything on one machine anyway. So the first technique we can use to mess your systems up is to put an HTTP call in front of everything. Who here remembers when .NET 3 came out with WCF? Right, yeah? There were those people going around, "Everything should be a WCF class." Operation contracted all. It's a great idea. That was a very weird time. There was, what, WCF, WPF, Workflow Foundation and CardSpaces. And CardSpaces just sort of disappeared very quickly on in the evolution of .NET there.

Okay, so we've taken our monolith and we've strangled and figged it until it's no longer monolith and now it's distributed. We still have the same coupling between the components. We still have all the same problems we had before. We've now just made the performance lower. So it's still a monolith, it's just distributed. And that means instead of a problem that we have with our monolith, working with our monolith, we've now got to also introduce network topology coupling to that. So that's another layer of coupling that we've put in there. So it's not just the tire fire, it's a burning field of tires. This is a good first step, but this is difficult for us to work on. This isn't a pleasant experience.

And we don't want to make our lives worse, we just want to make the system worse. We want to enjoy our job. So we go to business and we say, "I bet you I can rewrite the system in six months." And anybody find themselves thinking that? I've done it many times at Particular. I've even tried a few times, failed miserably. The thing with big-bang rewrites is that they don't work. Little asterisks in the air, they work in two specific scenarios, which I'll get to later, but why don't they work? Why do big-bang rewrites of large complex systems fail? And that's got to do with how software works for these large systems. So as you're developing the system, we'll have another graph here. We're on the Y axis. We've got the functionality that's available in the system, the amount of features that we've got. And on the X axis, we've got the time that we spent developing the system.

And at the start of the system, as you start developing, it's fairly easy to add new functionality because you don't have competing components with each other. You're still getting to know all the business rules that you have to work through. It's a nice time to... This is the Greenfields area. We all love Greenfields projects, right? As we carry on working though, the rate at which we can add new features decreases because we have to add features where we didn't really design the architecture to match that feature set at the beginning. So it's a little bit slower to add the features. We have to change a few things and propagate and there's competing business rules that we have to worry about, but that's fine. And then eventually, it gets to a point where adding any bit of functionality takes a lot of time and effort. Potentially even just changing the name of a field on a form could be a full sprint process to do sometimes.

So let's assume that at this point here, this is where we went to business after we've done our distributed monolith and we said, "Please, can we have six months to do a rewrite?" After we've done the rewrite, everything will be wonderful. The UI will be modern, the performance will be better, and we'll be able to add features before you even know what to ask for. It'll be so good. And business says, "Well, okay, six months. I'll give you six months to do this." And you are ecstatic, because now you get to do new work again and it's great. So now instead of working on the old system for those six months, we work on the new system. So we are not adding new functionality over this period. So this little section here that we would've added a little bit of value to our users no longer exists, that disappears. And instead, we start working on the rewrite of our system.

And at the start of our system in the rewrite, it's really quick to add feature, probably even quicker than when the first system started because now we've got better ideas. We've been thinking about how we would improve it for a few months before we did the rewrite. We know more about the business domain. Things are looking great here. But it's not going to take six months. How do I know it's not going to take six months? Well, that's because I'm a developer and developers are really, really bad at estimating. In fact, instead of doing coding assessments, when you try hire someone, they should do estimation assessments. And anyone who's really bad at estimating must be an amazing developer. Anyone who's really good at developing is probably... Developing... Really good at estimating, is probably a project manager and shouldn't be developing. I, myself, am the world's okayest estimator.

But even if you could estimate correctly, you're still not going to take six months. Even if your estimation for what you think the system is it's going to take six months is 100% accurate, it's still not going to be correct. Because developers are also a bit like that dog from Up where you get distracted by a squirrel and you run off. The systems that we work on, it's very, very unlikely that any large system that we work on will be 100% documented. Even if it is, there's even less chance that we'll have all of that knowledge sitting in our head while we're trying to thumb suck an estimate.

It's going to be very difficult to be able to get those guesses right. So we carry on working, we do our rewrite, six months comes and goes, and it's now nine months and business comes to us and says, "Listen, we can't go on like this. We need new features. We've got a compliance deadline we have to meet, GDPR 2.0 or something. The return." So you start panicking and your team starts working weekends, working nights. You order pizzas every evening for the team. You've got a masseuse coming in to massage the team to relieve stress and tension and you're just driving a roll into the ground. And eventually, you're still not going to be done and business is going to come to you at a point and say, "We can't do this", where you've got to cut this rewrite.

And you panic a bit. You panic because now you've put a lot of effort into the new system. You love it. You're emotionally attached to this new design that you've got. So you compromise and you say, "Okay, business, I'll tell you what. Got an idea. We've still got the old system that's running in production. So we'll have that and then we'll also deploy the parts of the functionality that we've currently written or that we've rewritten, and then the users can use the new system for the stuff we've rewritten. And then for the ones that we haven't done yet, they can use the old system. But now we need to keep them up to date. So okay, no problem. We'll put an integration layer between the two and then everything will be fine, except we know that the integration layer is going to suck because the original system sucked. And the longer that this integration layer sits in production that we have to maintain and fix, the more chance there is of some of the bad elements from the old system bleeding over into the new system, because there's going to be time pressure.

You're going to say, "Oh, I need to get this field into a report. I'm just going to copy it in the same way that it is in the old system." It happens. But anyway, we're going to do this. So we're going to have three systems that we have to maintain, but at least now we can add new functionality for our users so we can meet those compliance deadlines or whatever. Once we've added that functionality, we can go back and carry on with the migration, right? I'm sure business will give us time then. They will, right? Please, someone? They won't. Or they might give you a little bit, but you'll have to scramble for it, and that original system is going to live much longer than you ever thought possible. It's a bit like trying to build an airplane while you're flying it. I asked Stable Diffusion to generate an image of an airplane flying while you build it. And I've got to say, this is also a really good description of microservices. It's terrifying.

After we've done our distribution of everything, the second thing that we can do to make our system worse is to attempt to do a big-bang rewrite. I said there were two cases where a big-bang rewrite works. The first is if the system is actually small enough to be rewritten in a reasonable timeframe. This is tricky though because developers suck at estimating, so how do you know that it's small enough? Maybe it is, maybe it isn't. I don't know. It's unlikely. It's also you kind of have to take into the effect into consideration that working on an old system that you want to rewrite is unpleasant. It's emotional. So it's hard for you to be objective about how long it will take to actually do a rewrite.

So the first time it works is if the system is small. The second time that it will work is for your CV. So you start the rewrite and you do the rewrite in the most amazing technology possible, and then a month after you've started working on it, you go for interviews and leave. It works for you, not for the project, but that's fine. So this is what we've got now. We've got three systems we're maintaining, and the longer that these systems are in production, the more of the suckiness we've got from the first system bleeding over into our new system as well. Let's make it worse. We don't want to be malicious about it though. So remember this situation where we had the dropper from performance because with long lived calls waiting for responses? We don't want to do that in the new system. We want to accidentally make bad systems, not purposefully make them.

So instead of service A calling the credit card service, because it's going to wait around for response and the same problem's going to happen, what we'll do is we'll introduce a queue. And then service A can put a message on the queue to, say, charge this person's credit card. And once that message is on the queue, service A can end its call and terminate and then we're done. That means that service A is less likely to live long enough to get promoted to gen 2, so we've solved that problem and everything's fine. We then go to our credit card service, which takes a message off the queue, call SWIFT or whatever payment provider we want to use, and everything's fine. Okay, cool, cool, cool. Except now our message gets to the front of the queue and this message says, "Charge Williams credit card 50 pounds." Cool.

As we do that call though, something goes wrong. The network fails somewhere along the line, SWIFT fails for whatever reason. Something happens. But we haven't actually charged Williams credit card yet, but that's fine. The message is gone. Well, that's no good. We've lost money for our customers. That's a financial operation that we don't want to lose for them. So we certainly think, okay, what we'll do is when we take the message off the queue, we'll process it in a loop. And in that loop, we'll put a try catch. So if something goes wrong, we'll just keep processing it and that'll be fine. And then we call SWIFT and then something goes wrong and maybe something's wrong on SWIFT's side and they have to reboot some service internally, whatever it is. And they come to us and say, "You are DDoSing us guys. Whenever something goes wrong, you're just calling us nonstop. You need to back off."

So you go back and you introduce some sort of timer mechanism and you say, "Well, okay, this message, each message can only be processed five times every minute. Yes, perfect." And then that goes into production and SWIFT is at least happy with you there. But then you get someone like William who puts his credit card details in incorrectly on purpose, because he doesn't want to pay 50 pounds. And then the message will never actually get processed because that's an invalid credit card number. Or maybe there's a bug in the code upstream that generated that message and that failed. But now that message is sitting at the front of the queue and everything behind it isn't getting processed. And you're like, "Well, okay, I know what we'll do. We'll introduce a dead letter queue, and then we'll take that message, move it to the dead letter queue. Don't worry. We'll also make software to manage the dead letter queue."

And this is a really cool bit of architectural kit. This is fun to work with. And you are living for this framework that you're making, but it is a framework that you're making. And in this case, this type of problem is solved by Polly, NServiceBus, MassTransit, Rebus, BrighterCommand. There's many options in the .NET space that will handle this problem for you. But instead of doing that, you are getting a lot of joy out of working on this framework. You might even open source it one day. That moment when you get your first star and your open source GitHub project, it's Ah, Mwah.

So the next technique we're going to use is we're going to make sure that we suffer heavily from not invented here syndrome. And every framework that we come across, we say, no, we don't want to use that because it doesn't a hundred percent meet our requirements. We are going to make our own one. And that's how we ended up with so many JavaScript front-end frameworks a few years ago. But we're going to keep doing this. Because honestly, when you've got a system like this, some of the only joy you can get is working on cool engineering problems, and you're not going to get that from business problems usually.

So we've got our monolith, we've got our new system that we're busy maintaining. We've got our integration layer between the two, and then we've got this really cool framework that we're working on. At least one, try for many, I highly recommend it. And at this point, we're basically not delivering any business functionality at all anyway. We're having a lot of fun on our custom stuff, but not really doing much for the business. So they're going to come to us and say, "Listen, we need the search screen that you've been promising us for months. Please, can you give us a search screen?" And you think, "Fine, we will do it." We've got a product service. And maybe they're an e-commerce site. They're Amazon... No. They're rainforest.com.

So they're rainforest.com and you can search for product or you can purchase products off of rainforest.com, and you've designed your product service to match this requirements. So you've got, for instance, the name of a product. You've got the price of the product, there's images for it, whether it's in stock somewhere, description, the rating of the product, fairly standard producty things. We've all seen these types of things before. And you come up with a class and a data structure that sort of represents this, which looks a bit like this. All those fields except it's got an ID. Yay computer science. And based off of this, you can go and implement the customer's search screen that they wanted. So you can, in your product service, have a little search API passing a text string, and we'll search the name and the description and spit back all of the products that match.

And you pat yourself on the back and you go back to work in your framework, now the business is satisfied. But then they come to you and they say, "I've got an idea." The correct response, when business comes to you and says, "I've got an idea", is for you to excuse yourself from the room, say you want to get a coffee or something quickly, leave the building, get in your car and go home and pretend it never happened. But you don't do that. So business says, "I've got an idea." We have a theory that especially on consumable items, if someone has ordered a product before, there's a higher likelihood that they'll order the same product again. So can we maybe have a little feel that shows, "Hey, you've ordered this before on the search screen"? And you pause what you're doing on your cool framework and you swing over in your chair and you kind of just look incredulously at this person has asked you the most obscene question ever because the order status is not in the product service. That's in the order service. That's a very different thing.

But you want to get back to working on the cool stuff as quickly as possible. So you make them promise that this is the only time they're going to do this, even pinky-promised with them. And they agreed, pinky promise and promise to name their firstborn child after you as well. So you say, "Okay, fine." So the product service, once we get the request, we'll also then call off into the order service just this once and we'll go back to doing the cool stuff again. I mean, this isn't pretty. You've made your search screen slow because now your product screen product service has to first query its database, then query another service, then put all this stuff together. But it's just once, it's fine, right? Then business comes to you and says, "Hey, I've got an idea."

Remember the correct response? Leave the room, leave the building and get in the car, go home. Denial, denial, denial. "We want to reward our gold customers and give them a 10% discount. So special pricing per customer, please." And I mean at this point, you may as well give up because they're going to come and ask for every single service in your system anyway. So the customer status doesn't sit in the order service, it doesn't sit in the product service, it sits in the customer service. So the system start chaining all of these calls together. And eventually, we're going to end up with dedicated teams that can manage our API gateways. Has anybody worked in an environment where there's a dedicated data power team and only they can make changes to your API gateway? It happens. These API gateways can solve these problems for you, but they can become another... Not can. They are another layer of coupling that you introduce into your system, a new set of technology you have to worry about.

Has anyone seen this video? If you haven't, do it. It's called Microservices on YouTube by a channel called KRAZAM. It is the best description of this type of thing ever. The man's face is like a mirror image of me in business meeting sometimes. Why did we end up like this though? I mean, we've got a product service, it's nice, it's concrete, it's very neatly encapsulated. We've got an order service, a customer service. Look at these classes. These are beautiful. Why is this not working for us? And the reason is that people don't interact with products on your service. They don't productize things. They don't producty. They search for things. They perform operations in your system. They don't perform things. Even if your user is busy updating your product catalog, they're still not producting, they are updating a product in your catalog.

And having the idea that a product service, a thing service is going to neatly encapsulate and not bleed over into other boundaries is a bit weird. It's like when you do a search on a screen, it's like a puzzle that you build and you take pieces from different places and you put them together to make the whole puzzle. Although I've actually just realized the puzzle is a really bad metaphor for this, because it's always got the same pieces in the same places. But think of something like that. You're taking information from different places. You're putting it together to build one experience for someone. The thing is, people interact with your system by doing things, not by objecting.

So the next technique that we can use is when we define our service boundaries, we can make sure that they are designed based off of nouns and not verbs. Anybody working in insurance? This is a big one that I see very often in insurance space. You don't have to put your hand up if you are doing this, but if you've got a claim service or policy service, usually that's a bit of a red flag and something to look at. I suppose claims might work sometimes because you do submit a claim, but people don't policy, they do a quotation, they go through the order process. Their underwriting is a process, they don't policy.

So we've got our three systems, our monolith, our integration layer, our new system. We've got a cool framework that we're working on as well. And we've made sure that when we defined our service boundaries, they're all based on nouns so that we can maximize the chance of having cross-communication service calls between everything. Building another great monolith on that side. Let's see if we can make this even worse by trying to make it better. So we'll go back to that search screen. Well, we don't want this, right? This is bad. Let's see if we can improve it. What we'll do is we'll introduce a new service, a search service. To search is a verb. We're doing something. Yay, we're learning. The search service will be a dedicated service only responsible for searching. So it won't have a product class, it'll have a search product class, which has a couple new fields on it.

And if we need to change the business rules in the search service, we can update that. Sounds reasonable. How do we get the data into the search service? Well, that's easy. We'll pub/sub it, right? So what we'll do is we'll tell the product service, "Hey, whenever you update your product list, publish an event for us please. And then I'll maintain a search catalog and I'll update that whenever the product list changes." And whenever the customer status changes, also publish an event and then the search service can update its list. And we can do the same for the order service.

This is an awful lot of data duplication. We've got things in different places, but we're smart, we're engineers. We can explain this away. The source of truth, if you use that phrase, it's clever, the source of truth of the products is in the product service. What we've got in the search service is it's a different model. It's a, what can we call it? A read model. That's right. We're doing a projection. This is CQRS. Yay. And we're using pub/sub, so there's events. So it's CQRS and event sourcing and services, domains. We're doing DDD CQRS event sourcing. This is a good system design. Go technology.

We still have the same coupling problems here. If the customer service changes the way that it rewards customers, that has to flow through into the search service as well. For instance, if the customer service says, instead of rewarding discounts based on status, we want to reward the top hundred customers. Our top hundred customers will get 50% off, the next 200 get 10%, and then the rest just get an extra 10% added on, because we want to help them monetize their way up into the stream. We make that change in the customer service. We then have to make that same change in the search service because now the search service doesn't have statuses that we have to worry about. That coupling still exists. We haven't hidden it away.

And the thing to remember here is that even though it looks like it's separate and it looks like it's decoupled, the view isn't the same. So the logical boundaries and the deployment boundaries or the physical boundaries are different. And we've treated it as if they're the same. If we have a search service and it's deployed separately, it must be separate from everything else, it must be decoupled. But it's not. I've always really liked this poster for Episode 1 with Anakin standing there, and his shadow projected as Darth Vader. I've always thought this was incredibly clever because we know Anakin is Darth Vader or is going to become Darth Vader, but in this particular instance, in this moment, he's not. The logical Darth Vader is different from the physical Anakin at this point in time. So what I'm going to do now is I'm going to show you a technique that can make this previous situation a little bit better. I don't want you to go and apply this because you don't want to make things better.

Also because it's not a golden hammer, it doesn't work in every scenario. So don't just go and say, "William said apply this." It sometimes makes things better, sometimes it's not applicable. But it is a technique you can use to get rid of the distributed replication of data that you're going to build. So what we'll do is we'll replace our search service with a search engine. That sounds more natural, doesn't it? Search engine, search service, search engine. And what we're going to do is we're going to implement what's known as the engine pattern. So the engine pattern has some coupling in it, but not logical coupling, not business coupling. What do I mean by that? So the search engine will expose a contract that the other services must comply to. So they'll expose, for instance, three interfaces maybe. An I find product interface, which given a tech string, will filter through a list of products and return them.

We'll also have an interface for I price products. So for a particular user and a particular product, what discount percentage should they get? And we'll have a third interface for I track orders, which just returns true or false, whether someone has ordered a product previously or not. The search engine takes these three interfaces, compiles them into a DLL and hosts it on an internal NuGet feed or something like that. The product service, the customer service and the order service can then download that NuGet package and implement whatever interfaces they want. So what I mean by that is that the product service can implement the product finder class. And the product finder class knows how to query the product service database. I hope it's clear that the product finder class sits squarely within the logical boundary of the product service. The customer service can do the same with a status pricer, and then the order service can do the same with a order history class.

So all of these classes implement the interface that was exposed by the search engine, and they're currently sitting within the service boundaries that they're implemented in. And this is great. How do we wire this up together though? So what we can do is we can take the product finder class, compile it into its own assembly, and then take that assembly and stick it in the app folder of the search engine. And then when the search engine runs and it needs to do a search, it can ask .NET, "Hey .NET, give me all of the classes that implement I find products." And .NET will go, "Oh, here's one. It's the product finder class. There you go." The search engine can invoke that product finder class, which will then go off and query the product database because it knows how to talk to its own database.

And then we can do the same from the customer service. You can take the customer status pricer and compile that into a DLL, copy paste that DLL into the search engine folder, and then the search engine can apply that to the products that it's got from the search result. And we can do the same with the order history one, we can track what the person has ordered before. How is this different from the previous one? We're still doing three sets of queries, right? Well, if the customer service changes the way that it does discounts and it goes away from status and changes over to ranking, the customer service deletes that status pricer and creates a new class, a ranking pricer. Compiles that ranking pricer into a DLL and pastes the DLL into the search engine folder. Now, the search engine, when it does a query and it tries to find all of the prices, the high-priced product instances, it doesn't have the status pricer anymore, it now gets the ranking pricer.

The changes were constrained logically within the customer service. Physically though, we deployed a different assembly into a different runtime, but that's not really a big problem. The interesting thing with this type of pattern is that you can apply things that seem like they come outside of that domain into different domains. So for instance, the product service could implement a daily pricer. Maybe every Monday, these products get a 50% discount, that can get put directly alongside the existing pricer. The same with the order service. It can create its own bulk history pricer, where if a customer over the lifetime of the customer's account has ordered 99 of these items, on the hundredth item, they get a 99% discount. That can also just be deployed alongside.

And all of that sits and resides logically within those individual service boundaries. Physically, we've deployed it somewhere else, but logically it's separate. I always find it interesting that when people consider this type of idea that it feels weird to them. We're putting these things together, we're coupling it, but this type of distinction between logical and physical is fine when you're talking about scaling out instances. We'll take the same code and run it in two places. That's also physical differences, but the logical, logically, it's still the same thing. So don't do this. In fact, what you need to do is you need to make sure that you conflate your logical and your physical boundaries for everything that you do in your system.

So a quick recap, what we've done is we've put an HTTP call in front of everything. We made sure to attempt a big-bang rewrite. I didn't say do, I did there, I shouldn't do that. Oops. We didn't use any off-the-shelf frameworks. We created a whole bunch of cool frameworks that we get joy out of. We then also made sure that we use nouns instead of verbs when defining our service boundaries so that we can get the maximum chance of having cross-service communication. And in the places where we have cross-service communication, we made sure that we kept the logical and the deployment boundaries the same throughout the entire system.

And if you follow these five techniques, you'll have a great time looking for a new job somewhere, I'm sure. I did say I would get you out of here a little bit early, and I'm pleased to say that I did that. If you have any questions, you can ask them now. Otherwise, you can also find me around NDC. I'll be here tomorrow and maybe a little bit later. I'm always excited to talk about writing bad systems, so if that's a common interest, we can write bad code together. Any questions? No? Okay, great. Thank you, everyone.

Top 5 techniques for building the worst microservice system ever

About this video

🔗Transcription