Empires fall: Decentralize your code to avoid total collapse
Ruling the world is hard.
Alexander the Great1 may arguably have been the person to come closest to being “Emperor of the World”. In 334 BC, his armies left his home in Macedon (modern-day Greece) and conquered a swath of territory to Egypt and halfway across Asia to northwest India - most of the known world at that point, but ultimately failed in ruling all of humanity when he died at the age of 32 under somewhat suspicious circumstances2.
Of course, the British empire3 circa 1922 was the largest in the history of the world by land area. As it was said, “The sun never sets on the British Empire.” Of course, pesky revolutions4 proved to be trouble for the British as well.
No one person or government has ever managed to control the entire world all at once, let alone direct the actions of all the individual people that live in it. It just doesn’t scale.
This distributed system we call human society has scaled pretty well to over seven billion people, without any centralized point of control. Each person acts as a loosely coupled actor, focused on solving its own set of problems and coordinating with other actors in a loosely coupled manner.
No complex system is without its problems, and human society is no different. We’ve had the occasional war or famine, but on the whole, our accomplishments as a society have been pretty impressive.
This works for human society, but we could never build a software system this way. Could we?
🔗Control
In our software projects, we crave control. We want to know what our code is doing at all times. We want dominion over our empire. So we create highly structured applications, dividing capabilities into top-down layers so that we can manage the complexity. We make these architectural decisions early in the project, when we don’t have lots of code, and at first, it works quite nicely.
As the system evolves, it gets more and more complex. Larger codebases tend to be more coupled, and after several iterations, what was once maintainable on a small scale becomes more and more unwieldy the larger it gets.
Unfortunately, in large systems we tend to see a kind of “butterfly effect” where any small code change has the capacity to break some other part of the system. Pretty soon, we begin expending more time testing these changes than delivering new features. Eventually a new feature is needed and an architect is forced to make the economics-driven decision that starting from scratch would be cheaper in the long run than adding the feature to the existing system.
So our empire falls, and so begins the dreaded complete system rewrite.
🔗A different approach
How can we avoid our systems turning into this mess? A big part of this is keeping coupling low by organizing our system into separate actors that, like humans in our society, do their own thing and only lightly interact with each other. This allows us to make changes in one area that won’t break another.
This means introducing architectural patterns, like asynchronous messaging and publish/subscribe, much earlier in our project, when things are still simple. Components can work independently, and communicate with each other asynchronously through these messages.
Let’s consider the simple example of ordering a product from a website. A user clicks a button and a flurry of activities have to occur. We need to check that all products are in stock, recalculate the order, check tax, loyalty points, and then finally, charge the customer and ship the products. Even for a simple example, this sounds like a fairly complex business process with no opportunity for anything to be asynchronous.
But does it have to be an all or nothing success immediately?
This ends up being a good question to ask when trying to break down a complex process. Could we take a few steps out of this process, then publish an event saying that the order looks good. Then, the act of charging the customer for their order could happen asynchronously with respect to the original order being placed. Shipping is already naturally asynchronous.
Instead of one monolithic pile of business logic, you can separate your domain model. The sales-centric logic can be separate from the billing domain model, and separate from the shipping domain model. Each has its own domain model and business logic, loosely coupled from each other system by published events.
Now, if you make a change to some logic in shipping, the likelihood that you’re going to break logic in billing is next to nothing. That’s ultimately what helps projects get done faster, because the vast majority of time wasted on software projects is around regressions. Reduce the amount of code that any given change can break, and you increase the overall maintainability and adaptability of the entire system.
🔗Unanticipated advantages
If you work closely with your domain experts, you will find that there are many instances where an event-driven architecture can improve your business processes.
Start with the question, “If X succeeds, and then Y fails, should we roll back the first bit?”
Let’s revisit the product ordering example. If I order some products but the credit card process fails, does that mean we should throw away the entire order and delete all the associated data? Most of the time, a domain expert will tell you that’s a terrible idea! We probably have a pre-existing relationship with this customer, and they’ve ordered from us before, so it’s likely that their credit card’s expiration date changed. Instead, we can send them an email to update their payment info, and eventually process the order successfully.
Look for those opportunities as part of your business analysis for things that could be partially successful. That’s an indication that Publish/Subscribe could be useful. The thing that was partially successful can be its own service, which publishes an event once it reaches that point of stability. Then, the rest of the process can succeed or fail independently in a separate subscriber.
This is not just using pub/sub to make your codebase more loosely coupled, or allowing you to run things in parallel for greater scalability. Along the way, these processes actually improve the quality of your domain. But doing so effectively takes good, strong interaction with the domain expert.
🔗Conclusion
Empires rise and fall, but this basic ordering of humans working together as a society has remained unchanged since the beginning of time, and we can apply these principles to our software. These principles show how a distributed system can not only function, but surpass the possibilities of classical monolithic systems, with multiple actors working on independent tasks, loosely coupled from each other.
A distributed system should communicate by messages, and offer publish and subscribe mechanisms to decouple processes from each other. It should enable new ways of meeting business expectations, by not throwing out all of the results of a large process just because a small part of it fails.
(Good distributed systems should also offer reliability, resiliency, fault tolerance, message replay, and auditing, but that’s a post for another day.)
Some systems are so simple that you don’t need all of this, but simple systems do tend to evolve into more complex ones. With careful application of loosely coupled architecture using messages and publish/subscribe, you can keep both the complexity of individual components in your system low and the interaction between those components stable, thus avoiding the total collapse of your empire.
About the author: David Boike is a solution architect at Particular Software, author of Learning NServiceBus, and is passionate about distributed systems, elegant software, and brewing craft beer.