Managing success and growing pains
Every software system evolves through different stages of complexity. They start simple—attempting to solve a problem that might not yet be well-defined. As they grow, problems become more well-defined, and then they grow some more. Just like with lanky teenagers, this growth can sometimes cause growing pains. A skilled architect knows how to watch for the signs of these growing pains and how to apply more robust architectural patterns to ensure the system can continue to grow and flourish.
This post is the story of the growing pains experienced by our friends at VECOZO, a system integrator that ensures safe communication between numerous healthcare-related companies. They knew it would be irresponsible to design every piece of software to handle massive scale, even if it only had a few users. So, when the architects started to see telltale signs, they knew it was time to deploy more robust architectural patterns.
As you read on, maybe you’ll find that some of the challenges they faced sound familiar…
🔗Unblocking
Initially, an application could call another application using synchronous remote calls. During each request, the calling application had to wait for the remote system to complete. But that means when there’s a problem with one application, this affects other applications. For this, VECOZO came up with different solutions that suited their needs. They switched to asynchronous messaging instead of synchronous calls. But instead of manually creating a solution, they decided to use a ready-made framework called NServiceBus.
NServiceBus worked so well that they introduced messaging in multiple other applications and added more NServiceBus features to the system. Other departments caught on and introduced NServiceBus in their relationship management and healthcare purchasing systems.
Let’s look deeper at the problems VECOZO experienced with synchronous remote calls and what we need to remember when using messaging.
🔗When it’s synchronous
Synchronous remote calls, whether over HTTP or any other protocol, assume all involved parties are available, ready, and quick to respond upon request. A glitch in the request chain will ripple back to the original caller like a snowball rolling down a slope. If the caller is a few hops away from the failure, it cannot survive the error because it lacks context about the original reasons for the exception.
Systems evolve, and decisions made early can have long-lasting and unforeseen consequences. Usually, it’s not a problem until suddenly it’s a big problem. An experienced team doesn’t necessarily prevent every single issue like this, but when they happen, they diagnose it quickly and take proper steps to mitigate it, as was the case here.
When VECOZO started suffering from those effects, they laid out a plan to address the limitations of the current design, namely, how to reduce the temporal coupling introduced by the synchronous remote procedure calls.
🔗What about retries
To solve the issue, the team could have introduced a short-term fix to include automatic error retry mechanism as a short-term solution. When a remote procedure call failed, the calling code would retry it.
However, determining an appropriate retry strategy can be more art than science. How many times do you retry? Do you use an exponential backoff strategy? What happens if your retries inadvertently cause a denial of service attack? After investigating this option, the team realized such a solution might not be ideal.
🔗Messaging, a better approach
A message-based solution replaces synchronous procedure calls with sending a message asynchronously. Essentially, instead of saying, “Hey claims service, can you perform layout checks on this claim? I’ll wait,” the calling process would say: “Hey claims service, can you perform layout checks on this claim? Take your time; I’ll continue when you’re done.”
Now, whether a component was available or not didn’t matter anymore. Async messages replaced direct calls. Messages are naturally stacked up in a queue, waiting for the component to become available again. So, while some of the team still working on a home-grown solution was looking into rate limiting, the developers on the team switching to NServiceBus worried much less about overloading their components. They experimented with the optimal number of threads that would process messages in parallel. As a result, the components could go as fast as possible without ever overloading resources like a database.
That’s not to say that the team ignored a retry strategy. Since NServiceBus has this functionality built-in, it was easy to enable it. But as a bonus, when a process did fail, the offending message could be delivered to an error queue along with the context of the failure in the form of the exception details. The operations and development teams could work together to investigate the issue, define a fix, deploy it, and finally put the message back in the queue to retry. The business processes could resume where they left, with only a short delay.
🔗Messaging considerations
Things are never as straightforward as they appear on the surface, and moving from synchronous calls to asynchronous ones is no different. Due to their nature, synchronous calls often rely on ordering. Things will happen as they are declared in code—one after the other. Changing processes to be asynchronous introduces entropy. Processes can no longer rely on the steps’ execution order, as messages are processed out of order, and thus, certain components require some redesign.
Another key difference between retrying a synchronous remote call and retrying a message is who’s responsible for what. When the invoker sees its call fail and needs to retry, it has little to no knowledge about the context of the failure, and as such, its options are limited to a backoff retry policy.
When using messages, the failing message fails at the receiver end. The receiver has all the context and knowledge to make more educated decisions about how to retry and whether it’s worth it.
By forcing the sender/invoker to try, we’re violating an ownership boundary by making the receiver problems a sender/invoker concern. We should never offload issues to someone with little to no knowledge of addressing them.
🔗Retrospective
Even though NServiceBus provides developers with the flexibility to do what they want in code, the team appreciated the NServiceBus “pit of success” philosophy, which makes it harder to do things the wrong way. Best practices are everywhere, embedded inside NServiceBus and its API and in great documentation. It provides a standard way of working, percolating throughout the system. The team especially appreciated the Particular Software support team, consisting solely of NServiceBus developers with experience building complex, distributed systems. Having one solution removed the need to create, maintain, and document a homemade framework.
The focus on code was one of the best features that made the team choose NServiceBus. The developers already felt most comfortable in a code editor, with the ability to safely commit changes to source control.
Automatic retries, error queues, and the design of loosely coupled event-driven applications paved the way for adding new functionalities by adding new subscribers for existing events, as well as no-downtime releases during working hours. Developers and operations personnel had the safety measures to release confidently.
🔗Summary
Any software system used for a considerable amount of time will go through growing pains as it graduates from prototype to minimum viable product to critical business system. The patterns used to accelerate its development in the early stages can’t always provide the stability and scalability required later in life.
The key for software professionals is to recognize the patterns that indicate a system is beginning to grow beyond its architectural capabilities and how to replace earlier architectural patterns like HTTP and synchronous calls with more robust patterns like asynchronous messaging.