Fallacy #6: There is one administrator
This post is part of the Fallacies of Distributed Computing series.
In small networks, it is sometimes possible to have one administrator. This is usually the developer who creates and deploys a small project. As a result, this developer has all of the information about this project readily available in their head and, if anything goes wrong, will know precisely what to do.
I know quite a few developers and managers who talk about “bus theory” as a way to promote communication of critical knowledge. The central point is this: having only one person holding critical knowledge is dangerous because of what would happen if that person got run over by a bus. The term bus factor was coined to represent the number of people on your team who have to be hit by a bus before the project is in serious trouble.
Nothing so dramatic has to occur to cause serious problems. It’s far more likely that, rather than be hit by a bus, these people will simply be promoted or move to a different company. Your organization might be lucky enough to have a person that seems to know everything about everything. You might even be that person. But when that person gets promoted, their replacement probably won’t have a clue.
As the size of a project gets bigger and bigger, more and more people will begin touching it. Likewise, as the number of people increases, the number of communication pathways between those people grows exponentially. On a team of two, the people only need to talk to each other. On a team of eight, there are 28 connections! The chances of all this communication happening effectively (or at all) are small enough that it’s virtually impossible the people on this team can all understand everything going on in a large system.
🔗Config soup
Consider, for a moment, all the different possible sources of configuration information for an application:
- Configuration files
- Databases
- Command line switches
- Registry settings
- Environment variables
- Centralized configuration services
With a sufficiently large, sufficiently complex, or sufficiently old system, it becomes nearly impossible to know what all the sources of configuration for a given system are, let alone what all the potential configuration values might do.
A trend that complicates matters is the drive to create “mass customizable software” in which every possible option can be tweaked. This drastically increases the number of possible permutations, which makes testing prohibitively difficult. The push for customization also has the interesting effect of creating “campfire stories” around configuration, such as, “Oh sure, we can do that. You just have to enable these two features, disable that other one, make sure these four settings are entered and are integers…”
The larger (or more likely, older) a system is, the less likely you’re going to really understand what it does. When that happens, you will be much less likely to remove old code because you won’t have any idea if it’s necessary or not. This leads to a vicious cycle, in which an already complex system becomes more complex as time marches on.
🔗High availability
The end result is that deployments become increasingly difficult to accomplish without significant downtime. Business expectations regarding uptime are fairly straightforward: clients want the system available 100% of the time! We must then explain to them the concepts of planned and unplanned downtime and how we plan to take the system down in order to upgrade it to the next version.
Of course, we want to be Agile. We want to deploy as often as possible so that we can get valuable feedback from our customers as early in the process as possible. But if a deployment results in five minutes of downtime, we’ve already almost breached “five nines” (99.999% uptime for a year) on one deployment.
The longer our deployments take, the more pushback we are going to receive from business stakeholders. They will want us to deploy less often, which means that each release will now contain more changes. The more changes there are in a release, the higher the risks and thus the longer the planned downtime. Yet another vicious cycle.
In order to achieve five nines, you need every release to be capable of being completed in significantly less than five minutes. If you add up all the downtime from all of your releases throughout the year, all of those must add up to be much less than five minutes as well.
That’s hard.
🔗Solutions
So it’s impossible to have only one administrator. Because of all the different ways to configure our applications and their sheer size and scope, eventually we’ll get to the point where it’s impossible to deploy our applications in a timely fashion so that we don’t run afoul of uptime guarantees. It sounds like we’re in pretty bad shape here.
🔗Always on
Essentially, five nines of availability means you can never turn the system off. This means you need to be able to upgrade it while it’s still running, which in turn means that, to perform updates, you must be able to run multiple versions of the system side by side. New code must be backwards compatible with the previous version, which isn’t anything that the administrator can do. That’s on developers.
It’s hard enough to test one version of software in isolation. How do we test two versions running side by side? When running side-by-side versions, managing configuration — both where the configuration information comes from and how it changes from one version to the next — becomes a big deal.
Even if some magical technology were available that would guarantee our servers would never go down, our upgrades would cause the system to experience down time. The only way to deal with this is to close the loop between developers and administrators. Developers must be as responsible for writing backwards-compatible code as administrators are to deploy it on highly available infrastructure, in such a way that deployments can run side by side.
🔗Decoupling is key
With the addition of asynchronous messaging using queues, always-on availability and side-by-side deployments become much easier to implement. This decouples the sender (generally a website) from the processing of that request. Therefore, even when the back-end service is unavailable, the web server can continue to dump requests into the queue, and the overall system remains available.
This gives our administrators a lot of power. Using queues, we can enable the administrator to take down parts of the system for maintenance without adversely affecting the response time.
Additionally, the use of queues and messages forces work into discrete units, and there’s no reason we can’t deploy multiple versions of a message handler at the same time, each gathering work from the same queue using the competing consumer pattern. Assuming the vNext version of the handler does not misbehave, we can promote it and decommission the old handler. Or if an error is found, the vNext handler can be removed from service, and the messages it failed to process can be rerouted to the old handler.
This decoupling between sender and receiver also forces us to codify what the exchanged messages look like, and it guides us toward the direction of backward compatibility and facilitates continuous deployment. No longer do we have to bring the entire system down, upgrade, hope for the best, and then restore from backup when things go wrong. Now we can bring down small parts of the system, operating on well-defined message contracts, and we only need to guarantee that a small subset of functionality continues to work as expected after the upgrade is complete.
🔗Everything is logged?
It’s also important to consider how to pinpoint problems in deployment scenarios. It’s common to take the worldview of “if there’s a problem, it will be logged.” But how often is an entry in a log file insufficient to determine the true cause of a problem?
Generally, hunting down bugs falls on the shoulders of junior developers. That way, as long as everything is logged, somebody will take care of it…as long as that somebody isn’t us. Therefore, we don’t feel the pain, and we don’t invest our thinking into trying to make it better.
With the addition of messaging, we are given more possibilities for tracking down problems. Rather than dig through log files, we know that a failure occurs because a message fails to be processed and is moved to an error queue for evaluation. Bound up within that message is not only the stack trace but also all of the data that caused that error to occur.
🔗Summary
The ramifications of the 6th fallacy of distributed computing run quite a bit deeper than what might at first be assumed. There is never one administrator. If there were, we would be in trouble when they got promoted. Everyone working on the project is an administrator in some respects. All possess some of the required knowledge, but seldom do they possess all of it.
With all of these different players and the general tendency of software to become more complex over time, we need to take action to ensure we will be able to deploy our software successfully without downtime.
By introducing messaging and carving our monolithic system up into several smaller, independently deployed pieces, we gain two critical abilities. First, we can take down only a small portion of the system for an upgrade, without noticeable effect on the system as a whole. Second, we can deploy different versions of these components side by side in order to verify that they are production-ready without reaching an “upgrade point of no return,” beyond which we would be forced to recover from backups.
As an added bonus, developing with this architectural style guides us toward making our upgrades backwards compatible in the first place.
There are many administrators. For all of us, high availability is the primary goal, but it’s not something that can be added in afterwards. In order to achieve the goal, we must create an architecture that enables it in the first place.