Fallacy #5: Topology doesn't change
This post is part of the Fallacies of Distributed Computing series.
It’s easy for something that started out simply to become much more complicated as time wears on. I once had a client who started out with a very noncomplex server infrastructure. The hosting provider had given them ownership of an internal IP subnet, and so they started out with two load-balanced public web servers: X.X.X.100 and X.X.X.101 (Public100 and Public101 for short). They also had a third public web server, Public102, to host an FTP server and a couple random utility applications.
And then, despite the best laid plans, the slow creep of chaos eventually took over.
In an era before virtualized infrastructure made allocating additional server resources much easier, Public102 became somewhat of a “junk drawer” server. Like that drawer in your kitchen that contains the can opener, pot holders and an apple corer, the Public102 server continued to accumulate small, random tasks until it reached a breaking point. Public102 was nearly overloaded, and at the same time, it needed an OS upgrade because Windows 2000 was nearing the end of extended support.
It’s easy enough to upgrade load-balanced web servers. You create a new server with the same software and add it to the load balancer. Once it’s proven, you can remove the old one and decommission it. It’s not so easy with a junk drawer server.
Decommissioning Public102 was an exercise in the mundane, gradually transitioning tiny service after tiny service to new homes over the course of weeks, as the development schedule allowed. It was made even more difficult by the discovery that, because of the public FTP server, several random jobs held configuration values (both hardcoded and in configuration files) that referred to Public102 by a UNC path containing the server’s internal IP address.
When finally we had all the processes migrated, we celebrated as we decommissioned Public102. Unfortunately, the network operations had a cruel surprise for us. For a reason I fail to recall now, they needed to change the subnet that all of our servers occupied.
And so we started it all again.
🔗Embrace change
The only constant in the universe is change. This maxim applies just as well to servers and networks as it does to the entirety of existence.
In any interesting business environment, we try to think in terms of server and network diagrams that do not change. But eventually, a server will go down and need to be replaced, or a server will move to a different subnet.
Even if server infrastructures are relatively static or changes are planned in advance, we can still get into trouble with changing network topology. Some protocols can run afoul of this changing topology — even something simple, like a wireless client disconnect.
For example, in WCF duplex communication (which thanks to not being included in .NET Core, appears to be on its way out) multiple clients connect to a server. The server creates a client proxy for each instance and holds it in a list. Whenever the server needs to communicate something to the client, it runs through the list of client proxies and sends information to each one.
If one of these clients is on a wireless connection that gets interrupted, the client proxy continues to exist. As activity continues to happen on the server, multiple threads can all become blocked attempting to contact the same client, waiting up to 30 seconds for it to time out.
This creates a miniature denial of service attack. One client can connect to a system, do a little bit of work, shut down unexpectedly, and then cause a 30 second disruption to all other clients.
🔗Cloudy with a chance of containers
With the advent of cloud computing, deployment topologies have become even more subject to constant change. Cloud providers allow us to change server topology on a whim and to provision and deprovision servers as total system load changes.
Software container solutions such as Docker and Kubernetes, along with managed container hosting solutions such as Azure Kubernetes Service, Amazon’s Elastic Container Service and Elastic Kubernetes Service, and Google Kubernetes Engine, allow applications to live within an isolated process not tied to any infrastructure, enabling them to run on any computer, on any infrastructure or in any cloud. This level of freedom enables deployment topology to change with almost reckless abandon.
When taking these technologies into account, it’s clear that we must not only accept that network topology might change, but indeed, we need to plan ahead for it. Unless we enable our software infrastructure to adapt to a constantly changing network layout, we won’t be able to take advantage of these new technologies and all the benefits they promise.
🔗Solutions
The solution to hardcoded IP addresses is easy: don’t do it! And for the sake of this discussion, let’s assume that configuration files alone don’t solve the problem. An IP address in a config file is still hardcoded; it’s just hardcoded in XML instead of code.
Additionally, we need to think about how changes to topology, even minor changes like the disconnection of a wireless client, can affect the systems we’re creating. When topology does change, will your system be able to maintain its response-time requirements?
Ultimately, this comes down to testing. Beyond just checking how your system behaves when everything is working, try turning things off. Make sure the system maintains the same response times. If I change where things are on the network, how quickly will that be discovered?
Some companies, like Netflix, will take this to the extreme. Netflix uses Chaos Monkey, a service that randomly terminates production systems so that they can ensure their systems are built to withstand disruptions.
🔗DevOps
Along with the rise of the cloud and containers, we have also seen the rise of DevOps, which is becoming a common part of the software development process at many organizations.
Some of the concepts of DevOps include infrastructure as code and the concept of throwaway infrastructure. With this level of deployment automation, you can recreate your entire infrastructure within minutes, then throw it away and recreate it on a whim.
These approaches, using tools like Octopus Deploy, Chef, Puppet, and Ansible, mean that addresses and ports for locating services become a variable in code that is determined whenever the infrastructure is deployed.
This alone makes a system much more resilient to change.
🔗Service discovery
At their core, service discovery tools map a named service to an IP address and port in real-time, something that DNS is unfortunately unequipped to handle. You can use SRV records to resolve a port, but due to a DNS record’s TTL, it’s poorly equipped to handle real-time changes to name resolution. In dynamic environments, especially in the cloud or when using software containers like Docker, a service discovery mechanism allows a client to reliably connect to a service even as that service redeploys to a new physical location.
Zookeeper is an open-source lock service, based on Zab, which is similar to the Paxos consensus algorithm (but not the same thing). Zookeeper acts as a consistent and highly available key/value store, suitable for storing centralized system configuration and service directory information.
The Paxos consensus algorithm is decidedly non-trivial to implement, so more recently, Raft was published as a simpler alternative. The CoreOS etcd is an example of a Raft implementation, and Consul also builds on these ideas.
It’s important to acknowledge how complex it is to properly implement the distributed consensus necessary in order to maintain a highly-available service discovery mechanism. The Paxos algorithm is insanely complex, and even though the Raft algorithm is simpler, it’s no walk in the park.
It would be foolish to try implementing a service discovery scheme on your own when good options already exist, unless service discovery is specifically your business domain.
🔗Summary
If you know there will be a potential problem with a technology, test for it in advance and see how it behaves. Even if you don’t know there will be a problem, test anyway, early in your development cycle. Don’t wait until your system is in production to blow up.
Even for smaller deployments, it’s worth investigating DevOps practices of continuous delivery and infrastructure as code. This level of deployment automation gives you a competitive advantage and increases your overall agility as a software producer. Especially if you want to take advantage of cloud computing or software container technology, deployment automation is a must.
For the most complex and highly dynamic scenarios, service discovery tools can allow a system to keep running even when topology is changing every minute.
A network diagram on paper is a lie waiting to happen. The topology will change, and we need to be prepared for it.