The challenges of monitoring a distributed system

Written by Mike Minutillo on July 18, 2017

I remember the first time I deployed a system into production. We built a custom content management website backed by a single SQL Server database. It was a typical two-tier application with a web application and a database. Once the system was deployed, I wanted to see if everything was working properly, so I ran through a simple checklist:

Is my database up? (Yes/No)
Is my web server up? (Yes/No)
Can my web server talk to my database? (Yes/No)

If the answers to these questions were all yes, then the system was working correctly. If the answer to any of those questions was no, then the system wasn’t working correctly and I needed to take action to correct it.

🔗Beyond the checklist

For a simple little website, that was all the monitoring I needed. It was easy to maintain and to automate, so I was often able to fix problems before the users even noticed them. But over time, the complexity of the systems that I’ve deployed and monitored has increased.

These days, I work mostly with distributed systems, which are typically made up of a larger number of processes running on many different machines. Processes can run on-premises or in the cloud, and some solutions may include a mix. Some of your processes may run on physical machines while others run on virtual machines. For some processes (in platform-as-a-service environments), you may not even care about the machine they run on. Monitoring a distributed system requires keeping track of all of these processes to ensure that they’re running and running well.

Distributing a system over more processes increases the number of potential communication pathways exponentially. My original website was a typical two-tier application and had two major components (web server and database) with only the one communication pathway between them. If you distribute a system over just five processes, any two of them might need to communicate. That’s 10 different point-to-point communication pathways that might fail. When the system is first deployed, a small subset of those 10 pathways might be important. But as the system grows organically over time, the set of pathways being used can change.

Increased complexity isn’t the only challenge you face when monitoring a distributed system. There are some common patterns and techniques seen in distributed systems that need to be taken into account when designing a monitoring solution.

🔗When a failure isn’t really a failure

Distributed systems need to be designed to tolerate failure. It’s common practice to introduce a persistent queue between distributed processes, which allows them to continue to communicate without both having to be online and available at the same time. That’s great for the stability of your system, but it can hide real problems from you.

If a server restarts to apply OS updates, the rest of system isn’t impacted. The server will come online again in a few minutes and start processing its backlog of messages. If a process isn’t running right now, it’s not necessarily a sign of an inherent failure. It could just be routine maintenance. But it could also be a sign that something has gone wrong.

If you aren’t watching carefully, a process can be offline for quite a while before anyone notices. For example, if your website is online and accepting orders but your back end isn’t charging customer credit cards, it could be some time before someone realizes there’s no money coming in. When that happens, they usually won’t have a smile on their face.

Failures don’t just happen at the process level. Individual messages can fail as well. Sometimes these problems are caused by a programming error or by a malformed message. In these cases, either the process or the message needs to be manually modified, and then the message can be retried. More frequently, these problems are temporary (database deadlock, network glitch, remote server restarting), and a common approach to dealing with them is to wait a short while and then retry them automatically. Implementing message retries keeps messages moving through the system, but it can also mask growing problems.

I once worked on a system where nearly a third of the messages being processed by an endpoint failed at least once due to deadlocks occurring in a database. As these messages were retried, they never ended up in an error queue, and I just assumed that the endpoint was a little slow. Eventually, I tried to scale the endpoint out to get more performance out of it. All that did was put more pressure on the database and cause more deadlocks. I needed to address the deadlocks directly in order to solve the problem, but I hadn’t seen them.

🔗Monitoring at scale

Modern distributed systems are designed to scale out. The recent surge of cloud infrastructure and virtualization makes it much easier to spin up new processes than to spend money provisioning huge servers for individual systems. Knowing which parts of your system need to be scaled, when, and by how much requires collecting and analyzing large piles of data.

Once you do scale out, your data collection also scales. Not only do you have to collect data from more servers, but now you have to aggregate it so you can analyze your system as a whole rather than as individual parts.

Gathering the data for a single server isn’t difficult. Modern operating systems and platforms make it very easy to collect any sort of information you might need. How much CPU is this process consuming? How many messages per second are being processed? Which queues are filling up? All of these questions are easy to ask and get answers for.

But the key to successfully monitoring a distributed system isn’t just about gathering data; it’s about knowing which questions to ask, and knowing how to interpret the data to answer those questions at the right time.

🔗Summary

System monitoring is important, but that isn’t news to anyone. The problem is that, without a standard methodology, monitoring is a challenge, especially when dealing with distributed systems. Gathering data can flood you with information that doesn’t necessarily help you identify what needs to be fixed at the right time.

Coming up, we’re going to discuss our philosophy on the subject; what we’re doing at Particular to help you ask the right questions; and, more importantly, how to answer those questions. In the meantime, check out the recording of our What to consider when monitoring microservices webinar for more details.

About the author: Mike Minutillo is a developer at Particular Software. He has three children, so he has experience watching a collection of unreliable processes attempt to collaborate.

Share on Twitter