Fallacy #1: The network is reliable
This post is part of the Fallacies of Distributed Computing series.
Anyone with a cable or DSL modem knows how temperamental network connections can be. The Internet just stops working, and the only way to get it going again is to unplug it for 15 seconds. (Or, put another way, “Have you tried turning it off and on again?”)
Thankfully, better solutions exist for professional data centers than consumer-grade modems, but problems can persist.
As a company that does reliable messaging, we’ve really heard it all. Don’t worry, the names will be changed to protect the innocent.
A registered ISP had two routers: a primary and a backup. One day, the primary router malfunctioned. They switched to the backup, only to find that its routing tables had not been updated in a very long time. For many customers, that was the day the Internet died.
In another case, a project used an Oracle database. Everything worked great in the development environment, but in production there was an additional load balancer and firewall. Every once in a while, the load balancer would silently drop TCP connections to the database. These faulty connections continued to sit in a connection pool, so the next time somebody needed a connection, they would get an exception.
Just recently, on June 12, a single ISP in Asia broke the Internet for a big section of the world, creating Internet problems in Europe.
In general, you can’t trust any network, no matter how local or global. Hardware, software, and security can all cause issues. This is codified in the 1st fallacy of distributed computing: the network is reliable.
This is especially problematic for HTTP communication or any request/response or remote procedure call (RPC) style of communication.
Consider the following simple web service call:
var svc = new MyService();
var result = svc.Process(data);
How do you handle an
HttpTimeoutException? This is an exception that’s generated on the client side when there’s a problem, but you can’t know what went wrong because you haven’t gotten a response.
Data can get lost when sent over the wire. It’s possible that the web service call actually succeeded but the response got lost somewhere on the Internet. If the web service represents an idempotent operation (a process that can be repeated without any adverse side effects), then it can simply be retried. But what if that process charges a credit card?
To provide a truly reliable system, you must accept that cross-network communication will not always be possible. Because we can’t guarantee that an attempt at communication will be successful, we need to provide a facility to automatically retry after failures. To protect against failure while in the midst of a retry, we can use a pattern called store and forward. Instead of directly sending data to a remote server, we can store it in local storage. This way, when we boot up again, we are ready to continue right where we left off. We can use transactions to ensure that we keep retrying until processing succeeds.
This rises above the level of a simple retry loop around a web service invocation, which would fail if the server it was running on crashed. We need additional infrastructure to make these guarantees.
There are many different technologies out there, called reliable messaging or message queuing systems, that solve these types of problems. On the Microsoft platform, the best known one is Microsoft Message Queuing (MSMQ), and on Azure, there is Azure Queue Storage and Azure Service Bus. Outside the Microsoft ecosystem, there is RabbitMQ, ActiveMQ, and ZeroMQ. Basically anything with “MQ” at the end is an indication that the product fits within this family of technologies.
These queuing technologies wrap up something like a web service call into an isolated, discrete unit of work called a message. The message queue employs store and forward to ensure that the message gets to where it needs to go. It can facilitate automatic retry, as message processing can be attempted over and over, even after a system crash. Some even support transactions, so that the message is only fully consumed if the business transaction is successful. This way, a guarantee can be made that each message is successfully processed exactly once.
Techniques exist to enable “exactly once” processing in in environments like the cloud, where distributed transactions are not feasible, but that is outside the scope of this text.
Queuing technologies hold an additional advantage over a simple retry loop. For example, if we were attempting to create a customer and received an HTTP timeout, we would have no way of knowing if the server received the data and was simply unable to respond or if, rather, the data never arrived at all.
Retrying brings with it the possibility for server-side duplication. If we retry the attempt to create the customer, we may accidentally create the customer twice.
Message queues bring with them the concept of a message ID so that the server can decide whether an attempt is a retry or not. In essence, messaging allows deduplication on the server side.
Asynchronous messaging requires a slight change in thinking because it does not provide the ability to do the traditional request/response seen in typical web service calls.
Sending a message to a message queue is a fire-and-forget operation. You drop a message in a queue, and eventually it makes its way to the server and the server will process it. You do not get an immediate return value on the next line of code.
Ultimately, by solving certain infrastructure issues, asynchronous messaging forces you to redesign the logical flow of your system.
This is the difficult leap of queueing technologies: not that they have an API that is difficult to use (they don’t), but that they require letting go of more traditional request/response programming models.
Unfortunately, you can’t just take a system using HTTP, plug in a queue, and ship it. It requires a significant redesign, and sometimes rewrite, of your system.
That can be scary, but the results are worth it.
Compared with a few decades ago, networks are fairly reliable — except for when they’re not. As we continue to build larger and more globally distributed systems, we make ourselves susceptible to all the bad things that can happen.
In order to deal with this, we’re going to have to move away from synchronous request/response-type programming. The object-oriented model of invoking a method (known as remote procedure call, or RPC) tends to break down to conditions when the network is unreliable, putting our system into a non-deterministic state that is very difficult to get out of.
In the last several decades since the creation of the first computer networks, we have been unable to completely solve the problem of network reliability. It stands to reason that this will not change in the next 5–10 years. We need to learn to build systems that will work in this environment today.