This article is an excerpt from the book Dr. Harvey and the 8 Fallacies of Distributed Computing which chronicles the misadventures of Dr. Harvey Fallacious, a 19th-century explorer and researcher, and also my great-great-great-great-great-grandfather. In it, he details his travels and explorations in remote, then-uncharted regions of South America, unknowingly dealing with the repercussions of the 8 Fallacies long before they were written down by Peter Deutsch, let alone the birth of the computer age.

On the use of tin can phones in primitive culture

In the spring of that year, my travels brought me upon a previously undiscovered civilization. The people called themselves Ossians, and they lived in an isolated collection of villages in a remote part of South America.

Being remote as they were, their level of technology was understandably primitive. But I was surprised by the locals' recent obsession with new forms of communication. It all started, they told me, when one of them discovered that by attaching a rope between two clay pots and stretching the rope taut, a voice uttered into one side could be heard on the other. (I neglected to tell them that even as a boy I had done this very thing with tin cans.)

At first it was children doing it, but quickly the village elders recognized the advantages that communication over distances could bring, and so they strung ropes all throughout the village. "A clay pot in every home!" became a rallying cry for modernization.

While the clay pot network worked well enough locally, it didn't take long for problems to begin cropping up as they attempted to distribute the network to surrounding villages.

Proper rope tension could not be maintained over long distances, necessitating the placement of repeater stations (simple one-room huts) across the landscape, manned by a human operator who would listen to messages from one pot and relay them down the line toward the receiving end.

This all worked quite well — until a relay operator would miss an incoming message while outside the hut answering a call of nature or until a predator chewed through a transmission rope. Clearly not a perfect system, but improvements are ongoing.

Dr. Harvey Fallacious
November 25, 1834

The network is not reliable

Anyone with a cable or DSL modem knows how temperamental network connections can be. The Internet just stops working, and the only way to get it going again is to unplug it for 15 seconds. (Or, put another way, "Have you tried turning it off and on again?")

Thankfully, better solutions exist for professional data centers than consumer-grade modems, but problems can persist.

As a company that does reliable messaging, we've really heard it all. Don't worry, the names will be changed to protect the innocent.

A registered ISP had two routers: a primary and a backup. One day, the primary router malfunctioned. They switched to the backup, only to find that its routing tables had not been updated in a very long time. For many customers, that was the day the Internet died.

In another case, a project used an Oracle database. Everything worked great in the development environment, but in production there was an additional load balancer and firewall. Every once in a while, the load balancer would silently drop TCP connections to the database. These faulty connections continued to sit in a connection pool, so the next time somebody needed a connection, they would get an exception.

Just recently, on June 12, a single ISP in Asia broke the Internet for a big section of the world, creating Internet problems in Europe.

In general, you can't trust any network, no matter how local or global. Hardware, software, and security can all cause issues. This is codified in the 1st fallacy of distributed computing: the network is reliable.

This is especially problematic for HTTP communication or any request/response or remote procedure call (RPC) style of communication.

Consider the following simple web service call:

var svc = new MyService();
var result = svc.Process(data);

How do you handle an HttpTimeoutException? This is an exception that's generated on the client side when there's a problem, but you can't know what went wrong because you haven't gotten a response.

Data can get lost when sent over the wire. It's possible that the web service call actually succeeded but the response got lost somewhere on the Internet. If the web service represents an idempotent operation (a process that can be repeated without any adverse side effects), then it can simply be retried. But what if that process charges a credit card?

Solutions

To provide a truly reliable system, you must accept that cross-network communication will not always be possible. Because we can't guarantee that an attempt at communication will be successful, we need to provide a facility to automatically retry after failures. To protect against failure while in the midst of a retry, we can use a pattern called store and forward. Instead of directly sending data to a remote server, we can store it in local storage. This way, when we boot up again, we are ready to continue right where we left off. We can use transactions to ensure that we keep retrying until processing succeeds.

This rises above the level of a simple retry loop around a web service invocation, which would fail if the server it was running on crashed. We need additional infrastructure to make these guarantees.

Asynchronous messaging

There are many different technologies out there, called reliable messaging or message queuing systems, that solve these types of problems. On the Microsoft platform, the best known one is Microsoft Message Queuing (MSMQ), and on Azure, there is Azure Queue Storage and Azure Service Bus. Outside the Microsoft ecosystem, there is RabbitMQ, ActiveMQ, and ZeroMQ. Basically anything with "MQ" at the end is an indication that the product fits within this family of technologies.

These queuing technologies wrap up something like a web service call into an isolated, discrete unit of work called a message. The message queue employs store and forward to ensure that the message gets to where it needs to go. It can facilitate automatic retry, as message processing can be attempted over and over, even after a system crash. Some even support transactions, so that the message is only fully consumed if the business transaction is successful. This way, a guarantee can be made that each message is successfully processed exactly once.

Techniques exist to enable "exactly once" processing in in environments like the cloud, where distributed transactions are not feasible, but that is outside the scope of this text.

Queuing technologies hold an additional advantage over a simple retry loop. For example, if we were attempting to create a customer and received an HTTP timeout, we would have no way of knowing if the server received the data and was simply unable to respond or if, rather, the data never arrived at all.

Retrying brings with it the possibility for server-side duplication. If we retry the attempt to create the customer, we may accidentally create the customer twice.

Message queues bring with them the concept of a message ID so that the server can decide whether an attempt is a retry or not. In essence, messaging allows deduplication on the server side.

Abandon request/response

Asynchronous messaging requires a slight change in thinking because it does not provide the ability to do the traditional request/response seen in typical web service calls.

Sending a message to a message queue is a fire-and-forget operation. You drop a message in a queue, and eventually it makes its way to the server and the server will process it. You do not get an immediate return value on the next line of code.

Ultimately, by solving certain infrastructure issues, asynchronous messaging forces you to redesign the logical flow of your system.

This is the difficult leap of queueing technologies: not that they have an API that is difficult to use (they don't), but that they require letting go of more traditional request/response programming models.

Unfortunately, you can't just take a system using HTTP, plug in a queue, and ship it. It requires a significant redesign, and sometimes rewrite, of your system.

That can be scary, but the results are worth it.

Summary

Compared with a few decades ago, networks are fairly reliable — except for when they're not. As we continue to build larger and more globally distributed systems, we make ourselves susceptible to all the bad things that can happen.

In order to deal with this, we're going to have to move away from synchronous request/response-type programming. The object-oriented model of invoking a method (known as remote procedure call, or RPC) tends to break down to conditions when the network is unreliable, putting our system into a non-deterministic state that is very difficult to get out of.

In the last several decades since the creation of the first computer networks, we have been unable to completely solve the problem of network reliability. It stands to reason that this will not change in the next 5–10 years. We need to learn to build systems that will work in this environment today.


About the author: David Boike is a developer at Particular Software who first got into computers because he couldn't find a long enough string for his tin can phone.

Read more →