What does idempotent mean in software systems?

Written by David Boike on July 30, 2019

You’ll often come across the term idempotent in software, especially when you design distributed, message-based systems for the cloud. It seems like a concept that is easy to grasp at first but it’s important to know the intricacies of idempotence if you want your systems to be scalable and reliable.

Learn more about the challenges of implementing idempotent solutions: See Udi Dahan's presentation on Advanced API and Integration Problems and Patterns.

🔗What does idempotency really mean?

It can be confusing… so let’s start by talking about all the remote controls scattered around your living room.

As you start to accumulate home theater components to go with your TV, you end up with a bunch of remote controls. One remote for the TV, one for the receiver, and one for the DVD/Blu-Ray player. One for a cable box, satellite, or set-top streaming box. Maybe you’ve got a multiple-disc CD player, or even a turntable. (Vinyl is making a comeback!) Maybe you’ve even still got a VHS?…no I’m kidding, nobody has those anymore.

The solution, rather than dealing with a stack of remotes, is to get one universal remote to rule them all. But once you do, you find that not all components are created equal, at least when it comes to the infrared codes (the actual signals sent by the remote control) that they accept.

A cheap A/V receiver might accept codes like PowerToggle or NextInput. These make it nearly impossible to reliably turn the unit on and set the right input, especially if someone has messed with the receiver manually.

A good receiver will have additional codes for PowerOn and PowerOff, even though those buttons don’t exist on the remote, and will also have an independent code for each input, like InputCable, InputSatellite, and InputBluRay.

In software terms, this is the essence of idempotence. No matter how many times your universal remote sends the PowerOn and InputBluRay commands, those are idempotent operations and the receiver will still be turned on and ready to watch Forrest Gump. PowerToggle and InputNext, however, are not idempotent. If you repeat those commands multiple times, the component will be left in a different (and unknown) state each time.

🔗Why is idempotency so important?

The concept of idempotence is extremely important in distributed systems because it’s hard to get really strong guarantees over how many times a command will be invoked or processed.

Because networks are fundamentally unreliable, most distributed systems cannot guarantee exactly-once message delivery or processing, even when using a message broker like RabbitMQ, Azure Service Bus, or Amazon SQS. Most brokers offer at-least-once delivery, relying on the logic to retry processing as many times as necessary until it acknowledges that message processing is complete.

That means if a message fails to process for any reason, it’s going to be retried. Let’s say we have a message handler like the A/V receiver above. If the message is unambiguous like InputBluRay then it’s fairly easy to write code that will handle it as the user intended, no matter how many times the message is reprocessed. On the other hand, if the message is InputNext, it can be very difficult to write logic to fulfill the user intent under conditions of unknown numbers of retries.

In short, if every single message handler in our system is idempotent, we can retry any message as many times as we want, and it won’t affect the overall correctness of the system.

That sounds great…

🔗So why don’t we make all message handlers idempotent?

Because idempotence is hard. Imagine you need to do a fairly simple operation: create a new user in the database, and then publish a UserCreated event to let other parts of the system know what happened.

Seems simple enough, let’s try some pseudocode:

Handle(CreateUser message)
{
  DB.Store(new User());
  Bus.Publish(new UserCreated());
}

This looks good in theory, but what if your message broker doesn’t support any form of transaction? (Spoiler alert: most don’t!) If a failure occurs between these two lines of code, then the database record will be created, but the UserCreated message won’t be published. When the message is retried, a new database record will be written, and then the message will be published.

These extra zombie records are created in the database, most of the time duplicating valid records, without any message ever going out to the rest of the system. It can be difficult to even notice this happening, and even more difficult to clean up the mess later on.

So this should be easy to fix, right? Let’s just flip the order of the statements to fix our zombie record problem:

Handle(CreateUser message)
{
    Bus.Publish(new UserCreated());
    DB.Store(new User());
}

Now we’ve got the inverse problem. If something bad happens after the publish but before the database call, we produce a ghost message, an announcement to the rest of our system about an event that never really happened. If anyone tries to look up that User, they won’t find it, because it was never created. But the rest of the downstream processes continue to run on that message, perhaps even billing credit cards but failing to actually ship orders!

If you believe transactions will save you, think again. Wrapping the entire message handler in a database transaction only reorders all of the database operations to the end of the process, where the Commit() occurs. Effectively, a database transaction will turn the first code snippet (with the database first) into the second snippet (with the database last) when it executes.

When designing a reliable system you want to think in terms of what would happen if, on any given line of code, someone pulled out the server’s power cable. There are three operations at play here–receiving the incoming message, the database operation, and sending the outgoing message–and due to the lack of a common transaction it’s very difficult to code this without the possibility for zombie records, ghost messages, and other gotchas.

🔗What is a good way to achieve idempotency?

The outbox pattern gives us database-like consistency between our messaging operations (both consuming an incoming message, and sending outgoing messages) and changes to business data in our database. We get this consistency by piggybacking on a local database transaction, turning the message broker’s at-least-once delivery guarantee into an exactly-once processing guarantee.

🔗How does the Outbox pattern work?

To implement the outbox pattern, the message handling logic is divided into two phases: the message handler phase, and the dispatch phase.

During the message handler phase, we don’t immediately dispatch outgoing messages to the message broker, instead we store them in memory until the end of the message handler. At that point, we store all accumulated outgoing messages to a database table using the same transaction as for our business data, using the MessageId as the primary key.

insert into OutboxData (MessageId, TransportOperations)
values (@MessageId, @OutgoingMessagesJson)

This data is committed to the database in the very same transaction as the business data. This concludes the message handler phase.

Next, in the dispatch phase, all the outgoing messages stored in the Outbox data are physically sent to the message broker. If all goes well, the outgoing messages are sent out and the incoming message is consumed, and all is well. However, it’s still possible for a problem to occur at this point, and for only some of the messages to be dispatched, forcing us to try again.

This can actually generate duplicate messages, but this is by design.

The Outbox pattern is paired with an Inbox, so that when any duplicate message is processed (or a message that fails in the dispatch phase is retried) the Outbox data is retrieved from the database first. If it exists, that means the message has already been successfully processed, and we should skip over the message handling phase and proceed directly to the dispatch phase. If the message happens to be a duplicate, and the outgoing messages have already been dispatched, then the dispatch phase can be skipped over as well, producing the same result.

Expressed in pseudocode, the entire Outbox+Inbox process looks like this:

var message = PeekMessage();

// Check for deduplication data
var outbox = DB.GetOutboxData(message.Id);

// Message Handler Phase
if(outbox == null)
{
  using(var transaction = DB.StartTransaction())
  {
    var transportOperations = ExecuteMessageHandler(message);

    outbox = new OutboxData(message.Id, transportOperations);
    DB.StoreOutboxData(outbox);
    transaction.Commit();
  }
}

// Dispatch Phase
if(!outbox.IsDispatched)
{
  Bus.DispatchMessages(outbox.TransportOperations);

  DB.SetOutboxAsDispatched(message.Id);
}

AckMessage(message);

Using this pattern, we get idempotence on the handling side of the equation, when you can tell a duplicate just by looking at the MessageId. After all, if you actually pressed PowerToggle multiple times, it would be more like sending duplicate messages but with different MessageId values. In truth, an operation like PowerToggle is inherently not idempotent, and there’s nothing the infrastructure can do to help with that, but that’s a topic for another post.

🔗Summary

Idempotence is an important attribute of distributed systems but can be tricky to implement reliably. The bugs that crop up as a result of doing it wrong are often easy to overlook, and then difficult to diagnose, appearing as if the result of race conditions impossible to reproduce under any sort of controlled conditions.

It’s much easier to utilize infrastructure like the Outbox that can take advantage of the local database transaction already in use for storing business data, and use that transaction to build consistency between incoming/outgoing messaging operations and the business data being stored in the database.

If you’re interested in taking a look at this yourself, check out our Using Outbox with RabbitMQ sample, which shows how to get exactly-once message processing using RabbitMQ for the message queuing infrastructure, and SQL Server for the business data storage. Don’t worry if you don’t have RabbitMQ or SQL Server installed–the sample includes a docker-compose file with instructions so that you can run all the dependencies in Docker containers. You can also check out the documentation on how to configure the Outbox for Azure Service Fabric with NServiceBus.

About the author: David Boike is a developer at Particular Software who refuses to juggle multiple remote controls.

Share on Twitter