When distributed systems get frustrated

Written by Michał Wójcik, Szymon Pobiega, and William Brander on January 19, 2022

One of the greatest ever contributions to video games was the invention of the pause button. There are some sequences—those that require absolute perfect timing—where your repeated failure can make you so frustrated you just need to pause, walk away, and try again later until you get it right.

Screenshot of modified Super Mario Bros. level 1-1 with fire rods everywhere. — I'm not sure that the pause button is going to help on this one

Distributed systems can “get frustrated” too. One tiny thing goes wrong, and suddenly every message starts to fail. If you’re lucky, it just throws a bunch of errors. If you’re unlucky, it goes into a tight loop of failure that results in a hefty bill from your cloud provider.

🔗Failure on repeat

Distributed software systems built with NServiceBus are pretty great about dealing with all kinds of failure. From transient to systemic exceptions and everything in-between, systems built on reliably exchanging messages are resistant to failure, but that doesn’t prevent issues from occurring.

If your message handler relies on a 3rd-party service, and that service is not available, NServiceBus will keep retrying that message and won’t lose the data contained in the message. Eventually, that message gets sent to an error queue. And if you have high message throughput, you might have a lot of messages headed for an error queue, all of which are going to need to be replayed later.

In cloud environments, this can be especially problematic. (Read: expensive!) Every single attempt to process a message costs money, which is silly when we already know trying to process messages is going to be futile until the 3rd-party service is available again. You literally have to pay money to accomplish nothing.

🔗What to do with consecutive failures

NServiceBus 7.6 now tracks the number of consecutive failures and lets you take action to minimize its effect on the system.

For example, if your message handlers are all calling an unavailable 3rd-party service, all messages will likely fail. After, say, 10 consecutive failures, it should be clear that something more serious is going on. ¹

Now you can change how messages are processed to prevent flooding the error queue. After enough consecutive failures, you can now enter a throttled mode where NServiceBus will only attempt to process one message at a time at a rate you specify.

It’s just like pushing pause in a video game and walking away. Instead of trying over and over as fast as possible, the endpoint walks away for a while. Instead of sending every message to the error queue, the system attempts one message per second to see if the situation has improved.

It doesn’t have to be a long wait—trying one message every few seconds usually works pretty well. The critical point is that when the system becomes fully operational again, there are only a handful of messages in the error queue instead of hundreds or thousands.

And if you’re in the cloud, the system didn’t just spend a small fortune chasing its tail.

🔗Rate limiting on consecutive failures

To enable the throttled one-message-at-a-time processing mode, we have introduced an API on the RecoverabilitySettings class:

var recoverability = endpointConfiguration.Recoverability();

recoverability.OnConsecutiveFailures(10,
  new RateLimitSettings(
    timeToWaitBetweenThrottledAttempts: TimeSpan.FromSeconds(1),
    onRateLimitStarted: () => Console.Out.WriteLineAsync("Rate limiting started"),
    onRateLimitEnded: () => Console.Out.WriteLineAsync("Rate limiting stopped")));

With this setting, the endpoint will switch to a rate-limited mode after it experiences 10 consecutive failures. By default, this mode will change the endpoint concurrency to 1 and wait 1 second after each attempt. However, as soon as a single message is processed successfully, the endpoint will revert to the regular processing mode with the previous concurrency setting and no delay after attempts.

The RateLimitSettings class allows you to configure the delay between processing attempts and take action when rate-limiting starts and stops.

The exact settings you use depend on the circumstances for each endpoint. How many consecutive failures should determine a persistent failure state? And how often do you want to check to see if things have improved? That’s up to you.

🔗Summary

When you get frustrated in a video game, sometimes the best thing is to pause the game and walk away. Then, a bit later, you come back more relaxed, pick up the controller, and nail it on the first try.

NServiceBus 7.6 lets you do the same thing with your distributed system. Instead of going into “failure on repeat” mode and generating a big cloud resource bill, NServiceBus can now notice the repeated failures and push pause, patiently waiting until conditions improve, and then it’s back to normal.

NServiceBus 7.6 is available now. You can download NServiceBus 7.6 from NuGet, read the release notes, or check out the automatic rate-limiting documentation.

Just remember it’s healthy to step away once in a while.

Share on Twitter

About the authors

Michał Wójcik used to play multiplayer online battle arena games until he discovered that his enemies were sending deliveries to his door during the most important fights. With his gaming career over, he focuses on important engineering problems like tabs vs. spaces.

Szymon Pobiega always preferred turn-based games. He hopes one day physicists prove that time is in fact discrete and settle the turn-based vs. real-time strategy debate once and for all.

William Brander grew up with parents who never understood that online games can't be paused. This may or may not have directly led to him wanting to allow his production systems to pause.

It would also be good to be notified that the web service the endpoint depends on is having issues. Check out our sample on how to monitor 3rd-party systems with custom checks to see how this is done.