When distributed systems get frustrated
One of the greatest ever contributions to video games was the invention of the pause button. There are some sequences—those that require absolute perfect timing—where your repeated failure can make you so frustrated you just need to pause, walk away, and try again later until you get it right.
i regret nothing pic.twitter.com/G7hSQQAizU— Sunny ✨ (@YTSunnys) July 15, 2019
Distributed systems can “get frustrated” too. One tiny thing goes wrong, and suddenly every message starts to fail. If you’re lucky, it just throws a bunch of errors. If you’re unlucky, it goes into a tight loop of failure that results in a hefty bill from your cloud provider.
🔗Failure on repeat
Distributed software systems built with NServiceBus are pretty great about dealing with all kinds of failure. From transient to systemic exceptions and everything in-between, systems built on reliably exchanging messages are resistant to failure, but that doesn’t prevent issues from occurring.
If your message handler relies on a 3rd-party service, and that service is not available, NServiceBus will keep retrying that message and won’t lose the data contained in the message. Eventually, that message gets sent to an error queue. And if you have high message throughput, you might have a lot of messages headed for an error queue, all of which are going to need to be replayed later.
In cloud environments, this can be especially problematic. (Read: expensive!) Every single attempt to process a message costs money, which is silly when we already know trying to process messages is going to be futile until the 3rd-party service is available again. You literally have to pay money to accomplish nothing.
🔗What to do with consecutive failures
NServiceBus 7.6 now tracks the number of consecutive failures and lets you take action to minimize its effect on the system.
For example, if your message handlers are all calling an unavailable 3rd-party service, all messages will likely fail. After, say, 10 consecutive failures, it should be clear that something more serious is going on. 1
Now you can change how messages are processed to prevent flooding the error queue. After enough consecutive failures, you can now enter a throttled mode where NServiceBus will only attempt to process one message at a time at a rate you specify.
It’s just like pushing pause in a video game and walking away. Instead of trying over and over as fast as possible, the endpoint walks away for a while. Instead of sending every message to the error queue, the system attempts one message per second to see if the situation has improved.
It doesn’t have to be a long wait—trying one message every few seconds usually works pretty well. The critical point is that when the system becomes fully operational again, there are only a handful of messages in the error queue instead of hundreds or thousands.
And if you’re in the cloud, the system didn’t just spend a small fortune chasing its tail.
🔗Rate limiting on consecutive failures
To enable the throttled one-message-at-a-time processing mode, we have introduced an API on the
var recoverability = endpointConfiguration.Recoverability(); recoverability.OnConsecutiveFailures(10, new RateLimitSettings( timeToWaitBetweenThrottledAttempts: TimeSpan.FromSeconds(1), onRateLimitStarted: () => Console.Out.WriteLineAsync("Rate limiting started"), onRateLimitEnded: () => Console.Out.WriteLineAsync("Rate limiting stopped")));
With this setting, the endpoint will switch to a rate-limited mode after it experiences 10 consecutive failures. By default, this mode will change the endpoint concurrency to 1 and wait 1 second after each attempt. However, as soon as a single message is processed successfully, the endpoint will revert to the regular processing mode with the previous concurrency setting and no delay after attempts.
RateLimitSettings class allows you to configure the delay between processing attempts and take action when rate-limiting starts and stops.
The exact settings you use depend on the circumstances for each endpoint. How many consecutive failures should determine a persistent failure state? And how often do you want to check to see if things have improved? That’s up to you.
When you get frustrated in a video game, sometimes the best thing is to pause the game and walk away. Then, a bit later, you come back more relaxed, pick up the controller, and nail it on the first try.
NServiceBus 7.6 lets you do the same thing with your distributed system. Instead of going into “failure on repeat” mode and generating a big cloud resource bill, NServiceBus can now notice the repeated failures and push pause, patiently waiting until conditions improve, and then it’s back to normal.
Just remember it’s healthy to step away once in a while.
It would also be good to be notified that the web service the endpoint depends on is having issues. Check out our sample on how to monitor 3rd-party systems with custom checks to see how this is done.