Optimizations to scatter-gather sagas

Written by Jan Ove Skogheim on October 7, 2020

Over the last several months we’ve been noticing that some of our customers were having trouble with the performance of some of their sagas. Invariably, the sagas that were at fault were examples of the scatter-gather pattern, and the ultimate culprit was contention at the storage level due to the optimistic concurrency strategy that was being used.

This is a fairly common pattern for sagas to implement, and we didn’t want our customers to have to change how they model their business process to get around a performance problem, so we decided to fix it.

In this article, I’ll explain what the scatter-gather pattern is, how a pure optimistic concurrency strategy created a problem, and what we’ve done to fix it for you.

🔗What is scatter-gather?

For a simplified example, let’s say you’re running an email campaign. You need a business process to send 1,000 emails, and at the end, report back how many were sent successfully. At a fundamental level, you want two things: do a bunch of operations and aggregate the results of each operation into a total result.

Scattering is easy. A message-based system like NServiceBus naturally excels at the scatter part:

public async Task Handle(SendEmailBatch message, IMessageHandlerContext context)
{
    foreach(var emailAddress in message.Addresses)
    {
        this.Data.MessagesSent = 0;
        this.Data.TotalMessages = message.Addresses.Count;

        await context.Send(new SendOneEmail
        {
            EmailAddress = emailAddress,
            Contents = message.Contents
        });
    }
}

Instead of calling out directly to the email server for each email in sequence, which runs the risk of a failure in the middle, we send a message for each email and let those messages be handled one-by-one elsewhere.

The problem comes when we want to aggregate the responses as they come in over time.

The gather portion looks like this:

public async Task Handle(SendOneEmailReply message, IMessageHandlerContext context)
{
    this.Data.MessagesSent++;

    if(this.Data.MessagesSent == messages.TotalMessages)
    {
        await context.Publish(new EmailBatchCompleted());
        this.MarkAsComplete();
    }
}

This works great until you have hundreds of SendOneEmailReply messages all competing to update MessagesSent at the same time.

If you have to update the saga state after each message, you need some way to make sure that multiple messages don’t update the state simultaneously. Our sagas use optimistic concurrency control for the saga instance to handle this. If you get two responses simultaneously, they will both try to update the saga, but as they try to persist the new state, only one will be allowed to do so. The other will detect a change and roll back its changes.

In the example of two messages, this is trivial. The first message completes, and the second rolls back. The second message immediately retries and completes successfully. All done.

An issue presents itself as the number of concurrent responses grows, though. Let us say you got 500 responses back simultaneously instead. In this high-load scenario, only one of them would succeed in updating the saga state—the other 499 would have to retry. Then another completes, forcing the additional 498 to retry again. And so on, until all of the responses got a chance to update the saga state with an incremented count.

In theory, the last response could be forced to retry many times in the most unlucky scenario. In practice, your retry policy details would dictate the exact outcome in a real-world scenario. However, any reasonable retry policy would result in lots of retries scheduled, and a massive number of messages moved to the error queue as their allowed number of retries was exhausted.

As you increase the number of scatter messages, you get more responses. More responses lead to more gather collisions. More collisions lead to more rollbacks and retries. More retries and rollbacks put pressure on the messaging and saga persistence infrastructure. The result is increased load, decreased message throughput, and lots of messages in the error queue. Not a great situation overall.

🔗When being pessimistic is good

Where optimistic concurrency is letting everyone through the door and hoping a fight won’t break out inside, pessimistic concurrency is like a bouncer making sure only one person is allowed access until their business is concluded.

With pessimistic locking, you start by locking an entity for exclusive use until you finish. At that point, business data updates and the lock releases at the same time. Since the lock is done upfront and is exclusive, no one else can lock it and start their work until they can claim it themselves.

In our scatter-gather saga, this means that once one message locks the saga, all the other messages have to wait—they can’t even get in the front door. No more collisions, no more rollbacks, no storm of retries, and much less stress on the messaging and persistence infrastructure.

The drawback of pessimistic locking is the locks carry a cost in the database. In some database architecures, the database can decide to optimize too many locks by combining them. For example, in SQL Server, a row lock can be escalated to a page lock or full table lock, effectively locking many more things than needed in the name of resource optimization.

Optimistic locking is preferable when you expect very few write collisions since the average case, a non-collision, costs very little with no up-front locking necessary. But in optimistic locking, the cost of a collision is high. This is why sagas were originally designed to use optimistic locking.

So the more contention you have, the more you should favor pessimistic locking, as the fixed cost of the upfront lock is small compared to dealing with floods of rollbacks and retries.

Since the gather part of the scatter-gather pattern is high contention by its nature, implementing it using a pessimistic locking scheme makes sense.

🔗What we’ve done

To better support scatter-gather saga implementations, we’ve changed most of our persistence options to use pessimistic locks for saga updates.

SQL Persistence uses a combination of both optimistic concurrency and pessimistic locking in version 4.1.1 and higher. Because of slight differences in lock implementations of the supported database engines, we can not trust a pessimistic lock alone. We use optimistic concurrency for correctness, but also add a pessimistic lock to avoid the retry storms and improve the performance in high-contention scenarios. SQL Persistence does not support switching to optimistic concurrency only.
NHibernate Persistence has been using the same “optimistic for correctness, pessimistic for performance” combination as mentioned above since version 4.1.0. We think this is a sensible default for most scenarios, but it is possible to switch to pure optimistic concurrency by adjusting the lock mode.
MongoDB Persistence uses pessimistic locking by default in versions 2.2.0 and above. It does not support switching to optimistic concurrency only.
Service Fabric Persistence uses pessimistic locking (called an exclusive lock) by default in versions 2.2 and above. It does not support switching to optimistic concurrency only.

For all of these packages, you only need to update to the most recent version to get the benefits of pessimistic locking in your scatter-gather sagas.

RavenDB Persistence also offers pessimistic locking in versions 6.4 and above. However, because RavenDB does not provide a pessimistic lock option natively, you must currently configure the persister to use pessimistic locking instead of the default optimistic concurrency control for sagas. We plan to default to pessimistic locking for RavenDB in a future version, but for now, you must enable it explicitly:

var persistence = endpointConfiguration.UsePersistence<RavenDBPersistence>();
var sagasConfig = persistence.Sagas();
sagasConfig.UsePessimisticLocking();

You also have the option of tweaking locking options for RavenDB like the lease time and acquisition timeout. See our docs section on sagas pessimistic locking for details.

🔗Keep in mind

Whether or not a saga employs the scatter-gather pattern, and no matter what locking strategy it uses, it’s still important to design the saga to minimize locking.

A saga is a message-driven state machine, and so any saga handler should focus on using the data in the incoming message together with the stored saga state to make decisions (the business logic) and then emit messages: either by sending commands, publishing events, or requesting timeouts.

A well-designed saga should not access another database, call a web service, or do anything else external to the saga other than emitting messages. Doing so causes the lock on the saga data to remain open for a longer period of time, raising the likelihood that another message needing to access the same saga will cause contention.

If these types of things need to occur, send a message to an external message handler, where there is no lock held open for the saga data. That handler can do the work and then return the results to the saga using another message.

🔗Summary

While it is possible to address the performance of scatter-gather sagas while still using optimistic locking (such as dividing up a large batch into smaller batches first) we wanted to provide an experience to our customers that would Just Work® without causing performance bottlenecks.

To get started with the new pessimistic locking, you probably only need to update your persister. See the links for each persistence in the section above.

If you want to get more of a general overview, the saga concurrency article in our documentation is a good starting place.

And if you would like to learn more about how to model long-running business processes using sagas, check out our saga tutorials.

Share on Twitter

About the author

Jan Ove Skogheim is a developer at Particular Software with scattered thoughts about distributed systems design that are sometimes gathered.