Death to the batch job

There's something dangerous lurking in your software. Not just the general lurking, murky, ickiness you might expect. Oh no, it's much worse than that. It's something specific. Something big; something ugly. There might even be more than one. It can't decide if it's angry or hungry or both. All it knows is it's having a very bad day.

And it's going to eat you.

Maybe it won't literally eat you, but it will come after your family time, your sleep, and your sanity. Only the things you really care about.

Because it's a special kind of ogre called a batch job, and it just failed.

Lipstick on a pig

Of course, we don't call them batch jobs anymore. We call them scheduled tasks. But this is just window dressing; they're still the same thing. Call it what you will, a batch job is a great big ogre with a great big club. It comes around in the middle of the night and takes its great big club and bashes it against your great big database.

And tonight, while you were sleeping, that batch job failed. Someone had insomnia and was actually using the system in the middle of the night, so there was a deadlock, and the batch job failed somewhere in the middle of updating 74 million records.

You're on pager duty this week, so you got woken up at 3am to deal with it. You get to figure out which row it failed on (how?), why it failed, correct the issue, then start the job again from where it left off, because if you have to start from the beginning, it won't get done before peak hours in the morning.

If you're lucky, this might be the only failure, and you might actually get some sleep tonight.

A typical job

We need to update our customers' Preferred Customer status every night if they've purchased $5,000 worth of products from us over the past year. If they have, then we give them special discounts and free shipping and stuff. It's our little way of saying "thank you."

Since batch jobs run in the middle of the night when there's virtually no traffic, we can probably get away with a SELECT N+1 query or two, right? There's no way the DBA will notice.

DateTime cutoff = DateTime.Today.AddDays(-365);

foreach(var customer in customers)
{
  var orderTotal = customer.Orders
   .Where(o => o.OrderDate > cutoff)
   .Sum(order => order.OrderValue);

  customer.Prefered = orderTotal > 5000;
}

And so we (or another developer on our team; we would never do this) create a batch job to update every customer's Preferred status by looping through every customer in the database, querying for their total orders for the last year, and updating the Preferred flag if their total orders exceed $5,000.

Seems fairly straightforward, and while it could certainly be made more efficient, it's certainly not an unreasonable thing to do. So this batch job is created, and set to run every night. Problem solved, right? After all, the simplest possible solution is often the correct one.

What if there is no night?

One of the biggest problems that businesses can face, ironically, is success.

The more customers and orders that are added to this system, the longer it will take the batch job to run. Your company's success can be your batch job's undoing.

Eventually, this batch job will start to take so much time to run that it will not be able to run at off-peak hours. Or, it will, but only if it can run straight through with no errors, but if there is a single failure, it portends an entire night of babysitting.

Worse yet, what if there is no night? In the city of Tromsø, Norway, this is quite literally true during the summer, since it is situated above the Arctic Circle. But even for locations at less extreme latitudes, business all over the world is getting more and more real-time. There is no slow time when batch jobs like this can be safely run. For any company doing business internationally, there is certainly no such thing as non-peak hours. Even for the much smaller, local players, modern business demands more 9s of availability, and the luxury of 8 interrupted hours of off-peak time seems antiquated.

Customer demands

When I was a teenager, my mother would ask me to do things for her, like clean my room, empty the dishwasher, or mow the lawn. I did not tell her I would get to it later tonight. My mother raised me smarter than that.

Our customers are the same way. They want instant feedback. If I'm a frequent flyer, I want my app to show me I've reached Gold status now, not tomorrow. If I get off my flight, check my app, and see that I still haven't reached gold status, I'm going to call you up and ask, "Why haven't you made me a gold customer yet?"

As a customer, I'm going to assume that you have a problem with your system. I don't really care that the nightly batch job hasn't run yet. And that's going to turn into a support case for you. Someone is going to have to answer that phone and explain things to me. How much money is something like that costing your company right now?

What if I'm a customer that buys something today, pushing me past the limit into preferred status, and then I decide to buy something else? If the nightly batch job hasn't run yet, I'm not going to receive the discount I expect. You're actually penalizing your customers by doing this.

If I wait until the next day, then I'll get the discount. But as the customer, I might not know that. So again, I'll assume there's an error. Maybe I'll call and complain; another support case. Maybe I will wait and place the order tomorrow. Maybe I won't place the order at all, and the revenue is lost. Maybe I'll take my business elsewhere.

The brutal fact is that if your business isn't real time, customers will eventually leave and find a competitor that is.

Just following orders

Sometimes it's easy to justify the creation of a batch job. It had to get done and it had to get done quick! We know it's not good, but after this next deadline we'll come back around and do it better. I'm sure we've all said something like this at one time or another, and the latest batch job becomes another victim of the organization's technical debt that never gets addressed.

But sometimes batch jobs are not created out of excuses like this; they're created because our business stakeholders actually ask us to do it!

This occurs because we've accidentally trained our business stakeholders to think in terms of batch jobs. So when they bring us business requirements, they're already couched in those terms. They've unknowingly given us a solution in the form of a requirement.

As responsible developers, we need to be sure we fully understand their intent. We need to dig deeper, ask questions, and not accept things at face value. Far more useful than any programming language or framework, questions are by far the most powerful tool that we, as developers, have in our toolbox.

Imagine this exchange:

BA: Hey, we're adding a preferred customer concept. Kind of a loyalty rewards thing. Going to give them free shipping and special deals if they've purchased $5,000 worth of stuff over the last year. Think you can code up a thing to update their status each night?

Dev: Sure, we can do that. But will the user need to be able to see their current status?

BA: Well sure, obviously. We want them to see a gold badge on their profile if they're a preferred customer.

Dev: Is it OK if they don't see that until the next day? Or would it be better if it were immediate?

BA: Well, immediately, obviously.

Dev: And if they order twice in one day, and the first order puts them over the threshold, should their second order get free shipping?

BA: Yes, for sure! If we didn't it'd be a huge snafu!

The situation might seem contrived, but it illustrates a point. The BA seems to believe that the answers to all these questions are bewilderingly obvious, and yet the original request to implement it on a once-daily basis would not accomplish the real goal.

It's not their fault; we've trained them to think like this. We need to retrain them a little bit, and always ask questions. Your business stakeholders are your domain experts and have valuable knowledge if you can ask the right questions. Just as you know your own job, start with the assumption that your business stakeholders actually do know something about theirs.

Back to the Future

The problem is that we traditionally focus on the past. We have our sights stuck on our rear-view mirror. To calculate our customer's preferred status, we look back to the past, totaling their purchases over the last 365 days.

Instead, what if we could predict the future?

It turns out, we can do just that. We don't need a crystal ball, or a DeLorean. Our business stakeholders have the power to predict the future, because they know the business domain. We can ask them!

Dev: Let's say Steve orders something for $100. At what point does that amount no longer count toward Steve's preferred status?

BA: That's easy, after 365 days!

Now we know something about the future, how can we use that information in code?

Note to self

In the fantastic movie Memento, Guy Pierce plays a man with anterograde amnesia who is unable to form new short-term memories, forgetting everything new every time he falls asleep.

In order to complete his mission, he must rely on notes he leaves for himself, in the form of Polaroid photos or tattoos on his own body. When he wakes up, he knows nothing, and starts reading his tattoos to figure out what to do.

In perhaps the funniest scene of the movie, he wakes up chasing a guy in an alley. When he starts reading his tattoos, he has to turn around because he realizes that the guy is actually chasing him!

It turns out he has the same problem we have as developers. Unless we use a database or something, our programs forget everything when they shut down. If we want to predict something in the future, we can't use a .NET Timer because that won't survive a server restart. So what do we do?

It turns out that most modern service bus technologies out there, including NServiceBus, have a way for you to have delayed or scheduled delivery of messages, called a durable timeout.

Scheduled message delivery is natively supported by some queuing systems, like ActiveMQ and Azure ServiceBus, but not by others, like RabbitMQ and MSMQ. A good service bus technology will level the playing field and support timeouts the same way on multiple infrastructures so you don't have to worry about the details.

Durable timeouts operate pretty much like an alarm clock, and give us a way to say to the infrastructure, "Would you mind waking me up 365 days from now, because I need to deduct $100 from the running total of this client?" No tattoos necessary.

Every time we process a placed order, we request a corresponding timeout for the opposite amount, and schedule that for 365 days in the future. On that date, our software will receive the timeout message, and deduct the amount from the running total. In the following diagram, we can see the purchases in green, and the corresponding debits in red a year later.

Timeline showing purchases together with scheduled debits from customer's running total

When the running total goes above the $5,000 threshold, we can publish a CustomerHasBecomePreferred event, letting the rest of our system know that they can now get free shipping and all the other goodies. If a timeout eventually lowers the running total back under the threshold, we can publish CustomerHasBecomeNonPreferred. We can see the basics of this in the following code sample:

public void Handle(OrderPlaced message)
{
    if(this.Data.CustomerId == 0)
        this.Data.CustomerId = message.CustomerId;

    this.Data.RunningTotal += message.Amount;
    this.RequestTimeout<OrderExpired>(TimeSpan.FromDays(365),
        timeout => timeout.Amount = message.Amount);

    CheckForPreferredStatusChange();
}

public void Handle(OrderExpired message)
{
    this.Data.RunningTotal -= message.Amount;

    CheckForPreferredStatusChange();
}

private void CheckForPreferredStatusChange()
{
    if(this.Data.PreferredStatus == false && this.Data.RunningTotal >= 5000)
    {
        this.Bus.Publish<CustomerHasBecomePreferred>(
            evt => evt.CustomerId = this.Data.CustomerId);
    }
    else if(this.Data.PreferredStatus == true && this.Data.RunningTotal < 5000)
    {
        this.Bus.Publish<CustomerHasBecomeNonPreferred(
            evt => evt.CustomerId = this.Data.CustomerId);
    }
}

This is extremely proactive. It allows us to define more complex business processes, and evolve those processes more easily as the underlying requirements change and evolve. The threshold for making a customer can be changed, affecting all future transactions but grandfathering in old ones. Or the purchase of specially promoted products can accrue double points. Busienss stakeholderA batch job, with it's big ugly club, would apply the new rules across the board.

It's also much more efficient. When the DBAs show up at your doorstep armed with their pitchforks, a simple select by primary key is totally defensible, whereas a SELECT N+1 will (deservedly) get you run right out of town.

We can also hook a lot of functionality into responding to those events. When a customer becomes non-preferred, we should probably call them up and ask why they're not buying from us anymore. The possibilities are endless. Airlines are a classical example; they have tons of business processes around whether you're a preferred customer or not.

Summary

With the power of durable timeouts on our side, we can become a master of time, utilizing information from the past, present, and future to make our systems more responsive and real-time.

Not only will this please our customers, but will also create new business opportunities to react to changes in data live, as it happens.

Most importantly, we will kill the ugly batch job, and banish it from our lives forever.


About the author: David Boike is a solution architect at Particular Software, author of Learning NServiceBus, and is passionate about distributed systems, elegant software, and brewing craft beer.

Read more →