Stop looking in the past; Start telling the future

00:02 Adam Ralph

Hello again, everyone, and thanks for joining us for another Particular Software Live Webinar. My name's Adam Ralph. Today, I'm joined by Derek Comartin, who's going to talk about topics such as the challenges with recurring batch jobs, how to move a business process away from batch processing, and real-world use cases for modeling future events.

00:28 Adam Ralph

Just a quick note before we begin, please use the Q&A feature to ask any questions you may have during today's live webinar. We'll be sure to address them at the end of the presentation. We'll follow up offline to answer any questions we won't be able to answer during this live webinar. We're also recording the webinar. Everyone will receive a link to the recording by email. Okay. Let's stop looking in the past and start telling the future. Over to you, Derek.

00:58 Derek Comartin

All right, thanks. Thanks, Adam. The talk, in and of itself, as Adam just mentioned, relates a batch processing. But the takeaway that I hope I can convey a little bit here is just to be thinking about workflows and your business processes that really drive your system. It's just a little switch of how you may think of them that can really change how you write your systems and your apps.

Everything that I'm talking about today, generally, for the most part, is in line with everything that I have on my blog at codeopinion.com as well as my YouTube channel. Pretty much, they go hand in hand. Every blog I pretty much have a video for. Vice versa. They all revolve around a lot of what I'm talking about today: messaging, just general software, architecture, design boundaries, those types of things. If you're interested, at the end of this, in discovering more of what I'm talking about, you can check out the blog as well as my YouTube channel.

All right. I think most people, especially when we're developing systems or thinking about what our processes are, what they look like, what it means in our system... How I always used to think about it or how I still think about it until I flip that switch a little bit, which I'm hoping that you'll get out of this, is thinking about your system for the most part as the way it resides right now, how you want things to execute at this very moment in time. That can be a user... you know what I mean, performing some action where you're sending some type of command to your system, where you want to mutate state, have some different side effect, or some action that the user wants to do or the system needs to invoke.

Then, from there, in our event-driven world, I mean, publishing events, having other parts of our system or parts of that workflow pick up those events, can I either have some event choreography or orchestration to basically move these processes along? But in doing so, we're always thinking about... "Okay, I want to execute this command." Maybe then, from there, like I said, we derive some events that we publish, and then we consume those and react to those. But we're always in this mode, especially when we're writing this code that is going to be doing this workflow, is we're generally thinking about it as, right now or historically, the things that have happened. The shift in that and what this is is there's other ways to think about this.

I don't really blame us for this. Because even if we understand the full process, we're working with, generally, our database. Which is what? It's storing current state usually. Even if we're storing historical or more transactional information, we can see history of what happened. We're still looking at present time, the way things are in our system, and we're not really thinking so much about the future. But in a lot of situations, we can be thinking about the future because we understand what the business process is. We know that certain things are going to occur.

But again, when we're thinking in batch jobs, which I'll get into here, is we're generally looking at the state of the system, what it looks like right now when something executes. It's not just as it's executing. It's as we're actually writing it, as we're writing our application code, is we're doing so with the thought that we're looking at a current database, that we're looking at current state. We're not necessarily looking so much about what needs to happen, just what is happening right now. What's the state of the system?

Like I said, there's a shift. I'm planting a seed in your mind with this talk that there is certain workflows in your system that are occurring, especially if you're using batch jobs to perform some types of actions that need to occur, that you could really think about executing them in the future and programming it that way, programming it into your system to get out of the batch job scenario, which, I'll talk about, has its issues.

I'm going to be using some real-world examples that all happened within... One of the main ones I'm going to be using here, I'll talk about in a minute, was something that occurred recently. But I love examples, especially real-world ones that are not made up, and none of them are, is I hope they can translate/that you can understand them as simple as they are because usually some of these workflows are not simple. But you can translate them into your own system and just see... "Okay, yes, I get the gist of this. We have this process in ours that is complex/troublesome. We do some batch processing for this." Maybe we can get away from that using some of the things I'm talking about. But I hope my examples... You can just relate to them/use them as a way to think about your own system and where this might be applicable.

The key to this really is I had this... I don't know; realization somewhat earlier this year when I was doing some online ordering. Typically, I'm doing online ordering. I'm just immediately... Buy now. Ship to my house. Everything gets delivered so quickly now that it's not really much of a thought. As I was going through this use case, and I'll go through it, it dawned on me then of... "Oh, yeah. This is this business process. This makes sense." While it's intuitive, I guess it's not immediately easy to see how you would implement online ordering.

Specifically, what I was doing was sitting at my desk just like I am now. I needed to get something from a big box hardware-type store, which I give away, I think, in one of these slides what it is. This was back in February. I was just ordering one item. Now, I guess, luckily for me, I got a little bit more of the gist of this is because I have worked in warehousing and distribution. I have a little bit of a background in that, which just made me, I think, immediately think, "Oh, I wonder how this process is actually written in terms of their system." I was just on the website. I was looking for one particular item. The store's not actually too far for me. I needed it that day. Instead of waiting to get it delivered to me, I figured, "Okay, I'll just look online." It shows that my particular store... I actually had a few different around me. It shows me the quantity on hand of the item I was looking for.

Now, it's a little deceiving because the quantity on hand that you see on the website is what the store or what the system believes that they actually have on hand sitting on the actual shelf. The idea would be that it said that they had one of this item. The reason there was only one is because it was on clearance. I was a little bit concerned that maybe the item wouldn't be really there or it would be damaged. That's important to note because even though it said there's one on hand, I mean, the point of truth really isn't what the system says. It really is what's actually on the shelf. That's because in any type of warehouse or anytime where you have actual inventory, unfortunately, stuff does break. Packages could be open. I didn't really want an open package per se. I definitely didn't want a broken product that I was looking for, or maybe, unfortunately, that item really wasn't there, or it wasn't in the location it should be. It may be hard to find. That's also happened a lot.

I was taking a gamble here, saying, "Okay. Well, I'm not going to go to the store directly because maybe what if any of those scenarios are the case? I'm going to do an online order for pick up." What happens here is I get this immediate confirmation email, like you would expect. The interesting thing that they say here is that when my item's ready for pick up or on the way, that I'll get another email. This makes sense because what needs to happen is when somebody at the store realizes I placed this online order, they need to then go get that item for me. They need to go to wherever it is located in the store. Hopefully, it's not damaged or anything, and it's actually there. They basically take that item off the shelf and put it into a designated area, at which point, they clearly do something in their system to say, "We've reserved this item for you. Essentially, now, you can come pick up your order because we have your item available to you."

Really, what they're doing is just creating a reservation. They're basically reserving that item for me. What's interesting with this... It's really only for a period of time. They're only reserving this item. It's not like they're just going to take it, and I don't come in and pick it up. In the bottom box, they say that... "We recommend picking your items within seven days of the date that was ready for pickup," which I believe... Yeah, it was on the third. "If it's not possible, contact your local Lowe's store." That's clearly where I ordered it from. Then, the bold part here is that... "If you do not pick that item up, you'll be refunded, and then the item will be placed back onto the shelf."

This makes sense because it's like any reservation. If you think about, I mean, going to a restaurant and making a reservation, what are they reserving for you? A table. If you don't show up, say, at whatever their threshold is, say 30 minutes after your reservation, they're going to free up that table, so maybe somebody else walking in can have that open table. Really, all this is just a reservation.

I thought that was interesting in the sense of... "Okay, now I understand how this works. I understand how the quantity on hand/the motions that they're going through." How would you actually implement this? Specifically, how would you implement the expiry of the reservation? At some point, you're going to have to look at all the orders that have not been picked up. I think this is a use case where, immediately, you would jump to thinking of a batch job.

Whether some people call it a batch job, a cron job, some scheduled task, they're going to be something that's reoccurring that you want to do on some periodic basis that you need to execute some work. You need to do some things. In my particular case, it needs to figure out which orders have not been completed, essentially, where somebody hasn't actually come in and picked up their order so that they can refund them and then market the item some way so that the store knows that they can put that item back on the shelf.

There's a lot of different scenarios like this. But, again, it goes back to where I was talking at the beginning, which is we want to look at the state of the system on some periodic basis. We're looking, historically, to figure out what we need to do.

Now, that's one way of dealing with it if you go down this batch job scenario. But in most batch jobs that I've ever worked on, in any system... I'm going to talk about some of the issues here. Again, I hope you can relate to these, especially if you've been working on batch jobs. How would this work? Well, in any type of system, we're making some type of state changes. Say, it's like creating our order. We make some order in our database, whatever type of database you're using. There's some record of that. Then, going through the process still, we did that reservation. We flagged our orders. Now, it's maybe some state of pending or waiting for pickup.

Then, we have some type of other... It could be a separate process. It could be, again, something that's running periodically, whatever that basis is. My example is just daily. But we have some worker process that is going to go reach out to our database, get the relevant data that it needs, those orders that are sitting there expired. Maybe it gets all that data. Then, it iterates over them, one by one, and then doing whatever it needs to do. Canceling the order. Then, there's probably some other action that it needs to do related to the refund of the payment, which gets more complicated. While it gets more complicated because there's many things that have to happen, the concern here is you're doing this work. There's many different orders that you need to process, and then there's a failure. How do you handle the failure?

This is probably the biggest headache in batch processing, is if you're doing things in mass, like in an actual batch, how do you handle if there's a failure rate in the middle of it, especially, in this case, if we're doing refunds? Well, if those were occurring, and then we're trying to make state changes, we don't want to be in this inconsistent state of "we were able to do the refund, but we didn't update the database." There's a lot of concern about actual failures and how we're going to handle them.

You could think of this just like really simple pseudo-code here: simple just to fit on a slide. But as I illustrate it, you could think, "Okay, we're just going to fetch out those orders where they're waiting for pickup. The expiry is earlier than right now. Then, we just iterate over those and then we cancel the order. Then, we save changes." We think this is good because, well, if there is some type of failure... Maybe "canceling" throws for some reason or "save changes" throws, or we lose connection to the database, whatever the case may be. If there's a failure rate in the middle of this, we could essentially rerun the job, and we should be able to pick up where we left off and just be getting the orders that we still need to cancel.

But again, this is simplified because I'm not doing anything with the payments, which complicates this whole thing. But then, you also have the issue of... "Okay. Well, if this fails, where are you defining the retry?" If you have this in some simplistic way, okay, it failed. You got to define that. You want to retry it. How many times do you want to retry it? Do you want to do some exponential backoff? There's a lot of thought that has to go into when these batch jobs fail and, really, how you want to handle them.

You may think, "Okay. Well, the next logical step here is, yeah, it's a batch job. But I really want to execute this more in isolation, where I have each individual order being canceled on its own. That way, if there's a failure to say one order doesn't affect the entire batch." Really, what we've turned our batch into is just like this kick-starter that, again, we're running once a day. It's just going to get the order ID in the same type of query. Just get the specific order IDs, and then we're going to iterate over those. Then, we're going to send a "cancel order" message to our queue. That way, we can deal with each message as just a specific cancel order command that we are going to invoke. We can deal with that in isolation.

If there's a failure to one cancel or one order, it doesn't necessarily... Depending on what the failure is, it will potentially not affect all the other ones. If we add a hundred orders that we're canceling, maybe one only has a failure. Then, we can deal with failures in terms of those individual orders independently with retries, backoffs, dead-letter queues, et cetera, the typical way you'd think about how you want to handle failures with messages. This sounds great. But even if you do this, there's issues.

Two of the primary ones that I'm going to describe here again, seemingly, for me, have been in every case of doing any type of batch work/any type of batch jobs. The first is concurrency. You are executing batch jobs. For simplistic sakes here, let's say that they're more on a periodic basis. Maybe you're doing something every hour. While you execute your batch job, it's automatically getting triggered every hour on the hour. That batch only takes 15 minutes to run. Only! But it takes 15 minutes to run.

As time goes on, if you don't have proper visibility and metrics and alarming to how these are actually running... That first job only took 15 minutes. It ran again, say... We're talking every day. It's doing this hourly. But over time, your system grows. There's more volume. There's more actual work to be done. That batch job, all of a sudden, instead of taking 15 minutes, is creeping up. It's taking longer and longer.

Let's now say it's taking 30 minutes. You keep going to this point where, all of a sudden, your batch job is taking longer than what the actual interval is. All of a sudden, you are executing a batch job, say, processing these orders to cancel these orders. You haven't even completed it. Then, another job has actually already kicked off already. In both of my use cases of those simplistic code examples, they're both susceptible to this. Whether I was iterating over them, whether I was creating jobs that are now still sitting in the queue, and they're not completely processed yet, you have this concern now of just concurrency in how you want to deal with it.

More specifically, the way that this usually happens is you don't realize this is going to happen. It's not like you're initially writing this code to cancel an order, to think about concurrency, that you would have two messages trying to be processed at the same time. It's not necessarily something that you're thinking about because it's... "Oh, this has been working for months. Everything's fine." Then, all of a sudden, you have concurrency issues to deal with.

That one's the sleepy one, but... the sneaky one, I should say. The other one really just is handling and load. Most of the times that you're doing batch processing, seemingly when you decide to schedule when it's going to run, it's always at night, especially if you're in a system that has more activity during the day hours or work hours, depending on what your system is, and it's not really, say, 24/7 or globally. "Oh. Well, where are you going to put it?" "Well, I'm going to run it at 3:00 in the morning because that's when we really have the lowest point of any load that we're dealing with."

But what ends up happening there is you're going to be spiking. You're going to see, in your system... Obviously, you're doing a ton of work. The same scenario of concurrency is because if you have more and more volume... which, hopefully, you do. That means your system's growing. There's more activity. Unfortunately, if you're doing more and more batch processing off-hours, you're spiking not just from that one job but many different jobs to all different resources. It might not just be CPU compute. But it's also any other resources, like your database, queues, whatever you're using, that you're also flooding with work off-hours because you don't want to interfere with the user-facing stuff.

But what happens, again, with volume... This, over time, starts bleeding into operational hours, where you're expecting the load to be at a certain level. You start bleeding into that. If volume increases here, our peak just doesn't stay as a peak. It keeps going out, potentially, into working hours.

Now, what do you end up doing? "Well, let's just move it back a little bit farther. Let's start it at 11:00 AM." You're not really solving the problem necessarily. You're just shifting it. At some point, you really have to address it.

I just want to take a quick break here to explain some of the issues that I think are the primary ones with batch jobs. I don't know, Adam, if you've run into any or if you have any thoughts about batch jobs yourself.

20:55 Adam Ralph

Well, I was wondering, Derek, you mentioned in the description of the webinar about getting pager alerts in the middle of the night when a batch job is having problems, or if it's failing, or something. I'm really curious. Has that actually happened to you before?

I think that's the number one reason why I've experienced pagers occur. I was using that first example of concurrency. To me, it's the sleepy one because it's a scenario where... in the happy path. Your system's running. It's got a certain level of volume. This works fine. Then, all of a sudden, you run into this. There's either duplicate records. There's failures because of duplicates; however your system's set up. These types of pagers... Again, because they are running in the middle of the night, who really wants to be on, I mean, pager or support in the middle of the night when you're going to get one of these? It's not a good place to be in, especially when it creeps up on you. It's the same thing with load. It can creep up on you.

Again, if you have monitoring, you got to deal with it. Maybe you've got different metrics for CPU load and database, et cetera. What you'd rather be doing, which I'm going to talk about, is figure out how can we reduce some of this, how can we remove this concurrency that's causing these issues, and how can we can reduce the load and level it out.

Really, what I was talking about with the reservation pattern, it's exactly as I was describing it with that online order. Another way of thinking about it, instead of in your own system, is anytime you want to create some lock, essentially on a resource... You want to create some time-bound lock, giving some guarantee that... "Hey, there's some type of resource," in my case, it was a product on a shelf, "that I want to guarantee that I'm going to lock. I'm going to hang onto that lock. At some point, it's going to be time-bound that it's automatically released if I don't complete whatever it is."

I had to go into the store to pick up or actually complete my order. But there was that timeline guarantee that I had of seven days. If you really just think about it like that, you can look at this process a little bit differently. The timeline that I had here was the very first thing... When I placed that online order, let's say there's some event that's published, that's my order placed. That kicked off probably a bunch of different things. One of them was just that first email that I received that said, "Hey, great. Thanks for your order. We'll send you another email when your actual order is ready for pickup.

Likely, the store also had some notification that they actually had to go get that and process that order for me. How they do that? I have absolutely no clue. But there's probably some indication to them that they need to go do it. Maybe that's just some employee periodically going to look at the system. Maybe there's something that gets notified to them. I'm not really sure. But clearly, when they physically go and get the item off the shelf and reserve it for me, there's some indication that they're making in their system that that order is ready. All the product's been reserved, and I can come pick it up. That's, obviously, what kicked off that event. That order reserved event is what kicked off the email that I could actually go pick up the order.

Now, when I physically went to the store... I don't know. It was later that day, I think. I go and get my order. They're clearly checking me out. I've already been charged for it, so I don't have to give them my credit card or any type of payment. That's essentially completing that reservation. I mean, there may be some order completion there that's happening. But really, if you want to think of it as a reservation, that's when I'm completing that reservation. It's when I physically go to the store and pick it up. We know we're done because that's the indication that they don't need to expire it.

These are the three things on the happy path that really work. We don't need to think about batch jobs or anything here. It's how you expect if everybody goes and picks it up. But that's not the case because we also need to expire.

When we're thinking specifically about a reservation, the idea that you were reserving it, that's when the person was going to pick out the item. Like I said, if you're thinking about restaurants, you're calling up and reserving a table. There was that confirmation or the validation of that reservation. That's me going to the store and actually picking up the item. But we also have to deal with the expiry. How do we expire that order and do the refund, and put the item back on the shelf? That expire portion is that interesting part, is that... how do we actually do this without a batch job? That's really what I'm getting into. You know you need to do this. When you're physically writing the code for this, when you do that reservation, you know you need to, potentially, do an expiry in the future.

Knowing that, you can program and model this process that way of deciding, at a given time, that you're going to do something in the future. You don't need to do some interval batch job looking at the state of your system to figure out if an order needs to be canceled. You can decide in the future that you're already going to do it, potentially, if it hasn't been picked up.

What that looks like is... "Okay, our order was placed." But the moment that employee goes to the shelf and actually picks up the item, they're creating that reservation. What has to happen with a reservation? You need to expire it. When we're creating that order reserved event, and we're publishing that, we also, at the exact same time, can be creating a message that we want to consume in the future. That can be our expired reservation. We're defining that along our timeline here. Immediately, when an order of reserve happened, we are immediately going to basically send ourselves a message in the future.

What happens is, as I actually come into the store and completed my reservation, all that really means is that when our expired reservation is actually still going to occur, we'll just look at the state of the system. We'll see, "Oh, okay. That order is completed. We don't actually need to do anything." We just exit early when we consume that message. No problem. If the order hasn't been completed, and we haven't completed that reservation, then that's the work that we need to do related to the expiry, in the case of doing the refund and returning the product back to the shelf or flagging it that way so an employee does it.

This is what I mean by thinking a little bit about "you know when you're writing this process." You're actually in code writing this, is that you know what you need to do in the future. I think that the trouble is there is thinking, "Well, maybe not. Maybe it's conditionally." But you can still decide that you want to do it and then make that conditional portion of whether you actually do anything meaningful when it executes.

A way you can accomplish this is with delayed delivery. I'm going to explain how it works. Ultimately, NServiceBus provides an abstraction around this, which is wonderful. But I would recommend checking out the documentation. Say you're using Azure Service Bus or AWS SQS, or RabbitMQ, understand how it works under the hood because NServiceBus distracts this from you, which is awesome. I think I'll show an example here. But it's simplistic on how you're leveraging it. But behind the scenes, NServiceBus is doing a lot of work for you.

The way this works is you have a center where you're going to be sending a message to the queue. What you're doing is you're basically providing, with that message, how long do you want to delay the delivery of it? Hence, delayed delivery.

Let's say in the case of when I was going to send that expiry. I can tell it, "Okay, I'm going to send this message to expire this reservation/this order at seven days. That's what I'm including with my message when I send it. What happens here is that the consumer will never get that message until we've reached that timeline, that seven days. Seven days have to pass. Our consumer can be trying to get messages, but they won't be delivered to it because we haven't elapsed that time. Once we finally do pass that seven days, at that point, our consumer will get that message delivered, and then we'll be able to process it.

Again, look at, read, and understand how the documentation works underneath the hood because there's some implications there on how it actually works. But it's really simplistic in terms of the API on how you're doing this. Now, this sounds incredibly straightforward, and, for the most part, it really is. It just really is a matter of thinking about... "I know I need to do something in the future," and basically be defining that you're going to do it. The key part about it, as I'll explain a little bit more, is just understanding when you actually do get that message delivered, you realize, "Okay, do I actually still need to do anything? I know I needed to back when I originally sent this, but I may not need to now based on the state of the system."

This is the original load peak, that spike of doing that batch processing of, say, however many orders that we needed to cancel or whatever the work that you're doing. If you think of that now, we've eliminated the need for it because now what we're doing is that now that we're scheduling and delaying delivery of this message, it's just going to be executed at the time that... 10:00 in the morning, that order was reserved. Seven days later, at 10:00, is when we're going to process that message. We're not doing them all at 4:00 in the morning. We're doing them. Essentially, seven days from the exact time roughly, depending on how quickly you process them. It's going to be available, at least, to be delivered to you to consume seven days from that moment, or whatever you set your expiry as, to delay that delivery.

Essentially, what we do is we end up going from this batch processing, processing all this work, to really leveling things out. Where this peak was, now, all these messages that we need to expire, especially if they need to exit early, they're just sprinkled through everywhere, likely when they're occurring during working hours there. You're going to increase load over working hours a little bit. But again, it's spread out throughout your working day rather than needing to be batch processed that night. You can imagine how, if you take a lot of your batch processing and do this that you were previously doing and defining what you want to do in the future, you can really just level that load and not have to worry about... To me, why you end up getting pagers is because of usually high database CPU or different locking and connections and transactions taking too long. Those are usually the most reasons why you end up having these database spikes and what they cause. Now, you're just putting it throughout and leveling it off through the normal work and normal processing and messages that you're normally doing.

That's the biggest advantage: is load-leveling. You're still doing everything in isolation, just like you would normally with a message, so you get all those benefits of how you normally do things in your system. At that point, before I jump in. A couple of more real-world examples. Adam, I don't know if you had any thoughts about this so far.

32:59 Adam Ralph

Well, it's certainly great to know we can avoid these pager alerts. Based on what you said, how does this look in code? If I'm actually going to try and do this in code, is this a complex thing?

It's not. That's the thing: is that while this sounds incredibly simple, I think the biggest difficulty in this is really just thinking about your use cases/your processes and realizing where you can actually do this. Like I mentioned, a big portion of this is just having... Oftentimes, things are time-bound. The way you can do that with the delayed delivery is that you can do a request timeout and specifying your message. Then, from there, you're specifying, really, how long you're going to be delaying.

In my example, this is just some sample code that I had from a video that I've done on this. This is just in a saga. As you'd expect, after 10 seconds, I'm going to be executing this timeout and going to be able to do any additional actions I need to perform. I'm just handling this timeout here for this actual particular type, this message.

I also wanted to point out, in comments here there's other ways of doing this, which is just literally when you're sending a message providing the delayed delivery with a timeout as well. You can have many different timeouts that you're requesting in as saga. I just want to point out this way because there's different ways to do it. But I want to point out this way because another example that I'm going to show here... Let me just back to some real-world examples.

I'm going to give two more. They're different in nature. But, again, really, the point of this is I hope they give some context. Obviously, they're probably not the same scenarios you're in; but that you can relate a little bit to them, that maybe it sparks some ideas about your own system. Then, these are actually two that I've been in.

The first one is something I experienced. Again, this was relatively recently. When we go back to that example of that time-bound guarantee of something... And I think most people can probably relate to this in some way if you have a system that has some type of user registration or some type of account set up. What I was doing was I was signing up to a service for, I guess, money transfer, to do bank transfers, that type of thing. I've never done it before. I had no idea what I actually had to go through, a part of this registration.

The first thing I actually did was just fill out your typical... you know what I mean, demographic data: my name, address, that type of stuff. Provide, I think, either my email address or username that I wanted to create associated to my account. It's all pretty typical. Then, from there, the next thing I had to do was I had to enter some more information. But you'll see, as I was entering more information, what really also had to happen was that there had to be some expiry that had to be defined on how long this actual registration process could take.

Just because I started it does not necessarily mean I was going to finish it based on all the things that needed to happen. This is the same way where there's this process that has to happen. It has some type of life cycle where it can complete. But if it doesn't complete, there's likely some release or expiration of that registration so that maybe in a year from now or months from now, if I go through the registration, I don't start off where I was, nor would I probably want to because I need to probably enter all new information based on how the process goes. There probably likely is, say, "Okay, you got a week to complete this. Then, if you don't, if you try to sign up again, you can go through a new registration."

The first thing I had to do was some type of ID verification. I remember having to scan, on my phone, my driver's license or something like that. Then, after I had that completed, I had to wait. Then, I got a push notification in the app on my phone that said, Okay, you're good. You need to get to the next step." Then, I had to do some type of bank verification, which, again, took a little bit of time to complete.

Once that was all done, then everything was confirmed. Everything was completed. I could log in, and I could start using, basically, their service. But if, at some point, I didn't finish this, it could be because I realized, "Oh, my god. I don't really want to go through this whole rigmarole," or maybe there was an issue with my ID verification or bank verification. You could see why this process wouldn't complete. You're not just going to necessarily hang on and prevent somebody else from registering with, say, the same email address.

Like I said, later on, maybe a week later or a month later, whatever the threshold that they have, I want to be able to start this thing fresh. I don't necessarily want to continue on. You could see if you had some type of record with my email address, you'd be like, "Oh. Well, you can't sign up again. You've already started a registration." This allows you to have some endpoint and always define a start and end to some type of process. Again, it's just thinking about these types of processes, that they do have a start and an end, generally. It's never usually infinite. There is some finite start and end to some of these things.

Another one I want to describe is another real-world use case that is a little bit different because it's not necessarily a start and an end. But you could see how you would do this in batch processing, or you could do this with basically delayed delivery.

Think about a delivery of goods, or whatever. Say you're buying something online. In certain parts of logistics, freight delivery, there's some freight that's a little bit more valuable potentially that the customer/the shipper actually wants to keep track of or wants to be informed of where their goods or freight is, say, at an interval. Let's say it's every 50 minutes. They want to know where it is. By where it is, this could be the locations of the actual vehicle, the truck that actually has the freight. It could be in a plane. Either way, there's usually devices that are sending GPS coordinates back to a system to say, "Okay, I'm actually here, driving down, say, this highway," or, "I'm in the air," et cetera. If you were to think about this, how this process works, we want to inform the customer/the shipper, "This is where your freight is," maybe with ETA information, so they know when it's actually going to get delivered.

As our point in time here, the first thing that's going to kick everything off is when that freight is actually out for delivery. It's going from the shipper to the final destination. We have this event that starts, but we know we immediately, right now, want to schedule a delayed delivery for a message that we're going to consume that's going to basically send... whatever, an email, an SMS, a push notification, whatever the case may be, to the shipper/to the customer.

If you were thinking about batch processing, you just have something interval every 15 minutes, or you can immediately schedule a message for delayed delivery. Let's call it a status update. As time goes on. Let's say it's 15 minutes out. At some point, we get a position update from the plane, the truck, whatever the case may be, that's giving us the GPS coordinates of where they are. We persist that data to our database. Then, when we actually get to the time of that 15 minutes, we get that message delivered, and we can perform that status update. But what we immediately want to do is do the exact same thing. Because we're not using any type of batch processing, what are we going to do? We're just going to do the same thing. We're going to basically have another message for five minutes out that is going to execute in five minutes. We're just going to keep doing the same thing.

We get another position later on: gives us some GPS coordinates. We execute that message, consume that message status update, send another SMS or email whatever to the customer letting them know the ETAs, et cetera, and we do the same thing. We just keep going with this. We don't necessarily know when the end is. There is an end, which is when we actually deliver it. But we don't know when that's going to be. We just keep having these status updates. We just keep sending these with a delayed delivery. We're executing them.

At some point, we're going to have the final event that the actual item is delivered. We're finished. This has actually completed everything. At this point, once we've delivered, we're still going to run that last event or that message of a status update. We're going to consume it. But like I said, these messages aren't necessarily that you absolutely have to do something. You're going to be, in writing code here, looking at the state of, say, the delivery of that shipment, see, "Oh, it's delivered. I don't need to do anything else. I don't need to send an email out. I don't need to send another delayed message out. We're done at that point." This one, we'll really just execute. That will be the end of it. While there is a beginning and an end, sometimes you know what the end is going to be, potentially. Sometimes you don't, but you're just going to keep doing things along the way and just keep scheduling and delaying messages.

The biggest takeaway here, for me if you get anything from this, I hope, is to think about these business processes, that they are finite, generally. There is a start in an end. You may know the end or what the end could be depending on the scenario. It could be something actually completes what you would think is the happy path. But there's also the other scenarios where there's this time-bound end to something. It could be time. It could be, as my example, where something else has to occur, and you're just going to keep delay delivering messages out until something happens, something else happens, some other event stops that from occurring.

Just thinking about these processes because I find most times, you actually... Not all the time, but most times, you actually can really think about something that you want to do in the future rather than having some reoccurring batch job, looking at the state of the data, and then trying to figure out from it whether it should do something. When you're writing the code for this, you generally know something actually does need to happen. A way to do that is telling yourself in the future that you want to do something. With that, I hope you enjoyed this. I'm not sure how many questions we got here, Adam. But more than happy to answer some.

43:42 Adam Ralph

Yeah, we've got a couple of interesting questions coming in, actually. There are two which are related, so I'll try and combine them a little bit. We've got some people asking, "How do you visualize a state of the system that's hidden either in saga state or in messages that are sitting in a queue, delayed, to be delivered at a later time? How do you get some visibility out of that?"

Yeah. That's a good question, and I think that pertains to almost anything. The key word there is visibility. But I don't really think of it that way because if you're looking at the state of the system, for example, with these orders, and you're saying, "Okay, this order is pending for two days. It was placed two days ago, and it hasn't been picked up. But we reserved it," more so, it's visual in your system, I wouldn't be so much concerned about what delayed messages are sitting there because you know they're there. Really, the state of your system is the state of your system. If you have the appropriate testing and you know that messages are getting sent, there's really no reason to be concerned with, "Oh, there's all these messages sitting there for delayed delivery." Because, again, even if they're there and when they are consumed, the state of your system doesn't need them to do anything. That's what you're writing in those consumers of those messages to be doing that to say, "Okay, I don't really actually need to do anything anymore," and you just exit early.

45:25 Adam Ralph

Yeah, there is actually a related question to that about what happens when the message arrives because I've got a question here asking, "What if batch process is based on multiple pretty complex conditions and checks before doing some action?" For example, you're sending an email to a customer. If they've failed their regular payment three days ago but don't send it to the ones who have already paid, or if the customer is in certain other states in the system, at which point would you recommend to check those additional conditions?

Well, I guess the way I would think of this is... related to this question, or you can translate this to other ones, is, I mean, you can do all these with a saga to begin with. But you're thinking of it as to the individual, say, customer. The customer has this payment schedule or whatever the case may be. There's a part of your system that are dictating whether the payment failed or whether the email needed to go out. It's really about going down to that individual process of the individual customer. I guess it's how I would think about it. All these different conditions aren't necessarily happening all at the same time. There's different things that are occurring that you could be having in saga state that dictates whether you actually need to take action on something when you delayed something out to actually execute it.

46:58 Adam Ralph

Okay, great. Let's see what else we have here. What if moving from a batch job with batch processing to event-based processing results in billions of events... Wow! That's a lot of events. Because of the type of domain thus resulting in a huge cost, how would you handle that?

Yeah, that's interesting. Well, I guess it depends on what costs we're talking about. If we're talking about cloud costs and usage, there probably likely would be some trade-off here in terms of... When you're doing that batch processing and really spiking and hitting resources really hard, say, whatever interval you're doing it, in my experience you are... Again, it depends on the infrastructure where this is deployed. Because if you're not really in a scalable elastic way, you're generally paying the cost of that peak for the batch processing.

Say, it's CPU-bound stuff, for example. Let's just say you're just using strictly VMs. You're not doing it serverless. If you're not auto-scaling it, you're going to be paying for an instance or containers that have the capacity, essentially, to process the batch. Oftentimes, it's not necessarily executing. It's all the downstream services that are affected by it. When you mention the pain, it's usually a cascading series of failures that caused it. You know what I mean? You have, say, some batch, the execution of understanding what you need to do, but it's the database. It's the cash. It's all these other things that you're flooding with requests. If all of a sudden you're creating billions of messages-

49:08 Adam Ralph

I mean, that's a lot. That's a lot.

Yeah. If it's a cost factor, then I guess you'd have to be looking at what you're paying now versus what you think it'd be costing you.

49:18 Adam Ralph

I guess if I envisage something with potentially billions of events, it leads me to say, "Well, this isn't the solution for every single problem out there. It's not a silver bullet." If you genuinely have billions of events, maybe another solution is better.

Yeah.

49:39 Adam Ralph

Yeah. I've got another question here, which is interesting. Isn't delayed messaging the same polling from the message in for a standpoint? As messages need to be polled anyway in order to verified... if it is time to process it.

Say that again, sir.

49:57 Adam Ralph

I think what's being asked here is... Okay, you're delaying messages, which go to the broker or some broker. Isn't that then the same as polling anyway because you're having to poll the broker?

Yeah. I got a question before similar to this. Like, "Is it not just shifting the responsibility, I guess, a little bit? Thinking of internally how the broker is working to decide to delay that message to you." I guess, in some sense, yes. But, in some sense, it's that concern you're removing it from yourself versus doing it in batch. There's just a lot more involved if you're actually trying to figure out how to do it. If you're doing it in batch, you're doing it in batch, versus if it's spread out and the broker's dealing with when it's delivering that message to you. I'm fine with removing that concern from my own... I mean, my app and dealing with that infrastructure.

51:00 Adam Ralph

I guess a broker's doing that internally. It can perform...

Probably.

51:04 Adam Ralph

...optimizations and the broker's all about message delivery.

Which is fine. I would rather leave those capabilities to what it's useful if that's its purpose rather than me trying to work it out, like I said, with all the different ways for failures, et cetera.

51:24 Adam Ralph

Yeah. Okay. We're coming up to the top of the hour now. I guess it's probably time to wrap up. Any questions which we didn't answer now, we will follow up with you later. There were a couple of anonymous questions. But I'll see how we can follow up with those as well.

51:49 Adam Ralph

Thank you very much for attending this Particular Software Live Webinar. Thanks to Derek for an excellent presentation. Our colleagues will be speaking later this month: NDC in Oslo, in Norway, Techorama, in Utrecht, the Netherlands, and online at .NET Fwdays, which is hosted in Ukraine. You'll also see more of the Particular Software team booth at the in-person events, and you can go to particular.net/events to find us at a conference near you.

52:26 Adam Ralph

That's all we have time for today. On behalf of Derek Comartin, this is Adam Ralph saying goodbye for now. See you at the next Particular Software Live Webinar.

Stop looking in the past; Start telling the future

🔗Why attend?

🔗Attend the webinar and learn:

🔗Transcription

About Derek Comartin