An exception occurred... Please try again.

00:00:02 Adam Ralph

Hello again everyone, and thanks for joining us for another Particular Software live webinar. My name's Adam Ralph. Today, I'm joined by Laila Bougria, who's going to talk about topics such as the importance of system resilience and how to make your system self healing. Just a quick note before we begin, please use the Q&A feature to ask any questions you may have during today's live webinar and we'll be sure to address them at the end of the presentation. We'll follow up offline to answer any questions we won't be able to answer during the live webinar. We're also recording the webinar and everyone will receive a link to the recording by email. Okay. So I just ran into an exception. Should I try again? Over to you, Laila.

00:00:52 Laila Bougria

Thank you Adam. That was a good one to get today's session started. Thank you everyone for being here. I think that if we think back to the previous years and what we've been through in the world, this is the time to say Murphy was definitely right, because anything that can go wrong will go wrong. And that is also true for our systems, because exceptions happen all the time. And we're probably more surprised to open up a clean lock file than we are when we open up one that's... Sorry. We're more surprised to open up a lock file full of exceptions than one that's entirely clean. That's what I was trying to say.

So if we shift our thinking and say we're just going to accept that things will go wrong from time to time, then we can also shift our thinking into how we reason about these faults that are occurring in our system. Instead of thinking about anything that might go wrong, we'll just start to think of how can we make our system more resilient to anything that might happen, right? And before we do that, I want to distinguish different types of exceptions because not all of them are the same.

There are systemic exceptions, intentional exceptions, and then we have transient exceptions. And a systemic exception is pretty well known to us. It's your unfriendly neighborhood friend called no reference exception. It could be an argument exception or even a not implemented exception that fell through the cracks. And although they can cause us a lot of headaches, usually once we understand the context in the scenario in which this fault or error is occurring in, then we can easily reproduce that with the tests. So these are the types of exceptions that we just want to completely get out of our system. We want to therefore find them, fix them, and test them so that they never can occur again.

And there are a whole lot of different methods for testing available that's completely out of scope for the session. What I usually try to still get in there is do it consistently, right? Because if you're testing a little bit there and a little bit there, there's still a lot of things that go unnoticed. But if you make this part of how you write the system, you make this part of your processes of your behavior in your team, then you do it consistently, then it can give you that safety net that we are looking for. Intentional exceptions is the next category and these are the ones that we use on purpose. Right?

An example to explain this is the validation exception. So we basically have, whatever method that we create, we pass it some input and we want to validate, is this information that we can operate with? Is there anything invalid here? Are we missing any information? Is everything formatted the way that we expected so that we can move forward with that? If that's not the case, then sometimes we'll say, "Okay, let's just throw an exception," a validation exception for example.

Once we do that, we are basically saying that the exception occurring is an acceptable outcome of that specific method. And that's important because it means that we're assuming that the calling API will also handle this, right? The last category of exceptions are transient exceptions, and they're going to be most of the focus of this session. They happen when you least expect it, and therefore, they're very hard to reproduce. They could be caused by external systems that are accessed or could also be due to infrastructure issues, for example. And what's also special about these is that some persist longer than others. It could be ticks, it could be milliseconds, it could be minutes. Depending on the underlying cause, it might take longer than others. So it's hard to measure these.

A well known example of a transient exception in a multi-user system is a concurrency exception, which happens when two users or two processes for that matter update exactly the same data at exactly the same time in a well designed system that will cause a concurrency exception to occur. But transient faults could also happen, for example, when there's a failover of a database cluster, because when the primary goes down, then the backup will become available but that can take some time. And in that intermediate period, we could be running into some transient failures.

Another possibility is system overload. Think of the Black Friday that's coming up shortly or Cyber Monday or whatever it is these days. If you have a web shop at that point and there's a product that's extremely popular and there's a lot of sales coming in, then some systems could become overloaded and that could also cause transient failures. It could also be caused by external services that we rely on in our systems. So in many cases, we could be calling external APIs that are not even under our control. So if those are struggling, we are exposed to those failures as well.

And as we transition our applications to the cloud more often now, transient faults occur more than ever. Because when we think of cloud computing, we like to say we're running our software in someone else's computer, right? But it's not just someone else's computer, it's someone else's millions of servers in a data center. And those millions of servers require a multitude of network infrastructure to handle those. Think of load balancers and routers and whatnot. And all that additional infrastructure introduces latency, which could cause exceptions to happen here and there.

In the cloud, the load is also dynamically distributed across multiple sets of hardware. But it could be that a piece of hardware fails, right? And at that point, it could be as simple as just rebooting the machine, or it could be that the machine has died entirely in that it needs to be replaced by another one. And it might take time before your service that you're hosting there becomes available again or even if you're not hosting it yourself. Most commonly, unless you're paying a lot of money, the resources that we use in the cloud are also shared. And therefore, cloud providers usually protect these resources using a mechanism called throttling.

And throttling basically means that access to that specific resource is going to be regulated by some kind of predefined rate. That could be throughput, it could be the number of calls within a specified timeframe or something like that. And once you're crossing the network for anything, you are exposed to transient faults by default. I mean, one of the policies of distributed computing is that the network is reliable. And although it has become increasingly reliable over the years, we still can't assume that it will just work because now we're used to things like fiber and everything and we always have an internet connection. But that's not true. There could be many things that still go wrong along the way.

And taking all these different types of possible transient errors into consideration, it's way better for us to expect the transient exceptions will happen at some point. So if we think about it from the code perspective, transient exceptions are very hard to identify because many of the underlying reasons for those faults occurring are even out of our control in many cases. Therefore, reproducing them is very hard or even impossible, and therefore also hard to test.

Trying to reproduce these or trying to treat these transient exceptions same way that we do with systemic exception doesn't make a lot of sense because of these reasons. And it's also just sometimes impossible, right? Because they are transient by nature, which means that, by the time us as software developers are notified to look at a certain problem, it might actually be that the problem has already been resolved. And that usually leads to a statement that all of us have probably already used more than hundreds of times in our career. It works on my machine.

So let's take a look at a simple example. We have a scenario in which a user wants to place an order on a website, since we're close to Black Friday anyway. Right? If that fails, the user can basically just call support or move to a competitor. And in the worst possible scenario, the user is faced with something like this, the yellow page of death. Right? This is a death sentence for any modern business this day and age because literally anything is better than this. This would basically take away any confidence anyone has in the system.

But luckily, we've gotten a bit better at telling our users that something has gone wrong. And a good example is how Disney uses our collective memory of how Ralph broke the internet, or how Taco Bell uses optic illusion here to make a smile in the face of adversity, right? That's still funny. Something failed, it's still funny. Or we have Airbnb that uses relatable graphics to portray that sometimes bad things happen. This is life. It is what it is. And then we have GitHub with its sturdy, strong pink rainbow unicorn that comes to tell us that they're having a bad day.

If we can come up with huge, funny graphics that make the user smile even though they're running into a problem, we'll increase the odds that they'll be forgiving and that they will basically just try again instead of immediately saying, "Okay, this is not a website where I want to basically spend my money." But we need to take it a step further and ask ourselves, as software developers, as software engineers, what more can we do beyond the cute graphics? And for that, I want to put on some different hats. So if we place ourselves in the shoes of the user and we run into a failure, and probably we'll just try again to see if the problem persists, unless you are paying for something, right? If that user calls a support engineer, the first thing that they'll do is see, okay, is this actually an issue? And in order to validate that, they will try again.

If the support engineer ends up talking to one of the software developers or engineers on the team, then the first thing that they will do is probably also try again, not because the software developer does not believe the support engineer or the user, but also because that's what we do in order to understand what's going on. We need to be able to reproduce something so that we can actually start taking steps to resolve that problem. So there's a very clear pattern here. In the face of failure, everyone just wants to try again. That's like the first thing we want to do.

But we don't want to let these types of failures bubble up to the user, especially if they're transient, especially if it's just a millisecond or even a few seconds. Because that user has absolutely no means to solve that problem except for trying again or going through hoops by finding a contact number and contacting someone. Instead, we want to focus on making our systems way better equipped at dealing with failures in the first place, especially the transient ones. And that way, we can move from a recovery model to a resilience model. And we want to be building a system that is able to cope with failures without ever losing any data, and hopefully, without ever affecting users.

So if we start to think about strategies in order to achieve that, there are multiple strategies that can help make a system more robust. And I'm going to focus on the ones that we can apply from a code perspective, not so much from the infrastructure point of view, because that's a whole other beast. So if we think about code, how can we make code more resilient? And we have strategies like retries, or retries with an exponential back-off. There's also something called the circuit breaker strategy, which is basically a fail fast mechanism.

So whenever a system or a resource struggles, let's say it's a database, it's struggling, then the circuit breaker is invoked and what happens is that all incoming requests will just be rejected immediately. We're not even going to try to process this because we're probably just hurting ourselves. So one of the things I like to compare it to is putting on your oxygen mask first. Right? We're not even going to try to help you. We know that we're struggling and we need to focus on just recovering before we can continue to accept requests.

Another possible resilient strategy is a fallback strategy. And I think this is the trickiest one of all, usually not the one that I would recommend. Because in a fallback scenario, the tricky thing is that you cannot make any assumptions about which parts of the system are knock down, are not struggling, and basically making decisions in a fallback scenario where you don't fully understand, from a code perspective, what the current state of the system is, could actually lead to completely bring it down to it's knees entirely. Right? So that should be the last one in the list that you think about.

And you can even combine multiple of these strategies together, which is what companies like Amazon, Google, Microsoft do. I mean, if you think about it, think about your shopping experience at Amazon. Because I don't think I've ever seen an error message there. I can't, for the life of me, remember a moment where I run into one of these beautiful images like I showed previously. And that's probably because they're very proficient at all of these strategies. Right? But if this is a new topic to you, then I would start at retries.

So a retry is exactly what the word tries to convey, right? We're going to execute an operation that fails in the hope that if we do it again, it will just succeed. For some, that's the definition of insanity. In distributed systems and especially when running in the cloud, it's common sense. So if we think about it from a code perspective, if we want to retry something, then we have to wrap the code that we want to retry if something fails in a try-catch block. If we run into an exception, then we'll just re-execute that code block. And we're going to do that for a configured amount of times, which is actually perfect to catch the very super transient exceptions like a concurrency exception or maybe even a flaky connection. Let's take a look at a code example.

So here, I have a method called WithRetries, and I'm accepting an action which basically encapsulates the piece of code that I want to re-execute in case it doesn't succeed. I'm also accepting the number of retries that I should do. And then, I'm starting a for loop based on those number of retries, and I'm invoking the action that was passed through. If an exception occurs, I'm just going to swallow that exception and continue through the loop and basically try to execute that action again.

When I deplete the configured number of retries, I'm going to invoke that action one more time without the try-catch block. Because if a failure still occurs, I want that bubble up to the user or to the calling API so that at least they're aware that something is going wrong. You don't want to swallow all exceptions everywhere. Right? The downside of this strategy is that, as you do multiple retries and you're doing them immediately, you're using additional resources, you're doing additional IO, you're basically causing additional overhead in the system. And if you have a system that's already struggling for whatever reason, you could basically bring that system completely down to its knees by doing this extensively.

And that's why if you start reading and you run into, for example, Microsoft guidance, then it will actually say that if you're doing these immediate retries, you should basically limit that to just one single immediate retry. So if you think about it that way, then the only thing you might be solving are the super, super transient things like concurrency exceptions or so, where a matter of sticks is enough. So we're not winning a lot. What more can we do then? Well, we can introduce retry with an exponential back-off. And this method is actually very similar to the previous one, so I'm not going to go over every parameter in the flow again.

But what's mostly different here is that whenever an exception occurs, we're going to wait for some time. And the amount of time that we wait, we're going to base that off the number of retries that we have been executing, which means that the first time I'll wait half a second, the next time I'll wait a second, after a second and a half and so forth. So I'm going to grow the time we wait with an exponential factor. The downside of doing something like this is that the call takes longer to complete because we are waiting. So whatever API is waiting on the other side for whatever result is basically waiting until we go through all of the retries.

Another problem to consider is that if you find yourself in a very high throughput scenario, we were still talking about Black Friday. So imagine that you're Amazon and you have a lot of incoming orders coming in at the same time. So very, very high throughput. We are now giving the system a little bit of breathing time. So an exception occurred, we're waiting. So that system is now allowed sometime to recover and basically catch its breath, so to speak. But if we are then retrying in a very high throughput scenario, we're still sending a bunch of subsequent requests to that service when that time depletes.

So again, we're putting additional strain on the system. And that is something that we could solve by introducing something called jitter. And jitter is basically a mechanism or strategy to randomize the amount of time that we're waiting in between retries. We're still going to grow that time that we wait in between retries exponentially, but we're going to randomize it just a little bit, which again, in that high throughput scenario, will mean that the requests are retried at a little tiny difference in time, basically still spreading a little bit all of that load.

Again, this is mostly relevant if you have very high load, very high throughput scenarios. You could even say, "I want to combine these immediate and delayed type of retries." And if you take a look at this method, I'm accepting the number of immediate tries and the number of delayed tries because those could be different if you want to manage those separately. And what I'm doing within the loop is instead of just executing the action, I'm calling that immediate retry method. So for every delayed or exponential back-off retry, we are also going to do one immediate retry in this case.

And you can continue doing that until we deplete all the number of tries. And again, outside of the try-catch, we're going to do that one more time to make sure that if something keeps occurring, that will bubble up to the user or to the calling API. But this all is pseudo code. If you're thinking to yourself, "Oh, this is useful, I could write this," please do not. This is a solved problem. We don't want to reinvent the wheel. I'm using some pseudo code to explain to you how that works and to help us reason about it. But this is a solved problem. And that's where Polly enters picture.

Polly is a .NET library that handles transient exceptions for you. It allows you to express policies for fault recovery and it actually supports a very wide range of resilience patterns. Retries are also supported through the Retry and WaitAndRetry methods, and they support additional strategies like circuit breakers, timeouts, even fallback strategies, which again, be careful with those. I think just reading through their documentation is a very good learning experience because they explain a lot of things and it helps you try to think about all of these scenarios and which benefits and which drawbacks every scenario has.

So let's take a look at how you can do retries with Polly. So here I have declared a policy in a fluent manner. And what I'm doing is I'm handling the exception based class, so basically catching everything, and then using that policy to execute whatever code I want to make more resilient. Sorry. So one of the tricky things here is that we're catching the exception based class, which means that we are catching everything that might occur and we're always going to retry. Which is not the best thing to do because as we said in the beginning, not all of exceptions are the same, right? Some you want to fail fast, you don't want to do retries for it to begin with.

So what you could do is catch more specific ones and say, "Okay, I want to catch DBConcurrencyException and then I want to do the retries." Or you could even say, "I want one strategy for all of the database type errors that could occur in my system." And then you could amplify the policy by saying, "I'll catch DBConcurrencyExceptions, but also DBExceptions." Another thing that you can do is also intercept and basically plug in whatever you want to do on every retry.

As you can see there, I'm logging the exception that's occurring so that I can keep track and have some insight on what is happening in the system, like when are exceptions occurring, how many times are they occurring and things like that. When we think about the delayed retries or the retries with the exponential back-off, then that's supported through the WaitAndRetry method. And to this method, you could pass in fixed times to wait and you could just say wait a second first, and then two seconds and whatever. Or you could pass in a method like I'm doing here to grow that exponentially.

If you want to use jitter like I discussed earlier, then that is also supported, although it's not part of the Polly core package, it is supported through the contribution package called Polly.Contrib.WaitAndRetry. We have some resources at the end if you want to remember that. No need to take any notes. And then you can also use jitter. Of course, the example that I showed you is extremely naive. There's a lot more to jitter than just adding a random. So this package actually solves that for you. And it's also recommended by the Polly project as well.

If you want to combine these immediate and delayed type of retries, then you can use that back-off method and then pass true to the FastFirst parameter at the end. And that will make sure that on every retry, you're doing one immediate retry. So as you can see, just a few lines of code, and we've made the code a lot more robust. And by the way, these are also threat safe, so you're not going to have to care about that either.

So let's recap this first part. Polly allows you to declare different policies that suit different needs, because not all exceptions, you don't want to handle all of them the same way necessarily. You need to basically use a policy wherever you want to have retries in your system. The downside of this is, let's say you're building an API, right? And you think, "Okay, I want to make all of my API methods more resilient and I want to have retries everywhere." Then you have to remember to use a policy in every method. And that can get very noisy, very ugly, and it's also very error prone because someone could forget, and then that's a gap in your API where you don't have retries.

Again, this is a solved problem because you can plug Polly into the ASP.NET Core Middleware if that's what you're using through the Microsoft.Extensions.Http.Polly package. And then what that does is it allows you to declare those policies at application startup basically, and then you can handle recoverability as a cross-cutting concern across your application. Okay. So that's simple enough, but not entirely because I've left a few things out. There are a few design considerations that we have to keep in the back of our mind once want to introduce retries into our system.

Nested retries is a concern. We're going to have to isolate the changes we make. And then idempotency, or better said, is your quote repeatable? Retryable? So to go into nested retries, remember it as a thing you don't want to do. No nested layers of retries. Consider an example. I mean, all of the reading I'm doing is a lot also about microservices. Let's say that you have a system with microservices, right? And you have one service calling three other services. And each of those three other services are calling two other services to get some data. If you are using Polly on all of these levels, before you know it, you have an exponential number of requests that are happening, and you can think of them as cascading retries at that point. Because imagine that they're even talking to a shared resource, somewhere under the hood, and that resource is struggling for whatever reason, then you're making things a lot worse.

Another thing to consider is that the initiating requests can also time out because you're doing retries on so many different levels. And if you're doing those with an exponential back-off, then you're waiting everywhere. Right? So it's important to consider how the APIs that you're using in your system are already implementing retries and adjust to that so that you can avoid these cascading retries occurring.

Another thing to consider is that we need to isolate the chains. So whatever action that we want to make retryable, more resilient, is an action that needs to be a single unit of work. Every retry attempt should basically be independent of any other retry attempt. Which means that a retry should not rely on any shared state, and it should be self-contained. So whatever it needs, like get the data, modify whatever data and store that data, it should be self-contained, not depend on anything else. And it should definitely not leave behind any state changes. Because especially if that fails, well, I don't have to tell you how that could be nefast for the rest of your system. That is then seeing that state that was never really persistent.

And that brings me to the next design concern, which is idempotency. For an action to be idempotent, you should be able to invoke it repeatedly and it should always basically produce exactly the same result. So if you think about it, doing one successful result or doing 100 successful results should still lead to exactly the same outcome. If you look at that picture in the background of the light switch, I can flip that switch downward 100 times, the light will stay off. Obviously, software design in real life is a lot more complicated and a lot more messy than that light switch, but it's a very simple way to help you reason about idempotency when you are looking at your code.

If you want retries, then you have to also assume that duplicate requests are going to occur. Let's consider even a scenario in which you do a request to another API, and that succeeds, it worked. But in between that acknowledgement phase where you're waiting for that status okay, that 200 status code, something fails, network issues, there we go. At that point, from the client's perspective, that request failed because we never got an okay. So we're not sure if that succeeded. So the client is probably going to retry at that point. And that's why idempotency is so important.

But that's not the only reason. Another consideration is how we are now using multiple infrastructure in our systems. Nowadays, when we design a system, it's like, "Okay, this data, that makes sense in a SQL Server." But this information, let's start out somewhere in blob, and that information makes a lot of sense in the document database. And it's super cool that we have the ability to do that. Right? But it also means no more transactions, right? We can't rely on things happening as a atomic unit anymore, because these are completely independent systems.

So we need a way to ensure that our data is stored consistently across all of these separate types of infrastructure that we are using in our system. Therefore, we need idempotency because if one of them fails, we are going to retry to make sure that everything is consistent across all of them. I can't hammer this enough. It's going to be one of the key design principles you have to keep in the back of your mind if you want to have a resilient system. So remember this example? Let's take a look at how that looks like now, now we have retries with Polly.

So as you can see now, we're still placing the order. If that fails, we'll get the immediate retries kicking in. If that continues to fail, then we have the exponential back-off retries, the delayed retries. And if that still fails at that point, we're back to square one. Right? The user can either call support or move to a competitor. So when all the retries deplete, I think at that point, we can all agree that we're having a really, really bad day. But the question is, what challenges are we facing if we want to solve that problem?

And the main issue lies in the fact that the request is lost. It's gone. The user just spent a whole amount of time building up, for example, their shopping cart. I mean, I know it takes me ages to do that. Sometimes, it'll be... Okay, let's not go into that. It takes me ages to do that. Okay? So if at some point I finally decided to place my order and something goes wrong, basically means that I need to start over, right? And then I might say, "Nevermind, I'll just save myself some money." Or maybe I'll just lose confidence in the system and move somewhere else, right?

00:33:17 Adam Ralph

Hey Laila.

Yes.

00:33:20 Adam Ralph

So just to make sure I understand, I mean, this seems like an important point. What we're talking about is lost business here, right? Ultimately.

Yes, exactly. That's actually a good point. So let's drive that home. Right? In many systems where the data is not super crucial, for example, you could say, "The unicorn, or the pink unicorn, more than enough. I did everything I could, had retries, and that's how far I want to go given the business requirements that I have." But if you are a web shop, if that means that you are losing business and losing money, then that's not acceptable. First rule of business, ensure you take the money first. So you don't want to lose that money.

And if we're in that scenario, let's see what we can do. Because to solve this, we need to consider a completely different angle. We need to move away from this request-response type of mindset. Because if you think about it, up until now, we've been building that request shopping cart, right? And the user has submitted that request, clicked place order, and then the system immediately started to process that request. Instead, what we want to do is basically capture that request and say, "Thank you user, I got your order. I stored it safely. I'll get to it as soon as I can." And then what that allows us to do is that we can basically defer processing that information to a dedicated process.

And what that means is that if something fails, then we have a lot more leeway to solve that later. So how does that look like visually? So up until now, we had a calling API, let's say, and handling API, and those are coupled together. The calling API calls the handling API. If that handling API is unavailable or struggling, then we directly couple to that so the calling API will start to struggle. So we're going to introduce a queue in the middle.

Once we do that, I think it's also important to change our naming. Naming is hard. Right? So we're going to talk about producers and consumers and not about requesting handling APIs and that type of language. So what happens is that the producer will send a message into the queue. And if you think about that queue for a second, its sole responsibility is to make sure that your data, that message, is safely stored there until there's a consumer available to take care of that, to process that message. So whenever that occurs and a consumer becomes available, they can start asking the queue for messages and basically start processing those. And if there's a huge buildup, like next week, Black Friday, then you could say let's scale that out so we can get through those messages more quickly.

So let's consider how successful processing looks like when we have a queue. So we have a consumer, asks for a message, and we'll try to process that. If that's successful and we've done whatever we have to do with that message, what occurs is that we're going to acknowledge that message to the broker. We also call that ack the message. We're going to ack the message, we're going to acknowledge it to the queue. And what happens is, at that point, the queue says, "Okay, you've handled it, I'm going to remove it now. It's not my problem anymore." Right?

In the case that something fails, the consumer again asks for a message, fails to process that message, and at that point, we'll say, "We weren't able to successfully process this message, so we're going to nack the message or not acknowledge it." And what happens is it becomes available in the queue again. It could be picked up by the same consumer or by competing consumers for that matter.

So if we think about the benefits of having a message based architecture, there's plenty, but I'm going to focus on the benefits that we get from the recoverability, the resilience type of perspective. And one of them is that retry don't affect our synchronous path anymore, because nobody is waiting for that response. Because what happened when that user placed that order, we told them, "Thank you, we got your request, you can move on with your day." So they have moved on and they're doing other business, which means that nobody is waiting. That's due to the fire and forget nature of messaging.

But what that opens up is that you could be basically retrying forever. I want a nuance forever, right? Because forever means forever up until the point of your SLA, all right? So we're going to be retrying for as long as is acceptable by our business. It also opens up the opportunity to handle business retries. Let's say that you are working in an environment where you have continuous integration, continuous deployment, and you actually run into a scenario that wasn't considered, or the business made an error, or even we found the buck, right? At that point, if we have a small feedback loop when it comes to deployments, we could fix that buck, deploy it to production, and at that point, just retry those messages and the user won't even have known anything, because maybe that email with whatever shipping notification came in an hour later, they're probably not even aware of that.

So let's take a look at how the example has changed now that we have messaging in the system. It looks a little bit different. So we're placing the order. What that really means is we're just sending a message to place that order. That message is safely stored into a queue. And then when a consumer is available, they will pick up that message, process it. If it succeeds, we'll acknowledge the message and it will be removed from the queue. If it fails, we're going to nack the message, which makes it available again on the queue. And at that point, we can basically immediately try again. So if we look at this picture, we did lose a few things here compared to the Polly scenario.

If you look at that, we basically only have immediate retries. Because if you think about it, I said, we're going to not acknowledge the message. We'll nack the message, which means it becomes immediately available on the queue again. So we're immediately trying to process that again, which means we only have immediate retries. And as we know, those can increase the load on the system if things are failing behind the scenes for whatever reason. So we need a way to reintroduce those retries with an exponential back-off.

And although there are some native clients for queuing technologies available that do have some support for this, not all of them do, and some don't allow a lot of flexible configuration when it comes to these types of retries either. Another consideration or something to ask ourselves is how are we going to handle persistent failures? It could be that there's actually one of those systemic exceptions in there, actually a bug, right? Or it could be that there's a part of the system that is down for a longer amount of time and we need to intervene some way.

So it's not that this isn't handled by queuing technologies, because what they usually do have is a property called max delivery attempts. And what that means is that we will try to process that message for as long as the max delivery attempts allows. And if we can't get that done... Let's say that's 10. If we can't get that done in 10 times, then that message will be moved to a dead-letter queue. But that dead-letter queue needs manual monitoring and there's no easy way to manage them. So we need a way to centralize all of the errors that are happening and then an easy way to retry them again. And that is where NServiceBus can make a difference.

So for those of you who don't know NServiceBus, it's a messaging middleware technology that you can use on top of any queue broker of your choice, at least the ones we support. It allows you to build highly scalable and flexible systems. And we have a wide range of support for queuing technologies like Azure Service Bus, RabbitMQ, Azure Storage Queues and more. And we also support multiple data stores such as SQL Server, MongoDB, Azure Table and more. We also support Outbox and Saga patterns. We'll get a little bit more into the Outbox pattern later. And we also have monitoring and debugging tools available that can help you better understand the behavior of your system.

So again, I want to focus on recoverability and not all of the range of advantages that NServiceBus gives you, but rather from this system resilience type of focus. And if we start to think about NServiceBus that way, it offers immediate retries but also retries with an exponential back-off. And it even allows you to plug in your own custom policy so that you can completely finely grained control how the retries are going to behave. When it comes to persistent errors, we're going to move all of those to a centralized error queue, and we have tools available that build on top of that that allow you to retry those in group so it's a lot easier to manage the systemic failures or to handle the transient failures that just took a lot longer to solve like an outage of some kind.

Another thing that we also support is error notification, which you can think of as a callback that you can plug in every time a retry occurs so that you can do something additional there. Another thing that we support is that you can specify unrecoverable exceptions. These are the exceptions where you say, "If this occurs, don't even bother retrying. It doesn't matter. Just move it immediately to the error queue." Right?

A recent feature that we introduced is automatic rate limiting, which will basically slow down the ingestion if a lot of failures are happening in a very small amount of time so that we can say, "Okay, something is struggling, let's let it breathe." And we also have circuit breaker support in many of our transports, especially to manage connectivity issues for Azure Service Bus, SQL Server, RabbitMQ. So NServiceBus comes with a very, very sensible defaults, I would say. And if you want retries, you don't even have to configure anything. You get it out of the box when you run an endpoint.

But let's say you want to tweak it, then you can access the recoverability settings through the endpoint configuration, manage how many immediate, how many delayed retries, you could plug in your own custom policy, have fine-grained control, and you can even fall back to the configuration of the immediate and delayed retries if you want to. I'm not showing the error notifications here because a very common use case to use error notifications is to lock whatever's happening. You don't have to care about that either because NServiceBus will do that out of the box, and it will also do it in a sensible way when it comes to the log level.

So if it's an immediate retry, will log it as an info. If it's a delayed retry, it will be a warning. And if a message is actually moved to the error queue, which is when you need to intervene, it will be locked as an error, which is also important. You can also add any unrecoverable exceptions. A not implemented exception is not something that will be a transient one, I suppose. And then this is the rate limiting feature. So in this case, we're saying, "If you encounter 15 consecutive failures, I want you to start waiting and allow the system some time to breathe so that we can hopefully not bring it completely down." Right?

So let's see, now that we haven't NServiceBus, how that has changed the example that we've been using throughout the session. So now, we have that same flow with messaging as before. We place an order, which means we send a message to a queue and then process it. If that fails, we get our immediate retries. If that continues to fail, we get our delayed retries. And if that continues to fail, we'll make sure that we send that message to an error queue so you never lose a request, you never lose any business, you're never losing any money. Okay? So good. We have retries everywhere. Hurrah, right?

Sorry, but not really. Because as I said, NServiceBus offers retries out of the box. You don't have to do anything for it. But that's only true whenever you're handling a message, which means that there is a message stored safely in a queue. For example, that's true whenever you have that, IHandleMessages interface, for example. It's a way to detect whether you're safe on the retries side. But if you're in a ControllerContext, there you don't have retries. And why is that? Because NServiceBus is basically relying on that acknowledgement mechanism to handle failure scenarios.

So if it succeeds, we acknowledge the message. If it doesn't, we nack the messaging and we know it's still there. It's never lost, right? For the exponential back-off, we use a mechanism called delayed delivery, which just uses a waiting time before it appears in the queue for consumption. The point is, there's always a message in the queue, and we are relying on that. Outside of that scope, there's no message in the queue and there's no retries because of that. But if you've been listening, you have been thinking about how you could use Polly to solve this problem, and that's exactly what my point was going to be. So you could use a Polly retry policy to take care of retries when you're in that controller scenario.

And here, I'm showing it here, the full code, but you could lift this up into the middleware as we already discussed, and then you're good to go. Right? Have we covered everything? Nope. We haven't. Of course, we haven't. That would be too easy. Because remember idempotency. I already said that this was going to be an essential part of the design. But there's more to the story, because if you start looking at documentation around queuing technologies and brokers and stuff like that, you will read that they guarantee an at least once delivery. Which they framed in a very convenient way, but it means same message might arrive more than once, and you'll have to deal with that. Right?

So as a way to tackle that problem, we support message deduplication with the Outbox so that we can make sure that you are only ever processing the same message once. That doesn't mean that the messages will only go out once. There still might be duplicates. So the only thing that we are guaranteeing is that we'll only process it once. And it's important to understand that distinction. And actually the guarantees of the Outbox go a lot further than that. They're twofold. Not only do you get message to duplication, but you also get atomicity between your storage and your message operations.

Let's see what that means and why that's so important. So if we look at the snippet, we're storing some data, we're adding an order to the entity framework data context, and then we're publishing a message. If we think about this line by line, what could go wrong? If we can save the data, fine, because we haven't done anything. But what if we're able to save that information, store that in the database, and then we're unable to publish that message to the rest of the system? At that point, we have some information in our system that, it's like dead. The rest of the system doesn't know about it. So we call this a zombie record. Like it's there but it's half alive and we don't really know what to do with it.

So you might look at this code and say, "Well, let's just turn it around then." Right? Simple. Let's think through that. So let's publish the message first and tell the rest of the system, "Hey, there's an order available." But what if we're then unable to reach to database? At that point, we have told the rest of the system that there's an order available, and maybe a billing endpoint started to send the bill to the customer, and we have shipping that's starting to prepare things, but it's not there in the database. That's what we call a ghost message. Right? That could lead to a ripple of failures because all of those other subsystems that are part of your system are making assumptions of that order being available. Right?

Outbox solves this problem for you by mimicking anatomic transaction so you don't have to stay awake thinking about these problems. That means that operations are atomic and eventually consistent. It's not a DTC transaction. We don't have those. But we guarantee through a complex mechanism that it will be there and it will be eventually consistent. The message might still be sent or published more than once, which is why it's very important to enable the Outbox on all of the endpoints to make that work.

00:51:31 Adam Ralph

So Laila, you're saying that the Outbox needs to be enabled at all the endpoints, and I guess that makes sense because you are saying a message handler here, this is a message handler at an endpoint.

Right.

00:51:44 Adam Ralph

But what I've seen a lot is that people might have a web API controller, for instance. I might get a web request in to submit an order, and then they want to do the same kind of thing. Right? They want to send a message and they may need to do a database operation at the same time. So what do we do in that kind of scenario?

Yeah, that was going to be my next point. So very good question. Because this entire thing that I just told you only holds up whenever you're handling a message, so in the context of an incoming message. So this is a scenario I think, Adam, that you are trying to get across. Right? We're in a controller type of situation and we are storing some data and sending some messages. Nope. No Outbox guarantees here. Right?

So we usually recommend to not store data and send messages within a controller, but rather just send a message and defer any storage operations that you want to execute to when you are handling that message because then you get retries and you get Outbox and all that jazz. Right? But sometimes, life gets in the way. And as I said, systems are complex. We could have very, very specific requirements that require us to do this. Or even a legacy scenario. Right?

Let's say that we have a legacy application and we want to start introducing messaging there. If we didn't have to go from day one and change all of the controller methods, that's not going to work, right? And we've been hearing a lot of customers and users running into this problem. And that's why we introduced a new feature called the transactional session, which offers the same atomicity and eventual consistency guarantees that you get with Outbox but outside of a message handler. This is available through a dedicated package.

So Outbox can be enabled without pulling in anything extra. You do need a persistence layer, but you don't need any additional package to get the Outbox to work. That's not true for transactional session. This will be a dedicated package, NuGet package that you need to include in your projects. And another thing to consider is that this builds on top of the Outbox functionality. So if you want true atomicity, you need both the Outbox and the transactional session in order for this to work with those consistency and atomicity guarantees.

But if we just look at the code, it's pretty simple. We're going to open up a message session, do whatever data modifications that we need, whatever storage operations that we need, and then we're going to send whatever messages that we require to send. And when we're ready, we're going to commit that message session. I'm showing the entire code here, but you could also lift this up, again, into your middleware so it's seamless and you don't have to think, "Oh, I have to open a session and I have to commit the session." So you could lift that up, again, cross-cutting concern and handle it that way. Still using the retry policy so that you get retries here as well. Also important.

So at this point, we have built layer, upon layer, upon layer of additional resilience into our system, keeping into account a lot of stuff. So I think, at this point, we're quite safe to say that we are well equipped at dealing with especially transient exceptions. So one thing that I like to say, because people are habit people, and I've definitely been there, is to rewire our brain and say, "Generally speaking, you don't have to catch exceptions ever again." There could be some very specific scenarios in which you say, "Okay, I'm talking to an API somewhere. And I know that that API can throw intentional exceptions when there's a resource not found, and I want to catch that so that I can handle it accordingly." That's a very specific scenario.

But you don't have to be thinking about what could possibly be wrong and think of all of the possible exceptions that might occur, because we've introduced retries that will basically help the resilience of our system and see if we try a few times again, if that solves the problem. In doing so, we've also taken into consideration many design considerations that we have to keep in the back of our mind. Idempotency, the main one, right? Don't forget this one. And the cool thing is that these retries don't require any manual intervention anymore. Probably, it won't even affect users, or at least not as much because all of those small transient exceptions, they'll just succeed on an X retry.

And this doesn't only benefit the customers who are not looking at error screens anymore, or the business that's happy because they have additional money, because the system is more resilient. It's also very beneficial for us as software developers. I mean, no more PagerDuty calls, right? Or no people standing at your desk saying, "We ran into an exception. Can you please look at this?" Only for you to go look and see, "Oh, it's not there. I can't reproduce it." Right? It's already been solved.

We are basically throwing away so much of our own time looking at these types of things, and it's wasted effort because we can't really solve them. Right? So that allows us to refocus all of that time on what really matters, which is the systemic exceptions, the things that actually need fixing, the things that actually need our attention.

00:57:34 Adam Ralph

Laila, this is getting me excited now. You're basically saying that I don't have to catch exceptions anymore.

No, you don't. But I want to make something clear. It says, "Don't ever catch exceptions." I never said you don't have to care about exceptions anymore. There's a very important monitoring angle to consider. Right? You do want to be aware of how many exceptions are occurring? Why are they occurring? How frequent are they? Are they happening at a certain time of day? Is it specific to a place in the world? Or whatever. It could be that there's a database somewhere that needs a little bit more power and you need to scale that up. Or it could be that there's a service that's getting overloaded and it would be useful to scale that out.

So it's definitely something that you need to monitor, but it's not something that you need to solve anymore, at least not the ones that are transient. So to summarize this, there are multiple strategies available to increase the resilience of your system. And obviously what you need will be dependent on whatever your requirements are. But I think it's very important to remember to not force your users to do retries. Do it for them. We have all of the tools available for our usage to just not have to make users face that anymore.

Also, don't write it yourself. It's already available there. Just so you get package away, a few lines of code, simple and easy. You can use Polly for that. And if you're running into scenarios where you can't afford to lose any data, because that would be losing money, then embrace asynchronous messaging.

Good. That was the session. I want to share some additional resources. So if you scan the QR code, that will lead you to one of my repositories where I have a bunch of resources around system resilience link that I read through that you can dive in deeper if you want to. And there's also a course available by Udi, Udi Dahan about the policies of distributed computing, which is free for now. So definitely take a look at that as well.

And Adam, I'm curious to see if there are any questions. Again, like Adam said, if we don't get to all of the questions, we will also get back to you at a later point. Or you can contact me @noctovis on Twitter or through my LinkedIn.

01:00:09 Adam Ralph

Yeah, thanks Laila. So we do actually have some interesting questions. I'll pick out a few of them so we can get through now. Someone's asking, "If we're calling an external API, can a queue help us there?"

Okay. That's a very good question. So I would say yes. Because if you defer that and you say, "Okay, I'm going to send a message, and as part of handling that message, I'm going to be talking to that external service. If that service is down, we're getting the retries there and we're not affecting that synchronous path." It gets a little bit tricky if you need real time information. Because most of the times, it will still be almost real time, but due to messaging, you're introducing that asynchronous flow. So it's definitely something that you have to keep in the back of your mind to rethink it.

But if you send that message to a queue, handling that can be done by a dedicated message handler, which means that you get all of the benefits that we just talked about, and you're also not bringing down the service that needs the data from that other service or needs to access that other service.

01:01:25 Adam Ralph

That sounds great. Yeah. So it really comes down to isolating that into its own operation, right?

Yes.

01:01:33 Adam Ralph

Okay. So we've got another interesting question here. Why is the Outbox not a default? So moreover, why is it even configurable? What reason would I have not to have it enabled?

That's also a good question. Well, I guess it depends, right? I think if you think about how NServiceBus was originally built, it was built on top of MSMQ. We had distributed transactions there. And as we extended into having other options when it comes to QM technologies and persistences, we ran into this problem of, "Oh, we don't have the DTC available anymore, so this is a problem that we need to solve."

And it's only the last years that that has become the default. So I would definitely suggest to always turn it on. But a thing to consider is also, once you require the Outbox, you're also going to require a persistence mechanism. So you're going to need some storage that might not always be feasible depending on what you are doing. And there might be scenarios where you just don't care. Let's say that you're ingesting metrics data. If you have a duplicate now and then there, it might just matter less. So it still depends, like always.

01:02:53 Adam Ralph

Yeah. So I guess, like everything, it's not a golden hammer, right? It's got its uses and it sounds like it's very, very, very often useful, but perhaps not always. Okay. Let's see. Maybe we've got time for one more question. Let's have a look. Are there any plans to support Outbox in a ControllerContext for atomic update and published scenarios? So I think that actually comes back down to transactional session, right?

Yes. Exactly. So yes, that's already supported. Again, if you want the true atomicity guarantees, you need both the Outbox and the transactional session on top of that. I will also highly suggest that you look at our documentation, both regarding the Outbox and the transactional session because we explain exactly how that mechanism works behind the scenes. And it's also just interesting reading, right?

01:03:42 Adam Ralph

And that's actually quite a recent package, isn't it?

Yes.

01:03:44 Adam Ralph

So I guess, a lot of people who are watching might not be aware of that, that that's actually happening now. So it's actually pretty exciting thing.

Yeah, probably weeks even. So yes, definitely something new.

01:03:54 Adam Ralph

And before that, there was no real support for this, so it's quite exciting to close that gap as well with this new package.

Yes. Definitely.

01:04:02 Adam Ralph

Excellent. So we do have one or two other questions, but as we said, we'll follow up with those offline with you later. So I guess, we can wrap up now. So thank you very much for attending this Particular Software live webinar, and thanks to Laila for an excellent presentation.

01:04:19 Adam Ralph

Our colleagues will be speaking in January at NDC in London, and you'll also see more of the Particular Software team booth there. And you can always go to particular.net/events to find us at conference near you. So that's all we have time for today. On behalf of Laila Bougria, this is Adam Ralph, saying goodbye for now and see you at the next Particular Software live webinar.

An exception occurred... Please try again.

🔗Why attend?

🔗Attend my webinar and learn:

🔗Transcription

About Laila Bougria

Additional resources