Beyond simple benchmarks—a practical guide to optimizing code

00:00:00 Kylie Rozegnal

All right. So welcome to our webinar on optimizing code performance. I'm Kylie, your host for today and we're joined by Daniel. Daniel is a Microsoft MVP from Particular Software. Now Daniel is an expert in improving .Net message-based systems. And he'll guide us on using Benchmark.Net for performance improvements and share insights on cost-effective optimization. A Q&A will follow. So ask your questions, like I said, in the Q&A section. All right, Daniel, it's all you.

00:00:32 Daniel Marbach

Hey Kylie, thanks for the warm welcome. Hi everyone that joined us today. In this webinar, Beyond Simple Benchmarks. Like Kylie said, I'm Daniel Marbach, I'm from Switzerland. And we're going to be presenting to you about this very interesting topic. And I saw the poll already, so we have quite a few people that are new to benchmarking and performance optimization, so that's great. I think then this talk will even be more relevant to you. If you have also any questions that I cannot answer in the follow-up, so you would like to reach out to me over social media. You can reach out on X under my name or you can send me an email on the Daniel.marbach@particular.net. Or you can also find me on LinkedIn should you wish to connect there. And I'm happy to follow up with you as well. So, I remember the first time I started benchmarking my code changes to verify whether the things that I thought might accelerate this code really made an impact.

I had already seen quite a few benchmarks similar to this one that I'm presenting on the screen, that were written in Benchmark.Net. I saw them on blog posts, on open source projects. And, by the way, don't bother to look too much at the code, because the code is not really that relevant that I'm presenting on the screen. Because I did neither, at the time. So I was just conceptualizing Benchmark.Net as a unit test. Because my conceptual understanding was like, "Oh yeah, it's like a unit test." I've written quite a few unit tests with xUnit, NUnit, and God forbid, MS test. I'm sorry for those that have to use it, I had to use it as well. I feel you. But I was like, "Okay, I felt quite certain it wouldn't take me too long to actually get started with this Benchmark.Net and actually doing benchmarks. But I was horribly wrong.

I mean, writing the skeleton of the benchmark was indeed very simple. So the mind-boggling part for me was trying to figure out what should be taken into account in the benchmark? How to isolate the code, without basically spending crazy amounts of time in refactoring the code to actually make it even possible to benchmark it. What should be deliberately cut away to make sure the changes that I envisioned are going in the right direction? Because that's important. I want to know where I'm heading to. And how to measure change and measure without burning away the allotted budget. Sometimes I've been doing these in my free time and my allotted budget in my free time is smaller. But even in professional environments, you can't just go and close yourself into the salary, ignore all the customers and do these types of performance optimizations, the benchmarking all the time, except if that's your job.

But we're already ever are in the profession. We need to deliver code that actually makes a business impact. But then the question is, "Why would I even bother and go through all this hassle?" Because we've already talked about, that we only have a limited budget usually available. Well the thing is for code that is executed at scale, the overall throughput and memory characteristics are really important. Because code that wastes unnecessary CPU and memory cycles ends up eating away resources that could be used to serve requests. And especially with modern cloud native approaches, and even if you're not there yet, if you're planning to migrate to cloud native or you have your own data centers, you want to make sure that essentially these resources are used as efficiently as possible. And for scalable code like that, it's even more important that we are making sure that we are using these resources as efficiently as possible.

And in the cloud we usually get built by the number of resources consumed. And the more efficient the code is, the smaller the bill, and the more request we can execute for the same amount of money essentially. And let's not forget more efficient code execution also means that we are consuming less energy. Which is an important cornerstone of the green IT movement as well. Or, let me give you an even better example to quote Microsoft. So Microsoft has a block series where they talk about the teams that are migrating existing infrastructure that is, for example running on .NET framework to modern .NET versions. And I have here, one quote from the Microsoft teams infrastructure and Azure communication services journey to .NET. And the quote there is, "We're able to see Azure Compute cost reduction up to 50% per month. On average we observe 24% monthly cost reduction after migrate to .NET 6. The reduction in cores reduced Azure spend by 24%."

And the performance improvements that they got by migrating from .NET framework to .NET 6 to leverage more modern seizure programming technique and memory efficient code, and also CP efficient code. They got 30 to 50% improvements in performance, including the P99 CPU utilization and P99 service latency. That's quite amazing, because just by leveraging these techniques. And in this talk I have summarized my personal lessons on how to make performance optimization actionable. I will show you also practical process to identify some of the common bottlenecks, isolate components and measure change and measure without breaking the current behavior. I have to warn you though, I will not cover all the ups and downs of Benchmark.Net. I will also not cover mechanics of using profilers, although you'll get a fair number of insights into the analysis process of understanding some profiler output. But let's not waste more time.

Let's get into the essence of the talk. So for me, one of the key principles I try to apply to almost everything in software, is making explicit trade-offs and decisions as we go, together with the teams that I'm working with. This also applies to performance. And I would say a reasonably mature team should be performance aware. Or as my friend, Martin Bayou, from Belgium once famously said, "In some countries you have to be bear aware." Because for example, when you're hiking in Canada, it's good to be prepared for the likelihood of a bear crossing your hiking path. Not so much though in Switzerland where I live. I guess we have some wolves, but I'm definitely not aware that we have bears, but I'm completely digressing here. That's not relevant for the talk. But when it comes to performance, when you are performance aware, it doesn't mean you have to always go all the way in, not at all.

In fact, I personally, I, myself, always start with the simple solution that just works first, and then I get some reasonable test coverage in place. Once I have a working solution with a good test coverage, I start asking myself questions like, for example, "How is the code going to be executed at scale? And what would be the memory characteristics of this code?" And that can be just a gut feeling based on my more than 10 years of experience in writing .NET code. Then the next question that I ask myself is, "are there simple low hanging fruits I can apply to accelerate this code? Then are there things that can move away from the hot path by simply restructuring a bit my code?" Because everything that I don't have to do on the hot path is actually something that doesn't impact my hot path. So that's always a good question to ask.

Then my question is also, "what part is under my control and what isn't really?" So for example, sometimes I'm only owning parts of the stack that is in place in a specific code path. Sometimes another team owns it, or sometimes I have to interact with a third party library that is not under my control that I can't contribute to. Because it might be closed source. If it's open source, I might be lucky because then I can go there and maybe contribute, if that turns out to be the culprit of the code that I'm looking at. But I have to be aware of the boundary conditions around my piece of code that I'm looking at. And then, last but not least, is, "What optimizations can I apply and when should I stop?" Because these types of performance optimizations can get very addictive and they have to always find the right balance in order to make sure that I'm using the time that I have and everything else involved, as efficiently as possible.

And I have covered some of these nuances further in my talk performance tricks I learned from contributing to the Azure .NET SEK. You can look that up on YouTube if you're interested to hear more. But then, let's put that aside. And once I have a better understanding of the context of the code, by asking these questions that I just showed you, depending on the outcome, I start applying the following performance loop. So the performance loop is something that I came up with, based on my experience of doing performance optimizations. And I feel it's a great mechanical approach to apply performance optimizations in a structured way. And what I do is I usually start with sort of profiling first. That's the entry into the performance loop. I write a simple sample or harness that makes it possible to observe the component under inspection with a memory profiler and a performance profiler.

So the profiler snapshots and the flame graphs that I'm seeing there, give me an indication of the different subsystems at play, allowing me to make explicit decisions on what to focus on and what to ignore. And then once I have a good understanding of the code that I have in front of me, I can start improving it. I usually select the hot path. For example, the one that is responsible for the maturity of the allocations or the maturity of CPU time spent, or the biggest slowdown in other words. Or while I feel I can make a good enough impact without syncing days and weeks into it. So if the code path in question is not really well covered with test, I usually get test coverage in place first before I start doing any improvements, because it doesn't really help if the code is extremely fast, but utterly wrong.

And then, I start experimenting with the changes that I have in mind and check whether they pass the tests that I have in place. Once it functionally works, I put things into a performance harness. And then I can start benchmarking. So, what I usually do is to save time, is I extract the code, as well as possible into a dedicated repository. And then I do a series of short runs to see if I'm heading in the right direction. Once I'm reasonably happy with the outcome, I do a full chopper run to verify the before and after. And that is a rinse and repeat, or what I call the inner loop. Because sometimes I am learning more about the code, and I see, "Oh, I can do further tweaks." I put it back, I do some improvements, put it back into the benchmark and then I go back and forth there. And once I'm done, I'm actually not done yet.

Because what I do then, is I profile it again with the harness that I wrote earlier and make sure that the code that I optimized and benchmarked in the benchmark scenarios actually leads to the things that I want to see, by using the test harness again. And once I'm then reasonably happy with it, I ship the code and focus my attention to other parts of the stack. So what I can do is I can go further down in the call stack, or I go into other components and focus my attention. If I still have budget to work on performance optimization, I can then focus on other parts on the system. But enough of this overview of the process. Let's dive into a more practical example. Because I work for Particular, I'm going to cover that with NServiceBus. So, NServiceBus is the heart of distributed systems. And is part of the Particular service platform.

It helps create systems that are scalable, reliable and flexible. It's an abstraction on top of message-based systems. It can integrate with Azure, NServiceBus, storage queues, SNS, SQS, and it allows you to focus on your business code. If you want to learn more about answer response, if you're new it, you can go to go.particular.net/webinar-2023-quickstart. But at the heart of NServiceBus is this, what we call the NServiceBus pipeline. And that's the most critical infrastructure piece inside NServiceBus, or an NServiceBus endpoint that consumes messages from the queue. The pipeline is the engine that makes sure all the required steps involved, like serialization, de serialization, transactions, database access, sending and receiving messages are executed as efficiently as possible. And as such, it's very crucial for the pipeline to not get in the way of the customer's code.

And as we can see here on the screen, the pipeline, when it pumps messages from the queue, it has a set of behaviors that sequentially executed against each other. And the behaviors are essentially the things that we are plugging the necessary infrastructure pieces, but also customers can plug in their parts to the pipeline to extend it in an open close principle way, what NServiceBus does. And if you're not familiar with behaviors, you can conceptualize it similar to ASP.NET Core middle aware. So ASP.NET Core has a middleware structure and I have here an example of a request culture middleware where you have a class that has a method called, InvokeAsync, and that method gets a context in and it has next delegate where you pass the context on to the next middleware of the ASP.NET Core.

And in answer response, those behaviors are actually extremely similar. And let me show you that. So we essentially have a class that inherits from behavior and it defines a stage. It's here called, incoming logical message context. There are multiple stages. The logical one is where the message is already de serialized, there is also the physical stage where the message is not yet de serialized and essentially the heart of the logic happening there. So what we do there for example, we find out what message type you're getting, we de serialize things, then we get from the IOC container, the customer code, we call it. Then we have integration with CosmosDB, SQL Server, Dynamo DB, Entity Framework and Hibernate. And so all sorts of IO bound systems. And there are also bits and pieces that create open telemetry traces, locks, and much more.

So in order to get an understanding when we have to look at the pipeline and how we can optimize it, we can do that by actually doing profiling. And that's the first thing before we even go into benchmarking. So to get a good overview of the problem domain in front of us, it's vital to create a sample or a harness that allows us to zoom in into the problem space. And again, since my goal is to optimize the pipeline location, I can look at the pipeline location with a tool like dotTrace from JetBrains to get a good understanding of the performance bottlenecks and analyze the memory usage with a tool like dotMemory. I always recommend to look at multiple aspects such as memory, CPO and eye involvement to get more insights from multiple angles. And by the way, I'm mentioning the JetBrains tools because I'm a big fan of the JetBrains tools, but you can also use your favorite other tools, there are multiple on the market.

You can even use the visuals studio built in stuff, if you have access to that part, depending on your license. So your mileage may vary there. So what I did there to set up a harness that makes sure that we can look at the pipeline in location. I created this small applications here on the screen. So first of all, I essentially set up answer response in a way and I configured MSMQ transfer. Because on Windows that's the simplest thing. It's already built in, it's super fast because it's local and I use that at the time to profile the pipeline. Then, the next thing I'm doing is I'm using a serializer that is quite fast. I'm using JSONSerializer assistant text, JSONSerializer to not get in the way to not skew with stuff that I'm not really interested in, in this test harness, the picture.

And then what I'm also doing is I'm adding an in-memory persistence because I do not want to see any SQL transaction stuff, because it's just noise for the thing that I'm looking at. And then, what I'm doing is, I'm publishing a thousand messages with NServiceBus. And then what I'm also doing is, I have in this test harness, I have a bunch of console read lines that are interaction points where I can attach the profiler to say, "Okay, now NServiceBus has started." Now I can take a snapshot and then once I have published, I take another snapshot. So they have multiple points where I think, okay, there is something relevant happening where I want to see what's going on. And then what I also put in place, a handler that gets all those a thousand messages that are published.

I have here, also a consult right line. Normally I would advise you to not do a consult right line because it also skews the picture, slightly. I added it to also make sure that the test harness actually works. Because again, if the test harness is already wrong, we are going in the direction of not so optimal path. So we want to make sure that everything works as well. And then, usually a good test harness should adhere to a few things, that I have here on the screen. I'm sorry for that thing. So usually, what you should be doing is they should be compiled and executed on the release mode. It sounds super obvious, but I'm mentioning it here because the amount of times I actually have forgotten this, it's staggering. Then it should run a few seconds and keep the overhead quite minimal.

Then you should disable the tiered JIT. JIT is the just in time compiler to avoid any warmups. So you need to add this tiered compilation to false into your CS approach. And then usually what I also advise is, because by default if you do a release mode, you don't get the full symbols. And then add the full debug symbols so that you get a good picture over the stack trays and everything involved. Should you want to have a closer look at it.

So that's the general thing that the good test harness should encompass. Good. Then, the first thing that I'm doing is I'm looking at the memory allocations. Because allocations usually have a big impact on the throughput of the system and can usually be optimized up to a certain point with a few quick wins, without unnecessarily increasing the complexity just yet. This often comes from the fact that many systems have not yet been optimized for memory allocations. And to understand the data that is presented in front of us, we require usually domain knowledge of the problem at hand. And that knowledge helps us to navigate through the maze of noise that we might see.

And as we can see here on the screen, there are numerous of byte array memory stream and stream writer allocations that are quite hefty. But we are interested in the pipeline execution, remember. So, when we zoom in, we can actually see that 20 megabytes of the pipeline are coming from this behavior chain, invoke next thing. Now the question is, are we done yet? No, because this is just the publishing part of the pipeline, where I'm actually sending messages. And remember I send 1,000 messages in the test harness. Now we want to look at the full picture also on the receiving end. If you look at the receiving end when the messages are coming in, we see also, again, lots of bite array, XML text reader notes and message extension allocations. But again, now because we are looking at the pipeline, we have to blend out this noise and look at what is actually coming from the pipeline. And using my domain knowledge, I know that this funk of behavior context of task allocations are coming from the pipeline.

Good. We have another 27.83 megabyte of allocations that are potentially unnecessary, that we might be able to get rid of. Then what I also usually do is I look at the stack trace. And here, I have an example of the stack trace. It's pretty massive. It contains plenty of steps that actually clearly hide the actual pipeline operations, like the mutate incoming transport message behavior or unit of work behavior. And for you conceptually, if you're not that too familiar with NServiceBus, you have to think about it. That's the actual business logic that we plug into NServiceBus, and that is not really visible because the rest, as we can see here on the screenshot, we have this behavior T context, behavior invoker, behavior chain, behavior chain display class and blah blah, blah. Until actually we see the unit of work behavior and the in-between stuff is basically just noise.

It's infrastructure stuff that enables things, but we might be able to get rid of it. And then the question is also, when I showed you previously this allocation screenshot where there were lots and lots of binary allocations that were heftier than actually these 20 megabyte and 27 megabyte of allocations that were actually coming from the pipeline. And you might be thinking, "But Daniel, I mean come on. All these gymnastics, why aren't you getting rid of these allocations? Should you do that?" Because they're super hefty. And yeah technically, ideally yes, we would actually want to get rid of all the other allocations that are sort of heftier in this picture. But in this specific case there are actually a few things we have to consider. So the context really matters about the piece of code and the context is super crucial when we look at these profiler screenshots or profiler outputs, sorry.

So for example, in our case, the allocations that we saw, they're mostly coming from MSMQ transport, which has a diminishing user base. Most users eventually transition away from MSMQ to either SQL Server, Rapid MQ or in the cloud transport like Azure, NServiceBus, Amazon SQS. So our efforts there might lead to allocation reductions only for a very limited segment of our user base. Another angle could, for example be that we might not be transport experts. We already know by making iterative gains on the hot path, we will end up with great improvements, the 1% improvement philosophy. But since every activity has to be weighted against building features and all the other activities that we are doing in our day job, it might not be justifiable right now to ramp up the knowledge in that specific area. And then finally our goal is to see what we can do to optimize the pipeline because the pipeline optimizations have a great benefit for all the users across the board, in the specific example of NServiceBus, independent of the transport.

So then, let's have a look at more allocations that are happening here. What we can do is we can zoom in more into the pipeline and what we can then see is we have lots of behavior, behavior chain, function task and function of behavior context, and display class allocations that are totally unnecessary. And with a tool like dotMemory, what we can also do is we can filter the view to a specific namespace. And when we do that, we can see all the allocations that are coming from everything that is in a specific namespace and here's the NServiceBus.Pipeline main space. And then last but not least, like I said at the beginning of the talk, it's usually not enough to just look at one type of a profile. We also have to...

So it's not enough to just look at the memory profile. We also need to look at the CPU profile. Because sometimes any algorithm that spends a lot of CPU cycles executed on the hot path can also have a significant impact on the performance. And flame graphs are an extremely helpful tool to actually get a very quick understanding of what's going on. So let's take a look at the CPU characteristic of the publish operations in a flame graph. And I have it here on the screen. And for you, I know this might be a little bit overwhelming, but actually when you look at the top of the flame graph, we can see there's this message session published and we read the flame graph from top to bottom. That's what we are doing. That's the arrow on the left side. And what's great about the flame graph is that we can actually, without having a big understanding of what's going on, we can actually quickly understand where it's spending time and where it's not.

Why is that the case? Well, with flame graph, what's actually happening is each call is shown as a horizontal bar whose length depends on the calls total time, which equals to the function's own time that's executing, plus the time of all its child function being executed underneath. So the longer the call, the longer the bar is. So that means we can have a look and we can see there's also all sorts of red stuff happening on the screen and there's some orange stuff on the side. And the red stuff based on my domain knowledge, I can see there's actually the infrastructure part that this might not be that relevant. And the orange part is the actual IOC container stuff, integration, the thing or the actual business logic code that is being executed. And if we tune in into the flame graph, we can also actually see that.

We see that there is the mutate outgoing message behavior and there's lots of stuff going on in between. And then again, at some point we then only then, deep down the flame graph, we actually see the actual business logic being executed. And this is the flame graph that also comes from the JetBrains tool. By the way, if you prefer using free tools. And for example, you're running on Windows, I can highly recommend to use a tool called perfume, which gives you very similar results and in more recent versions you can also zoom in with perfume. I have here the same picture visualized with perfume. This time, unfortunately it's bottom up. So the actual publish operation starts at the bottom. And, as you can see here, it's a little bit of a overwhelming tool potentially. I think it's quite an advanced tool. There's numerous of great articles, blog posts, tutorials and videos available to get a better understanding of perfume.

I personally have used it, but I struggle sometimes to put in all the right commands into the text fields and I usually prefer to go another tool. But your mileage may vary. Just, for those people that are thinking, "Daniel, you're clearly a fanboy of the JetBrains tools." It's like, "Yeah, that's right. I am a fanboy of those tools." But use whatever gets the job done and what you prefer. Because your mileage may vary. Okay. And then, again, if we're going back to the tools that I prefer, we can see that when we look onto the hotspots overview filtered into the publish operation, we can see that the behavior chain is consuming 20% and the behavior invoke is consuming 12.3% of the CPU, which is a third of the overall time spent executing the mechanics of the pipeline. On the receive end, the picture is slightly less dramatic, but there's still a measurable impact if we look at the CPU characteristics.

What we can see there is, again, if we are zooming in and we are looking at the hotspot view. We can see that the behavior chain is consuming 4.8% and the behavior invoker is consuming 9.2% of the CPU. Which is a seventh of the overall time spent executing the mechanics of the pipeline. And now that you have seen the flame graphs, you might be wondering, "But Daniel, what about flame graphs?" Well, I have them for you. Again, we're reading from top to bottom. And now we are in the receive parts of the pipeline. We can actually see on top there is this receive with transaction scope, receive message, and reading top to bottom. And again, the same picture of the infrastructure overhead becomes apparent in the receive pipeline by only glancing at the flame graph. Because we see, again, lots of red stuff and then we have the orange, yellow part on the other side, which is the actual business logic that is being executed, which is way smaller.

And now we are ready. Well, let's have a look and zoom in. When we look at this, we can actually see again the same picture on the receive end. We see the mutate incoming transport message behavior, lots of infrastructure pieces and then we see the actual business logic being executed way underneath and the relationship between those things are way off on the infrastructure side, that hopefully we can get rid of. So now that we understand the memory characteristics, the CPU characteristics, and if you're working with an IO bound system and you want to look at what is the IO characteristics, you should also look at IO profilings and some tools can actually do that. For the interest of time, because we have a limited time in this webinar, I'm not covering this. I'm only covering the CPU, the memory characteristics. And I also believe that looking at CPU memory gives you lots and lots of great indications for doing your optimizations. Then we can actually go to improve things.

Because that's what we want to do. We want to change the code to make it faster. But hold onto your horses. We actually want to get good test coverage in place. If we don't already have it, then we have to write it, at least at this specific point in time. Because we need to take our context, take the domain knowledge, and put tests in place to make sure that our improvement cycles are not invalidating what's already there and are not breaking the things. So that's crucial. So, luckily with the NServiceBus pipeline, we already had lots and lots of acceptance tests in place that were executing the system end-to-end. But there were a few tests missing. Specifically what I call component tests. So basically the idea of a component test I had in mind there was to just take the pipeline and only execute the pipeline with the relevant parts, but not the rest of the system, to make sure the pipeline actually does what it's supposed to do.

And I wrote a few tests. It should just execute the pipeline. And for example, there were a bunch of state objects that were passed into the pipeline. I added a test to make sure that nothing gets statically cached, that shouldn't be statically cached to make sure that the thing that I now start tweaking actually works as it is intended to work. And then I start making improvements. And so, if you're interested in the interest of time, I'm not going deeper in the actual improvements that I did.

But I have a bunch of blog posts available on the Particular blog. You can go to go.particular.net/webinar-2023-pipeline. There's one blog post called 10 Times Fast Execution with Compiled Expression Trees and How we Achieved Five Times Faster Pipeline Execution by Removing Closure Allocations. And basically the gist of it, I did a bunch of allocation optimization tricks that removed all sorts of allocations that we saw on the previous screenshots and that gave us a 10 X boost in the first iteration, another five X boost. And in the third iteration even another 20% boost in terms of throughput of the raw execution speed of the pipeline.

But I know these are very click bate-y titles, but I hope you enjoy these blog posts. Should we wish to dig deeper into what I did there, because that's not really relevant for the webinar today. Good. Now let's look at the benchmarking the pipeline. So usually with code that we have in place, if you're lucky, the thing that you want to benchmark might be a public method on some helper or utility without countless external dependencies. Then everything is golden.

Because then it's mostly simple because you can just basically add a benchmark project into your solution, reference the assembling question and then start calling this method. So in the worst case, you might need to add internals visible too, to give the benchmark approach access to this helper or utility. So much about the theory. Because in practice, software is way messier than we like to admit. At least the software that I am usually working with, or even worse that I have created. Because components sometimes come with numerous dependencies, so we can bite the bullet and just throw them all under a benchmark. But then, the gains that we are trying to compare, might get lost in the signal-to-noise ratio. Or how Gordon Ramsey would say, "Software is a disgusting festering mess."

So when I first started to face this problem, I started looking at various approaches and ended up with a very pragmatic potential. But potentially slightly controversial approach. The pragmatic approach that I took is, I was copy pasting the existing irrelevant pipeline components and I adjusted the source code to the bare essentials. And this is how it looked like. I'm going to present it to you on the screen. So, this basically allowed me to isolate the component under inspection. That's the pipeline stuff that's on that folder that you see in that pipeline folder. And here we have a bunch of classes, behavior, chain behavior, instance behavior, invoke or step coordinators, whatnot. For you, don't worry about these names, these are just fancy names to basically describe the mechanics of what's going on. And as you can see, it's not just a simple class, it's not just one method.

It's actually multiple classes and things that are behaving and coordinating things with each other to actually execute the pipeline execution logic. And the idea there is that we want to make sure that we have something, a controllable environment that is only progressing when I adjust the code that I have copied, with the copy pasted sources that I took from the pipeline. What's also cool with this approach is that I can tweak the code with, for example, partial classes to be able to easily overload relevant aspects that I'm tweaking and then execute different approaches against each other to make sure I select the one with the right trade-offs. And again, I want to highlight, this approach worked very well for me and I think there is great value in others for it. And to give you a more concrete example, with the pipeline... For example, what I start doing is I started trimming down the pipeline to the only relevant behaviors that were relevant for my benchmark.

So I removed everything that I felt is not relevant for the pipeline execution. And then for example, what I also did is I removed the dependency injection container with creating a relevant classes, to basically just new them up in a hard coded way for that specific scenario. Because again, I'm not interested in comparing different IOC containers that are out there. That's not my job. I'm looking at the pipeline execution and how I can make that part faster. Then I already explained to you there is lots of, usually in a real answer response system, there is lots of IO going on because we access databases and we do transactions and whatnot. But again, for the raw pipeline execution speed, I can actually remove that noise and just return it with a completed tasks. Since the IO operations, they are hundreds of thousand times slower anyway. And again, our goal is to optimize the raw pipeline execution, so we need to remove them.

By the way, some of you that might be more advanced into the whole Async Labs, they might be thinking, "Yeah, but Daniel, I mean, you're not tricking me. I know the task, complete the task, it just returns to complete the task and therefore there's no yielding happening with the asynchronous state machine. Isn't that actually skewing the results?" I'm going to talk a little bit more about Benchmark.Net and what it does. But the gist is basically, because how Benchmark.Net actually executes things, if you would basically do a task that yield, if it's statistical analysis and running it for a period of time, those things would actually blend against each other. And for this specific scenario, it's actually totally okay to just use a completed task. But again, context is the key and your mileage may vary depending on the scenario that you're looking at. Good. That being said, I would like to say, because I said it's controversial approach because I'm copy pasting code. I would say applied 80/20 rule.

For code that is not changing that often, this approach works very well and gets you started tipping your toes into becoming performance aware quite quickly, without overwhelming the whole organization of building up a performance culture. Because remember, doing cultural changes in an organization takes time and you rather want to make changes gradually and you might be coming the first person that is doing these performance optimizations in a structured way. You might becoming the go-to person. And you also don't want to overwhelm yourself. And then there's also things like, with this approach, you actually are not facing yourself or the organization with questions like, "How can I reliably execute those benchmarks? How can I set up the CICD pipeline? What do I need to take into account? Do I need to have dedicated hardware? Can I run it on GitHub Actions or on my DevOps runner? Is that good enough? Or do I need specific images to actually do that?"

And all of those mechanics that would be coming, it's third down the pipeline of becoming performance aware, you can essentially blend them out and get started, building this performance culture without overwhelming yourself and the whole organization. So that's sort of the key message that I want to give away. So, I have here a benchmark on the screen. And again, at the beginning I said my conceptual understanding was like writing a unit test. But the thing is they're actually not the same. Because when we write a unit test, we ideally want to test all methods and properties of a given type. We also test both the happy and the unhappy path. And the result of every unit test is basically a single value. It's either past or it's failed. It's red or it's green. But benchmarks are completely different. So first and foremost, the result of a benchmark run is never a single value.

It's always a whole distribution described with values like mean, standard deviation, min, max and so on. And to get a meaningful distribution, the benchmark has to be executed many, many times. And this takes a lot of time depending on the problem that you're looking at. So the goal of benchmarking is to test all of the performance of methods that are frequently used on the hot path. And only those methods should be performed. So the focus should be on the most common use cases and not on edge cases. But how do we find those hot paths and the most common cases? With the pipeline, one obvious case is the pipeline execution. And what we can see here is that when we do the raw pipeline execution, we first do a global setup with Benchmark.Net and we put building the pipeline into the global setup.

Why the global setup? With Benchmark.Net, it actually makes sure that this is executed once before all the iteration. It's not counting to the statistical analysis of the runs. And then we can move that away because we are not interested here in actually measuring the building of the pipeline. And then the next thing that we have to do is, we can think of, "Well, could actually in real life scenarios influence the raw execution throughput of the pipeline?" And in our case I was thinking about, well actually the pipeline, because it's dynamically composable, actually it has a pipeline depth. And what I looked at is, what are common cases, again, how deep can a pipeline actually be? Came up with numbers like 10, 20 and 40 and added these as permutations to the execution benchmark to actually make sure that we look at reasonable deviations of how deep the pipeline can actually be.

But we have to be aware about the parameter and combinatorial explosion. Because the more parameters we have that can influence, the more we blowing up the execution time of the benchmark. And then, again, we also might be skewing the results. So we have to be careful. I selected only the pipeline depth as the single input parameter. And then the next thing that I usually do is when I'm still in this inner loop of benchmark and improve, I usually add the short run attribute because the short run, as it already says by the name, make sure it only does a quick run. And then that is just to give me a direction where I'm heading towards, to make sure that I'm heading in the right direction. Because I'm not interested in the specific time to get a fully statistical result. I just want to look that I'm heading in the right direction.

And then at the end I add the actual pipeline execution. Here we can see before. It's the pipeline, before the optimization. It's the baseline of the benchmark and then I compare it to one of the approaches that I execute. And then I run this benchmark. I'm not going to show you the benchmark results because that's also not relevant for this talk. If you're interested, you can run the code yourself and the repository that I will share towards the end of the talk. Then the question is, "What is actually a good benchmark?" So usually a good benchmark, it should follow the single responsibility principle. So as all other methods should be doing, it means a single benchmark should cover a single thing. Like I've shown you the execution benchmark, it only actually sets up the pipeline and it executes the pipeline and it also does the parametrization of how deep can the pipeline be. But that's all it does.

It doesn't do any other scenario coverage. And then a benchmark shouldn't have no side effects. And that's really crucial because the benchmark is going to be, after the global setup has run, it's going to do a bunch of iterations. And if every iteration mutates some state, that then influences other iterations, that's not a good benchmark. And for us, we're just setting up the pipeline, executing it over and over again, but there is no mutating state in there, so it has no side effects. So it's a good benchmark. Then another thing that's important, we should make sure that it doesn't do any dead code elimination. So, that means because sometimes when code is not used, the chip looks at it and says, "Oh, this code is not used. I'm going to compile it away." And then you are in trouble because you're actually measuring nothing.

So we need to make sure that we use, for example, we return stuff with Benchmark.Net or we use the consumer class that is available in Benchmark.Net to make sure anything, intermediate results are actually consumed by our benchmark infrastructure so that the chip doesn't optimize it away. And then usually, use a good framework like Benchmark.Net that delegates all the heavy lifting to the framework. So for example with Benchmark.Net, it doesn't require you to do a number of indications per iteration. Because what Benchmark.Net does, it runs pilot experiment stage based on some iteration time setting and then it does some statistical analysis and it runs the benchmark until the results are stable. So all that is abstracted away by the framework for you and you don't have to think about these edge cases.

And then, what's also crucial for a benchmark is always be explicit. So, basically the code that you're writing that you're seeing in front of your eyes when you look at the code should be the code that is actually executed. So no implicit casting should be in place, no VARs, make sure you declare the stuff that is there so that you don't have surprising side effects because some implicit casting operators in place that you're not seeing in the code and then it skews your result. And then this might be obvious, but sometimes when we're running benchmarks on our infrastructure, on our laptops or workstations, we forget that we should make sure that we have the benchmark has dedicated resources. So you don't want to run a Zoom webinar or a Teams meeting at the same time, when you're actually running benchmarks because that's going to skew the results as well. So avoid doing that. Good.

Now I highly prefer Benchmark.Net because it abstracts a lots of things away for you. So for example, it runs the benchmarks and separate the processes. So it applies process isolation, so that any side effects are out of the window. It makes sure that it applies smart heuristics based on standard error and runs the benchmarks multiple times, and the results are stable. That's also super nice.

So that measurement errors are reduced, the likelihood of measurement errors. And really Benchmark.Net was designed to make accurate benchmarks as repeatable results as possible. And it's a very good library. Because at the end of the day, what benchmark also does it removes outliers because sometimes you have measurement errors that would skew results and removes those so that you have a good statistical baseline. And at the end of the day, even for people that have been doing benchmarking optimizations for a while, benchmarking is really hard. And you want to use a library that protects you from the common pitfalls and that does most of the dirty work for you, so that you can just focus on the actual problem that you're trying to solve.

Then again, so far we have only covered the execution throughput in relation to the pipeline death. Another scenario that is also relevant for the pipeline execution is that we need to measure the warm-up scenario. Why is that important? Because we, from the design choice perspective, the pipeline does some expression tree compilation. So it builds everything together and that code takes time. And we want to make sure that we are super fast, also in serverless environments that we're not in the way. So the warm-up of the pipeline is also scenario that we need to take into account. So we want to make sure that when we're changing, improving the pipeline and potentially even the expression tree compilations, then we are not making things worse. Or when we are making things worse, we want to know whether it's within the thresholds that is acceptable, because we're sort of making the trade-offs between warmup and execution speed and all those things.

So this is what I've done in that benchmark that I've just showed you. And the next thing is that we have to look at what are other scenarios that we could take into place. So most of the time, it doesn't really make sense to test codes that throws. Because I told you it's not the unit test. So a benchmark shouldn't really care about exception cases. Because we would be benchmarking the performance of throwing and catching exceptions. And again, throwing exceptions should be exceptional. And exceptions should usually not be used for the control throw. So most of the time you should not be benchmarking exception cases. But here, for us, the picture is a little bit different. Because when the pipeline is executed in a message based system, it is possible that, for example, a user code has an error, it throws an null reference exception. And then, because we are handling thousands and thousands of messages concurrently, it could happen that we are actually throwing an exception.

And in that specific case we are moving lots and lots of messages, potentially after a series of retries we're moving it to the error queue for further processing, when actually the problem that was occurring in the system was resolved by, for example, deploying a patch. And so what we can derive from that for us, exception cases are actually irregular case and we need to test for it. And that's what I did as well. So I created another benchmark besides the warmup, I created a benchmark that looks at the exception cases. And this is how it looks like. So essentially, the input parameter, still there is a pipeline depth. Because how deep the pipeline is, also influences depending on where the exception is thrown, how the performance is happening. Because essentially the exception bubbles up the whole call stack and then eventually gets handled and the message gets moved to the error queue.

So what we are doing here is we're taking the pipeline, the depth, and then we are adding, to the lowest point in the pipeline, we're adding an behavior that froze exception. And that's how we tweak the execution benchmark or we add another iteration of a benchmark. That's the better description. And then what I'm doing is, simply I'm doing a before, where I used to just do a try catch to make sure that the pipeline actually executes and then I do a try catch with the optimized version and I do a before, after comparison of these two optimizations that I'm looking at in the exception case. Good. Maybe you have noticed the name step one, step two and step three in the previous screenshot? Because, essentially this is coming from the act that I've been going through the extra mile of making relevant things, plugable. A partial class, the methods in the base infrastructure in that pipeline folder that you see here on the screen.

And what this allowed me to do is I essentially had a few optimizations in my mind and I wanted to compare them against each other. So I tweaked the pipeline that the copy pasted sources with partial methods and classes. And then I extended those in the step one, step two and step three folder with various iterations of the things that I had in mind. That's what I've done. And if you're interested to see how that looks like, in the repository that I'm sharing towards the end of this talk, you can actually see this in code, if you want benefit from what I've done there. You can actually look at the repository. Good. And now, once we have done all that, we essentially want to make sure that the things that we actually have improved, actually have improved something.

So what I'm doing there is then I take, after doing all these benchmarks and the iterations in the inner circle, when I'm reasonably certain that things are working, I potentially have done, not just the short run but the long run to actually do the comparison. I've then put it into, back into the test harness that I wrote at the beginning to get an understanding of the full picture of the context that I'm looking at with the memory and the CPU profilers. And I've done that. I really have a private version of NServiceBus with the pipeline. And then I bring it back into the harness. And here, on the screen, we can see it before and after. So on the message session published, previously we had 453 megabytes of allocations. We had this behavior chain allocations and after my optimizations are there, we can see it in the harness that essentially the behavior chain allocations are gone and lots of other allocations are now visible. But we achieved the pipeline optimizations that we were looking for.

Then we can also look at the receiving end, because that's also an important part of the picture. And we can see from 653 megabyte of allocations, we went down to 596 megabyte of allocations and the funk of I behavior context of tasks on the left side is now gone, on the right side and other allocations are popping up now. And then when we compare, again the stack trace, side by side. On the left side is the before, on the right side is the after. We can now see that previously had all this infrastructure garbage in place. We can now clearly see on the right side that we have mutated incoming transferred message behavior and then we just have the async state machine and then next one is mutated income transport mesh behavior. Then unit of work behavior. And all that stuff is optimized and gone.

So again, we see what we want to see. Then, what we also can do is again, we can filter in, into the pipeline name space, on the before and after. Previously we had a bunch of allocations and now we filter in. In the after we can see that the pipeline name space is no longer allocating lots of garbage. So we also achieved something there from an allocation perspective. But again, because we always have to look at both sides, memory and CPU, we can now look at the CPU characteristics. And again, we look at the flame graphs as a before and after. And when we saw the previous flame graph had lots of infrastructure parts in there. Then when we zoom in into the optimized version, now we can see that the flame graph starts lightening up the orange parts and we can start seeing how the chase and serialization is now in there.

We can see that the actual business logic is executed in there and all the infrastructure part is gone. So now, by just looking at the flame graph, we already have a good indication of where things are heading towards. And the same is actually also when we look at the flame graphs on the receiving end where we have this receive with transaction scope that we have before. When we are zooming in, we can now again see that there is a better ratio between the red parts and the orange and yellow parts. And the flame graphs clearly indicate how all the bloat, that previously has eaten up 32.3% of the published operations and 40% of the receive operations, they're all gone. And that's a big win from a performance improvement standpoint. Okay. Then the next part, I'm going to very quickly fly through this in the interest of time. But I just want to get the point across that you can take this loop, the performance loop, and you can apply it to various parts of the stack.

So for example, I talked to you that I've been using MSMQ and it has a diminishing user base. But what we can actually say is that the transport has a significant impact on the throughput of the system events response. And that's a crucial part. So for example, we can look at the customer distribution and we could say, "Well, Azure Service Bus might be transport that is more often used. And then we can for example, go and we can set up a test harness for a system that is using NServiceBus with Azure Service Bus. And then we can look at the before and after picture of using a system that's using NServiceBus. And don't get shocked, I'm going to quickly fly over it, because really, the code is not that important. It's more about the process that I'm going through, with Azure Service Bus. Because I had a hunch something is not optimal with the SDK was actually creating lots and lots of allocations and slowing the system down.

I actually created a test harness that is here. And I'm using now the same scenario that I did with NServiceBus. But I'm using the Azure Service Bus SDK. And what I'm doing here is, basically, I'm just, again, I'm publishing or sending 1,000 messages to Azure Service Bus and that's this gist here. Again, you don't need to understand this code. But I'm just publishing to Azure Service Bus 1,000 messages and then I'm receiving 1,000 messages. And at that time I already had the hunch that potentially the problem with the SDK, is that when I access the byte array that I'm getting from the service will allocate every time I access. And that actually significantly slows down the throughput of the system. And when you then attach the profiler, what you can then see is, now because we are deeper down the stack, we can now see that things are lining up from the Azure Service Bus SDK and the Azure Service Bus SDK underneath has an MQP driver.

And now we can see lots and lots of allocations that are happening there. And in my specific point, because I access the body multiple times, we can see we have 5.6 megabytes of allocations of array resize. And after contributing to the Azure Service Bus SDK, we were then able to reduce the half the numbers of allocations by tweaking the Azure Service Bus SDK library. And with the virtue of that contribution making the whole NServiceBus ecosystem using, for all the customers that using Azure Service Bus, way, way, way, way faster. Good. That's, as you can see, you can apply this loop over various iterations over the stack and looking at various pictures. Whether you go breadth first or you go depth first, that depends on the application domain that you're looking at. Let me quickly talk about... I showed you this approach of copy pasting source code. And I said it's a little bit controversial.

Because this approach only really works when you have code that doesn't change that often. But sometimes if you're going down the path of doing more performance optimization and you want to become more performance aware, you want to make sure that subsequent iterations of making changes to the code are not regressing your code base. But what you can do is there is actually a great tool it's called Great Guidance written by Microsoft. In this links that I will share towards the end, it's preventing regressions and we can then use the result compared tool, to essentially do a before and after. So what you do is, in essence, you take the code, you run the benchmark, and you store the artifacts into a disc folder that you run against the code that is not yet optimized. Then you run the same benchmark and you store the artifacts into an after folder.

And then, what you use, you use this result compared tool to essentially measure the before and after and define a threshold. And if the threshold is above a certain percentages that you define, or above certain values, then you can, for example, fail the build pipeline that you have in place. But, about the build pipeline, when you use this approach, there's actually one thing that you need to take into account before you go down this path. And this is essentially when you're executing stuff on your build pipeline, you need to have dedicated hardware. Because usually, for example, things like GitHub Action runners, they're actually too unreliable to execute a regression tests. So Andrey Akinshin has written a great blog post about the performance stability of GitHub Actions. And as you can see here on the screen, the gist of it is that two subsequent builds on the same revision can have ranges of 1.5 to two seconds and 12 to 36 seconds.

And CPU bound benchmarks are much more stable than memory disc bound benchmarks. But the average performance level still can be up to three times different across builds. So as you can see is, with this copy pasting approach, I have shown you a very pragmatic approach to get started to become more performance aware. And then once you go down the path of reliable builds, also reliable benchmarks, regression testing on your build infrastructure, there are more things that you have to take into account, but at least now you have heard the things that you have to consider. Now I'm at the wrap up of this talk. So, what I've done here, I have shown you the pipeline performance optimizations. And you might be thinking, "But how is that relevant for my business code?" Well, although I showed this pipeline, I truly believe that these optimization techniques and the performance loop that I've showed you is also relevant for your business code to optimize it and to get a better understanding of what's going on.

And then you can use this performance loop and the tricks that I've shown you today to sort of isolate components profile, change profile with a breaking existing behavior. Then rinse, repeat to become more performance aware. And then you can also combine it and bring it back into the test harness to become a macro benchmark to see how these small changes start adding up over time. And then what I also advise you to do is start optimizing existing things until you hit the point of diminishing return. Because I truly believe that exploring the ins and outs of an existing code path until you cannot further optimize it, is it actually gives you tons of great insights that you can use to redesign at the later stage, at a later point in time and create a new design that is optimal, based on the things that you have learned.

So I have here a QR codes. The first one, the green one leads you to the repository that is all here, github.com/danielmarbach/beyondsimplebenchmarks. I have a second QR code on the screen, should you wish to get my help to get a better understanding of your distributed systems with messaging. I can help you there as well. And before I go to the questions, I want to mention that because I'm in good contact with JetBrains, I actually can do a raffle. So if you reach out to me over one of the above contact details and tell me which is your favorite slide and what you got from it, or provide me structured feedback about this talk, until the end of tomorrow, that's Friday, you'll be included in a raffle for a personal JetBrains ultimate license, which contains all the JetBrains tools, including the ones dotTrace and dotMemory that I have shown here today. Okay, that's it. Now I'm heading over to a question. Let me just read them.

The first question is, "Do you have a CI set up for benchmarks, so that when changes are done you get feedback if perfume was degraded? If so, can you share some resources how such things can be set up and best practice around the setup?" I've shown you that in the talk. So there is the result compared tool from the dotNet performance team that you can download that have the links in the slides that we will share. And then you can use that. Be aware of the blog post that Andrey actually wrote about the unreliability of things like DevOps Runner and GitHub Actions. So you need to have dedicated hardware for that. I've answered that. And the other one is, "When benchmarking optimizations and the results show decrease in memory allocations, but degration of performance, do you have a ratio and or consideration to make a decision whether improvements are actually improvements?"

That's an excellent question. And it's very hard to answer this question actually, specifically. Because it's a generic question, I can only give a generic answer, but we would need to look at the specific examples. But the essence there is that, again, the context of the code. If you are maybe in a memory constraint environment, you might be willing to trade off the throughput, or performance there. But you don't want to spend more memory.

So, depending on your needs, you can actually go and say, "You know what? I'm actually happy with that. I can live with less throughput in this specific example." Or with warming up scenarios like I showed with the pipeline, you might be saying, "Well, it's okay to spend a few more milliseconds on the warm-up time because we can do other tricks in a serverless environment to speed up things." That compensates for that, but we want to make sure that we are faster at the execution time. So again, you have to balance these things depending on the context or the application or the business code you're working on. Good. I think that's more or less it from a questions perspective.

Daniel. Hey, it's Kylie. It looks like we've got one more question that popped up.

Okay. Yeah. "So using Benchmark.Net, do we have any limitations to look for a longer run?" I'm not sure I understand that question? So I guess, I'm going to do an interpretation of the question. So again, for me it's really crucial when I'm in this iterative mode of doing performance optimizations, I want to see if I'm heading in the right direction. I do short runs. Once I'm reasonably certain, I'm doing a long run. And then I'll take into account that it takes half an hour, 20 minutes, an hour, depending on scenario to run it, to actually see that I'm getting a statistically relevant result that I can then share. So that's what I'm doing. But again, you always have to sort of balance right, again, how much time you invest? How statistically significant should it really be? Sometimes I just want to prove to my colleagues that it actually improves things, that maybe a short run is good enough. Again, it depends, your mileage may vary. So that's the approach that I'm taking. Hopefully that answers that question. Thanks Kylie for reminding me about that last question. Good.

Yeah, no problem.

That's it from my end.

All right. Well thank you so much Daniel. And our colleagues will actually be speaking at a number of events this month in Sweden, Poland, Florida, and Belgium. So if you're interested in any of those, go to particular.net/events and find us at a conference near you. That's all we have time for, for today. On behalf of myself and Daniel, goodbye for now and see you on the next Particular live webinar.

Beyond simple benchmarks—a practical guide to optimizing code

🔗Why attend?

🔗In this webinar you’ll learn how to:

🔗Transcription

About Daniel Marbach

Additional resources