Performance loop—A practical guide to profiling and benchmarking

00:00:01 Daniel Marbach

Okay. Hi everyone. I hope you had a good lunch. The food is nice here. I really like that you can just go and whenever you're hungry, stuff your faces. That's super awesome. So yeah, I'd the last day of the conference, so I hope you are energized to get into profiling and benchmarking because it's not an easy topic I guess, but I'll do my best to get you introduced into this topic. So who is new to benchmarking? Hands up, a few. Who is new to profiling, a few. Who is new to profiling and benchmarking? Hands up. So then we have a good audience here. That's awesome because I remember the first time I tried to start benchmarking my code, I was looking around for blog posts out there and I saw lots of benchmark written with Benchmark.NET and I was like, hmm, that's interesting. So can I apply this as well?

And this are code like this out there. And I was like, that looks really easy because I sort of had a conceptual understanding of unit tests and I was like, and please don't bother too much to look at this code. It's just an example. So I was just thinking, yeah, it's just a class with a bunch of attributes and then magic happens and then I'm getting a result. So benchmarking is easy. So because I knew X unit, I knew N unit. I've used MS tests for a while. I didn't like it at that time, but it has improved a lot. And I was like, hmm, that shouldn't be too difficult for me as well, so I can just do the same. So I felt quite certain it doesn't take long for me to actually embrace this concept of benchmarking, but I was wrong because essentially it was very easy to write a skeleton.

It's easy to write a class, it's easy to write, to slap a bunch of attributes on a class and then let the thing do something right? But then at the end of the day, I start asking myself questions like what should I even benchmark and why? Or how do I take the code that I wrote that is usually quite messy? My colleagues usually complain about the code that I write and how do I take this code that is sort of intertwined and entangled and put it into a benchmark? Is this even a good idea? Or what should I deliberately cut away from the code that I wrote so that I actually get results that are meaningful for me and I'm not watering things or the results or I was also thinking how can I actually measure, change and measure and go through that loop without burning away the lot, the budget that I had for this performance investigation?

Because I mean I got fed up with job titles at some point in time, so I call myself a principal chocolate lover these days. I think job titles are highly overrated, but I'm not a performance engineer in my day work because my job is to get stuff out there into the hands of our customers. So that means I need to make sure it's reasonably fast, but I also cannot gold plate stuff. And sometimes I'm doing this sort of tinkering at home in front of the TV when I'm watching something that I'm only half interested in and I'm doing a bit of spelunking, but I have a hack because I know that I'm terribly bad in going to bed when I'm actually dealing with something that is really interesting. So I basically have this hack that switches off the internet around midnight because whenever I need to Google Bing something and I have no internet anymore, I'm like, ah, whatever, and then I go to bed.

So that's essentially the time when I have to stop doing these performance optimizations because they're highly addictive, at least to me. But then the question is also, well why would you even bother and why would you even go through that hassle of doing performance investigations profiling and benchmarking, right? Because at the end of the day, this is time-consuming and we only have limited time available. So one of the things that I usually talk about is, when you have code that is executed at scale, there are two things that really matter. It's the throughput and memory characteristics of your code at runtime when it's running in your data centers. Because today we talk a lot about this green IT movement. What it means is, if code is running in your data centers, it consumes energy and consuming energy. At the end of the day, we only have scarce sources on planet earth, so the more efficient the code is that we executing in our data centers, the better.

It's also basically in terms of sustainability on planet earth, but you might be saying, yeah, but Daniel, come on your green IT stuff. I've seen it. I don't want to hear it anymore. At the end of the day when you're running, let's say in a data center in Azure or AWS doesn't really matter. Someone puts down a credit card and then in the clouds we have these virtual numbers like gigabytes per second, throughput units, premium throughput units or whatever. They're making this stuff up so that we don't know what we're getting charged for and at the end of the day or at the end of the month, we're getting this huge cloud bill and someone is going to cry a river and say, why did you burn that much money in the cloud? So essentially it's really important to make sure that the code that we ship at the end of the day is fast enough so that it doesn't burn away unnecessary resources which directly translates into money in the cloud.

Okay, but let me give you another example. So Microsoft has this awesome blog post series where Microsoft Teams, the blog about their journey to let's say more modern .NET versions and different things, and there is this blog post from the Microsoft Teams infrastructure and Azure communication service team, and they essentially said something like that. We were able to see Azure compute cost reduction of up to 50% per month. On average we observed 24% monthly cost reduction after migrating to .NET six, the reduction in cores reduced Azure spend by 24%. So just by upgrading to .NET six, they got actually, they could benefit from the performance improvements that was in the runtime that the team, the .NET runtime team actually has rolled out over time. And as you can imagine, 24% monthly cost reduction, that's quite significant. I can assure you when you go back to your work and only squeeze out 5%, 6%, 7% in sort of less costs in the cloud, your bosses are going to be pretty happy.

And then you can turn that at the end of the year in your salary negotiation to say, "Hey boss, remember I actually tweaked out some performance there in the data centers," but then in these talks, I have basically tried to come up with a very practical approach that I use in my day-to-day work so that you can also benefit from my knowledge and apply that in your work so that you can go through a series of steps that are highly practical. But one of the things you need to start first, and that's the beginning of the journey, is you need to become performance aware. One of the key principles that I try to apply to almost everything I do in software is that I'm trying to make explicit trade-offs as I go, so this also applies to performance. So I think a reasonably mature team should be performance aware, but what does it mean to be performance aware or is it sort of an all or nothing?

Do I have to go all the way into becoming performance awareness and I think not at all. So what I usually do is I always start with the simplest solution first. So basically I have my requirements. I basically hack a piece of code together. I cobble it together so that it works so that it passes the tests and then once I have reasonable test coverage in place, and that's really important to me because without tests, I don't even know whether it's working right, and I don't need to have a full crazy test coverage, but at least a walking skeleton. So that stuff, I know the things that are working and then I start sort of applying a few questions to that stuff. So I'm trying to understand the context of the code. The first question I ask myself, and this is purely based on gut feeling of over 15 years of experience in the .NET space and writing good software, is how is this code going to be executed at scale and what would the memory characteristics be?

So this is really just, I'm looking at the code, it's like, when I don't know a thousand requests per second come in here, what will potentially happen? Are there byte array allocations and stuff like that? How does it impact the CPU into memory? Do I do something compute intensive here? These are questions I'm asking myself. And then once I've done that, I'm sort of asking myself, are there low-hanging fruits in there that I can just apply to accelerate this code? And one of the things that I usually do is, for example, when you think about this code is executed hundreds of times per second, and I'm up a new byte array, so I ask myself, can I move this somewhere else because maybe I can reuse this byte array or can I introduce some type of pooling or stuff like that? These are low-hanging fruits, but the first versions might be just new byte array because I don't need to optimize it when the code is not even working.

And the next thing is, are there things that can move away from the hot path by simply restructuring it a bit? Right? I mentioned the byte array. That's definitely something like that. Again, if I know this is not multi-threaded at all, there is no concurrency, I may be able to sort of optimize it. And then, an important question for me is also always what part is under my control and what isn't really right? Because sometimes we're dealing with codes from other teams and they're probably also quite smart people, but whenever I discover something that is owned by someone else, whether it's another team, whether it's a third party library, I might not get access to that code or I might first need to knock at their doors and say, "Hey, friendly Daniel here, I discovered something. Would you maybe prioritize that in your backlog?" So I need to have these types of conversations, or if it's a third party library which might be closed source, I first need to open a ticket with them.

They have their own priorities. So I'm trying to sort of sift through that maze of things to become aware of what's going on. So that is essentially what I'm doing in a nutshell on one slide, but the key is here to find the right balance. I don't want to investigate these types of things, but asking these types of question at least gets me started becoming performance aware and you can use that as well in your projects. And then the last one is, that's the most important one for me. What optimizations can I apply and when should I stop, right? Because these types of things for me at least are highly addictive, right? Oh, a byte, a byte allocation less over here and some SIMD over there and blah, blah, blah. And then you burn a lot of time, but you never really know whether that's actually achieving a lot if you haven't shipped it actually to production.

So we need to have it in production so that we actually know what's going on at the end of the day. Good. So if you want to know more about concrete performance optimizations, you can do in .NET, you can watch this talk. Here's a QR code if you don't manage to make a picture, I also will hand out the slide, a read me to a GitHub repository with all the links and resources of this talk. Should you be interested in concrete performance optimization you can do in .NET, but this talk is not going to be about that. Okay, good. Now here is what I call the performance loop. I use this all the time when I'm doing performance optimization and I've come up with this practical way of doing performance optimizations so that I can do it in a structured way and so that you can do it as well.

So I'm going to walk you through this performance loop that starts with profile, using a harness, improve a path, benchmark and compare, go back to improving depending on the benchmark results, go back to benchmark and then eventually, once you're done with that inner cycle, profile the improvements again and then ship it to production to actually see whether it makes in the grand scheme of thing and actually impact in your production system. And I'm going to walk you through this sort of profiling loop or performance loop with a practical example out of my work. The work that I'm going to show here is going to be in messaging framework and abstraction library. And you might be thinking, Daniel, but I work in an application, so everything that you're showing to me, this is kind of nerdy science geeky stuff. I don't know if that's even applicable to my context in my project.

I can tell you this, what I'm showing you here is fully applicable also to applications, and one of the things I really like is Kevin Henley told me that, and I think it's a good conceptual sort of bridge is, every application out there that you're writing probably has half of a framework built in because you always have some type of infrastructure code things are based upon. So even if you don't believe yet that you can apply it in your applications, if you have half a framework, you can definitely apply it there, but trust me, this is also applicable to your applications out there. Good. Let's look into the example. So I work for a company called Particular Software and we have this messaging abstraction library called NServiceBus and the platform around it, I don't want to go too much into detail, but basically conceptually you can think of there are queuing systems out there like RabbitMQ, Azure, NServiceBus, SQS, SNS, and this piece of code tries to efficiently pump messages from the cloud or in your on-premise data center and it tries to call our customers code.

That's basically what it does conceptually. If you want to know more about it, go to go.particular.net/ndc_London_2025_quickstart. That's where you learn about this. But this is not the topic of the talk. I'm not here to explain to you what messaging is and all that stuff. So this is just the introduction here, but inside NServiceBus there is this piece called the NServiceBus Pipeline. So NServiceBus is trying to follow the open-closed principle. So what we are doing is, because we don't know what our customers want to achieve, we have sort of an open core inside NServiceBus where we plug in our stuff and customers plug in their stuff. And what we are doing in there in this sort of pipeline is we're doing the deserialization of the messages. We're doing the correlation. We're doing, for example, open telemetry, trace spanning and stuff like that.

We're integrating into custom ORMs or we are calling Cosmos DB. We are opening transactions against SQL Server and committing transactions and stuff like that. That's all sort of infrastructure pieces in this pipeline that is going to be executed when people are using NServiceBus. And the core pieces that make this thing extensible are these behaviors in here. So that is the stuff that our customers plug into behaviors. If you're not familiar with behaviors, you can conceptualize it potentially with ASP.NET Core middleware. So ASP.NET Core has these middleware pieces where you can essentially have, you can declare a method called InvokeAsync, you get state into it. That's the HTTP context. Then you have something that you can do before the pipeline executes, and then you can do await next, and then you can have something after the pipeline executes. Simple example, if you want to wrap the invocation of the ASP.NET Core controllers or the handlers with a try-catch, you can basically have a try-catch around the weight next, and then everything will be safeguarded by the try-catch that comes after it.

So ends of response behaviors are quite similar. We just inherit from a class, you define some sort of state there, and then you have an invoke method and you have an await next. You do something before you do the await next and you do something after. And in those sort of classes where we extend response and where our customers extend response, that's where all the core logic that just previously described are being executed good. So my goal is or was at the time to sort of optimize this end response pipeline. Why is that important? Because we never want to be in a case where a customer calls us or sends an email and says, "Hey, we have a problem, and the response is slow. What are you doing?" Right? Okay, we are not perfect. We're also making mistakes, but at least we want to make sure that we do our best to not ship bad stuff.

So we want to make sure these things are fast, so the pipeline needs to be super fast so that it's not in the way of our customer's code ideally, and we do that. We are entering now the first step into this performance loop, which is profiling using a harness, what I usually do is before I get started with, I write a simple sample or harness that makes it possible to observe the things that I want to profile under a memory profiler and a CPU profiler. I always take two profiles. One is memory and one is a CPU. So you might be thinking, well, but isn't there also a third profile? And yes, any complicated sort of application has probably IO stuff, right? IO can be database, can be HTTP client, and IO is usually hundreds and thousands of times more expensive than CPU and memory.

Of course, you should also be using IO profilers. You should look into your database queries and stuff like that because you should get more bang for the bucks when you're optimizing your database queries. This talk is not focused on that part, but I want to mention it, not that you think, oh, he totally cheated, right? But anyway, so I create this harness so that I can look the different subsystem in play when I have these two profilers attached, and since I'm sort of trying to optimize the NServiceBus pipeline, I create a simple sample that just reproduces the minimalistic part so that I can actually observe the pipeline being executed and I can see the bottlenecks. I usually use the chat brains tools like .memory or .trace, and also my samples here are with .trace and .memory, but I want to get it really clear here.

You can use any tool if you have a Visual Studio subscription and you have a Visual Studio professional, it has built in memory profiling tools that you can use. If you're saying, well, I cannot afford this for whatever reason, if you're a student, there is also stuff like Purview out there that you can use for free to do profiling investigations. I want to warn you, Purview is a little bit of a steep learning curve, I must say, and every time I use it, I have to Google a cheat sheet in order to remember all the things. And as a matter of fact, I usually use a variety of tools because sometimes when you're doing performance investigations, it's the kind of, I know it when I see it type of principle, and every tool has different views optimized based on the user experience that sometimes make you like, ah, here it is, right?

And that's why I sometimes use different tools as well. So your mileage may vary. That's what I want to say here. Good. So how does the harness then look like for the N response pipeline? So this is sort of the basic configuration. So what I'm doing here, basically I'm just setting up, it's a console application. I'm just setting up NServiceBus and what I'm doing is I'm using here MSMQ. Why MSMQ? Because it is on my machine. I'm using Windows as an example. It's old, it's rusty, it's outdated, but it just works. I don't have to set up any Docker container, nothing. It's local, it's reasonably fast, it gets the job done. Good. Then what I'm also doing is I'm saying, well, let's use a reasonably fast JSON serializer. I'm using here system test JSON, because again, I'm not measuring, I'm not comparing serializers, I'm just using one that is out of my way, super-fast.

Then what I'm also saying here is I don't want to see any IO, I'm not interested in IO, I don't want to see Cosmos DB or stuff like that because that might light up in my performance profile to be very huge and sort of distract me from the investigations that I want to do because I want to measure the pipeline invocation. That's what I'm doing here. And then what I'm doing is, I'm publishing messages to MSMQ and I'm doing that parallel concurrently. So I'm publishing a thousand messages. That's what I'm doing here, and I know with my conceptual understanding when this publish method is going to get called, that's where the publish pipeline is going to be executed. So I'm getting into the direction that I want. Then the next thing that I'm doing is I'm setting various sort of console right lines and console read lines so that I know for example, the warm-up phase is done and now is the part where I can attach the profiler and that's where the things are happening that I want to see.

And you conceptually for your applications or for your libraries and frameworks out there, you can do exactly the same, right? You're creating a harness that sort of reproduces the minimalistic part that you're interested in. And then on the other hand, because I'm publishing messages to MSMQ, I have to receive them again because with my conceptual understanding of the framework, I know that when I receive it, I get then the receive pipeline coming in and that's the second part of the pipeline that I'm interested in because I want to have a holistic picture, but I don't want to show you just answer response. I also want to give or how we did this with answer response, I want to give you a sort of practical guidance what you have to consider when you're writing your own harnesses to actually attach the profiler. So one of the first thing that you should always think about is, it should be compiled and executed in release mode.

And you might be thinking, but that's so obvious. Come on. But it's actually interesting because most of the IDEs that use the default mode is the debug mode, right? Because they want you to optimize for F5 and then debugging session, but you need to do it in release mode because debug mode is not going to suit for profiling sessions, and then it needs to run a few seconds and keep the overhead minimal because if it's too short, you don't see anything and it should run a little bit. But again, all the things that you don't want to see ideally should be removed because otherwise the noise that you see in those tools is way too much. So if you can work towards minimalistic reproductions is even better. Then usually in order to sort of avoid the warmup of the JIT, and this is a little bit of a controversial topic because some people are saying, well, why do you disable the tier JIT compilation?

Because the tier JIT compilation might actually change at runtime things that actually you don't have to fix then, and that's absolutely true, but as a general sort of guidance at the beginning when you're doing these investigations, you want to see things before these optimizations and you want to avoid the JIT going through the warmup so that you directly have results within a few seconds, but your mileage may vary. And then also you want to have full debug symbols emitted because by default that's not there in the release mode so that you actually have all the stack traces and everything so that you can trace it back to your code once you go through the profiling sessions. Let's have a look at the memory characteristics of the NServiceBus publish pipeline, and I'm showing you here a screenshot of .memory and the memory characteristics there and why memory first?

So one of the things that David Fowler also tweeted out there, apparently still up to today, memory allocations are the biggest resource hocks in .NET applications today because of how the garbage collector works and everything like that, that is where you get still today the biggest bang for the buck. So that's why I'm starting usually memory allocations first and compared to CPU stuff, it's also less complicated because when I start tweaking, improving CPU stuff, I'm in algorithmic territory, potentially very complicated. So when I can sort of move away, byte array allocation, introduce some buffering, I'm getting more bang from the buck. That's why I'm starting there. Good. So basically to be able to understand what I have here, I'm zooming in into the pipeline. So what I'm doing is I'm using in the tool, the namespace filter or the filter for a specific class.

And here, because I'm looking into the publish pipeline, I'm zooming in into the publish part. And you conceptually can think of if you're investigating your order processing logic and you have some name spacing logic, you might know, oh, this is my controller over here that starts processing the order. So you filter into the order controller and everything underneath so that you see the part that is relevant and the allocations that are happening there. And what we see here, we see a bunch of stream reader, stream writer allocations, memory stream allocation, stuff like that. Then, but if we're zooming in, we see here 20 megabytes of behavior chain invoke allocations. So 20 megabytes in this entire allocation thing are relevant for the investigations that I'm doing. Now let's look at the receive pipeline because we saw that's also important. Again, I'm looking at the memory profile and we see a lot of noise, I see sort of byte array, XML, text reader notes, allocations, all that stuff.

And again, I'm trying to zoom in and saying what is actually relevant for my investigations? And I know based on my domain knowledge of the framework, I know that the funk of behavior context of task allocations, so 27 megabytes of allocations are actually relevant for me. And you might be thinking, but Daniel, come on. Seriously, I saw that you had 200 megabytes of allocations of stuff and you're zooming in on 20 megabytes. It's ridiculous, right? Wouldn't we want to get rid of everything there? It's like ideally, yes, but again, we have only a limited time. We can't just heck and optimize everything out. So we need to essentially use our knowledge that we have in the implication and control the things. So really it depends, and I'm going to make here an example with what I had to consider. So for example, I was using MSMQ large portions of the allocations I had on this memory screenshots, they came actually from MSMQ.

MSMQ is a piece of middleware infrastructure that is done. Microsoft is not touching it. It's out of my control. Why would I care about doing it? Plus what's also important for us, that's a diminishing user base. So most people these days, they have migrated their stuff into the clouds. They're using Azure, NServiceBus, SQS, SNS, they're using a RabbitMQ, on premises. So if I would investigate into MSMQ my time, I would not be benefiting a huge user base. Then it might also be possible that I simply don't have knowledge in that code part, and that can happen to you as well. And if you don't have knowledge, you can't optimize it. So that means you would first have to build up the knowledge, ramp up and talk to your colleagues. So maybe it's not worth your time. And what I'm also trying to give away here, I usually apply the principle of 1% improvements over time.

And essentially the .NET team does something very similar. If you have ever read sort of the Steven Tobe, the Book of Tobe blog post, right? The, I don't know, giant thousand page blog post, what they're doing, they have hundreds and thousands of little performance tweaks across the .NET runtime and over time those little tweaks, the compounding effect of those actually gives you a big impact. So I'm trying to not get too hung up and make and just making a bunch of optimizations there. Yes, because iterative gains on the hot path will overall lead to bigger impact over time. You have got to be patient, but it will come, I can guarantee you that. Good. And then last but not least, it's all the pipeline optimizations that I can do that are independent of MSMQ. They benefit all the users regardless of where they're running. So I want to focus that on that.

So at the end of the day, what I want to say to you is with this example, the context matters and you out there in your application, in your systems, you are the expert. So talk with your team, think about the trade-offs and apply your best thinking, your teams to do those types of performance investigations, optimizations to sift through that noise, right? That's what you need to do because the profiling session is just the beginning essentially. Good. So let me zoom in a little bit more. So what I'm doing here is I'm looking more into memory allocations that are happening there. For example, I'm filtering into the pipeline namespace and then I see a bunch of sort of func of task and behavior display class allocation, and if I'm zooming in even further, I see that there is lots and lots of display class allocation that gives me already a hint something is going on.

Maybe have some sort of closure allocations. And again, you can use the same trick, you can filter into your order management namespace into a shipping management, whatever you call those where you know are doing some type of performance investigations. Once we have sort of gotten an overview over the memory characteristics of the code, I told you at the beginning I always do two profiling sessions. I do a memory profile and I do a CPU profile. So now we have to look into CPU and one of the really cool tools that you can use to get a very quick understanding of what's going on in terms of CPU is the concept of flame graphs. And this is probably a little bit overwhelming and probably way too small, but I'm going to tell you what's on the flame graph. Written on the flame graph in this black font is actually not really relevant because one of the cool things about flame graphs is you have a bar you have up there and you read it from top to bottom.

So you have a bar and it has a length, and then underneath you have other bars that are sort of going down like this. And what it means is the topmost bar, the length represents sort of how long it takes for everything to be executed with all the bars underneath. That's basically a flame graph. And without actually having concrete understanding of milliseconds and stuff like that or percentage, I can just look at this flame graph and I can get a conceptual understanding of what's going on. So because what I can do is, I can see there's a bunch of red stuff over here and there is a bunch of orange stuff and with my knowledge, I know this is actually the main specific code, the orange stuff, and I can see there is a relationship between the red stuff and the orange stuff and the red stuff is way too much.

So what that means to me with my spider sensors is well, there is way too much infrastructure code being executed that takes a lot of CPU away that you might be able to optimize. And I can do that just by looking at the flame graph and when you zoom in you see that even better. Essentially the orange part is super, super small, but if you're saying yeah, flame graphs, okay Daniel, I think I understood this but I'm not sure if I'm going to use it. Well, every tool has usually hotspots, overviews you can go into. So for example, .trace has this hotspot overview that shows you for a specific code pack, shows you the hotspots. And what I can do here is I can just look at the hotspot and I can see percentages of the relationships of the things and when I zoom in I can see that 20% CPU is spent on this behavior chain invoke and 12.3% is spent on the other one.

So for me that means 32.3% of the CPU is entirely spent on infrastructure stuff that hopefully I can at least make smaller or potentially even get fully removed ideally. And on the receiving end, it is slightly less dramatic, but when I zoom in, I can see here that I have 9.2% and 4.8% of infrastructure code that is being executed there. So basically a seventh of the CPU that I'm using for executing the receive pipeline is spent on stuff that I probably don't want to have. Good, now I understand what's going on and that helps me to sort of navigate that maze and now I can actually start making some improvements and let's look at what we can do in terms of improvements. So before I even really improve is I usually put tests in place if they're not tests there because without tests we might be actually making improvements, but actually it'll be faster, but it might be absolutely broken and then we're not helping at all.

And when we are just building the performance culture and we're shipping broken stuff, I can guarantee you we'll get no budget in your project to do more of these types of investigations. So luckily in my case, we already had a bunch of tests, but what I was thinking, I had a specific improvement that I might be able to do just by doing these investigations in my mind. And so what I did is like, okay, I'm going to mock around with some state management and things like that. So what I did is I put tests in place that make sure that the state that I accumulate per pipeline execution never leaks out. So I put a bit more test cases into place. So I'm not going to talk about the improvements here, but you can go read it up. It's 10 times faster execution with compiled expression trees or how we achieve five times faster pipeline execution by removing closure allocations.

The blog posts are there if you're interested to read up, to read things up there, and I know these are all click-bait-y titles, but we actually improved the throughput of the pipeline with the performances investigations 10 times and another five times after we already have achieved 10 times improvement by doing these types of performance investigations. Good. Once I have done improvements, and again this is, I call this the inner loop, it's usually not just improve and benchmark and done, if I'm lucky, sometimes it's improve benchmark a bit, seeing what direction things are going to get a good feeling and that sometimes triggers new ideas and then improve again, benchmark again, improve again and benchmark again. But let's have a look at what we can do. So what I usually do, I use this hack what I would say because one of the things that is sometimes tricky is how do I actually get stuff really under a benchmark?

Because when you read all these blog posts about benchmark.net, people usually show you string concatenation, string builder versus this and that or calling a static method and that's nice, but I can tell you the code that I write is never that clean because usually software has dependencies is intertwined this messy, it sort of grows over time. So how do I take this and just measure it? If you're lucky, you might be able to add internals visible to and call some static method in your code, but you might want come up with an approach that allows you to sort of find the right trade-offs and yes, so software that I write or software that I've looked at is usually quite messy. Like all the wiring on this or as Gordon Ramsay would say, software is usually a disgusting, festering mess, but it makes money out there so we have to deal with this.

So what I usually do is I create a new repository and I extract the code, I copy-paste code and you might be thinking, oh Daniel, but copy-pasting code, that's the source of all evil. That's how the mess starts and you are right and wrong at the same time because what I'm doing is I'm doing sort of a controlled experiment. So I take the code on the question and I copy-paste it into a dedicated repository or a folder within the same repository. I start there and then what I'm doing is I sort of adjust the code that I have to the bare essentials to create a controllable environment because my teammates might be still working on it, might sort of move forward. So I want to have a controllable experiment where away the stuff that is not relevant and here I have a screenshot of what I did.

So I basically took conceptually, I took the entire pipeline infrastructure code and copy-pasted it out into a dedicated repository and then what I've done there is I asked myself a few question what is actually relevant? So I removed all the behaviors that are not relevant out of that code base. Luckily it was already following open-close principle and then I replaced the dependency injection container with creating relevant classes directly because my goal is not to basically come up with a blog post that compares structure map against artifact against the dependency injection of Microsoft because I want to get this out of my way because it's just blurs the benchmark not relevant. Then I also replaced all the IO operations with completed task and some of you who might be a little bit more advanced with Async and Await and stuff like that might be thinking about Daniel, I got you.

You are cheating, right? Because actually when you are not really yielding the threat, then the things are completely different than when you're returning complete the task. You're right and wrong at the same time. Because what my goal is, I'm not trying to measure concurrency, I'm trying or how it behaves on the concurrent execution. I'm trying to actually measure the pipeline execution speed. So IO bound stuff or yielding the F threat doesn't really matter. That's additional benefit I get out once I start yielding. But anyway, what I want to say here is again, context does matter. When you're doing these types of sort of trimming down to relevant stuff, why do I show this to you? I strongly believe that when you're starting on becoming performance aware, this sort of copy pasting the code, fiddling around with it a bit, making controlled experiment gets you on a journey to build this performance awareness culture because you can't go to a conference, listen to this Daniel guy on stage and then go on Monday to work and say, now we are performance aware because I listened to this talk.

Awesome, right? This is going to take you potentially weeks and months or years to build this up because you are becoming the trusted person to essentially create this culture and this sort of approach gets you sort of 80% rule, moves you forward because it allows you to not think about how can I reliably execute benchmarks on my CICD system because that's a whole other topic, or how do I even have to set up my CICD system? Do I need dedicated hardware? Can I run it on my shared DevOps runner or what are the consequences of that? All these types of additional questions you don't have to ask yourself. So that's why I'm showing you this approach by copy pasting code. Again, the trade-offs are really important to me. At the beginning I told you that I sort of started on this journey conceptualizing benchmarks with, it's similar to a unit test, but this is actually not a good analogy or it's not even a good mindset to have because what is a unit test?

A unit test usually is you have a class of attributes and the result is it's either green or it's red. So it's passed or failed or maybe ignored, hopefully not too many tests or ignored, but a benchmark is completely different. So first and foremost, the result that we are getting out of a benchmark is we are getting a distribution of values. We're getting averages, means, standard deviations and stuff like that. So what that means is the result is not green and red. We're getting these distributions and in order to get these distributions, we need to execute the benchmark hundreds and thousands of times. So that means those benchmarks, they take seconds, minutes, even hours sometimes depending on how much we're essentially benchmarking. So that's a huge difference there. So we need to essentially execute those benchmarks until they are stable, we're getting stable results to actually know that we get statistically significant results there.

And again, they're taking minutes or hours to execute. And what's also important is with unit tests or with integration tests or acceptance tests, whatever you want to call those, we usually focus on all sort of the edge cases, but because benchmarks takes minutes or sometimes even hours to execute, we need to focus on the most impactful common cases for things that are on the frequently used or hot path with the least amount of permutations because the more permutations we have in the benchmarks, the more execution time it takes and the longer it takes, we get results. So we need to make careful trade-offs there. And yeah, what I usually want to advise you, if you're doing these types of benchmarks derive when you can the permutations and the values you input from actual production cases because then you know that you are benchmarking the code under the values, under the constraints, under the restrictions of your production system and that gives you actually statistically significant results.

So I did that. So what I've done is I've essentially created a pipeline execution benchmark, and don't worry too much about this code I'm going to quickly walk you through, but it's more about sort of the setting and the trade-offs that I made on this slide and not about the actual code. So I wrote this sort of pipeline execution benchmark, and if you zoom in, what I'm doing here, I'm basically using the global setup to set up the pipeline. I'm not interested in measuring the warmup time of the pipeline if there is any warmup, I just want to make sure the pipeline is there and then I can execute it. So that's what I'm doing here. Then what I've done is I've thought about what are permutation values? I want to input into the pipeline benchmark, and I went actually through Salesforce customer cases and whatnot, and I looked at how many things do we put into the pipeline, how deep is the pipeline?

And I came up with values like 10, 20, and 40. So I'm saying, okay, in the wild with stuff that we add, we have these three cases that sort of standardize the deepness of the pipeline so that I can actually understand how does the deepness impact the execution of the pipeline? And I could have come up with 50 values here, but again, if I do that, then this takes ages in order to execute. So I need to make reasonable trade-offs there. And the next thing that what I'm doing is it's on the top. I'm adding the short run attribute and I'm adding the memory diagnosis. Memory diagnosis tells me how much memory this consumes. Short run is basically just a way to say, just do a quick run so that I get some results, but it's not yet statistically really significant. But what I want to do is because I'm in this sort of benchmark, improved type of loop, I don't want to spend 10 minutes every time I tweak my benchmark, I might want to spend one minute to get some results looking, am I going in the right direction?

Yes, no, course correct and go and only after these iterations when I know I'm on the right track, then I do actual long run, which probably takes 10 minutes, 12 minutes in order to get some statistically significant results. And at the end I basically just execute the before and the after, compare it against each other. So that's pretty much that pipeline execution benchmark. Good. So now if you execute this, you get results. And in this specific case it looked like that. And here as we can see, we have the before and the after. So after my optimizations we have the pipeline depth and we can see that my improvement actually did lead to actual improvement. So the chance zero garbage is gone, we have no more allocations and it's actually five times faster, more or less, if I can do the math. So we can see that the improvements that we have done, they're now measured, exit measures compared against each other and we know we are on the right track.

Good. But what are the crucial things you should take into account in addition to what I just talked about? What are the best practices for a benchmark? So a benchmark should follow the single responsibility principle. So what that means is I've quickly showed you I took the warm-up and took it into the global setup. So I've basically separated the case of warming up from the execution because I only care about the execution. If I put both things into the same benchmark, I have too many permutations, I'm sort of muddying the water, and therefore that would not be a good benchmark. So I need to focus on a single thing with only the parameters or permutations that are relevant for that single thing.

Then ideally or best if it has no side effects, because one of the things that happens in benchmarking is right when you have, let's say a counter or a byte array or something like that, that sort of grows over time over the execution and that influences your further iterations of the benchmark, then you have side effects and then that influences what you measure and then your results might not be correct, so you need to take this into account.

Then a good benchmark always makes sure that it prevents that code elimination because the JIT today is pretty smart. When you have code that is not going to be consumed, what essentially is going to happen is the code will be removed and then you measure nothing. So you need to make sure that the relevant parts are there and not eliminated. And sort of what I can advise you is, usually take a library like benchmark.net that sort of allows you to delegate the heavy lifting to a framework because Benchmarking is really hard and you don't want to use a stopwatch or build that yourself. Use something that has already best practices built in to put you onto the pit of success. And I prefer doing this. I'm not religious about it, but I want my benchmarks to be as explicit as possible. So what that means is if I can, I will not be using bar as an example.

I will be making sure that I'm not using implicit casting or stuff like that. Why is that relevant? Because when I look at the benchmark and I'm trying to understand what's going on, I want to make sure that the benchmark code is not in my way. I want to make sure that I can focus on the code that I'm actually improving and measuring on. So I want to make this explicit so that the cognitive overhead, the cognitive load is not too big for me to handle it. And this one maybe sounds obvious, but avoid running any other resources on that machine while you're benchmarking because it's so tempting, and I've done this several times, I can tell you, you're doing sort of benchmark improvements and then in your calendar pops up, next meeting and then you're in a two more Teams call and you're like, huh, I could just run this benchmark while we're having this boring conversation with my boss.

And then you start running it. And the thing is, those tools, I don't want to bash on Teams, but those tools are quite heavy. CPU and GPU intensive and that load on your machine will influence the results that you're getting. So ideally do, let it run, go to drink a coffee, and when your boss sees you the 10th time on the same day at the coffee machine, you're saying, boss, remember I told you I'm going to squeeze out 5% throughput out of our system and make you happy. Let's see you again at the salary discussions end of the year, and your boss will be happier. I can assure you. Good. I can highly recommend Benchmark.net. It's a great library. It's also used by the .NET performance teams, .NET runtime teams to measure all kinds of things and compare all kinds of things because at the end of the day, benchmarking is really hard and benchmarking.net prevents you or protects you from falling into common pitfalls that you would be exposed to because one thing that it does, it creates different processes, it creates isolated processes.

The stuff is executed under so that static state does not leak. So by definition you're already more likely to have no side effects as an example, or it prevents that code elimination by when you return something, it already consumes it, so that JIT knows it's going to be consumed, it'll not be removed away. It has consumer classes to explicitly mark stuff as to be consumed. It also executes your stuff in several iterations until the results are stable and then it measures that and it gives you statistically relevant results. So it has solved all of the problems you're going to be exposed to anyway. Good. Now that we have that out of the way, I want to sort of zoom in. I told you benchmark should only ever go for the most common cases and should not really go into edge cases. But the next example that I'm showing you is a deviation from that rule because I actually also went and actually created a benchmark to compare exception cases.

Why did I do that? Because one of the things that usually here at conference, you get lots and lots of best practices, and I strongly that best practices, they are... Actually the name is pretty wrong because it implies that they're always the best. But sometimes you need to deviate from best practice. You need to use your brain, your knowledge to understand in this specific case it might actually be good to deviate from it. So that's what I did here because in our case, when stuff happens in production, we get thousands and thousands of concurrences of exception happening and the messages are getting moved into the error queue. So the execution speed of the framework matters in exception cases. So that's what I did is I created basically a pipeline with a deepness, and at the end of the pipeline on the lowest part, I added something that throws an exception and then that exception bubbles up through the entire call stack and sort of crashes out on the other end.

And then I can actually measure, that's what I've done here, how the code before and after behaves in terms of exception cases. Good, I'm going to skip this. Good, but once we have done that, once we have sort of done the improve benchmarking loop, I always advise you to actually take the improvements that you have and put it back into the test harness and profile it again. Why does this matter? Well, one of the things that you do with benchmark, you compare a specific subset of stuff, but you don't know yet, how these micro improvements over here are actually going to have a compounding effect across your entire application. So you want to see, because maybe you see in this example, pipeline execution is five times faster, but if you put it into the grand scheme of things, you might only get 5% throughput improvement or 10% throughput improvement for your entire system.

But you want to have that contextual understanding. That's why once I'm done with improvements, I always put it back into the harness and then compare the before and after. On this side we have the before and here we have the after. And as we can see here, we had 20 megabytes of behavioral allocations, and on the right side those are gone, right? So I'm taking the exact same sort of snapshots and I'm comparing it before and after. Then I look into the receive pipeline and I can see here previously I had 27 megabytes of function delegate allocations. They're no longer there on the right side. I can look into the stack trace. I can see that we went from having lots and lots of infrastructure code to no more infrastructure code on the other side. So I've been able to make some changes that make things very fast.

Then again, I can use the technique that I used before. I sort of can zoom into the namespace, I can filter into that, and then I see whether the before and after is actually better and I can actually tap myself on the shoulder because we see that all the 15 megabytes of allocation and stuff like that, it's all gone. They're no longer there. But of course we also, again, we want to have two views. We just did memory. Remember you want to also want to look at the CPU stuff. So let's look at the CPU. We can use our beloved flame graphs again, we have the before. And then when we zoom in into the after, let me compare this on side by side, we can see sort of the relationship between here, lots of red stuff, a bit of orange stuff towards less red stuff, more orange stuff.

We can actually see how now the business code starts to sort of highlight up. So we have managed here in this specific example to reduce the 32.3% CPU overhead on the publish operations and the 14% of receive operations, they're all gone. And that's also directly visible here in the flame graphs without actually going into this percentage to compare it. Good. So one thing that I want to give away here as well is I taught you about this concept of copy pasting code, right? And one of the huge drawbacks this copy pasting code has is that you freeze the code somewhere and it doesn't evolve and you have no insights into whether you're getting a regressions, right? It could be possible that a team colleague or yourself three months down the line because you don't remember anything, you go back and you tweak the code. Never happens to me, by the way.

You tweak the code and then you break stuff. And on the journey to become more performance aware, you probably want to investigate into some form of regression testing and you can do that. So when you use benchmark.net and when you use the result compare tool from the .NET runtime .NET performance repository, and the link is in the slides, I give it away towards the end in the handouts, what you can do is you can take a Git SHA right before you can run your benchmark with .net run release, of course not debug, right? Your framework and you say, store the artifacts in this folder. And of course you can do that on your CICD system if you want to. And then you execute, you forward your Git SHA to the next one, or on your branch, whatever. You then execute exactly the same benchmark and you say store the results over there.

And then you use the result compare tool on the bottom and you say, please compare the base against the other one. And my threshold is, let's say 2%, right? You're saying we are willing to deviate 2%, regressing 2%, more or less. And if it's more than that, it immediately fails and we did regress. I want to give you a bit more hints because one of the things you usually might be asking yourself, how many of those benchmarks should we actually store? I would advise you to throw away most of the benchmarks because the benchmarks, they're just an artifact, a tool that use to find out in specific optimization scenarios to get somewhere, right? Keep for a regression testing perspective. Keep the most important, core pieces of your infrastructure around. Why does it matter? Because like any other code, this stuff might also have to be maintained, right?

Because if it lives side by side with your code, unless it's a different repository where you pull in dedicated NuGet packages or something like that where you already are that advanced, I don't know, but it has to be maintained. And the execution, the more of those that you have, the more time it takes. And therefore it's also as it becomes over time, a slower and slower tool to give you insights into what's going on. So you want, again, you want to keep the right balance of those things around. There is also more guidance if you're interested in that link, preventing regressions from the .NET performance team, I can highly recommend you to read it. Last but not least, I want to talk about when we are going onto this path of starting to execute those benchmarks or regression tests on a CICD system. We have to talk about the elephant in the room and the elephant in the room is this one, Andrey Akinshin from JetBrains did actually investigations for JetBrains to look at sort of shared runners out there like GitHub actions.

And for example, if you're using GitHub actions and you're saying, hey, let's just throw our benchmarks into GitHub actions, you might be surprised because his results show that CPU bound benchmarks are much more stable than memory disk bound benchmarks. But the average performance levels still can be up to three times different across builds because it's a shared environment, it means you have noisy neighbor effects. So that's going to be potentially something you have to account for. So that means you might not be able use those shared runners. You might have to deploy your own stuff or use some metal hardware under someone's desk if you're running on premise or whatever, so that you actually get stable results, right? That's why with all these problems I told you, you can get started with this copy-pasting the code so that you get on this path of doing this types of performance investigations without being exposed to the whole world of complexity out there.

Good. Last but not least, I want to wrap up here. So use the performance loop to improve your code where it matters, right? I showed you profile, improve benchmark, improve benchmark profile, and then eventually ship it into production, right? Don't spend too much time on it. Enable your monitoring system. Look at your data doc application insight traces and see whether that actually improves stuff, because then sometimes you make assumptions that they're just not valid and it turns out in production sometimes we have to learn the hard way and that cannot be avoided. Use this approach, even though I showed it with an example of a framework on library, like I told you, it's totally also applicable to your code out there. And again, every application has half of a framework somewhere baked in. So maybe also get started there and then use this approach with profiling to observe how the small changes over time that you continuously doing start adding up as a compounding effect across your code base.

And what I would strongly advise you to, because I usually hear that, people are looking at some piece of code and they're saying, well, this code is crap. We need to rewrite it, right? How many times have you heard this in your projects? And I can tell you this is almost never successful because you're going to make the same mistakes because you are not informed, right? And if you're going and apply this principle of continuous improvements over time, you learn a ton about the code base, you learn a ton about the assumptions, how it runs in production, write down those bits and bits and nuggets into your architecture decision record or decision record lock or whatever into the PRs. And then once your understanding is good enough of the piece of code, you then might not even need to rewrite it. Or if you actually are starting to rewrite it, you can use this knowledge plus all the benchmarks around it to make sure the new code is extremely fast to the needs that you actually need for the specific use case, right?

I do not believe that these blanket statements out there where people are saying this code is shit, are actually valid things. So in the majority of the cases it never actually works. Good. That's that. So all the stuff that I showed you today is available also in handout form with lots of explanations and screenshots and everything and code samples on Github.com/Daniel Marbach/beyond simple benchmarks. There is a QR code you can scan, and I also have here business cards on the stage. Grab yourself a business card if you want to shoot me an email later. I will also, for the people that give me feedback of the talk or tell me what they particularly liked, send me an email or reach out to me over LinkedIn or social media. I will raffle towards the end of the weekend at Chatbrain's ultimate license to those that give me that feedback. I will be at the Particular software booth for maybe half an hour, but then unfortunately have to go back to Switzerland. So yeah, hopefully you had a good time and have a great weekend.

Performance loop—A practical guide to profiling and benchmarking

About this video

🔗Transcription