The performance loop—A practical guide to profiling and benchmarking

00:00:05 Daniel Marbach

So, basically I declare a class, slap on a bunch of attributes, and then some magic happens. But then I was like, okay, is it really a unit test or not? We're going to answer these questions throughout the talk. But I felt quite certain it wouldn't take me long to actually be able to write a class with a bunch of attributes, and then have a good and reliable benchmark, really simple, right? But I was really wrong. Because writing that class, or that skeleton for a benchmark, that was indeed very, very simple because it's just a bunch of code. I know C#, I know how to slap attributes on a class. But the mind-boggling part for me was that I was trying to figure out what should I actually take into account in the benchmark? What should I actually try to measure?

And I write code, and sometimes my code is not the best code, sometimes it's very entangled, and it has a bunch of dependencies, but usually when you look at these blog posts, online, you see very simple code. Someone calls it a static method. But your code, or mine at least, I can only talk about my code, my code is not as perfect as all the other examples on the internet. So, I was like, how do I actually isolate my code so that I can measure it without basically investing tons and tons of time in actually refactoring this code, so that I can actually benchmark it? And then, what I was also thinking, when I do these steps, in order to benchmark, what should I cut away from my code so that I'm actually measuring the things that I want to see, to observe, without taking countless noise into account, that deviates my attention to parts that I'm not actually want to focus on?

And then, the other thing was also, how do I go through this cycle of measure, change and measure, so that I actually know that I'm improving things without burning away my allotted budget that I have? Because I started no longer making a fuss about job titles, so I call myself principle chocolate lover these days, because I think job titles are useless. But as a matter of fact, I'm not a performance engineer, my goal is I need to write business code that makes an impact for our customers. And you are probably also not a performance engineer, so that means you have only a limited budget available to actually invest into performance investigations. And I must say, sometimes I consider these performance optimizations just to feed my inner geek, and I do that in front of the TV, while watching some TV series, and trying to improve the overall situation.

So, that means, by the nature of it, at some point I need to go to bed, right? Or actually, I have a hack because I know I tend to go over midnight, so I have a hack at home that switches off my internet around midnight so that I actually go to bed and not spend too much time. So, essentially, what I want to say is we only have limited time available, and we want to use the time that we have in efficient ways. But then the question is, why should we even bother to actually do these performance optimizations? So, for code that is executed at scale, the overall throughput and memory characteristics, they really really matter. And especially today, we start to realize more and more that we are living on planet Earth, which has scarce resources available, we have energy and we consume energy code that is executed in data centers, consumes energy.

So, we need to make sure that the code that we execute is as efficiently as possible, and that's the green IT movement. But even if you're saying, well, Daniel, don't bother me with this green IT stuff, I heard enough about it, I'm kind of tired of it, at the end of the day, when you think about, when you're going to move towards the cloud, or maybe you are already in the cloud, what you have there is you have a credit card associated with your cloud account. And then you have arbitrary numbers, and the cloud vendors make it really, really deceiving for us because we don't really know, we have some arbitrary metrics and then get turned somehow into charging our credit card. Sometimes it's premium messaging unit, throughput units, premium messaging throughput units or whatever, or some gigabytes per second, some arbitrary metrics.

But at the end of the day, what it means, we are going to get charged for the code that is executed in the cloud. So, what we want to make sure is we want to make sure our code is as efficiently as possible so that we only ever get charged what we really want to pay for the services that we are using. But let me give you a more practical example, so Microsoft has this blog post series where they talk about the internal movements of teams that are migrating codes internally in Microsoft. And this one here is from the Microsoft Teams and infrastructure services team, and they have this code, we are able to see Azure compute cost reduction of up to 50% per month, on average, we observed 24% monthly cost reduction after migrating to .NET 6.0. The reduction in cores reduced Azure spend by 24%.

So, what they did is essentially they blocked about their journey of moving from .NET framework to .NET 6.0. So, by just moving from .NET framework to .NET 6.0, and leveraging some more programming techniques and the improvements that the .NET runtime team did over time, they were able to reduce the cloud cost, spending by over 24%. And that can guarantee you, even if you manage to reduce the cloud cost of five or 10%, your boss is going to be super, super happy. And 24% is actually amazing. So, in this talk I have summarized my personal lessons on how to make performance optimizations actionable, so that you have a structured approach that you can leverage when you are doing performance optimizations. But it all starts for me, essentially, as a reasonable team, or as a reasonable senior, or even software engineer, I think you need to become performance aware.

What does that mean? One of the key principles that I always try to apply for myself is, I want to make explicit trade-offs whenever I make decisions, together with my teams, for my code, for my architecture, and also for my performance. So, this also applies to performance, and you should be performance aware. But my friend, Martin Pavlu from JetBrains once famously said, "When you're going to go on a hike in certain types of countries, you need to become bear aware." For example, when you're hiking in Canada, the likelihood of you crossing the path of a bear, it's there. Essentially, I did some research in preparation for this talk, and I saw that bears in Switzerland, where I lived, they have actually disappeared for more than 100 years, and in 2005, there was one single brown bear that crossed from North Italy into the South Switzerland border, and from there on we had the bear again in Switzerland.

But I'm totally digressing here. So, what I want you to say, basically, is similar to being bear aware, you have to be, and you're preparing yourself for the likelihood of a bear to cross your path, you know how to deal with a situation like a bear. And here, in this talk, I want to teach you some key lessons that you can learn to become slowly performance aware, and then start applying when you actually need it. Good. And then, usually I get the question, what performance aware, does it mean I have to go all in? No, not at all. And I'm going to talk about this. In fact, I always start with the simplest solution first. So, what I do is I basically think about the business requirements, whatever I want to implement, and then what I'm going to do is I'm going to write a bunch of tests, whether I do test driven development, test first or test after, doesn't really matter to me, I'm not religious about that, but I want to have some tests in place so that I actually know that the thing is working.

Then, I write the simplest thing that fulfills those tests, and the requirements that I have, and then I basically usually ship this. Because only when I ship the code I know that it actually does, or at least in our CI/CD environment. But what I also do is I make a step back, and ask myself a bunch of questions. And the first question I usually ask myself is, how is this code going to be executed at scale, and what would be the memory characteristics of this code be? And this is purely based on some gut feeling, based on my over 15 years of experience being in the .NET space. All right. So, I basically look at the code and try to reason how it would behave, that simplest solution.

And then, I'm going to ask myself, are there some simple low-hanging fruits that I can do to accelerate this code that I'm looking at? And then, what I'm also going to do, I'm going to think about, well, are there things that can move away from the hot path by simply restructuring the code a bit? As an example, right? So, for example, instead of instantiating a byte array, every time I have to request, I ask myself, well, is this code going to be executed in a loop? Is there some multi-threading involved, yes or no? Or can I basically move the byte array allocation outside of the loop, and then just reuse the same byte array? Because then, instead of creating a new byte array every time the loop iteration is executed, I only do it once, and then I can basically amortize the cost of that byte array. That's one simple strategy I do.

And then, I also ask myself the question, what part is under my control and what isn't really, right? Because sometimes I'm working with a third party vendor library, maybe, if I'm lucky, that one is open source, and maybe I can contribute to that third party library, can improve the code when I see something. But that's not always possible, sometimes I work with closed source library, or sometimes I have to work with code from another team, and then if I see some problems, I first have to basically make sure that I need to go to them, make sure I have staff has priority in their backlogs and stuff like that. So, I need to understand what is actually in my control and what isn't really, because that context really matters for me to make performance optimizations and explicit trade-offs.

And then, also because I told you I have a limited budget available always, I have to think about what optimizations can I apply now, can I defer to a later stage in time, when I know more about the characteristics of the code, and how it's being executed at scale, and when should I stop? Because I must confess, these types of performance optimizations, for me, they're highly addictive. And at some point I was like, oh, it's like an adrenaline rush, you do one more, and one more allocation, one less allocation, and a bit more throughput and whatnot, but at some point it's like, yeah, we're reaching a point of diminishing return. So, what I want to say is it's important to find the right balance, the problem is I cannot give you a cheat sheet here in this talk, to say this is always going to be the right balance.

So, it's really the context of finding the right balance matters a lot. I gave a talk last year at NDC that was called Performance Tricks I Learned from Contributing to Open Source .NET packages, the QR code is here, if you're interested to see optimization techniques for C#.NET, that give you hints how you can make your C#.NET code faster, this is going to be the talk. Today, I'm not going to talk about specific C# and .NET things, that's out of scope for this talk. And by the way, if you can't make a picture of the QR code, I also have resources at the end of the talk so that you can go back to this talk. So, when I started doing this performance optimization, I started asking myself, what do I actually do in order to find the right approach? And I came up with this great, I feel, great mechanical approach to do performance optimizations in a very structured way.

So, what I do first is I always start in the performance loop, with profiling. So, what I do is I create a sample or test harness, and that sample and test harness basically reproduces this scenario that I want to profile. And what I then do, I always take at least a CPU and a memory profile, that's the bare minimum that I do, because I always need to see those two views. And by the way, if you have a reasonably large system, you are always going to have I/O bound stuff as well, database call, HTTP calls and whatnot. So, I/O bound stuff is usually hundreds and thousands of times slower than CPU memory stuff. So, you should also look into your I/O bound stuff. So, a test harness should also reproduce that. That's out of scope for today's talk, because today I'm only going to focus on CPU memory, in the interest of time.

Okay. And what I do is with that test harness, I can then attach a profiler, and then I get profiling snapshots out of it, flame graphs and whatnot. And then, the next thing what I do is, basically, from what I see from these profiling snapshots, I get an overview of what's actually going on in the code. I see the memory allocations, I see the CPU that is spent, and then, with that, it allows me to navigate and focus on the parts that they want to focus on. And then, I start improving the hot path that I see. And the question is usually, what is the hot path actually? Because what your profiler might show you, it might show you different things that are slow, different things that are allocating lots and lots of stuff, and where should you actually focus on?

So, I try to not get hung up too much on sort of, oh, I'm picking the one with the biggest allocations, or I'm picking the one with the most CPU spend, because contextually, what really matters is what do I know about the code base in question? So, because in order to do improvements to a code base, you need a lot of context, because that context helps you to make the right trade-off to actually improve it. So, what it means is I basically take the thing that I have the most context upon, or I know I have the biggest impact with my knowledge, and then I start improving that. And I apply the philosophy of 1% improvements, and what I want to say is, instead of caring about, oh, this is the biggest priority, I start applying hundreds of small improvements, 1% improvements, over time, and the compounding effect of all these improvements over the code base, they actually add up and they start to matter over time.

And as a matter of fact, the .NET team does exactly the same. Every time you read this giant book from Steven Taub, about the performance blog post, these are hundreds of thousands of PRs that do small tweaks here and there and there, and over time that's what gives you the big bang for bucks in the .NET improvements blog post, right? That's exactly the same philosophy. And what's also important, before I really improve things, I always make sure that I have reasonable test coverage in place. Because without test coverage in place, I could actually make things faster, but that doesn't help when it's utterly broken, right? Because then it's the fastest, sorry for the worksheet, solution in place. And so, that's not going to be helpful for anyone. Good. And then, what I usually do is, essentially, I start then benchmarking and comparing my improvements.

And then, what I also do is I have an inner loop in this performance loop because sometimes when I do improvements and then do a benchmark, I learn more about my assumptions about the code. And that gives me new ideas, how I can improve the code, and then I basically go through this inner loop of improve benchmark, improve benchmark. And I usually do, in the benchmarking stage, I do with BenchmarkDotNet, I'm going to talk more about this, as a short run. What the short run is, is basically just a less statistically significant way of getting a result, because all I want to see is I want to see a north star, a direction, whether I'm going in the right direction or whether my improvements are actually making things worse, that's what I care about at this stage. Okay?

And then, basically, that is a step that most of the time gets forgotten, I basically go back into the test harness, take my improvements that I did, once I'm reasonably certain that I'm basically going to the final solution, and I put it back into that profiling harness, or test harness, or sample, whatever you want to call it, and actually trying to find out whether the things are actually making an impact from the macro perspective. And then, at the end of the day, I ship it. Because... This sounds so obvious, but the amount of time I have seen codes and improvements leaning around just in the CI/CD system and not being shipped, it's crazy. Because at the end of the day, all we do is we're making assumptions, we're making assumptions with our benchmark, with our improvement, how the code is going to behave.

So, I advise you to ship the code, fire on your monitoring system, your telemetry stuff, and actually find out in the real grand of scheme, when your requests are getting in from the users, whether the stuff is actually giving you the results that you were hoping for. And then, let me give you an example how I did this with NServiceBus. And by the way, I'm a framework and library engineer, so I'm going to use an example out of my profession, but you might be thinking, yeah, but Daniel, come on, framework and libraries is a completely different context than actually me, as an application developer. So, what I'm going to show you with examples of answer response is 100% applicable as a structured approach also to your application code. So, please bear with me a little bit, okay?

And so, NServiceBus is basically a messaging framework and abstraction, and what it does, it connects to message skewing systems like Amazon, SQS, SNS, RabbitMQ, Azure, NServiceBus, and it pumps messages from queues, and then it evokes arbitrary customer code that the customer puts into the framework. And before it invokes the code, what it does, it does deserialization of code, it does unit of work management, it interacts with persistence, like DynamoDB, NHibernate, Entity framework, SQL Server, whatever. But I don't want to go more in details about NServiceBus, because that's not the focus of today's talk. If you want to know more about NServiceBus, you can go to code.particular.net/NDC-Porto-2024-quickstart.

But the most critical infrastructure piece of answer NServiceBus is the NServiceBus pipeline. And what the NServiceBus pipeline does is whenever a message comes in, it essentially invokes a series of behaviors. And essentially what the core of NServiceBus follows the open-close principle. What it does is it has a very thin core, and then it has a bunch of behaviors that get plugged in, where we, essentially, implement the features that we provide for our customers, but customers can also plug in into the framework to do their stuff. And these are those behaviors that are going to be executed. But for those that are not familiar with, for example, NServiceBus, maybe you have heard of the ASP.NET core middleware.

ASP.NET Core middleware is very, very similar like behaviors, you declare a class, you have a next delegate, and then you can basically, and InvokeAsync is called, and then you can do await next, and before the next you can basically do work that becomes before everything that is called after the rest of the pipeline. And then, everything that comes online, 11 is when the rest of the pipeline is going to be executed. And NServiceBus behaviors are very similar. You inherit from a class, you define some state that comes in, and then you have Invoke method and you have an await next, where you have stuff that becomes before the next and then stuff that you have after the next.

And in here, all the magic of the framework happens. We do deserialization, we do unit of work management, we do resolve stuff from scope dependencies from the IoC container, and we call into DynamoDB, Cosmos DB, everything happens, essentially, there. So, what that means is, essentially, this part, the pipeline part of NServiceBus needs to be super fast. Because we never want to be in a situation where the customer comes to us and says, we think your framework is not really working. Usually the best is when we can point out, no, it's actually your code, it's not the framework code. So, that's why we want to make sure things are essentially working as fast as possible. So, what I do is, so we now, with this example, we're stepping into the first step of the performance loop, and that's the profiling part. So, what I'm going to do is I want to get an overview with the profile of the problem domain at hand. And I want to see what's going on in the code that is going to be executed.

And then, what I usually do, I take tools, I am a big fan of the JetBrains tools, I usually use dotTrace or dotMemory to get an overview of the memory profile and of the CPU profile by attaching to that profiling harness. I want to be very clear, so maybe you have a Visual Studio license, maybe you prefer Visual Studio performance tooling. Any tool that gets you started with doing that, it doesn't really matter. Whether you don't like JetBrains, or you like Visual Studio more, or maybe you're saying, well, I actually don't have the budget to basically have a paid license, but I still want to do this performance optimizations and those profiling sessions, what you can do is there is a tool, if you're on Windows, there's a tool called Purview, you can download it, it's free, and then you can attach Purview to your profiling harnesses, and basically achieve the same result that you would achieve with those commercial tools.

I have to warn you, Purview is a fairly advanced tool. I confess, every time I'm using Purview, I have to Google, "Cheat sheet Purview," to find all the commands because it's a very cryptic tool. But what I also do is, usually, when I do these types of profiling investigations, I use more than one tool. And the reason is quite simple, sometimes I don't know yet what the problem is, and having multiple views, basically a cockpit of multiple things that show me multiple aspects, is a good thing. So, I might actually use Visual Studio performance profiling tools together with the JetBrains tools, or the JetBrains tools together with Purview. Because sometimes it's really, I know it when I see it, and having multiple tools in place give me exactly that. Good. Let me show you how a test harness looks like for this example, with NServiceBus. What I do is I create basically a simple console application, with some code in there, that initializes NServiceBus, and has a few hooking points that I control.

So, what I do is, first, I'm setting up NServiceBus with MSMQ. Because I did this test on Windows, MSMQ is already there, it's super fast, I don't need to set up any cloud infrastructure, I know it's very outdated, old, and rusty, and some people are like... MSMQ, but it just works, and it's available. So, I did that because I want to get most of the things, all the hurdles out of the way so that I can do my job. Then, I configure a reasonably fast serializer, and this is the System.Text.JSON serializer. Because I'm not interested in measuring serialization subsystem, I'm just interested in actually measuring the execution speed of the pipeline. And then, what I also do is I basically configured NServiceBus with the in-memory persistence. And here, again, the reason is simple, because I know that I/O stuff, I talked about this at the beginning, is sometimes hundreds of thousands of times slower than CPU and memory, and I want to focus on the optimization of the pipeline.

I basically use a persistence that is not going to involve any I/O bound stuff, so that I don't have this noise in the profiler snapshots that I'm going to look at. And then, I have a bunch of console right lines and console read lines, and it's basically simple, it is hooking points where the console says to me, hey, now is the stuff over that you don't want to measure, and now is a start starting that you actually want to have a look at. And then, what I do is I know with my context there are two pipelines. There's the publish pipeline, and that's when a message goes into the queue, and that's what I'm doing, I'm basically concurrently sending a thousand messages, publishing a thousand messages essentially to the queue, that's what I'm doing here.

And then, what I'm doing is, on the other hand, I know that when I'm sending messages, and I'm receiving messages again, on the other hand, there is the receive pipeline going to be invoked. So, what I do is I basically attach a simple handler, basically a consumer of that message that does nothing, because I don't want to measure the execution of the handler, I want to measure everything that comes before I execute the code that is there. So, I use my context, my knowledge of the framework, to basically to understand what is the bare essentials that can put in place so that I get a reasonable overview. And that's the receive pipeline here. Good. Now that I have this in place, you might be asking, Daniel, now you showed me all this with NServiceBus, that's nice and fine, but what can I learn from this?

Well, I have summarized for you what are the important things you need to take into account when you define a harness. The first one is super obvious, you're like, Daniel, come on, stating the obvious here, you need to compile it and execute it in release mode. I'm having this here because the amount of times I fell into this trap, it's staggering. Because by default, all the Visual Studio writers is in debug mode. Because you always attach the developer to the debugger, and therefore you have to switch to the release mode, otherwise the results will not be the same. Then, the harness need to run a few seconds and keep the overhead minimum. You need to make contextual decisions to remove all the noise there, and it needs to run more than just a few hundred milliseconds because otherwise you are not going to see anything. So, that's an important characteristic of a harness.

Then, one good practice is disable the tier JIT compilation, because the tier JIT compilation has a few impacts and that might hide a few problems that you have, so that's a good practice to, depending on the scenario that you're looking at, to disable that. And then, what you also want... Whoa. Excuse me. What you also want to do is you want to emit the full symbols, and you do that with debug type PDB only, and debug symbols true, because when you look at the stack trays, the call stacks and everything in the profilers, you want to see what's going on, you want to have back references to the code to do your investigations. So, these are the things that you need to take into account for a harness. But now, let's have a look into the memory characteristics of what's going to happen in this specific example.

I have here a screenshot of dotMemory, and I'm starting with memory allocations first, and the reason is pretty simple. One of the things or the biggest slowdowns still, up to today, in C#.NET applications out there are memory allocations. And actually, David Fowler recently tweeted about this as well, they have telemetry about C#.NET systems, and one of the biggest slowdowns of C# code out there still, today, is memory stream to array, or byte array allocations out there. Because you can get a lot of bang for the buck if you starting with memory allocations first. Another thing is complexity, usually what I have found out, based on my experience, it's easier for me to optimize code that has memory allocations than going down the mind boggling part of trying to understand algorithmic complexity and then doing CPU type of optimizations. But your mileage may vary.

Okay. So, I'm looking here and what I see is when I look into the published pipeline, I see lots of byte array, char array, and stream reader allocations, and message queue permission allocations and whatnot. But using my domain knowledge, and my focus that I want to optimize the pipeline execution, I actually know that I have to focus on this one. And you're looking at this, and it's 20 megabytes of allocations in the pipeline, and well, you might be thinking, but Daniel, come on, you have hundreds of megabytes of allocations over there, and now you're focusing on 20 megabytes. I'm going to talk about that a little bit more. But then, before we jump to conclusions, we should look at the receive part, right? Because we have two pipelines that are going to be invoked, so again, I'm looking at the memory allocations and I see a lot of message extension, XML text reader implementation allocations, and again, using my domain knowledge, I know that this funk of behavior context of tasks, that is stuff that is coming from the receive pipeline execution.

So, I'm looking at 27 megabytes of allocations here. And again, the question is, okay, we were essentially zooming in from hundreds of megabyte allocations that we saw on those screenshots, onto 20 megabytes of allocation, isn't that completely not? Shouldn't we then get rid of all the other allocations first? Well, the thing is, it really depends. And here, again, and I want to tell you, that's the same for your code, you have the knowledge about your domain, how things are going to be executed, I have that with NServiceBus. And there is actually contextual things that I have to take into account. So, for example, I use the MSMQ transport, and most of the allocations in those screenshots, they were coming from the MSMQ transport, and MSMQ has a diminishing user base.

Most of our customers today, they're moving to NServiceBus, RabbitMQ, or whatnot, and so MSMQ is not really relevant anymore. So, I can basically blend out these allocations. Or maybe it's also, maybe I'm not an MSMQ expert, and ramping up my knowledge would take up too much of my allocated budget that I have to do these types of performance, so it doesn't really make sense for me to invest more time in there. And then, again, I told you about this 1% improvement philosophy, right? When I do lots and lots of optimizations, 1% improvements all over my code base, these iterative gains over the hot path will essentially have a compounding effect over time. And then, what I also have to take into account is when I optimize for pipeline allocations, I'm actually improving the code base for every customer out there independent of the transport.

So, basically, even though I'm focusing on 20 megabytes of allocation, I'm essentially doing the biggest bang for the buck. And again, you have to make of those trade-off, and I advise you, if you do that together with your team, maybe also do a decision log, maybe a performance investigation decision log. So, you write down the trade-offs that you're making when you're looking at these profiler snapshots, so that you can control where you're going. So, the context really matters of the code, and you only have the context because you are the expert together with your team in your specific code base. Good. And then, let's have a look at the other memory allocations, we can see here, when we zoom, in that we have other 15 allocations there. Then, what I can do is, with, for example, the JetBrains tools, I can filter into a namespace, and I can filter for NServiceBus pipeline, and then I see all the allocations that are happening in the NServiceBus pipeline.

So, again, maybe you have good clean namespaces in your code, and maybe you're focusing on a specific shipping feature, whatever. So, what you can do is you can use this tool to zoom into the shipping domain, and then you see all the allocations in the shipping domain. So, that allows you to zoom in, into the things that actually matter. Now that I have an understanding of the memory characteristic, let's go into the CPU characteristics. And like I mentioned, we always want to have two profiles, memory and CPU, right? And now we're going to look into where does the code actually spend on unnecessary CPU cycles, and where can we potentially make further improvements?

And one of the great things, one of the good tools is, the tools called flame graphs. And I know this is very blurry on the screen, but you don't have to focus on what's actually written there. Because one of the great things about flame graphs is you can see here there are bars, and there is a bar over there which is the entry point into the publish pipeline, and the length of the bar, on the top, and we read it from top to bottom, represents the execution time it takes for everything underneath to be executed. Why is that important? Well, it's a super handy visual tool, where we can see, we have a bunch of red stuff, which is in my context or knowledge, so it's the infrastructure code that is getting executed, and we have a bunch of orange stuff over here that is the actual business logic that is getting executed.

And just by looking at, without knowing nanoseconds, milliseconds, whatever, just by looking at this flame graph, I can already see that the relationship between the infrastructure stuff and the business code is kind of wacky. So, I already know something is off here, by just looking at the flame graph. So, that's a really handy tool, and when we zoom in, we can see this even better. We can see now this is super off because what we have is we have mutate outgoing message behavior, and that's for your business logic, and then we have gibberish, gibberish, gibberish, gibberish, gibberish, and then we have apply, reply, address behavior, invoke. And that means we basically have lots of stuff that we are not really interested in, that some infrastructure glue that gets executed and we might be able to get rid of. And the flame graph tells us that already.

But you might be thinking, yeah, Daniel, it's almost evening, I'm already a little bit tired, all this flame graph stuff, it's a little bit over my head, aren't there other ways to do this? Yes. Most of the tools, they actually have a hot path or a hotspot view, for example, dotTrace has a hotspot view. And what you see there is you can zoom in, into a namespace, and here I'm zooming into the publish pipeline, and then it gives me a CPU wall clock times and whatever, we don't have to go into that part. But it basically gives me percentage of how much time is the code actually spending in CPU. So, when we zoom in, what we can see here is we have 20% of the behavior chain invoke, and 12.3% of the behavior invoker invokes.

So, basically, this tells me 32.3% of the CPU is spent in infrastructure code, that's what this screenshot tells me here. But then again, we need to look at the receive part, right? Because we want to get a holistic overview, and what we can see here is when we zoom in, we have 9.2% and 4.8%. So, in total, 14%, if I can do the math right, that is spent in infrastructure code. But the cool thing is, the flame graph has already shown me that but hasn't given me exact numbers. But now, here, I have exact numbers. Again, sometimes you need multiple tools to actually see what's going on. Good.

Now that I have a good overview, I can start improving things, but I want to say, again, as a reminder, hold on your horses, because it's very crucial that you have tests in place. If you have forgotten to write this, or if your colleagues have forgotten to write this, at least put some basic stuff into place so that you can make sure that you're not breaking the code. And what I have done is I have, because I have already done these contextual investigations with the profiling snapshot, it gave me lots of great ideas how I might be able to improve the code. So, what I did is I wrote a bunch of unit tests and acceptance tests that make sure the things that I have in mind will not break the existing code, because that gives me the freedom to actually get started with it.

And if you're interested to see what actual improvements I've done, there's a bunch of blog posts that I wrote, it's 10X Faster Execution with Compiled Expression Trees, and How We Achieve 5x Faster Pipeline Execution by Removing Closure Allocations, you can find the blog posts on, go to particular.net/NDC-Porto-2024-pipeline. But I'm not going to more into that. And I know these are very click-baity titles, but that's how the internet work. But as a matter of fact, we actually improved the performance of the pipeline 10 times and five times. Good. But now let's look into how we are going to benchmark the pipeline. So, I told you about all these blog posts that we see, and usually the blog posts, they show some benchmark, this is calling some little, I don't know, string manipulation, or some static method, and this is super, super easy. But codes that we have out there, or at least the code that I write, is usually super, super messy.

Because it evolves over time, I have to deal with existing assumptions and whatnot, so it's not that easy. It's not just a static method that I can essentially call, or as as Gordon Ramsay would say, software is a, "Disgusting festering mess." But essentially, what I want to do is I want to think about usually code that we have has numerous dependencies are getting called. And I want to find a way how I can take my existing code, without doing crazy amounts of refactoring, and putting it under a benchmark. And I found a way, and it's a little bit of a controversial way, because I'm going to show you how you can copy-paste codes. And then OS developers are like, oh, Daniel, you're copying and pasting code, that's the root of all evil. But bear with me. Okay. So, what I essentially have done is, what I usually to get started with, I take the code in place and copy it into a dedicated repository or a dedicated folder, and I just basically take everything that is relevant from that specific code path, and I put it somewhere else.

And then, what I do, I basically take the code, think about the trade-offs that I'm going to make for the benchmarking purposes, the things that are just pure noise, that I don't want to see, and then I basically strip it down to the bare essentials to create a controllable environment. Because one of the benefits this has also is that while my team is working on the other code, we are not influencing each other at all. And I also don't have to basically refactor the actual production code in order for my benchmarking narrowing down to then basically damage the production code for something that I don't want to damage, because I'm just trying to measure what's going on. So, I show you here screenshot what I have done, and this screenshot just, you don't have to understand what's there, but basically, what I want to show you is the pipeline is not just a single static method, the pipeline is lots and lots of components and classes that interact with each other.

So, I took the entire, essentially, NServiceBus pipeline and put it into a repository. And then, what I did in this specific example is I basically said, okay, which behaviors, which pluggable stuff in the framework is not relevant for my performance investigations? So, I removed all of those. And then, what I did as well is I basically said to myself, well, I don't need to measure, I'm not comparing, I don't know, artifact against structure map, against Microsoft extensions dependency abstraction, I don't care about that. So, basically, I removed essentially all the dependency injection stuff with simply newing up classes, because it's not relevant for my benchmark because I want to measure the pipeline execution speed. And then, what I also did is I replaced all the I/O bound operations by just returning task, complete the task.

And some of you that might be more advanced, software engineers that have already listened to talks about async await, you might be saying, but Daniel, I know you're cheating here because when you're doing returning task, complete the task, you're not actively yielding the thread to the thread pool, and therefore your benchmark is super artificial and is not going to work out... You're right and wrong at the same time. Because again, what is the context. Because what I want to measure is the raw execution speed of the pipeline, so yielding here is not relevant because I'm going to synchronously execute the pipeline over iterations, that's one thing, the other thing is a tool like BenchmarkDotNet is going to do a statistical analysis of the code in place, so essentially, that yielding would blend in the statistical analysis and would not be relevant for this specific scenario. Again, I want to give you the full context here.

Good. And I said copy-paste in code. Because I told you you need to become performance aware, and I want to give you this approach so that you can start building up a performance culture. Because I know that you can't just go to an NDC talk, and then say, oh, Daniel told me we should become performance aware, go back to your job and then be the evangelist, and tomorrow, you will be performance aware. That stuff takes time, it takes months and years to build up a performance culture. And usually, when you are the one that heard all these principles, and you're going back to your work, you want an approach... Because you are becoming the expert, the performance expert. And you want to have an approach that you can teach your colleagues that gets you started with, and that lets you build up this performance culture.

So, what I want to say, this copy-paste in the code approach, it's the 80/20% rule, it's very good for codes that rarely ever changes, but of course, this approach does not discover any regressions, I'm going to quickly talk about regressions a little bit later, but it's a good structured approach to get you started, building up this performance culture gradually. Because one of the things that you have to take into account when you're going down the path of, oh, we want to execute this benchmarks on the CI/CD environment, you have to start asking yourself, how can we reliably run those benchmarks on the CI/CD environment? Then you have to ask yourself, how do we set up the CI/CD environment? Is it good if we're executing on our shared Azure DevOps, GitHub runners, is that going to give us reliable results or not?

Maybe not. And what hardware do we need? So, you need to ask yourself a bunch of more questions than you might be actually willing to take on as a team. Again, that's why I showed you this more pragmatic approach to get started with. Good. Now, that we have that aligned together at the beginning, I told you, I had this conceptual understanding that the benchmark is like a unit test, right? Because it's like, okay, I'm just going to declare a class, and add a bunch of attributes, but as a matter of fact, this is actually a wrong conceptual understanding that I had at the time of a benchmark. Because a unit test, what a unit test has, it essentially has two results. It's either, it's green or red, or it's past or failed. But when you execute the benchmark, what you get is you get statistical results, you get lots of numbers especially out of it, that are not green or red.

You're getting a distribution of values of what's going on, under the scenario that you're benchmarking. Then what is also important is that a benchmark needs to be executed onto the results are stable, and that means it needs to be executed potentially hundreds and thousands of times. So, already the runtime difference between a benchmark and the unit test, a unit test is going to be executed in three, four, maybe 10 milliseconds. The benchmark will potentially take minutes up to hours, depending on the scenario. Then, again, it takes minutes up to hours, and then what's also important is while in unit testing or testing scenarios, we usually focus on all the permutations of scenarios that we can think of. With benchmarking, we have to focus on the most common cases, on the frequently used hot path, with the required amount of permutations, and this is really, really crucial.

I'm saying a required amount of permutations is because you can think about, the more permutations you are willing to take on, the longer it actually takes to execute all those permutations, and the longer it actually takes you to actually get results, to get meaningful insights about what is going on. And then, what's also important here is you need to derive the cases that you permutate on from production scenarios, so that you actually have reliable results from your benchmark. Otherwise, your benchmark is completely synthetical, and doesn't give you what you actually want to see. Good. Here, I have a concrete example from measuring the NServiceBus pipeline. And I'm going to zoom in, don't worry too much, the code is not too important. I'm just going to show you, to essentially walk you through some of the features of BenchmarkDotNet.

So, the first thing that I'm doing is I'm creating the pipeline. So, I use this global setup, and what global setup basically means is I'm going to do a bunch of stuff that you should not be measuring BenchmarkDotNet. Because I do not care about warming up the pipeline scenarios, that's not something I want to measure, I want to measure right now what is the performance, the throughput, the memory characteristics of the pipeline execution, that's what I do here. So, that's the global setup, setting up the pipeline. And then, what I did here, I have parameters, and these are the permutations for the benchmark. So, I went back, as an example, to Salesforce, and I was thinking, okay, a pipeline has a deepness, and I looked at the cases that we have in Salesforce from our customer, I looked at our documentation and samples, I looked at our internal usage, and I saw a reasonably deep pipeline is 10, 20, and 40.

So basically, I derived, I could have picked 15 numbers, 20 numbers, but I need to find reasonable numbers that give me an impression, how does the deepness of the pipeline actually relate to the pipeline execution, or does it not relate to it? So, that's what I'm trying to derive from. And then, what I also do is, it's on the top, I added this short run attribute, and again, here the idea is the short run attribute is to get a result in a quick amount of time. I don't want to have yet the statistical relevant result, I just want to see in which direction I'm going. And then, I add the memory diagnosis so that I also get an overview not just about the CPU stuff, but also about the memory stuff.

And then, down here, I basically have the benchmark of the pipeline that I have... The baseline is basically the thing that I have before my optimization, and then below, I have the benchmark of the pipeline after my optimization, so that I can compare those two, depending on the permutations of the benchmark that I'm executing. Good. And then, I want to, of course, because I'm showing you this practical example of NServiceBus, I also want to give you a way best practices you can take into account for your own benchmarks. So, a benchmark should always follow the single responsibility principle, like any other class or method that you have in your code bases. So, the idea is a benchmark should benchmark a single scenario. To go back to my previous example, I said I remove the warm-up of the pipeline, because that's not part of my scenario, I could also have said I want to measure the warm-up of the pipeline. But bonding those two things into the same benchmark, it's going to give me different results.

So, I'm basically saying, I'm going to measure the execution of the pipeline, and that's the single responsibility of the benchmark under the permutations of that specific scenario. Then, a benchmark should have no side effects, benchmarks are going to get executed in iterations. When you have state, like fields and classes and stuff, or even your own internal code, when it's going to accumulate state over every iteration over the benchmark, obviously that is going to influence the results of the next iterations. You should make sure that this is not going to happen because it's going to skew the result. One thing that's also super important is that we have the just-in-time compiler. And any code that looks like it's not going to be used will get optimized away by the just-in-time compiler. So, you need to make sure that the code that you have in place in your benchmark is going to get used.

Otherwise, the just-in-time compiler will remove the entire code, and then you're basically measuring nothing. Okay? BenchmarkDotNet has a bunch of ways to do that, you can return stuff on the method, or there's a consumer class that you can call, that does consumer.consume, and then your code is going to be consumed. Then you don't have that problem. Then, I advise you to take something like BenchmarkDotNet to delegate all the heavy lifting to a framework or a library that does benchmarking the right way. So, stop using stopwatches and stuff like that, that's not a reliable way to do benchmarking. And then, I also advise you to always make the benchmarking code as explicit as possible. So, I'm not religious about this, if you are a fan of var, please use var.

But I just want to make sure that you understand that when you have stuff like implicit casting and things like that, and you're looking at the benchmark and you're trying to understand why something is behaving the way it's behaving, and you have lots of magical code that is not visible in the benchmark that you're looking at, you're not going to find ways to find problems in the code. So, that's why it's important to make it as explicit as possible. And then, there's one thing that I want to mention because I've suffered from that as well. You wrote this nice little benchmark, and you-hoo, now I can measure the things in place. And then, you look on your calendar, see, oh, a meeting is coming up, and then you're sitting in a Zoom or a Teams meeting, and like, ah, I'm really bored, shouldn't I just kick off this benchmark?

That's not a good idea. Because one of the things that benchmarks... Benchmarks are quite CPU and memory intensive. And I'm not going to bash on Teams, everyone has their own opinions, but the tools like those calling tools, the remote call tools, are quite resource intensive on CPU, memory, GPU and whatnot. So, this influences the benchmark that you're running, so please, execute this benchmark, go grab a coffee, and then we are standing the 10th time on the same day at the coffee machine, and your boss is coming, Daniel, what are you doing at the coffee machine again? You say, hey, boss, I'm optimizing the code, I'm squeezing out the 10% for my bonus, and I'm making the code faster.

So, make sure that you're basically managing these things explicitly. Good. I can highly recommend BenchmarkDotNet, it's a super slick tool because it's written and used by also the .NET performance team, by the .NET teams, it's super reliable. Because I can tell you benchmarking is really, really hard. I would say I have already a good amount of experience with benchmarking, but still I'm making countless mistakes with those things. So, if you have a framework or a library that helps you to guide you through to not make all the mistakes that the community has already learned from, that's the best way to actually write the benchmark. And again, for example, what it does, it isolates the code, it runs it in dedicated processes so that you don't have static state influencing each other.

It does statistical analysis and it executes the code until the results are stable, it removes outliers and stuff like that. So, all the things that you would have to reinvent, you don't to do that anymore with BenchmarkDotNet. Good. Now, let's have a look at one other thing, usually when we talk about benchmarking, we say, I told you you should only ever do the common cases, not special cases, not exception cases, but again, every guideline that you hear at the conference in a talk should always put into context of your specific thing that you're doing, right? Don't follow guidelines and things that the speaker tell you at conferences blindly. And I want to give you here an example. So, what I was thinking is, I want to measure the execution speed, but because we are a messaging framework, there are cases when a programmer writes, I don't know, a bad code that throws, for example, null reference exception, it's possible that hundreds and thousands of messages will essentially go through a series of exceptions and then eventually go into the error queue.

So, that means that exception case that you would normally not benchmark, in my case, is actually something I want to benchmark because I want to make sure that my code is also fast under exception cases. So, what I do is I basically set up in a global setup, a pipeline with behaviors, and at the bottom, at the deepest end of the pipeline, I added a behavior that throws an exception. So, that basically the exception bubbles through the entire call stack, that's what I did here. So, basically, that's the scenario that I'm measuring, and then what I do is I basically have the optimization before and... Sorry, the code before, with a try catch, and then after. So, I'm violating my own guidelines that I just told you today, but I wanted to make this example so that you understand, you have to take it into context, these guidelines on the things that you are doing as well.

Good. And then, I'm going to skip this one in the interest of time. So, one of the things that then is usually forgotten, most people stop here, right? It's like, yeah, I've done it, perfect, ship it to production. But what I then usually do is I take all these improvements that I measured and put into my code, and put it back into the test harness or the profiling harness that they used at the beginning. Why is it important? We want to see it under the scenario again. Because what we did is we did a bunch of improvements, we potentially measured only parts of it with benchmarking comparison because that's also never a perfect picture. And then I put it back into the harness so that I can actually see whether the things are actually improved things in the grand scheme of things. And this is what I did there because all those tools, all the Purview, they store snapshots on disk.

So, what you can do is you can store the snapshot from your performance investigation before, store it on disk, and then you compare it to the after result. And when we zoom in, we can see that previously we had 20 megabytes of behavior chain allocations on the right side, these are no longer there, but some other allocations are now showing up. So, we see, we already have improved, in the grand scheme of things, the memory allocations. Then, when we look at the receive pipeline, what you can see here is essentially previously we had a funk of behavior context of task allocations, 27 megabytes, they're gone on the right side, and the overall allocations in the profiling harness, they're also lower. So, we know we achieved something.

And then, when we look at the memory characteristics as well, or, for example, at the call stack, what we can see here is on the left side we essentially have the call stack before, and on the right side we already see that everything has been shrinking, so that means we're also going to be faster in production. It also gives me a good overview of what's going on. And then, we can also zoom in, basically by selecting the namespace, and we can see all the previous display class allocations and whatnot, they're all gone, and we can tap ourselves on the shoulder, we actually achieved something. We improved the memory characteristics. And then, when we look at the CPU characteristics, while previously we had sort a bunch of red stuff, and lots of red stuff, and a little bit of orange stuff, when we look at the flame graphs, what we can see is the after picture is now the relationship between the red infrastructure part and the orange business code logic has also drastically improved. So, we know that we have actually achieved something good.

So, overall, what I have achieved there is I've also got rid of 32.3% of CPU overhead on the published pipeline, and 14% of CPU overhead on the receive pipeline. But let me give you, before I wrap up, a little bit of an overview of, I talked about copy-pasting codes. And I told you this approach is great to get started, but it doesn't give you any regression testing. So, you don't know whether something is actually breaking. So, there is a really easy way, there is a tool that is called resolve compare, it's in the .NET performance repository, and they also have guidelines around that. And what you do is, you essentially take BenchmarkDotNet, you run your benchmark on a Git SHA before your improvement, and then you tell it with --artifacts to store the artifacts in one folder, and then you basically forward your Git SHA, where you did your improvement, you execute the same benchmark with... And you store the artifacts in an after folder.

And then, you take essentially this resolve compare tool, which is a .NET executable, a global tool, and you just basically tell it, hey, here's the baseline and here's the afterline, and here is my threshold. And what this tool gives you, an exit code. And then, you could essentially, in your CI/CD pipeline, you could essentially prevent regressions from happening for crucial infrastructure code. But does it mean you have to store all the benchmarks that you're writing? No, I would not advise you to do that. Again, executing benchmarks is expensive, keeping them around, maintaining them is also expensive, so I would advise you to basically keep the most important benchmarks around for the most important crucial infrastructure, and do this only for that. But you have to be careful. And essentially, when you're executing stuff on the CI/CD environment, there was a great blog post... Oh, excuse me, it's a little bit laggy.

From Andrey Akinshin from JetBrains, he looked at the performance stability of GitHub actions, so basically, shared CI/CD infrastructure, and the results are not reliable. So, he looked at, "The CPU bound benchmarks are much more stable than memory disk bound benchmarks, but the average performance level still can be up to three times different cross builds." So, that means once you start going down the path of executing your benchmarks in your CI/CD environment, you need to have a good infrastructure in place that allows you to give stable results, because otherwise you cannot trust your benchmarks, right? So, go read this blog post and take this into account when you embark on this journey. Good.

I want to wrap up. So, I showed you here with examples with NServiceBus, a very practical approach that helps you to do performance improvements in your code as well, by doing profiling, improving benchmarking profiling and shipping with an inner loop of doing improving and benchmarking, and I truly believe, and I've used it also in application code, this approach that I showed you with a framework on library also makes you successful with your application code. And then use this approach to sort of, and together with profiling, to observe how the small changes over time are actually making a compounding effect. And I want to hammer this home, I advise you because I hear this all the time, people look at code and say, this code is crap, let's rewrite the entire code.

I've never seen this being successful for the first time. I advise you do small incremental improvements with this approach, because what you're going to learn is you're going to learn a ton about your code, you're going to learn a ton about all the assumptions that you're making, and that knowledge will crucially guide you once you should actually decide that you actually need to rewrite this code to find a better approach. Because if you just rewrite from beginning, just based on the gut feeling that this code is crap, you're going to make the same mistake, and the new code is also going to be crap. So, use this structure approach to make you successful.

I have all the resources of this talk, including extensive handout on GitHub.com/danielmarbach/beyondsimplebenchmarks. If you have more questions, I'm still available today at the Particular Software booth. Please add also your ratings, hopefully the green card into the box, and reach out to me over social media, business card, whatever, and give me some structured feedback, and I will raffle until the end of tomorrow, two JetBrains license for everyone that is so kind to reach out to me. Thank you very much.

The performance loop—A practical guide to profiling and benchmarking

About this video

🔗Transcription