Performance tricks I learned from contributing to open source .NET packages

00:00:01 Tim Bussmann

Hello again everyone and thanks for joining us for another Particular Live Webinar. This is Tim Bussmann. Today, I'm joined by my colleague and solution architect, Daniel Marbach, who's going to talk about performance tricks for .NET applications. Just a quick note before we begin, please use the Q&A feature to ask any questions you may have during today's live webinar, and we'll try to address them at the end of the presentation. We'll follow up offline to answer all the questions, we won't be able to answer during this live webinar, and we're also recording the webinar. Everyone will receive a link to the recording via email. Okay, let's talk about .NET performance tricks. Daniel, welcome.

00:00:39 Daniel Marbach

Hi, Tim. Thanks for introducing me. Hi, everyone, who joined this webinar and maybe is also watching the recording at the later stage. So, today I'm going to talk about the performance tricks that I learned from contributing to open source .NET packages. So one of the things that I am, or I would describe myself is, I mean, I'm the kind of guy... I can read, basically, a book about performance or any type of .NET engineering, software engineering topic. And then I'm usually like, "Okay, I got it figured out." But once I sit in front of the code, I'm like, "Okay, how is that going to work again?" And I'm getting confused, and then what I do is I try to basically apply the things and then I fail by trial and error. And once I've done this several times over several iterations, I kind of get to know the topic better and better, but it really requires me to go through a series of a lot of learning, trying to reread it again until I have really figured out stuff. Sometimes it annoys me, but it is who I am.

So that's why I usually also do, try to basically learn things by going into the nitty-gritty details. And one of the things that I did, is I always wanted to learn more about performance optimizations in .Net. So what I did is I was looking around for practical use cases, not just reading a book again, but basically applying it. And one of the things I found was there is the Azure .Net SDK. And Azure .Net SDK is a pretty big open source project in the world, on GitHub. It's used by many, many people around the globe to basically connect to Azure Service Bus events or whatever, right? And I figured well, since it's a huge code base, there might be things I can apply from my learning. So I started running through some changes and providing pool requests and luckily the team was super grateful about it.

And we had a few conversations on the poll requests and I started reapplying some of the feedback, started learning more and over time I got better and better in doing these performance optimizations. And today I want to give you a brief overview of some of the things I've learned in this talk. So hopefully that you can also apply some of these principles that I have learned in your day to day job, should you care about performance. And I'm also going to talk about why you should care about performance. So it's important to me to actually get the details straight. So today in this talk, I will not really talk about stuff like horizontal and vertical scaling. I will be purely focused about the performance optimizations that you can do in your C# and .NET code, right?

Of course, throughput and other things can sometimes be applied by actually doing things like horizontal and vertical scaling, and often time that's a pretty cheap way to do it. But if you want to squeeze out ultimate performance out of your code, your code also needs to be highly optimized. And that's what I'm going to talk about today. I'm also not going to talk about the tooling approaches here. So I'm not going to talk about BenchmarkDotNet or memory and tracing profilers because that in itself will be a whole another talk. Okay. But why even bother about optimizing C# and .NET? Why wouldn't you just go and use C or C++ to actually write high performance code?

Well, it's interesting. The .NET round time has been optimized over the time and actually gets better and better together the round time and the language to actually do a lot of highly performant code. So you don't actually need to go towards C and C++ anymore so that you can write high performing code. And as a matter of fact, many of the things that are in the .NET round time that have been previously been written in C++ and there was always this kind of interrupt between managed and non-managed code, many of those things are getting rewritten over time, over the iterations of the .NET round time to be ported to C# and managed code and it's still performing very, very well.

So I'm going to show a lot of optimizations that you can do and sometimes when I do these type of optimizations in the code basis, for example, when I'm working at Particular, I sometimes get called out about these things that... Well, these are pretty esoteric type of changes and sometimes I even hear like, "wow, that's super crazy. Is the complexity really worth it of these changes? Isn't this premature optimization?" Well, what's really near and dear to my heart today in this webinar is that you shouldn't just blindly jump to conclusions and go basically to your job and apply the things that I'm showing you here to every code that you have, right? That would be insane and probably not a good investment of your time or your employer or your customers. What's really important is that while these performance optimizations that I've personally suffered from that as well can be highly addictive, nobody really likes to optimize code that is fast enough or is only ever executed once a day on a background process, right?

So what we have to understand is, we have to apply these types of optimizations that I'm going to show you to the code that is executed at scale. But what does scale even mean? And how can we find out whether the optimizations that I'm trying to make or that we are trying to make in our code basis really have value and I'm not getting called out for premature optimizations by my colleagues? So David Fowler has... Probably know him from Twitter, might have heard of him. He's a well known figure in the .NET space, especially in ASP.NET Core. So he has this quote from his talk about At Scale Implementation Details Matter, and I think this quote really summarizes what is near and dear to my heart when it comes to performance optimizations, and code that is executed at scale. He says, "Scale for an application can mean the number of users that will concurrently connect to the application at any given time, the amount of input to process or the number of times data needs to be processed. For us as engineers, it means we have to know what to ignore and knowing what to pay close attention to."

And the last part is the most important part. So in order to know whether we should optimize a piece of code, we actually need to go and look at the piece of code and discover all the assumptions that have accumulated over time. So we need to pay close attention to how that code is instantiated, how things are parsed, how things are processed, for example, per request. And we also need to basically figure out how those assumptions that have accumulated over time in that specific code base, affect the performance characteristics mainly the throughput, the memory at scale, right? So what's important is that the piece of code needs to be executed many, many times a second or for example, if you're saying, well, it's only executed a few times a day, but we know that we are going to scale this even more and right now it already takes quite some time, so what types of optimizations can we do in order to when we start scaling this thing, the time of the execution, the memory, throughput characteristics of this code will get better and better over time?

So this is basically what scale means for code. And so in summary, what's really important here is we need to understand the context of the piece of code that we're looking at, how it's executed at scale, right? But one of the things that is usually that we are trying to achieve is when we try to look for performance optimization is, what type of rules can we apply to our code basis so that we have highly performing codes? And I have summarized the rules that I have learned from contributing to the Azure .NET SDK here on the following slides.

So these are the basic tool, high level categories that we are going to focus on this talk. So the first one is you should avoid excessive allocations to reduce the garbage collection overhead. And the second one is you should avoid coping on memory unnecessarily. So these are the main categories that we're going to focus on this webinar. If you've got these two rules written down and you start applying them in the code basis where the code is executed at scale, you have already optimized the code quite well, so that it's going to perform well at scale. I will then, in these sections, go into more handy rules that you can use to optimize the code even further, but this is the basic category that you should remember and take away from this webinar. So the first one is we're going to dive into avoid excessive allocations to reduce the GC overhead.

And the rule is, think at least twice before using LINQ or unnecessary enumerations on the hot path. I mean, I know this is a little bit controversial, right? Because we all love LINQ, at least I do. And I think LINQ is great and I wouldn't miss it at all, yet on the hot path, it's far too easy to get into troubles with LINQ because can cause hidden allocations. And the code that is using LINQ is very difficult for the just in time compiler to optimize. So we're going to look at a piece of code from the Amqp Receiver that is currently the driver behind the Azure Service Bus and Event Hub driver that is in the Azure SDK. So the Amqp Receiver is basically the thing that affects messages from Azure Service Bus or fetches events from Event Hubs but it also interacts with the Amqp protocol if Azure Service Bus or the Event Hubs.

So that piece of code is essentially executes at scale when you are, for example, using Azure Service Bus and you are fetching or interacting with as Azure Service Bus with many, many messages, hundreds or thousands of messages a second. So let's have a quick look at the Amqp Receiver. So this is a highly simplified piece of code from the Amqp Receiver. I'm going to focus on this one. Of course the actual code is far more complex, but for the sake of this presentation, I cut it down to the essentials. So one of the guidance from the Azure SDK says that if ever possible, we should accept for public APIs, basically the broadest possible enumeration type. And that is the IEnumerable. So this code here accepts an IEnumerable string, and it's called lockTokens. LockTokens are essentially Guids.

So these Guids represent a message and whenever you want to complete the message on Azure Service Bus, or here, if you want to complete multiple messages, you would have called this complete async method, passing a few lockTokens or Guids into it. And then what the underlying code did, it used LINQ to do a select and then turned these strings into Guids, and then it called to array, and then there was another piece of code that essentially looked into a concurrent bag of Guids for... To look up whether it has already seen the lockToken. And if it has already seen the lockToken, it went on one cold path and if it hasn't seen the lockToken, then it went to another code path. So fairly straightforward piece of code, right? But what happens under the covers? So if we decompile this code, what we can see is we get this very gibberish kind of code that is compiler generated, right?

The compiler does a lot of heavy lifting for us. And I know this is potentially quite complex piece of code, but let's focus on the most important aspects of this piece of code. So here we have this Enumerable to array and blah, blah, blah, and we see this C dot weird syntax, nine dot complete internal async field. So what this check does, it essentially looks whether a static field is already assigned. If it's not assigned, then it assigns a function of string and Guids, and that points to another method. So this piece of code is fine. We don't really have to worry about it because essentially what it means is, in summary, the compiler generated code that will make sure that the function delegate that is associated to this piece of code is statically cached. So, that's fine. We can ignore that.

But the next thing we have to pay close attention to, we have here Func Guid bool, that points to CompleteInternalAsync. So what that means is every time this piece of code is executed, we are going to create a new function delegate of guid and bool. And this allocation is unnecessary. We can twist this piece of code by simply turning the Any into a loop. This is how the code looks likes. If we zoom in, instead of having the LINQ Any and If that we had before, we are going to just for each row, the lockToken Guids that we have converted above from line six, and then we do a quick check and if, whether the Guids is already contained in the concurrent pack. And then we go on one path, we execute that, or otherwise we go to the other path.

And if we compile these changes that we have just done, if we compile those down with the compiler again, we can now see that the code has drastically changed underneath the covers. And the most important aspect, again, is this C.9__2_0. As we can see now, we have essentially changed this code and the function delegate is now statically cached and we don't have the new function of guid of bool anymore. So we have already saved an allocation. So let's have a look, whether this change actually gives us anything. And in order to do that, of course, we need to benchmark whether the improvements that we have done to this piece of code actually gives us meaningful improvements. And I've done that using BenchmarkDotNet. So I have extracted this piece of code and like I said, I'm not going to talk about BenchmarkDotNet here and all the mechanics around it.

Should you be interested, hit me up and maybe we will do another webinar if enough people are interested about how to benchmark your code. But if you look at this piece of code, then we get these numbers. And by the way, don't get overwhelmed by all the huge numbers here on the screen, because I'm going to summarize these numbers for you, but I just wanted to show it to you briefly so that you trust me that I have actually executed this piece of code in a benchmark against different enumerable types or collection types. So when we summarize this, what we can say is that the first line is the throughput improvement, we got 20 to 40% more throughput on this code. And we got a garbage collection reduction of 20 to 40%. So again, just by getting rid of Any, we were able to squeeze out some very good performance improvements out of this piece of code, which is pretty impressive, right?

But sometimes we can do even more. For example, there are a few general rules that I would like to present to you, should you want to move from LINQ based code on the hot path to collection based operations. I have summarized the rules here on the next slides that you can write down, and then you can apply those as well to your code basis. So whenever you have to represent an empty array, you should always use Array.Empty to represent an empty array, because then you get the statically cached array, and you're not paying the price of allocating empty arrays anymore. The same applies for enumerables. So if you need to represent an empty enumerable, you should use Enumerable.Empty to represent an empty enumerable. Then one of the key things in performance optimizations is you should try to prevent collections from growing.

So whenever a collection is created, it basically has a number of buckets internally for a fixed capacity and once you start adding elements to the collection and those elements reach the capacity of the collection, for example, lists, then what's going to happen is internally, the list will essentially grow its capacity. It will shuffle around memory and it will realign stuff internally, so that new things can be added. So the capacity grows over time and that growing of collections is quite expensive, both in terms of performance throughput, but also really in terms of allocations. So if you can, and if you know things up front, how many items you're going to add to collection, for example, lists, make sure you always instantiate the lists with the count so that they don't have to grow. And then the next thing is when you operate with collections, you should always use the concrete collection types.

So instead of using IEnumerable or IReadOnlyCollection of something, you should always use list or arrays concrete collection types. Why is that important? Well, whenever you use concrete collection type, essentially you're getting a struck enumerator. The struck enumerator is a value type, therefore it doesn't allocate on the heap, therefore you're saving basically in terms of GCs. You're not putting pressure on the GC. Whenever you look at a collection through the lenses of IReadOnlyCollection, IEnumerable essentially the struck enumerator gets boxed and then things start to get a bit more expensive. There are a number of optimizations that are currently being applied and done in the .NET round time in the chit as well that make this problem be less and less of a problem, but still it's always a good best practice use the concrete collection types if you can.

Of course, sometimes again, with the example of the Azure .NET SDK, where you accept enumerables from outside, maybe then there, it's more important to be as broadly applicable as you can from the public API perspective. And there you have to violate this rule, right? Again, it's important to understand the usage of the code, whether it's like something that is broadly applicable to many users, or whether it's something that is only applicable for your specific cases. But we are going to also talk about some of the trade offs that you can make throughout this talk. Then what you can do is in order to see how many things are in a given collection type, you can leverage patent matching, essentially to check whether collection implements a specific interface. And then for example, you can patent match towards IReadOnlyCollection if you get an IEnumerable and then you can get the count of the collection that is passed into your piece of code.

Sometimes if you're, for example, on .NET six, there is a new helper utility it's called enumerable dots. Try gets non enumerated count. And this does all the heavy lifting on covers for you in order to get the count of a, of an IEnumerable that is passed you by checking internally, whether it implements certain type interfaces and then extracting the count. And one of the good things is this is more performant than actually calling counts. The LINQ count method, right? Because the count method, what it's going to do, it's going to iterate through the collection in order to find out how many items are in there. And by leveraging pattern matching or try get non enumerated count. That's not going to happen. And the last rule, when you're moving from link to collection based operations is you should always wait for ins instantiating the collections until you really need them.

So essentially when you know that you are on a path where you have, for example, nothing to do, then don't early in your code, create a new list for example, because if you don't need it, then that's an allocation that is unnecessary. So you can do a number of checks and then only at a later point in time, actually create a new list, create a new array if that's really needed, otherwise try to avoid instantiating the collection. So let's go back to this piece of code that we had previously before, right? So I've only gotten rid of the Any call, we still have LINQ in there. If you zoom in, we still see that, Hey, there is this lockToken.Select, right? We're still doing LINQ. So let's have a look what we can do. So if we know what type of parameters are passed as IEnumerable string to this method, we can do a number of optimizations and actually in the .NET Azure SDK, the lockTokens enumerable is almost always an already naturalized collection type that implements IReadOnlyCollection.

So that's the context of this piece of code that we need to understand, right? So by actually, for example, having telemetry of the code or by knowing, interacting with the users, we can start making optimizations for this piece of code to assume that we are always going to have arrays of string, lists of string, something like that, passing to this method. And if we know that, we can start optimizing it. So let's have a quick look. So what we can do is, here I'm showing the pattern matching approach because at the time the Azure .NET SDK was still on NET standard 2.0, and we didn't have access to enumerable tri-gate non-enumerate account. So we had to apply pattern matching. But again, because we know that most of the time we're getting already naturalized collections, we can do a pattern matching check to see whether this collection that is passed in is of type IReadOnlyCollection of string.

If it's not, then we have to call to array because we are in the case where we essentially get a lazy enumerable. Okay? And then we can change the signature of this method to be based on IReadOnlyCollection of string. And you might be saying, well, Daniel, you're cheating now, right? Because I've just said to you, in the previous rules, that you should always be using a concrete collection type, and now I'm still using the interface. Yeah, that is totally true and it's a valid point. But again, we have to make trade-offs, even with performance optimizations, right? And because the Azure SDK guidelines essentially make sure that we only get the broadest in renewable type. We cannot change that fact. So the only thing we can do is we basically find a good middle ground and the good middle ground here is the IReadOnlyCollection of string.

So we might still get enumerator boxing here, because we use the IReadOnlyCollection of string, but that is potentially neglectable at least for this piece of code and let's have a look what else we can do. So let's apply the rules that we have just learned. So what we then do is we basically have now a count available. We check whether the count is zero, and if the count is zero, we use Array.Empty of guid and otherwise we create a new array with the specific count. For array, it's probably not that important to essentially use the count, but let's imagine if you would be using a list, we could then pass the count to that list and the list would already have the necessary internal buckets to store the data in. Once we have that, we then just iterate through the lockTokens. And then what we do is in the forage, we then convert it to guid, add the guid to that array at the specific index and that's all we have to do.

Good. Let's have a look how we are doing. Again, we have now gone through a very... From a very simple LINQ piece of code to get rid of the Any, and now we're also getting rid of the select by applying the rules that I have showed you. And we need to understand how we're doing in terms of performance. So let's benchmark this. So again, I'm comparing here, the already optimized version that got rid of the Any, towards essentially the highly optimized code that we just had a look at, that is also a bit more complex. And if you look at this, we actually got five to 64% throughput improvement, depending on the collection type that is passed into that method. And we got another 23 to 61% GC reduction, which is quite amazing, but let's have a closer look.

Well, if we look really close, we can see that we actually are doing worse. So in certain types of cases, we are 56% slower. Well, in which cases are we slower? In the cases where we get passed in a lazy enumerable, in these cases, we are slower. So is that an indication that we shouldn't be doing that type of refactoring? Well, it really depends, right? If we know what's going to get passed into this method like we just did, right? And in the Azure SDK case, we know that the majority of the time you're actually getting already naturalized collections. It might be totally worth doing these type of optimizations because we can save significant amount of GC allocations and we are getting a significant amount of performance improvement. But in certain cases, if we know that the majority of the time, people are going to pass lazy enumerables into it, we're going to do a two array and it's going to be costly significantly slower, right?

So again, what's important here is the context of the piece of code, including the usage of the piece of code, how it's executed at scale. So what I would advise if you're looking at the before and after, we have seen that the code has significantly grown over time, right? We went from really simple LINQ usage towards a potentially much more bloated version that is much, much faster on the certain circumstances. But if we know that in this specific case, we're almost always going to get enumerables, then I would say don't do this type of optimization. Then readability should always be the key driver in your code basis, right? And there are potentially other areas in the code that are slowing things down even more. And in order to actually find it out, you should fire up your favorite memory and performance profile and get a better understanding of how the code is actually behaving on the scale up.

Like with all the things, again, and I'm driving this home multiple times because it's so important, it's crucial to know how the code path is executed at scale and then make the right trade-offs between potentially less readability, more throughput, less cheesy, or more readability and broader understanding because your team is already familiar with LINQ, right? So make those trade-offs together with your team. So the next thing is be aware of closure allocations. I have already touched a bit on closure allocations during our LINQ performance investigations, but closure allocations can actually occur anywhere where you have Lambdas and Lambdas are action delegates, function delegates that are being invoked, that access states from the outside of the Lambda. That might maybe sounds a bit cryptic. Let's have a look at the concrete example from the Azure .NET SDK. So the Azure .NET SDK internally has, for example, for the access to Azure Service Bus and Event Hubs, it has this run operation method. And the run operation method, what it does, it's essentially almost like a poly or a type of circuit breaker.

So what it does, it executes a piece of code and whenever the server returns "I'm busy," it then waits for a period of time and then it retries to execute the specific piece of code. So what it does, it returns a task or what it did at that time and then here we have a de-operation that is executed, that gets in a time Span which is the timeout that how long this method should be executed. And it returns task because most of time it's an IO-bound operation that's going to be executed. And then here inside the Y loop, we are essentially just awaiting the operation. We're passing the time out into that method. And then if it was successful, we return. If it actually through an exception, we're then going to do the while loop again, check whether the server is busy and then do an away task delay.

So that's essentially that retry or run operation method that is almost potentially poly or a circuit breaker implementation that was in the Azure SDK. Let's have a look at one of the usages of this run operation method. So here we have an example of the Azure Service Bus SDK, where this RunOperation was called, and it was calling this CreateMessageBatchInternalAsync method. Let's zoom in a little bit more. If we see inside that Lambda or inside that function or action delegate, we're getting the timeout in. What we then do is we call this CreateMessageBatchInternalAsync method, we pass in the options, we pass in the timeout. And then the result that comes back, which is the message batch gets associated or assigned to the message batch local variable that we have outside the scope of this Lambda, right?

And the important piece here is outside the scope of the Lambda. And as we can also see, options is also outside. So that means we are going to have closure allocations. How does a closure allocation actually look? Well, we have to decompile the code in order to understand what's going on. So let's decompile this code, and this is how the code more or less look likes. And again, it's a lot of gibberish stuff, right? If you have never seen this, let's have a quick look at the important pieces here, in this piece of code. So as we can see is, for every execution, there is a new display class instantiated, and the display class captures the state of the execution, primarily it's the options and the transport message batch. And then the next thing that's going to happen is we are going to create a new function delegate that accepts a time Span and returns a task.

And then we are going to call some method pointers and stuff. So what we can say here is we are creating two allocations for every execution of the code. First is the display class allocation. The second is the function delegate allocation. And these allocations are unnecessary. How can we get rid of these allocations? Well, we have to do a little bit of middleware or library writing. So I have here the new version of the RunOperation method. So when the SDK started devolving, many of the methods that were being called underneath the cover started using leveraging the ValueTask. ValueTask is basically a discriminated unit out of a result and a task. And ValueTask is super helpful, especially in cases where you, for example, have you're calling an IO-bound operation, and in some cases it directly returns the result. And in some cases, it actually yields and dust the IO-bound operation then ValueTask is super helpful.

So in order to be able to benefit from ValueTask, I changed the result to be ValueTask based. And then what we do is, we essentially need a generic parameter T1 and something that returns. Why do we do something that returns? Well, with this helper utility, we can essentially model later everything that also doesn't return anything by essentially doing some kind of functional programming by essentially lifting up the action delegates to function delegates that return nothing. I'm going to show you that a little bit later. And then we need to change the method here. So instead of just accepting a time Span, we essentially change the function delegate to get T1 in the time Span, the cancellationToken, which is also important piece. And then it returns a ValueTask of TResult. And then on line three, we're also passing in the state that will be passed into this Lambda.

The state needs to be passed in here because only then we can actually avoid the closure allocations happens, right? Because the closure itself or the function delegate needs to have access to all the state locally. It cannot reach out, out of the basically curly braces of the function or action delegate. If it would be reaching out, then again, we would have closure allocations. And then we just await at the operation by passing in T1, the state, the timeout and cancellationToken. So that's the basic thing that we require and then from there we can then get to the next level. So we can now build an operation that returns nothing based on this little helper method. So this time we return a ValueTask and not a ValueTask of TResult. Again, we need T1, that's the state and it's important that it's not object, it has to be generic, right? Because if you would be passing an object, then again, anytime you would be passing a value type, you would get essentially a boxing operation and boxing means allocations again, which is going to put pressure on the garbage collector. We want to avoid that.

Cool. And then we return instead of ValueTask of TResult, the method just returns the ValueTask, we pass in the state and everything that's needed. And now we can leverage a few tricks. With the introduction of C# 9.0, we now have the possibility to actually declare a Lambda as static, and that's on line six. So that's basically just in syntactic sugar, because whenever we put static to Lambda, what's going to happen is then the compiler will make sure that we cannot access any state that is outside Lambda. So here it means once we are on line 8, 9, and 10, we cannot access any locals that are outside, for example, on line 5, 4, whatever prequel we have to this method, right? That's enforced by the compiler.

And then what we also can do is with the introduction of value tuples, we can now essentially leverage a value tuple and pass the state of this, that the Lambda requires via value tuple. What we do is, we do a little bit of a trickery. We pass in T1, which is the state that we got outside. And then we also pass in the actual operation, which is the function delegate that we get line of 2... On line 2, as a value tuple of two values into that method. And then on line 8, we then deconstruct the value that we're getting passed in as a state. And then we get the actual state that is passed into that method, as well as the operation and then we await the operation and passing the state to that operation, including the timeout, including the token.

And then we will just return a default of object, which with that nice little trick we already got rid of all the allocations that we had before. Let's have a quick look at how this code now looks like if we decompile it. Again, gets a bit more complex, but the important piece is here. I've already shown you that previously, when we started a decompiling code, I've shown you the pattern of some kind of static field access that gets checked, whether it's null or not, right? And if it's no, then it assigns essentially a function delegate, but it only assigns it once. And then for the further executions of this piece of code it's statically cached, and we have no allocations anymore.

And as we can see here, is this code is now heavily optimized because that static field access and Null check happens and every state that this is required to execute this method is automatically passed into this Lambda including the options and everything else. So, that is great. So we actually got rid by doing this while in small change with all the display class allocations and the function delegates allocations. And we also have been able to leverage newer stuff like ValueTask in order to optimize this code even third, but then you might be asking, but Daniel, I mean really all this complex code just to get rid of a display allocation and the function delegate allocation. Are you nuts? I mean, is that really necessary? I mean, let me show you a performance optimization that I've done in Answer Response, internal pipeline execution.

So Answer Response is the messaging middleware that essentially feeds messages from Azure Service Bus, Azure Storage Queues, and multiple other queuing systems and executes the code of your code essentially inside so-called handles. And there is a thing that is called Pipeline in the guts Answer Response and the Pipeline is the most important piece, essentially that executes all the decentralization logic, the ISE container access and all that type of stuff. So, that is a crucial piece of code that we want to make sure it's highly optimized so that Answer Response never gets in the way of the customer's code, right? And I have actually gotten rid of closure allocations in newer iterations of the Answer Response Pipeline and this is the result. And again, I know it's a lot of code, but if we summarize this, what we can say is, by just getting rid of closure allocations inside the Answer Response Pipeline execution engine, we actually got a performance throughput optimization of 74 to 78% and all the allocations are gone.

Let me go back briefly to let that sink in. As you can see here, focusing on the first line, we actually got down from a whooping 19... I can't even read the number, right? But it's 20 megabytes of allocations if I can do the math right now, probably not, but down to one byte. So almost like a rounding error. And we got so much more throughput by getting rid of closure allocations. So they really do matter. But how could you actually detect these types of allocations? Well, what you can do is you can use your favorite memory profiler of the day, whatever you're using, whether you're using Visual Studio or .NET Memory or whatnot. And then you look for allocations of display class or various variants of action and function delegate allocations. That's one way of doing it. There are also other ways to do it.

You can actually look out proactively when you're writing the code with tools like the Heap Allocation Viewer in Rider or the Heap Allocation Analyzer in Visual Studio. And that already tells you when you're writing your code, that something is off here, that you're allocating closures and stuff like that. And then you can already get rid of it while you're writing the code. So you don't have to find it out after the fact. And what's also pretty cool is many built-in .NET types that use delegates have nowadays generic overloads that allow to pass state into the delegates. Let me give you a quick example. So when you essentially use the concurrent dictionary, the get or add method has been augmented to accept a state, and then you can apply the same trick as I did before. You can essentially just pass in the state as a value tuple into the GetOrAdd method, mark the delegate aesthetic, extract the state by deconstructing the tuple, and then do whatever you want inside the static Lambda.

And then you don't have closure allocations, should your GetOrAdd method be executed that scale. That's one nifty way of leveraging those newly added functionality in the base class library. Then the next thing in order to avoid excessive allocations is, you should always pull and re-use buffers and large objects. So we've already talked a little bit about the lockTokens and as I tried to explain previously, it's essentially a glorified guid to acknowledge a message when you get it from Azure Service Bus, when you're done handing that message. And one piece that was in there is, whenever we got from the network, essentially a byte, the SDK, what it did, it created a 16 byte buffer and then it used buffer block copy to essentially take the array segment and copy it into that byte array.

And that piece of code is fairly impactful when this code is executed at scale, because what it means is it's going to create a 16 byte array for every execution of this code, just to essentially turn a byte array into a guid and that is unnecessary. We can do better. Maybe you have already heard of a thing that is called array pool in .NET. So the array pool is essentially a mechanics that allows you to rent arrays. And what it means is, it's almost like renting a car, right? You rent a car, you say, I need a five seater, four seater, whatever, right? And then once you're done using that car, you return it essentially to where you rented your car. This is very similar with the array pool. What we can say is, well, instead of allocating a 16 byte array, how about we get it from the array pool?

So we can rent a 16 byte array by calling ArrayPool byte.shared and then rent it, and then we can do our copying and at the end, when we are done, we can then return that buffer that we got from the ArrayPool. That's one way of essentially saving that allocation. Is that going to help us? And I actually stumbled exactly in this type of problem because I was naively assuming that a ArrayPool is a great way to actually solve this problem. And let's have a look because we need to benchmark, right? And if you compare the before and after, as we can see here, while we got the rid of all the allocation, the code that we just wrote using the ArrayPool is now 226 times slower than the previous piece of code. Is that a problem?

Well, what you could say is, if you are in a memory constraint environment, we really want to save allocations and we might be able to make a trade off between allocations and throughput, right? So maybe this piece, this change of code is actually desired and we can live with it. But we can actually do much better. So instead of using the ArrayPool, what we can do is we can, for small and local buffers, we can consider using the stack instead of the ArrayPool or instead of the Heap. How does that look like? Well, with the introduction of Span and also the stackalloc keyword in C# 7.3, we can allocate on stack of the current executing method and this is how it looks like.

We can say stackalloc of byte 16, and what's going to happen is, it's going to use essentially the memory of this method, the current stack, and it's going to allocate a byte array there. Whenever the method returns, this byte array is going to get cleared so the garbage collector doesn't have to interfere at all with it. That's already pretty cool. And then we can turn it into a Span and copy everything into that stack allocated buffer, and then we can create our guid as before. So you might be wondering, but why do you even need to copy this buffer here? Well, in certain cases, you're absolutely right that for example, the guid already takes a Span as an input. So in certain cases, you don't really have to copy the buffer. I'm showing this example here as an illustration, when you need to do a defensive copy of your stuff, you can actually use stackalloc to do a defensive copy and copy the data that you get as Span into the stack allocated byte array.

And I'm going to talk about how you can avoid this in the reminder of this talk. So if you use the stack allocation, we can actually now see that we got quite interesting performance benefits now. As a summary, we got like 45% improvement and we have no allocations anymore. So while the first iteration where we started naively using the ArrayPool for this small byte arrays, was actually 226 times slower than the original version, but we saved allocations, we have now a version that also saves all the allocations, but it's 45% faster than the original version. And that's pretty cool. So let me summarize the most important rules that I have shown you here to avoid excessive allocation to reduce the GC overhead. You should think at least twice before using LINQ or unnecessary enumerations on the hot path. You should be aware of closure allocations.

You should pull and reuse buffers and for smaller and local buffers considered using the stack. And be aware of parameter overloads, that's something that I haven't shown here. Sometimes method accept for example, parameters which are object arrays, they tend to allocate a lot of arrays and also boxing happens if you pass in stuff like value types like in-teachers and bools, that's also very dangerous to look for. And of course, where possible and feasible, use value types but pay attention to unnecessary boxing. I did not cover these two rules in this talk in the interest of time for this webinar, but you have here the rules, should you wish to apply those as well. So let's go into the last section of this talk. How can you actually make sure that you avoid unnecessary copying of memory? Well, the cool thing is with the introduction of Span that I already have hinted at, and with C# 7.3, we now have this thing, a Span that allows us to basically represent a continuous region of arbitrary memory.

And what a Span is, I'm not going into details of Span but basically the gist is, it's just a pointer to a memory location and the length that represents the length of the memory represented by the Span. So really it's just a pointer to somewhere in the memory and the memory can be anywhere. It can be native memory, it can be on the method stack. It doesn't really matter, right? And then it's a length that basically says, this is how long this memory is. And the cool thing with Span is it can actually be sliced into various chunks, right? So by just modifying the length, you can actually say, well, from this, let's say one megabyte of memory, I only want to see the first 16 bytes, but we can point to that one megabyte of memory with the pointer, but then we just restrict the length of the Span to be 16 bytes.

And with that, we have sliced the memory without actually copying memory at all. There is also the cousin of Span of T, which is called Memory of T, that's usually used for memory that is on the Heap. And for example, it can be used when you use asynchronous method. When you have a weight statements, then you're going to be using Memory of T and not a Span of T. Let's have a quick look at some general rules that we can apply when we want to avoid unnecessary copying of memory. So in your code basis, look for Stream and Byte-Array usages that are copied or manipulated without using Span or Memory and replace existing data manipulation methods with newer Span or Memory-based variants that have been introduced into the .NET round time over various iterations. Sometimes memory copying is very, very hard to spot, it's not really obvious and it requires deep understanding of what's happening on the hoods of the framework, the library, or the SDK that you're using.

So with the Azure SDK, there was recently a new event ops client introduced, that had a new publisher type that used internally a partition key resolver that turned partition key strings into Hash codes. If you don't know what a partition key is, basically you can almost think about it, it's when you have multiple boxes that have a label on it, and you want to basically put away your stuff, your books, you have a box called label A, you have a box called label B, and you have a box called label C, right? So label A, B, and C would be essentially partition keys. And then you look at the book and you look at the title and say, well, it starts with A, so we put it into box A, or it starts with the letter B, so we put it into the box B or whether you used the author, it doesn't really matter.

But this is basically, in a short explanation, a partition key, right? So the Event Hubs client that publisher type, it turned partition keys into Hash codes. And the piece of code that we're going to look at was execute on 30 to 40% of the hot path and therefore it represents a non-trivial amount of CPU and memory cycles when using that publisher type. And the code looked like this. And again, I said to you, sometimes copying of memory is very hard to spot. So as you can see here, if we zoom in, how would we actually spot that something is happening? Well, here we have this encoding GetBytes and as the method already says is, it essentially returns bytes and we can also see the line 11 when we call the GetBytes, we pass in a string and then we get a Byte-Array back.

So that method essentially returns a byte array. And because we are not... It returns it essentially, it doesn't really own the memory. So it has to create a new Byte-Array every time we call it, right? And that's an expensive allocation. And what's even worse is because the partition keys are potentially passed in by the customers, the strings can be of any arbitrary length, right? So if you pass in an A, it's just going to be a string of length 1, but if you pass in, for example, my name, which is already quite long, we are using a lot of letters that will also expand the Byte-Array and therefore the Byte-Array starts growing and the more allocations we have. We can optimize this, and I'm going to now bring everything together into this piece of code in the optimizations that I did there.

So the first thing is we are turning the petition key into a Span that will get passed into that method. Once we have that Span, we can leverage the new overloads that we have available to operate on Span based methods. Then the next thing is, instead of going into actually finding out how large the string is, we are going to get an approximation, how long the string is in bytes. Well, why is it important? Well, when we are using, for example, GetByte count of the encoding, we get the actual length, so the .NET Framework has to essentially go through the string and find out how long it is. But we are going to apply a pattern of over renting and for the pattern, we don't really require the exact length. We just need an approximate length. So we do get max byte count and the benefit is, it has basically all of one semantics.

And then what you is... We do a combination of either array pooling or stack allocation. So we define a maximum stack limit of 256 spikes. And by the way, why 256? Well, there is some guidance that is being developed by the .NET teams. Well, we cannot use more than one megabyte because that's the maximum size that we have available for the method stack, but why 256 and not 512? Well, it's not really clear what you should be using, but the important thing is use something that is safe. Depending on the context, I've chosen the 256. So if the approximate length of the string is less than 256, we then stack allocate. And otherwise, if it's longer, we then go and rent an array from the ArrayPool.

And then what we're going to do is we are going to call GetBytes, which accepts this buffer as we are passing in the buffer into the .NET Framework method with the Span, which is the string, which means the GetByte methods will essentially fill in all the bytes into that already provided buffer. And now we are in charge of owning the memory. And then what we do is because we know exactly what has been written by GetBytes, because we get the number of bytes that has been written, we then slice, we essentially narrow down the window of the Hash buffer to the stack location to what has been written. And then we pass that slice to the method and what we also need to do is in the case when we got the sharedBuffer, we then need to return the sharedBuffer to the ArrayPool.

If we stack allocated, it doesn't matter, right? Because it'll be cleaned up once we exit that method. And by the way, we didn't do a try final here, we actually saved the additional code size that would be introduced when doing a try finally, because we couldn't really come up with a failure scenario where essentially GetBytes would throw. So that was an additional trade-off that we made when I contributed this piece of code to the Azure SDK. And then we're passing to the ComputeHash method that read only Span, which either represents a stack allocated memory or the rented pool. And then the algorithm is exactly the same. One other thing that we do and notice line one that just came into this slide, we also add the SkipLocalsInIt. So SkipLocalsInIt is basically compiler magic that tells the compiler to not emit the locals in it flag.

So what's going to do is normally everything need... Tries to be safe. So when you actually get, for example, a Byte-Array of 256, it would essentially try to clear the memory to essentially null out the memory that is in there. By actually adding SkipLocalsInIt, what we get is we already get allocated potentially reuse memory, but we don't really care, right? Because we know exactly how much data we write into this method, and then we sliced the memory to only what has been written. Therefore, we don't need to fear anything because we never get actually garbage because we are exactly in charge of the window of memory that we are looking at. That's another optimization that we can do. And this is actually crucial, especially when we are doing over renting, right? Because, like I tried to say, when the string is just one length... one byte large, sorry.

When the string just has one char in there, we still will be renting 256 bytes, right? So we are all over renting. And by essentially adding the SkipLocalsInIt, this case also gets faster. So that's an important optimization that we can do as well. Cool. If we measure this, we can then see that we have gained quite a bit of throughput improvement, especially we got 38% to 47% of more throughput and all the allocations are gone. But previously we were basically bound in terms of allocations to the input... To the size of the input. We now are no longer bound to that, right? We have essentially no allocations anymore, and we are also much, much faster. And for a piece of code that is executed on 30 to 40% on the hot path, that's a very crucial optimization that we did.

So these are the rules that we looked at, again, as a quick summary, look for the Stream and Byte-Array usages that are copied or manipulated without using Span or Memory, optimize those and replace the existing data manipulation methods with the newer Span or Memory-based variants. And then you don't need to copy memory anymore. This is the summary of all the rules that I have showed you. I know it's a bit overwhelming. I just put them here essentially so that you can write them down later when you look at the recording, the slides will also be available later. So let me do a quick summary of everything that I have showed you today. So one of the things that I want you to take away, and that's the most important piece again, the context of the piece of code really matters. So one of the things that you should always do before you do these type of optimizations, go tweak the expensive I/O operations first, right?

And once you have done the I/O based operations, because there are hundreds or thousands of times more expensive than actually doing memory allocation, stuff like that, then you can start applying the principles and practices that I have shown you today to make your code even faster. And sometimes when you're doing refactoring and redesigns on the hot path, you can actually combine I/O optimizations together with the optimizations that I've shown you today. So really only apply the principles that I have shown you where it really matters and everywhere else, always favor readability. So use your LINQ stuff when you like it, because it's more readable. Where it matters, get rid of LINQ and also go the extra mile if you can, to essentially get rid of all the LINQ on the hot path and refactor your code towards collection based operations, or as I've shown, use Span and Memory to make things even faster.

And that's all I wanted to show you today. I wish you happy coding. The slides are also available on my GitHub profile. Here is the link. I've also shared with you QR codes that points directly to the repository. If you have any questions, you can reach out to me @danielmarbach over Twitter and I'm aware that we were also going to potentially get some questions from the audience right now. Tim, do we have any questions?

Yes. Thank you, Daniel. We do have some questions from the audience, indeed. So let me bring forward first one from Valdes, he asked whether ArrayPools can be expanded on demand or whether we should rent them with known size upfront?

Whether array pools can... I'm not sure if I entirely understand the question. So the array pool essentially has internally... It has a fixed way of how it gives you back the Memory, so in certain chunks, but you can essentially, if you're using the shared array pool, but you can create a new array pool that you manage yourself and there you can actually essentially tell it... Give it certain configuration parameters, how it should use internal buckets and what types of arrays it should return to you depending on the needs that you have so that it's more suitable for your needs.

And most of the time, I guess, array pool shared is probably good enough start and I would only suggest to actually do those tweaks. If you find out that it's really necessary to do those tweaks, but if you even want to go down further, you can even implement your custom memory pool and for example, use predefined, pre-allocated native memory and manually manage everything yourself. So the base class is available, you can inherit from it and custom control everything, if you need to.

All right. Then a bit more in depth question, Bayard was a bit confused on the part about removing the LINQ, where the first iteration that you tried with the for-each loop, that there was also a slowdown in performance, especially for the arrays. That something you can maybe shed a little bit more lighter on.

Yeah. I think that's probably a question that I can follow up with Bayard after this webinar, essentially reach out to him because if he... I guess, his email address should be available and that will answer that offline.

Perfect. And then we just got in another question right now, do you have any view regarding channels versus the TPL data flow?

Okay. So I guess the question is related to system threading channels versus the TPL data flow. Well, that's a pretty complex question to answer, but I guess so if you're using things where you have, for example, single reader, multiple writer, or multiple reader, multiple writer scenarios, I would suggest to use, especially if you're doing low level types of stuff, libraries and frameworks, then potentially system threading channel is the better way because it's also more async enabled. It has APIs that return ValueTasks. So it can essentially properly support scenarios where you synchronously return values and it has also a way to essentially achieve back pressure between the writers and the readers, which TPL data flow in a sense also has but I would then go for system threading channels.

I mean, TPL data flow is kind of neat if you have higher level data crunching stuff where you just want to plug things together. Mileage may vary, but I'm aware that TPL data flow is a bit outdated when it comes to, for example, of supporting async and stuff and ValueTask based stuff. So I would, if that's a requirement, I would usually go to using system threading channels and plug things together there. But yeah, I hope that answers the question.

All right. There are no more questions at the moment. So I think we'll wrap up the webinar at this point. For our next events, our colleague Dennis will be speaking next month at future tech in Utrecht in the Netherlands and you can also meet us at the Developer Week in Nuremberg in July and the .NET Day Switzerland in August. You can go to particular.net/events for more information on those events. And with that, thank you so much for joining us today. And on behalf of Daniel, this is Tim saying goodbye for now and see you on the next Particular Live Webinar.

Performance tricks I learned from contributing to open source .NET packages

🔗Why attend?

🔗In this webinar you’ll learn to:

🔗Transcription

About Daniel Marbach

Additional resources