Performance tricks I learned from contributing to open source .NET packages

00:03 Daniel Marbach

Hi, everyone, to my talk about performance tricks that I learned from contributing to open source projects, primarily the Azure Service Bus SDK. So I'm the kind of guy that I consider myself a practical learner and I've invented that term. I do not know if it exists. But basically, when I read the book about something like performance optimizations, architecture, design, whatever, I'm sitting in front of the book and I think like I have it figured out.

And then, I'm sitting in front of my code, I'm sitting in front of my challenge, then I'm like, "How did that work again?" And then, I basically go through a series of learning, trial-and-error attempts in order to really, really embrace the concepts, but then I think like, "Okay." Then, I have it finally figured out. And my dream was to get better in performance optimizations because I was reading a bunch of blog articles and stuff like that.

I was like, "Hmm. How could I learn that?" So I went out looking for an open source project that this is welcoming contributions and I found the Azure Service Bus .NET SDK and they started basically sending in some pull requests trying out. Some of them have been accepted. Some of them have not been accepted.

But I started gradually applying the things that I learned from reading books and stuff like that to the Azure Service Bus SDK. And now, I'm around 80 pull requests and most of them have been merged. Some people have said that, "You contributed more than some internal Microsoft employees." I don't know if that's really true. But I mean, it's an honor to hear that. And in this talk, I basically have summarized some of the key learnings I learned from these contributions to the Azure Service Bus SDK. So that if you're interested in performance optimizations, you don't have to go through the same exercise as I did with a lot of tears and sweat and midnight programming sessions until my internet switches off. I basically have this hack. Because I know that when I'm deep into coding and stuff, I cannot stop anymore. So I basically set my internet router to switch off at midnight. Because when I cannot Google or Bing anymore and the internet is gone, I'm like, "Okay. Whatever," and I go to bed. So that's the trick that I do.

Okay. A quick introduction. So this talk is not going to be about horizontal or vertical scaling. You can achieve a lot of throughput and performance things by horizontally or vertically scaling architectures. That's not going to be about that. It's also not going to be about tools like Benchmark.NET, profilers, and stuff like that because that will be a whole other talk. So this talk is really about performance optimizations that you can do in your C# .NET code. I will be showing examples in C#.NET because that's the language that I use day-to-day.

Some of these things can also be applied to F# or Visual Basic .NET if you're using that, but the focus will be C# examples. And sometimes people ask me, "But why .NET? Why C#? Why a managed language? Shouldn't you just be using C or C++? Wouldn't that be way more efficient?" Well, I truly believe that the .NET runtime over the time has become a really, really good platform for writing high-performance code.

And people like Aaron Stannard for Akka.NET proved that, the Microsoft Orleans teams proves it. There are a lot of also online games, like massive multiplayer online games, that are running on .NET. So it is truly a platform that is able to do a lot of great things and at a high scale. And I think C, C++ is less and less needed to achieve code that performs high at scale. One of the good examples that I find is that the .NET runtime team, they actually ported the ThreadPool that was half managed land, half unmanaged land. They ported it into the managed land. It's now entirely written in C# and it performs really, really well. In some cases, slightly better. But it's, at least, on par. So that's also a good sign that when you have a common code base that uses the same language constructs, that's also a really good thing for maintainability. So I really love C# .NET and how it can be used with high-performing code.

Sometimes when I do these types of optimizations, I hear from my colleagues and peers is like, "Wow, Daniel. Isn't that premature optimizations? Some of these things are really highly esoteric," and they're asking me questions like, "Is this change really worth it? Is the complexity really worth it?" And these are really important questions to ask and I want to... I don't want you to go home after this talk and say, "Hey, Daniel showed a few things, some optimizations. I'm going to apply them everywhere in my codes." That is not my message today, right?

Because some of these optimizations, they should only be applied in the context where it actually is necessary to apply them. And in some other contexts, and I'm going to talk a lot about this context, you should not be applying this type of optimizations. Because you would be wasting your employer's time, your customer's time, and your own time by doing that. So that's really important to keep in mind. So don't jump to conclusions and apply them everywhere. I fell into this pitfall myself. So some of these performance optimizations, they can be highly addictive.

Because once you get into this mode of, "Oh, I found something. I saw an allocation," whatever and you start optimizing it, tweaking it. And then, you go into loops and loops until the internet switches off at midnight, right? But it's really important that you ask yourself a few questions like, "Is this code going to be executed at scale?" I'm going to talk a bit more about that, because you don't want to optimize code that is only executed once a day, right? Because it doesn't really matter, right?

So code is executed at scale when it's more efficient in resource usage, execution time, throughput, and whatnot, then these types of optimization can make a huge difference. And I have this quote from David Fowler from a talk that is called At Scale Implementation Details Matters and he says, "Scale for an application mean the number of users that will concurrently connect to the application at any given time, the amount of input to process, or the number of times the data needs to be processed." And for me, the last sentence is the most important one. For us as engineers, it means we have to know what to ignore and knowing what to pay close attention to.

So really, usually, when I look at the code base, I assume that the people that have written the code, they were smart. They made trade-offs. They thought about different input parameters and how this code is going to be executed at scale. So we have to essentially discover the assumptions that people made about this piece of code. And hopefully, it's written down somewhere. Sometimes it's not, right? And we have to think about, "Well, what is currently instantiated in this specific piece of code per request?", as an example, "How much memory is going to be used?"

And sometimes people were also thinking like, "Well, it's only going to be used 10 times a second." And then, we forget that and then the system evolves. And suddenly, it's going to be executed a hundred times, a thousand times a second. And that doesn't mean those people were stupid, it just means that things have changed. And sometimes, we have to go back and re-evaluate our decision-making so that we can actually improve the piece that is currently executed at scale. And what I also try to do in my talks is I want to find rules that you can apply yourself so that you have some things to memorize from this talk. And I have these rules that I'm going to talk about and I'm going to go more into these rules along the way of this talk.

So the first one is avoid excessive allocations to reduce the GC overhead and GC is the garbage collection overhead. That's the first one. And the second one is avoid unnecessary copying of memory. These are the high-level categories that I'm going to talk about today. And if you're now falling asleep because you're low in sugar and you're waiting for your lunchtime, that's fine. If you keep that from this talk, then you're already good. And if you look into your code bases and try to look for these things, that's already a good key learning that you can take away.

So let's go into the first subcategory of this. Avoid excessive allocations to reduce the GC overhead. So the first one is think at least twice before using LINQ or unnecessary enumerations on the hot path. So don't get me wrong, LINQ is a great language feature. I also love it, but it should be applied on the specific circumstances. And on some specific circumstances when the code is executed at scale, LINQ can be actually a resource hog.

So for example, LINQ is very difficult to be JIT optimized and I know that the .NET team is actually optimizing LINQ more and more and more. And some of the things that we talked today about LINQ might no longer be true maybe in a few months time or in a year. And some things have already been changed in .NET. Some will be further optimized in .NET 8. Still, you have to pay attention to LINQ that is executed on the hot path. And when I say JIT, I mean the just-in-time compiler.

So let's look first into a specific piece of code that I have from the Azure Service Bus SDK. We have here the MQP receiver class. And the MQP, MQP is a protocol that essentially allows, for example, Azure Service Bus or event hubs to essentially communicate over TCP/IP with the client application that is running, either in the data center, on-premises, or wherever it's running.

And essentially, in MQP, there is, and especially with Azure Service Bus, there's this concept of a lock token and the MQP receiver is the driver behind the connection to Azure Service Bus. And whenever you receive messages with Azure Service Bus, the lock tokens represent, essentially, the lock on a message. When you are done, you complete by essentially giving the service a lock token.

And the Azure SDK has these Guidelines. Whenever they are accepting, essentially, enumerable types from the outside, they have to take the broadest possible enumeration type and that's IEnumerable. Here, the lock tokens are represented as strings and they're actually Guids. So they accept this IEnumerable of string. And then, what they're doing is essentially the code looked like this, right? They did lockTokens.Select, new Guid(token), then To.Array.

And then, they did the lookup with LINQ any. Looked into a data structure, whether it has already been seen and then they went on one code path or on the other code path. And if we know what's going on, we have to closely look at the code that is actually essentially lowered under the cover. If we decompile it, what we will see is this piece of code and I know it's a little bit overwhelming. I'm going to zoom in a little bit more.

So what we see here is we have this Select statement here and we can see, here on this side, is this 9__2_0 and then ??=. So whenever we see this pattern simplified, we know we are in the safe zone. We don't really have to worry about it, because it's going to be a statically cached delegate so that's going to be fine. But if you look even closer, we can see here this new func of Guid of bool that essentially points to this CompleteInternalAsync method and that is an allocation that is not really necessary.

How can we get rid of this allocation? Well, we can simply turn this any statement into a foreach loop. So instead of doing the any, we essentially foreach and then we look into this data structure that we had there. And then, if we find it, then we go on one code path. And if we don't, then we go onto the other code path.

And if we decompile this code, what we can then see is now we have the pattern that we saw before that I said is kind of safe, right? Because now, we have this C and 9__2??= and now we have a statically cached delegate and we should be fine. Are we fine? That's an important question to ask yourself, right? Because when we're making optimizations into the code, we actually have to measure whether to actually improve something.

And we can do that with tools like Benchmark.NET as an example. There are also other tools available that you can use. I find Benchmark.NET really accessible to write benchmarks. And I did that and I actually did a benchmark and I compared the before and after solution against multiple enumeration types. And I know this is a huge graphic here on the screen. I'm going to summarize it for you so you don't have to read it all. Essentially, when we get rid of the LINQ any statement, we are able to do some good performance improvement.

What we see on the first line, we get 20 to 40% more throughput in this piece of code and we get a garbage collection reduction, garbage collection reduction by 20 to 40%. That's already quite amazing, right? By just getting rid of any statement, we are actually now able to pump more and more throughput this piece of code. And this is really relevant for that AMQP receiver, because we might be receiving thousands and thousands of messages from Azure Service Bus. Or in the event hub, it's going to be 20 megabytes of streams basically per second that is going to be pumped through the network channels and we need to then acknowledge all those lock tokens. So performance improvements like that are super important.

But then, the question is can we go even further? And because those who are not asleep yet, they saw that there is still a LINQ statement in there. Can we also get rid of this LINQ statement? And yes, there is. I call this LINQ to Collection-Based Refactorings that you can actually do. And the first rule that I have here is whenever you have an empty array that you need to represent, use Array.Empty. And the other one is whenever you have enumerables that you need to represent that are empty, use Enumerable.Empty to represent those. The next one is when you have collections, collections have a pre-assigned capacity. And that capacity when you add more items to the collection and the capacity is about to be reached, the collection is going to resize itself internally. And that resizing is going to cause allocations and it's going to use CPU time. So if you know how many things you will add to a collection, it's usually good practice to instantiate the collection with the number of things that you want to put into. Then, you're not going through this growing.

And this one is a little bit counterintuitive, the other one, because I hear a lot of teams saying, "Well, but I want to use, IReadOnlyCollection or IEnumerable everywhere." If it's performance-sensitive code, it's usually better to use the concrete collection types because when you're using the concrete collection types, then there is no boxing of the enumerator happening and then you get also less allocations and you get also far more speed.

This rule here gets more and more optimized, also with things like PTO and the JIT gets smarter and smarter over time. But still, I still consider it good practice for high-performance code to actually use the concrete collection types. And then, if you can, for example, if you're getting input from the outside and you need to copy it into a new collection, sometimes you need to get the count. Instead of using LINQ count use, for example, pattern matching to figure out whether it's an IReadOnlyCollection and then you can have access to the account. Or you can use the enumerable, TryGetNonEnumeratedCount count, which basically attempts to find out whether it's a collection type that has a already a count available. And then, it doesn't enumerate through the whole collection. That's also a good way to improve performance.

And this one is whenever you are thinking about essentially creating a new collection type because you need to copy, wait until you really need it. So basically figure out your, essentially, boundary conditions and whatnot. And if your boundary conditions are reached, then exit the method. Otherwise, once you really know that you need a collection, then create the collection.

And this one, there be dragons. Be careful with this one. So you can essentially try to align access so that essentially instead of... .NET has bound checks for arrays and different collection types, right? And you can, for example, write the code in a way so those bound checks are actually not emitted anymore. That's a way to actually go into really performance-sensitive areas you can use unsafe. And one of the things that's also quite nice, you have now these CollectionMarshal-ing types and MemoryMarshal and Unsafe methods available in the .NET runtime that allow you to actually get access to the underlying memory of collections. But this is really dangerous. So be really careful and think at least three times before you use these techniques. But it is possible to squeeze out even more performance.

So if we apply this to this piece of code, right? Because what we had here is we still had this ToArray call, then let's have a look how it looks like when we're doing this. So what the Azure .NET team has, they have a lot of telemetry, how people are using the SDK. And we are able to basically use this knowledge and we know that in the majority of the cases, people are actually passing already materialized collections into it and not lazy enumerated collections.

So what we did is we said, "Okay. We're going to optimize this code and we didn't have enumerable TryGetEnumeratedCount count available because we target net standard there. So we're using a pattern match to essentially look, "This IEnumerable that we are getting from the outside, is it an IReadOnlyCollection of string?" And if it is, then we just pass it to the method. If it isn't, we are probably in the case of a lazy enumerated enumerable and then we are calling ToArray. And then, we pass IReadOnlyCollection to string to this method. And you might be thinking, "But Daniel, I'm awake. I paid attention to your rules and you said use the concrete collection types. And now, you cheeky bastard, you're passing an IReadOnlyCollection of string. Are you tricking me?"

Well, the thing is with all these types of optimizations, you have to make tradeoffs, right? And this one, the tradeoff was we couldn't just go and break essentially the public method and we couldn't violate the rules of the SDK, so we had to make tradeoffs. And the tradeoff was well, we're going to use the IReadOnlyCollection of string here. And then, what we do is we do essentially figure out the boundary conditions. And if it's an empty array, represent it as an ArrayEmpty. And otherwise, we allocate the array and then we just go, essentially, and fill the Guids into this array.

Okay. Let's have a look how we are doing now with this optimized piece of code. Let's do some benchmarking. And this is a before and after comparison of the already optimized version that got rid of the LINQ any statement to the collection-based refactored version of this piece of code. And if you look, we can actually squeeze out another 5 to 64% throughput improvement on the previous already optimized version and we get another 23 to 61 garbage collection reduction which is pretty neat. And now we could say, "Okay. Let's switch off the internet and go to bed."

But if we look really closely then we can actually see that in some scenarios, we're actually doing worse. We're doing 56% worse. So we are actually slower than the previous version that was far less complex. Now, the question is, should we not do such an optimization? And by the way, sorry, I forgot to mention we are worse in the cases when we get lazy enumerated enumerables. Because then, we have to materialize the collection. Then, we're actually worse. And again, the question is should we not do this type of refactoring? And here, the answer is, "It depends," right?

In the specific code, we knew that the majority of the time in the production cases, because of telemetry and whatnot, we knew that it's going to be materialized collection. So it's a good refactoring to do and only in unit testing cases people might actually pass lazy enumerated enumerables. And then, we can actually go to say, "Okay. We are never in that sort of the danger zone, really, for production scenarios where we are actually worse." And then, we can make this trade-off.

For you, for your teams, it might be, "Well, we are fairly familiar with LINQ. We're happy to make the trade-off to actually get rid of the LINQ any in this specific example." But for everything else, we just leave what is currently in place because this code is going to be fine like that. And then, focus your attention to other pieces in your code base where you can probably make even more significant improvements, than actually trying to refactor everything. Because as we have seen, there is a complexity explosion. We went from a few lines of code to essentially 20 lines of code. And these 20 lines of code, they come with a maintenance overhead, a cognitive overhead whenever you're looking at this piece of code. So these factors need also be taken into account. And with all the things in software, it's kind of the trade-offs are important and thinking about these edge cases. And sometimes it means we have to stop here and we use the first simplified optimized version.

We already touched a little bit on the closure allocations and I'm going to reiterate a little bit on this one. The next rule is be aware of closure allocations. And closure allocations, they can occur whenever you have action or function delegates or any type of delegates they're accessing state outside of the lambda or outside of the curlies. And it's like, what does that even mean?

Okay. I'm going to give you an example. So here, we have this RunOperation method out of the Azure Service Bus SDK and this, essentially, is sort of a poly retry capability mechanism. And what it does is whenever they're calling a server method, what they're doing, they're wrapping this method in this RunOperation method and the RunOperation method returns a task because these are going to be I/O-bound methods.

I/O-bound means we're going to call over ANQP, a TCP/IP, or HTTP. We're going to call to the service that is running in the cloud. And then, we have this Func of Task, that's the actual operation that we're going to execute that is going to be passed into this method. And then, what we're doing is we have sort of a while loop and we essentially call this method. And if it was successfully returned and if we got a server busy exception, sort of back pressure from the service, then we are going to retry with task delays and whatnot. A really simplified sort of poly mechanism that you might be already familiar with, that is built into the SDK.

And if we look at the usage, what we can see here, this is such a usage with the retryPolicy.RunOperation. And if we zoom in, what we can see here, there is this online messageBatch that's a local variable. And then, we see the curly braces around here and within those curly braces, we access this local variable and then we call CreateMessageBatchInternal. And that is a closure, right? Because we have something that is outside of the curly braces that we try to reach out and then we get a closure allocation.

How does a closure allocation look like? Again, we have to decompile the code and look at what's actually happening under the covers. And this is the gibberish code that gets generated by the compiler when this code is lowered. And what we can see here, we have this DisplayClass16 allocation and we have this func of TimeSpan of Task allocation that happens every time we call this method. These are two allocations that are totally unnecessary. How can get rid of those?

So we have to do a little bit of mental gymnastics and build a library infrastructure tool or method that we are going to use. So what we are doing is... And at that time, we're essentially moving from Task to ValueTask. ValueTask is basically a discriminated union out of a result that is available or an I/O-bound operation that is going to be executed. Some people might be saying, "Well, we should be using ValueTask everywhere these days." I mostly use ValueTask when I have cases where it's like 8 or 9 out of 10 times I already have the materialized results and only in a few cases I'm actually fetching out to I/O-bound operations. Your mileage may vary. There are different ways to approach this. I use this simple rule here that I just talked about.

Well, we had already methods that returned ValueTasks. So our infrastructure tool also needs to return a ValueTask. And then, what we do is we accept T1 which is the state that is going to be passed into this method and we return a result because we have methods that return a result. That's what we are doing. Then, we are changing the function delegate to accept T1 and we add CancellationToken and whatnot and we return a ValueTask.

On Line 3, we pass in the state into this method. Why is it a generic? Well, you can't use objects, right? Because if you're passing an int, then you're going to be essentially boxing the int to object and then you get unnecessary allocations. That's why we use here a generic. And then, what we're going to do is we just pass the state that we got from the outside into this operation method with the cancellation token, the timeout, and everything. And that's basically the basic infrastructure thing that we have to have in place.

And once we have that, we can build additional methods on top of this library. We can now represent methods that do not return anything because we have this library that allows us to pass a method that returns something. And what we can do then is when we do not return anything, we return ValueTask instead of ValueTask of TResolve. We still need to accept state because we need to make sure that the state is always within the lambda itself. And I'm going to talk a little bit more about this.

Then, we have the function delegate that returns the ValueTask, accesses the state, and we pass in the state from the outside. And now, we can leverage... C# 9 has this nifty feature in the language that is called static lambda. So we can essentially call the RunOperation method and we pass in the lambda and we attribute it with static async. And what's going to happen is now the compiler ensures that within the curly braces that we have, essentially on Line 7 and Line 11, we can only access state that is already available within that lambda. So it's compiler enforced, right?

And what we then do is, on Line 12, we pass in the state and the operations. So we basically package the state and the function into a value topple. And then, we pass that as state essentially into the function. And then, what we can do is on Line 8, we can essentially then deconstruct that state which represents the state from the outside plus the operation. And then, on Line 9 we then call essentially the operation passing the state, timeout, CancellationToken, and everything. And now, we have no access to state that is outside of the curly braces and, now, we have no closure allocations anymore in this piece of code.

Okay. Then, if we decompile it, we can now see that we get the code that we wanted to achieve by having this static caching thing in place, right? We have, again, our already kind of familiar pattern that I talked about at the beginning with 9__16??=. And now, we have a statically cached delegate and we just got rid of two allocations.

And the thing is, how can you actually discover these types of allocations in your code? Well there is one way you can do it. You can fire up a profiler and you can look for a DisplayClass or various Action or Function delegate allocations. And then, you will see them lighting up and you can go into the piece of code and then refactor them out. Or a more proactive way of doing it, is you can use tools like Heap Allocation Viewer in Rider or the Heap Allocation Analyzer in Visual Studio and that will proactively tell you when you're writing the code that you have a closure allocation. And then if it matters, if the code is executed at that scale, you can avoid it by using the tricks I just showed you and you might be thinking, "Yeah. But I don't have fancy code like that. I'm not even going to bother."

But a very good example is, that you might stumble over as well, is concurrent dictionary. Concurrent dictionary has methods GetOrAdd or AddOrUpdate and they accept lambdas and they actually added, in the .NET runtime, they added state-based overloads that pass in a T1 and then you can apply the same trick. Because if you have code like that, you also have closure allocations. And if your concurrent dictionary access happens on the hot path where it's executed hundreds of thousands of times in seconds, you might have closure allocations in your code as well. That, with an easy trick, you can get rid of. But now, you maybe thinking, "Hmm. But really? All these gymnastics just to get rid of two allocations? Why would I even care?"

And I brought something from my project. So I work for a company called Particular Software and we have a queuing abstraction library in place that fetches messages from Azure Service Bus, RabbitMQ, SQS, SNS, whatever. And internally, it has an engine. We call it the pipeline execution engine. The pipeline execution engine is the piece of code that is going to execute our customer's code. We call it the handlers. And this piece of code needs to be highly optimized and fast because we are going to pump thousands and thousands of messages a second in the data centers of the customer through this pipeline engine.

And we had closure allocations in there and I did some optimizations. And as you can see here, we actually were able to increase the throughput of this pipeline engine by 74 to 78%, just by getting rid of the closure allocations depending on the pipeline death. And if you want to know more about the optimizations and tricks that are applied there, you can go to go.particular.net/ndc-oslo-2023-pipeline and there is a blog post where you will also learn more about the pipeline. But as you can see, the allocations are also gone. So this is... By getting rid of all these allocations, we are way, way, way, way faster. So it's five times faster than before which is quite impressive.

And by the way, we did even more optimizations by also applying some of these unsafe trickeries that I quickly hinted at where there is there be dragons on the collection side of things, where we essentially avoid the bound checks. And we were able to squeeze out another 20% of throughput improvement on top of what you see here on the screen lately.

Good. Then, let's go to the next one Under avoid excessive allocations. Pool and re-use buffers and larger objects. So the Azure Service Bus SDK, I already talked about it, it has this concept of lock tokens. They're basically glorified Guids, right? And they're coming from the network over the ANQP protocol. And then, when we get it, there was this piece of code in place and I'm showing it on Line 1 where we have this ArraySegment and there is this Guid ByteArray. And what it was doing here, it allocated a 16 byte buffer. And then, it used Buffer.BlockCopy to essentially copy the network segment into that byte buffer. And then, it allocated a Guid.

And so, that means whenever we are getting thousands and thousands of messages per second, we are essentially allocating 16 bytes every time for every message. That is a lot of allocations that are going to tremendously slow down, essentially, the processing of those messages and they're unnecessary. And at that time, I was reading about the thing called ArrayPool. Who has heard of the ArrayPool in .NET? A few. So ArrayPool is sort of a way that you can get arrays out of a pool and return it.

It's sort of like a car rental. When you're going to a car rental, you say, "I need..." You're three people and you say, "I need a car that can fit three people into the car," and the car rental might give you a four-seater car, it might give you a six-seater car depending on the availability. And then, you drive around with your friends, have a good time in the car. And once you're done, you basically clean it or not clean it and return it to the car rental. That's exactly an ArrayPool, right? But it just does that. Instead of renting cars, it rents you arrays, and that's already available in the .NET framework. And I was like, "Hmm. I can use the ArrayPool to optimize this piece of code and to get rid of this 16 byte allocation." So I did that.

And I'm not making this up by the way, this really happened that way. So I introduced this and I used this ArrayPool.Shared. And then, I rent a 16 byte array. And by the way, like I said, you might get 16 byte but you might get more. It's just that essentially what you're telling it, "Give me an array that can at least fit 16 bytes into it." That's the conceptual model that you have to think about. And then, the rest of the code is almost the same. I rent it and because I rented it, I also need to be a good citizen. I need to return my car to the car rental. I return my array and return it back once I'm done.

And then, I was like, "Yes, I got rid of another allocation. The team will be happy." And I was like, "Hold on a second. Before I embarrass myself, I should actually know whether this actually solved something." I wrote the benchmark and then I looked at this and compared the BufferAndBlockCopy version versus the BufferPool version.

And as you can see, sad trombone, I got rid of all the allocations. So I was like, "Yes." But then, I looked at the other number and was like, "Okay. 226 times slower than the original version." And now the question is, is this a bad optimization? Well, I would say it depends, right? You could say if you are in a memory-constrained environment where memory is really important, this is an optimization you can use to say, "I'm basically trading off throughput for memory," and you could use this technique to actually save allocations even though the code is slower.

But actually, you can actually do better and that's going to be the next rule. For smaller local buffers, consider using the stack. And what we have here is with the introduction of C# 7.3, there was also this stackalloc keyword in Spans. What you can do is you can stack allocate 16 bytes on essentially the stack of the current method and what's pretty cool is the garbage collection is not really involved. Because whenever the method returns, the memory is just going to be freed up and that is really fast if you use that and then you don't interfere with the GC.

I'm going to talk a little bit more about the span and ReadonlySpan and those types of things a little bit later, but we can use the stackalloc and then we stackalloc 16 bytes. And then, what you can do is we can then copy, essentially, the bytes that we got. We can copy it into that Span and then we can create a new Guid. And you might be thinking, "Well, but I've read essentially all the Guid constructors and I know that there is a Guid constructor that allows you to pass in a ReadonlySpan. Why are you even copying the memory and do all that type of stuff?"

So essentially, because we have NET standard in the Azure Service Bus SDK, those constructor overloads that accept the ReadonlySpan, they're not available. And at the time, there was also some buffer pooling around and the team figured out we need to copy the memory. Over further iterations, it actually turned out it's not necessary to copy the memory and I'm going to talk about this. Whenever you can, do not copy the memory because that's even more efficient.

I'm showing this here as a demonstration and what's also really important here is this code is not really safe, because Guids are represented in endianness. And depending on the endian environment, the bytes are going to be in little-endian or big-endian. And if you're using CopyTo, this does not take endianness into account. Buffer.BlockCopy does actually take endianness into account. So you have to be careful. So the actual code that we did right into the Azure Service Bus SDK was way more complex than what I'm showing here. I'm showing this example as sort of a demonstration that you can stack allocate and then copy memory into it, because it's conceptually simple. But I want to make sure that you understand that I cheated a little bit here on this slide.

Okay. Let's have a look how we are doing. Now, what we can see is this version is 45% faster than the original version and all the allocations are gone. So we actually have managed to optimize the code in really neat ways. Cool. And another thing that I want to quickly hint at, when you stackalloc, you might be going and say, "Oh, stackalloc. Cool. Now, I know this keyword, I'm going to use it everywhere." You have to be very careful because one of the things that's going to happen is if you, for example, accept stuff from the outside that is out of your control, you might get arbitrary stack allocated memory in various sizes. And when you allocate more memory than the method stack has available, guess what's going to happen? Things will explode, right? Okay? So be very careful.

So you have to make sure that you only stack allocate within safe boundaries and there are a few guidelines around in the community. So for example, a good sort of boundary is 256 or 512. It's not that... They are still writing some guidelines. They are not entirely finalized, but they want to make sure that you take this home. That this technique is good, can be applied, but has to be applied in the right context, okay?

Good. Let's quickly summarize the rules that we had under avoid excessive allocations. There's the think at least twice before using LINQ. Be aware of closure allocations. Pool and re-use buffers. For small local buffers, consider using the stack. I have three more that I haven't showed here in the interest of time. Be aware of parameter overloads, when you have methods that accept parameter arrays, unnecessary allocations. Where possible and feasible, use value types but pay attention to unnecessary boxing. And I think Aaron is also going to talk about these types of things. It's today, I think? The-

39:12 Aaron

Tomorrow.

Tomorrow? In his stock. So also attend his stock. He's even going into more depth than I'm doing today. And another trick to actually save allocations is move allocations away from the hot path. That's a really neat trick. So if you have a byte array and you don't want to use pooling, you can allocate the byte array essentially away from the hot path. And if you know only have a single thread that is going to be entering that method, you're going to essentially re-use the same byte array and override it from time to time with, of course, making sure that you only read then what you actually have written, but we're going to talk about that as well.

So the last category is avoid unnecessary copying of memory. And I already talked a little bit about this Span and Span of T and ReadonlySpan of T. Conceptually, Span is a pointer to a memory location and it can be any arbitrary memory. It can be unmanaged memory. It can be managed memory. You have a pointer and you have a length that determines how long, essentially, the memory is that you want to access. And for me, conceptually, and this is highly simplified it's actually more complex than that, but I consider it more like a curtain, right?

You basically have a chunk of memory and then you point to a specific location of that memory. And then, I say, "Well, I have length 16," and then you're basically pulling the curtain so that you only see the memory that you want to access. So it gives you safe boundaries around a memory. And there is also the other cousin of Span, it's called Memory, Memory of T. That's usually for Heap Allocate stuff.

Again, I'm highly simplifying this. There are talks available that you can watch. They go into an hour of the difference between Span and Memory. Because in reality, it's a little bit more complex but I want to give you a conceptual understanding of Span and Memory. And Span is available when you have methods that return T result or void, right?

Once you actually have, for example, async stuff with Task, Task of T result, ValueTask, you cannot use spans. Then, you have to essentially... If you want to use Spans, you have to refactor your code into sort of a synchronous path that uses the Span and into an asynchronous path. Then, you can only use Memory of T or ReadonlyMemory of T. So these are sort of the conceptual things that we have to take into account.

So I have two rules here. So look for Stream and Byte-Array usages that are copied or manipulated without using Span. And recently, David Fowler made a tweet. Apparently, Stream manipulations, ToArray and stuff like that are still the highest causes of allocations in the .NET in many .NET projects. So look out for those and replace existing data manipulation methods with newer Span or Memory-based overloads. So over the time, .NET team added more and more and more methods that essentially instead of accepting a Byte-Array except now ReadonlySpan or ReadonlyMemory and you should be using those overloads. Because then, you can do a few nifty tricks that I'm going to talk about. So the last example brings all these things a little bit together and it's a little bit more complex. It's going to be about event hubs. So the event hubs is a streaming service where you can, for example, push in your IoT data. As an example, your events into event hubs and event hubs has sort of a partitioned publisher.

And the partition, conceptually, is sort of when you... For example, you have books in your bookshelf and then you want to basically take out, sort the books. You want to say, "All the books with Author A go into Box A and all the books with Author that start with B, go into Box B. And that is sort of partitioning where you, for example, partition by the author, the first letter. So the simplified version of partitioning.

And that partition key resolver has a partition key hashing function internally and that hashing function figures out into what partition essentially an event needs to go when using the event hubs. And that hashing function is used on 30 to 40% on the hot path when customers that are using this publisher are publishing messages to the event hubs. And because the event hubs as a streaming service should be able to ingest tons and tons of data, the CPU that we are going to use as a publisher is going to significantly impact the throughput against event hubs. And that makes a non-trivial amount of CPU and memory that is going to be spent for no added value.

And I have here the original version of the GenerateHash function. And that hash function, by the way, is the hash function that is also running behind the scenes and it has to be kept highly consistent across all the languages of the Azure SDK and that's how it looked like. And sometimes, copying of memory is not really, really obvious. You have to look very closely. And what we can see here, this GenerateHash code function, it takes a partition key, right? And the partition key is output that comes from the outside. So someone can put in one letter and then the string is pretty short. Someone can put in my name, which is because I have Swiss German roots, is already quite long, right? So, Daniel Marbach. So that's going to mean it's going to allocate probably in relation to that string that I'm inputting. So what we can say is that essentially depending on the input, this method will probably allocate various amounts of memory.

How do we know that? Well, if you look closer, we have this GetBytes method and GetBytes already kind of hints at something is going to happen. Bytes will be returned. And because we use the method and we can see the action on Line 11, that returns a byte array. Because it returns a byte array, there cannot be any pooling involved because the .NET runtime doesn't know what you're going to do with this byte array.

Maybe you assign it to a static field or maybe you're going to call it every time this method is going to be called. Here, it's going to be allocated every time this method is called and they told you 30 to 40% on the hot path. Whenever we publish, we are going to essentially allocate a byte array that is in relation to the input that came from the outside, from the partition key and that is going to be a ton of memory.

Let's have a look how we can actually optimize this. And here, I have the full version. I'm going to zoom in for your better understanding. So the first thing that we do is we want to use the Span-based overloads that I showed you in the rule, use the Span-based overloads. So we're going to turn this partition key that we get from the outside, we turn it into a Span. And then, we are going to apply a technique that I call over-renting. And what we are going to do is we are going to use GetMaxByteCount. So there is a GetByteCount which gives you an exact length of the string, but we're not going to use that because we are not really interested in the exact length of the string. We just want an approximation. That's already faster than essentially having to go through all the chars in the string to figure out how long it is. We can actually just say, "Hey, tell me approximately how long it is." It's a simple math method that is going to take the length of the string and multiply it by four. And then, we have that.

And then, what we're going to do is we are going to combine the two rules, the stack allocation. For small and local buffers, we use stackalloc. For bigger ones, we're going to use the ArrayPool. So what we're doing is we define an arbitrary stack limit. Here, I've chosen 256 because it's a safe value. We could also have chosen 512, but definitely something below 1 megabyte.

And then, we're going to say, well if the length is smaller than our stack limit, we are going to stack allocate a memory on the stack of the maximum stack limit. So I'm over-renting, right? Basically, I'm saying, "Hey, give me more," and I'm going to further reiterate on this why this is actually faster than allocating a specific chunk of memory that has the exact size.

And then, if you're not in that sort of stack limit, we're going to use the ArrayPool. So we're getting essentially memory from the shared ArrayPool. And then, what we are going to do is we're going to use a method. It's also called GetBytes, but this time we are essentially giving the buffer that we allocated. The hash buffer, we give it from the outside. So we use the Span-based overload that accepts that span of chars. We pass it in which is the partition key. We pass in the buffer. We're telling it, "Hey, I'm owning this memory. Please use this memory." We give it to the method. And then the method, which is pretty cool, will essentially fill it in. Plus, it will also tell us how many bytes it has written. And now, what we can do is we know we have enough space to actually write into. We over-rent it and then it writes into and it says, "We might have 256," right?

But it only wrote 16 bytes and it will tell us, "Hey, I wrote 16 bytes." And then, what you can do is we can use the Span-slicing mechanism, which is basically drawing in the curtain. We can tell it, "Hey, slice it down to essentially from 0 to what you have written," that's the specific example, 16 bytes. And then, we pass that to the ComputeHash method. And then, the ComputeHash method is in the safe zone of only being able to access that 16 byte memory in the previous example or only in the sliced memory. So that's the important part.

And then ,if we were in the case where we actually got a shared buffer, what we then have to do is once we are done, we return the share buffer. And you might be thinking, "But you should probably be using try-finally here, because you always need to return the buffer when you used it." Well, we looked at the error cases in this piece of code and we concluded, we couldn't come up with an error case where we would actually need a try-finally. Because this is performance-sensitive code, we also took into account a try-finally will actually add additional method overhead, will make the method bigger. So we actually avoided the try-finally here to increase the performance even more.

And by the way, the documentation on the ArrayPool also says you don't have to necessarily always return the memories that you rented from the pool. So that's an additional trade-off that you can take into account. So what I'm demonstrating here is when you do these types of performance optimizations, you also have to essentially build up a deeper understanding of the tools and libraries that you are using in order to really benefit from these optimizations.

Good. And then, again, we pass it to the ComputeHash method as a ReadonlySpan. And because it's not returning a Task, we can actually use ReadonlySpan. Good. And the last thing is I told you, I'm over-renting and what is pretty cool is because I only get an approximation that is faster because it's essentially O(1) instead of O(n) with the GetMaxByteCount, I need to over-rent, but then I can apply this trick and it's on Line 1. If you noticed on the slides, we have the SkipLocalsInit. What we can do with SkipLocalsInit, we can essentially tell the compiler to not Init the SkipLocals operation. And SkipLocals operation, usually, the .NET runtime tries to be safe by default. And what it does is when you ask it for a chunk of memory, it's going to clear that chunk of memory for you, so that you have no garbage in there.

But because we... Sorry. That takes a number of CPU cycles, right? And because we know that we are in a very controlled environment where we know exactly how much we are going to write in that byte array, it doesn't really matter when we get 256 bytes and those 256 bytes are basically chunk, chunk, chunk, chunk, chunk, chunk, right? Because when we fill in 16 bytes, we know that we have written 16 bytes and the 16 bytes that are in there and the rest is chunk. Because we are going to slice in and essentially focus on the specific memory and we only read that, we are in the safe zone. So we can essentially avoid this LocalsInit by giving this attribute to the method and then it's going to be extremely fast.

So let's have a look if we were actually really fast. And if we look at... And of course, we have to think about various input sizes to be able to compare it. And we also want to make sure that we are actually around the 256 boundaries as well. I chose a few real life examples of partition key, fed it into this method, and then we can see that we have now a 38 to 47% throughput improvement and all the allocations are gone in this method.

Okay. So as a quick recap, these are the rules. Look for Stream and Byte-Array usages that are copied or manipulated without using Span or Memory. Replace existing data manipulation methods with newer Span or Memory-based variants. And I haven't talked about this one, but watch out for immutable/readonly data that is copied. When the data is immutable/readonly, you shouldn't be copying it around, right? Because that's an easy way to essentially gain throughput.

Cool. If you want to take a picture sort of as a reference point, I have this gigantic slide with all the rules that I have shown you today so that you can take it away. I also have a link to my slides towards the end of the talk if you're interested. And this is the second part of the rules that I just talked about and let's go to wrap this thing up.

So I want to talk a little bit about the caveats that I mentioned at the beginning. So do not go and try to apply these rules everywhere, like I said at the beginning. So when you have expensive I/O-bound things in your code like database calls, entity framework, lazy loading, and stuff like that, or database queries that take hundreds of milliseconds, because you haven't optimized a query or HTTP client stuff that is slow as hell, right? Then, tweak those expensive I/O-bound paths first before you even go to think about optimizing your LINQ code and your array allocation stuff and whatnot. Because then, you're going to have 10x, 100x improvements by optimizing expensive I/O-bound paths.

But once you are done optimizing those things and your code is executed at scale, then you can actually combine the rules and the practices that I've shown here today with the already tweaked code path. And then, you can get amazing benefits out of it. And sometimes, what I... I also talked with Aaron recently about this. I'm a believer that when you have a piece of code, first try to optimize the existing piece of code until you reach, essentially, the boundaries of you can no longer optimize this piece of code.

Because while you're doing this exercise, you learn a ton about the assumptions and all the things that are in place for this piece of code. And that, what you learn, will also significantly influence how you are going to redesign later on this piece of code. When you actually find out, "Well, we have tweaked it, but it's still lighting up. It's still going to be a performance hog. We need to further optimize it." You can take all these key learnings from these optimizations and feed that into your ideas how to redesign this code. So I think this is a hugely valuable exercise.

And just to give you a perspective of all the performance optimizations that I've contributed to the Azure Service Bus SDK here with the event ops, with the partition key optimizations, and other optimizations I have done, they're now adding up to 8% throughput improvements in terms of publishing performance, 8%. And up to 3 to 4% in terms of receiving improvements with event ops, right? So who wouldn't want to have 8% faster car for free? Because once you update, essentially, the Azure Service Bus SDK that bumps to the versions that have these optimizations, you get an 8% faster car which is pretty amazing. Except, I'm only allowed to drive a Kia Ceed, so it was a difficult conversation with my wife.

But anyway. What I want to drive home is, essentially, it's really important for us as engineers, coders, whatever you want to call yourself, that we essentially pay close attention to the assumptions of the piece of code. Figure out whether it's executed at scale, figure out whether it's worth investing more time, essentially, to optimize it. And if the answer is no, then leave it as is. Leave your beloved LINQ code in there and all the allocations, because it really doesn't matter. But where it does, you can apply all the tricks that I have showed you today to the code and make it even faster.

Cool. That's it. I have here a QR code. You can scan the QR codes. You can go up to the GitHub repository. That's github.com/danielmarbach/PerformanceTricksAzureSDK. I wish you a great rest of your day. I will also be at the Particular booth in the exhibition area for today if you have any more questions. I also have business cards here if you want to send me an email and send me some suggestions how I can make this talk better. Or if you have questions, feel free to reach out.

And I think I can also take one question right now. I have two more minutes. Any questions? No? One.

57:32 Attendee

How much reusability can you do in the performance? Is this very unique from each scenario or-

Okay. So the question is how much reusability, I guess about these patterns and stuff like that, you can leverage when you do these types of performance improvements? For me, the reusability is essentially the patterns that I talked here, right? But the actual applications of, "Can I use ArrayPooling? Should I use stackalloc?", and these types of things. That is specific, essentially, for a piece of code. But in terms of the patterns and the practices, there is a lot of reusability. And of course, one reusability is if you have a shared library, you can actually put infrastructure code that solves some of these problems into those shared libraries. Or in our case, when you have something like NService Bus, right? Essentially, those optimizations are in the shared library and then everyone that uses the library gets those optimizations for free. So there is re-use potential. Yeah. Cool. Thank you very much and have a great rest of your day.

Performance tricks I learned from contributing to open source .NET packages

About this video

🔗Transcription