The science of queues: Performance monitoring theme parks (and distributed systems)

00:25 Audience

Hello!

00:30 Mike Minutillo

My name is Mike, I'm from Perth, Western Australia, and I work from home for a company called Particular Software. We build tools that help you to build distributed systems with confidence. But the story I want to tell you about today is from a time before I was working at Particular Software, it was actually the very first software system that I was the primary technical team lead for. Actually I was team lead, solution architect, business analyst, a little bit of project manager. If you wanted to play a job title bingo with my business card, you absolutely could have. We were working for an insurance company, and I don't want to date myself too much, but the system was a three tier system, because that was the style at the time. So we had a web tier, a middle tier, an application tier and a database tier, and we'd been working on the initial version of the project for months.

We'd come up with our first version. We'd come in on a weekend, it was actually a long weekend, just in case something went horribly wrong. We installed everything, we got everything working, and when we were happy, we had a celebratory drink and we all went home. And then when we came back to work on the Monday, the business was actually incredibly pleased, which doesn't happen. But that was good. Because we'd managed to help them to do their job more closely aligned with their business processes, which is exactly what they'd hired us for in the first place. And so, after spending a couple of hours wandering around the floor, seeing how people were using the software, collecting some more requirements, we went back to our desks and started working on the next version. And everything was great for about two weeks. On the third week, I was working at my desk and the phone rang and I picked it up. Guy said, "Yeah, the website's not working. I can't do my job."

I said, "Okay. Well, go and make a cup of tea. I'll look into it and if I can come up with a solution, by the time you've finished having a cup of tea, everything should be working again." I hung up the phone, it rang again immediately, "The website's not working, I can't do my job." And then again, and then again, and then again, until I left the phone off the hook, and went to actually go and investigate what the problem was. And what we discovered is that, at database server the hard drive was full. Another application had come in over the weekend, they'd installed their own software that happened to be reusing the same database server as us. And obviously they hadn't done sufficient capacity planning. And what happens when a database server can't write to the disk anymore? It stops accepting incoming connections. And in an N-tier system when one tier goes down, the next tier goes down. And when that tier goes down, the next tier goes down.

We didn't have a terminology for this at the time, or at least I didn't, I was still a young solution architect. But this is an example of a problem called temporal coupling. Temporal coupling happens in distributed systems when you have two things that are physically separate, but they have to be up and running at the same moment in time, and they have to be able to communicate with each other in order to be able to complete some business function. And that's a problem. Because the more distributed your system becomes, the more likely it is that one of those things is not going to be true. Either the network will be down, one of those things will be down, and the business process can't be fulfilled. So we kicked that other application off of our database over. We said, "Go and get your own database server."

And we got everything up and running again. It took about 90 minutes by the time we were finished, and then we all went home exhausted. The next day when I came in, I got called into the boss's office, which for a young solution architect was a frightening proposition. But she was very nice. She said, "Look, the business is quite pleased with what we've managed to do. They're not happy that it happened, but they're happy that we managed to get it fixed. But I'd like to be able to preempt this sort of thing from happening in the future." So we set up a small monitoring solution that was... I think this was a custom-built system tray app actually that ran on my machine. And every five seconds or so we're trying to do something innocuous on each of the tiers, and tell me when something went wrong. So that the next time that the phone rang, I could actually say with confidence, "Yes, we know what this is, go and make a cup of tea."

And over the next two years, we kind of trained the business into this type of behavior. When something went wrong, they didn't bother calling us anymore, they just knew that we had their back. They knew that we would looking into it already, and that we would sort it out, and they'd go have a stretch and feel good. Which is why it took me so long to discover the next time that temporal coupling came up. The reason it came up is because, I happened to be having lunch with a guy by the name of Tim. That's not his real name, but it's a good name. And I finished eating, I think we were talking about cricket because he was really into cricket and I was really polite.

And I finished eating and I put my dishes in the sink and said, "All right, I'm going back to work." And Tim said, "I can't go back to work just yet. The system won't allow me to enter new policy details straight after lunch." And I made it all the way to the door and I'm like, "Wait a minute. I took the requirements for that screen, I designed that screen, I architected it. I actually did the implementation personally myself. There's nothing in there about lunchtime. Tim, humor me, let's go back to your desk and try this." And we do, and it didn't work. And we tried again and it didn't work. And we tried a third time and it did work. And I was confused. And Tim said, "Yeah, we just assumed you knew about it." "All right, I'll go back to my desk and have a look and see what I can discover.

And what I discovered is that, while a three tier system sounds like a good idea, it's really just three tiers. It's really just three things. Your distributed system ends up being a lot bigger than you would originally draw it on your diagram. See, when you enter new policy details into the system, we send an email, and that email goes via an SMTP server. And when we found this SMTP server, it was in this aging yellow beige box tucked into a corner of the center. I believe the hardware being charitable was younger than me, maybe. I think it was probably running Windows Server NT 2000, something like that. Anyway, somebody in the IT department had decided in their infinite wisdom that everything should be running a virus scanner. Everything in the service center should be running a virus scanner. Very laudable, I'm totally for this. But this machine couldn't handle it.

And when they configured it, rather than running at midnight, it was running at midday. And so for about 12 minutes after lunch, the SMTP server couldn't process requests. And we already saw what happens in an N-tier system. When a lower tier goes down, the next tier goes down, the next tier goes down, until the customer picks up the phone and complains at you. This time we did recognize this for what it was, temporal coupling. The SMTP server and our application tier had to be up at the same time. And when we investigated a bunch of different solutions, the one we came up with, the one we ended up using, was to put a rock-solid bit of infrastructure in between our application server in the SMTP server, in this case a queue. In this particular example it was MSMQ, but today you might use RabbitMQ, Azure, NServiceBus, Amazon SQS, anything with Q somewhere in the initials is probably one of these technologies.

So now we don't send an email directly to the SMTP server. The application tier, when it decides it wants to send an email, it writes the details of that email down, it puts it on the queue, and then some other process wakes up, reads it from the queue and passes it off to the SMTP server. So now if your SMTP server is down for 10 minutes, 20 minutes, the next 12 years, application server won't even notice. Which is itself a problem. Because it means the first time you're going to know that you're not sending welcoming emails, is when a customer happens to mention they didn't get the thing they weren't actually aware they were going to get in the first place. So, we added this to our monitoring solution as well, to make sure that this thing was up and running at any given moment.

But now we had a golden hammer. And suddenly our project had a lot of nails sticking out of it. PDF generation, credit card payment, any kind of web API call, anywhere where we had two things that had to be up and running at the same time, we were like, "Oh, we'll just stick a queue in between those things." Until ultimately we'd kind of got all the things around the edges and then we went, "Do their application server need to be a single thing, or can we break some of these business processes up?" And we ended up with something that looked more like this. Which looked starting to look a bit more like a modern microservices architecture. We've got a bunch of different components all collaborating, they're passing messages to each other and doing their work as quickly as they can, and everything was working fine.

And now we're not coming in on long weekends to install new versions of anything, because we can take down two-thirds of the system, and the end users don't actually realize, don't even know. So we're doing it in the afternoon. We do a daily standup and go, "All the tests passed yesterday, let's just deploy a new version." This took probably close to four or five years to actually get to this point, where everything was in its own separate component. It took a while to get there, but it was well worth the effort. But, then my boss called me into her office again. And at this point we have a pretty healthy working relationship. I'm not as scared as I was that first time.

She sits me down and says, "The business is extremely happy with what we're doing. We're delivering features on time, on budget, and all the other lies that we tell people we're able to do. But, I'm hearing just rumblings from people that it's kind of slow. The system seems to be slowing down over time. The more policies and things we push through the system, it's not as snappy as it was on that first day when everyone was really happy. And I'd like to preempt this, just like we monitored for errors before things had happened in the system, and we're aware when things go wrong, we're aware of this, can we do something about it? I'd like to spend some time optimizing the system, making it faster. How do we go about doing that?"

And I closed my eyes, and I imagined in my head the system that we built. See, when this was a three tier system, I could run it all in visual studio under a profiler, and I could see which bits of it were slow. I could push buttons in the UI and I could measure everything and see. This, I have no idea. But I let out a long breath, and still with my eyes closed, I said, "In order to answer that question, I need you to send me to Disneyland." To her credit, she immediately arrived at the obvious question, "How is that going to help?" And I grabbed a whiteboard marker and I started scribbling on her whiteboard, because that's how I communicate.

"And I said, "Look, in our system, we have a collection of components, they're all processing work as fast as they can. And in front of each one there's a queue which represents a backlog of work that that component is trying to get through." And when I go to Disneyland, you got to see a bunch of theme park attractions. They're all going to be processing customers as fast as they can. And in front of each one, there's going to be a queue representing a backlog of work for that theme park attraction to try and go through. And if anyone has figured out how to optimize distributed systems, it's going to be Disneyland. They have way more money than we do."

Now I mentioned, if you add up all the times, you'll figure out that I was on this project for about six years at this point. I've been there a while. In that time, I'd finished my education, I got married, I had three kids, I had a dog, I had a mortgage, I had way too many board games. I must have looked absolutely exhausted because it worked. And I got to go to Disneyland. And there's something about Disneyland. I know that that is a guy in a suit, it doesn't matter, because the rational part of my brain was just shoved out of my ear as my inner child leapt up and grabbed the reins. It was like, "It's Mickey Mouse. Come on kids, let's go find Chip and Dale."

And I got lost in the magic for a while. And I went back through my photos to try and find the exact moment when the horror kind of dawned on me, "Oh no, I have to go back to work and justify this." So I bid goodbye to my family, and I started about trying to understand the performance of a theme park. Now, from the outside this was a huge problem. And as engineers when we're faced with a big problem, we tend to break them down into smaller problems, find a solution, then scale that solution up to the bigger problem. So I thought, "I can't figure out the performance of an entire theme park. But what if I could figure out some way to measure the performance of a single theme park attraction? I'll do that for two of them and then I'll figure out which one I want to optimize based on what I've measured."

My first thought on how to do this is to measure throughput. How many customers ride an attraction in a given window of time? Measuring throughput sounds like a good thing. Low throughput is bad, high throughput is good. It seems like a reasonable performance metric. And measuring throughput is easy, you need two things. You need a stopwatch in one hand, you need a tally ticker in the other. You start your stopwatch, you stand at the back of the attraction. As people get off, you just click the tally ticker. At the end of whatever period you've defined, let's say an hour, you write down what the tally ticker says, reset both and go again. And if you do that for an entire day, you might see a graph that looks something like this. "Okay, I've got a performance metric for one theme park attraction. Let's do it for another one." Audience participation time. We're going to optimize one of these. Should I optimize the blue one?

13:42 Audience

Yeah.

Yeah? The orange one?

13:46 Audience

Blue, orange, orange.

Who thinks this is a trick question?

13:50 Audience

(mixed answers)

Yeah, good, everyone's awake. It becomes immediately apparent when you start to look at these graphs and think about theme park attractions. The performance of a theme park attraction doesn't change during the day. This blue one didn't start crap in the morning, get better, dip a bit at lunchtime and then get peaked in the afternoon and then go down. What throughput is helping me to measure is not necessarily performance, it's demand. There's a theme park attraction at Disneyland called Splash Mountain. You're going to get wet if you ride Splash Mountain. The clue is in the name, Splash. On a cold wet day, less people are going to want to ride Splash Mountain than on a warm sunny day. And if I measure the throughput on each of those days, I'll see different results. Okay, so that's not a good performance metric.

Wait, there is one thing that can help to tell me something about the performance of a theme park attraction, and that's when throughput flatlines. Because when it flatlines, that might be an indication that the theme park attraction can't process any more customers than this. We would call that the maximum throughput. Maximum throughput is a performance metric, that's a very useful performance metric. Unfortunately, we can't tell just by measuring throughput whether or not we've hit it. Because the other time that throughput flatlines, is when you have constant demand. Now, other things will happen when you hit your maximum throughput, which would help us to identify this situation. But luckily, we don't need to measure maximum throughput, we can calculate it. To show you how, I'm going to talk about Goofy's Sky School.

Goofy's Sky School is a roller coaster, but it's a very specific type of roller coaster, it's called a wild mouse roller coaster. A wild mouse roller coaster is very short cars that's designed to make very tight turns and dips. So for the purposes of this example, I want you to imagine Goofy's Sky School as a roller coaster with a single track. On that track there is a single car, and on that car, there's one seat. So we can serve one customer at a time, we're going to finish serving that customer, and then we have to start the next one. So intuitively, if it takes two minutes for the car to go around Goofy's Sky school, in an hour, we can only do 30 rides. You can only service 30 customers in an hour. And that is a reasonable measurement of maximum throughput.

Measuring the ride duration is easy. You just take your stopwatch from before, start it when somebody gets on the ride, stop it when somebody gets off. You shouldn't go to Disneyland without a stopwatch is basically what I want you to take away from this thought, it's very important. Can we scale that up? What if we just put more seats on the car? If we have 30 rides an hour, and we can put 20 people on the car, simple math says that we can now service 600 customers an hour, and that gives us a full formula for maximum throughput. Maximum throughput is the period we're measuring divided by the duration of a single ride times the concurrency, how many customers we can service at the same time. Turns out these metrics are pretty stable. The duration of a theme park attraction ride doesn't change much throughout the day. And the concurrency, well, Disney is going to do their darnedest to make sure that as many people are stuffed onto those cars as they possibly can.

But, if we do want to maximize maximum throughput, which is a awkward sentence, but that's what it says, there's two ways of doing it. We can increase concurrency, we can reduce ride duration. Both of these are problematic for theme park attractions. There's only so many seats we can add to a car before it can't make those tight turns. There's only so many cars we can add to a track before we can't fit anymore. We could in theory add another track, but after a while it becomes logistically difficult to have that many tracks to get customers to them. We could duplicate the entire Goofy's Sky School and Drop Goofy's Sky School to Electric Boogaloo next to it. All of these are ways to solve the concurrency problem that have different costs at different levels. Hooray. So, but they're not things that we necessarily want to be doing.

Similarly, we don't really want to reduce the ride duration. Because if it takes two minutes to go around the track of Goofy's Sky School, making it go for one minute makes it a less fun ride. In theory, for most of the theme park attractions, you want the ride to give you a good experience. So the ride duration is actually linked to the business value proposition that we're providing. Same for the concurrency. So, if you're building theme park attractions, which I assume some of you in the room are, that's why you're here, you'd probably want... You would invest an awful lot of time in making sure these numbers are right before you actually start constructing anything. And then you don't really want to alter them once you've built the thing.

So, we can calculate maximum throughput. Maximum throughput is an important performance metric, because it puts limits on how fast you can process customers. It puts limits on how well you can meet demand. So if you have a theme park attraction that can service 200 customers an hour, and you can improve it so that it runs at 250 customers an hour as a max, that's good, that's a performance improvement. That's something you should be thinking about. But you can't really use maximum throughput to talk about comparing to attractions. Remember back to the two graphs I had before, the blue one and the orange one? If one of them is running at 200 customers an hour at max throughput, and another one's running at 250 customers per hour at max throughput, and I ask you the same question of which one should we optimize, it's a trick question again. Because maximum throughput isn't normalized across all of those things.

They're each trying to provide a different user experience, a different value proposition at the end of the day. But we can normalize them, and we can include some element of demand, because that's the other part of it. There's no good looking at the one that has the lowest maximum throughput and say we should optimize there, if the demand is nowhere near that maximum throughput. So we can normalize that by taking the throughput that we were measuring before, dividing it by the maximum throughput. That gives you a number called saturation. It's a number between zero and one. If you're at zero, your throughput is nowhere near your maximum throughput. Don't bother optimizing that, no one's using it. If it's close to one or at one, then that's an indication that this theme park attraction is pretty close to maximum throughput.

And if we spend some time optimizing it, we can drop the saturation down, which will give that theme park attraction more of a window before it starts to back up. Okay, saturation is a decent measure. Saturation is also a point in time measure. So what I did was, I took a bit of a histogram and I said, "Okay, I'm going to set a threshold and say anything above 80% saturation for any hour that a theme park attraction goes above 80% saturation, I'm going to record that as a one. And then I'm going to graph those." And figured I would get a graph something like this, which would be very useful. Because then I could point at Space Mountain and say, "That spends a lot of time very close to saturation. The graph is really high. That's where I should spend the time looking at it."

I didn't see that at all. Anyone know what I saw? I saw maximum saturation across the park. And I was confused. Who was in Dell's talk last session in here? What do we do as engineers when we're confused? Ask a duck. So I know, there's a lot to be said in the industry about rubber ducking. I highly recommend Donald Ducking. Donald's a fantastic listener. He's not allowed to talk to you, so that helps. But through explaining myself to Donald, I managed to get the point across, and he explained through me what was going on. What happens if a theme park attraction runs at 100% saturation, if it's hitting its maximum throughput? Some of the people that want to ride don't get to. They stand in front of the ride. And then more people stand in front, and more people stand in front, and more people stand in front.

You will spend more time at Disneyland standing in queues than you will spend on theme park attractions. That's insane, but you will do it because it's fun. Theme park attractions are incredibly expensive pieces of equipment. They're expensive to run, they're expensive to design, they're expensive to maintain. They're maintained pretty much whether or not people are riding them. So anytime when they're not running at full capacity that represents waste, that's money that Disney is just throwing into a fire somewhere. If you're standing around in line for three hours to get on a theme park attraction, you might consider that a waste, but Disney already has your money at that point, so I don't think they are that concerned. But , but people queue up and we end up with a queue. Is queue length a good performance metric, how many people are waiting to actually go on this ride? Is that a decent thing to measure?

It turns out queue length in an instant, counting the number of people in a queue is not that useful. But knowing how it's trending is incredibly useful. If queue length is trending down, that's an indication that you're able to keep up with demand. By the end of the day, this theme park attraction is going to have the fewest number of customers left unsatisfied, which is exactly what you want. If it's trending the other way and going up, that's bad. That's an indication that you're not able to keep up with the demand. By the end of the day in this case, you're going to have the most number of customers left to satisfy. And what do you do with them? Do you tell them all to go home? Do you run the equipment in overtime and pay staff to hang around?

Ideally, probably what you want to do is, you want to find a point earlier in the day where you can shut the gate and say, "Please stop joining this queue. You're not going to get to ride this attraction today anyway." You want to be really careful about picking this point. If you pick it too early, you'll have a time at the end of the day when you've run out of customers to service, and then if you reopen the queue, then you're back into the same problem that you had before of, how do I figure out what to do with that? If you do it too late, you're back to the initial problem. At the end of the day, you have the most number of customers left to service. So how do we pick that time to know when to shut the gate? To do that, we're going to use a measure called queue wait time.

That is, "If I was to join this queue, how long will it take me to get to the front?" It's a very useful metric. Measuring queue wait time is easy. Take your stopwatch that I know you've all brought with you to Disneyland, join the back of the queue, start the stopwatch. Spend the next interminable amount of time making it to the front of the queue, stop your stopwatch. That is an accurate measurement of queue wait time. It's also completely useless, because it's a measure of the queue in front of you, which is now gone, and it says nothing about the queue behind you, which could be very, very different. So measuring queue wait time isn't going to help you. And in fact, the longer the queue wait time is, the less useful it is, because the more time that the queue behind you has had to change, we can't easily measure it.

Lucky we can estimate it. We can estimate it because, we know how many people we can service concurrently. Let's say in this specific example, we can count back four people and say, that's one ride. Count back another four, that's the next ride. Count back another four, that's the next ride, and so on. We also know how long each ride pretty much takes, and so we can walk backwards at the queue and say, "That ride will be finished in two minutes. That one will be finished in four minutes," and so on. You don't even need to count heads. It turns out you can count people by volume, which is a weird thing to do. But if you know how many people fit in a space, you can walk back X number of meters and go, "That's probably about 50 people. That's probably about 100 people."

And using this kind of math, you can then hang a little sign on the wall that says, "From here it's going to be five minutes. From here it's going to be 15 minutes. From here it's going to be two and a half hours." There was really a sign I saw that said that. It's a weird thing to do, but it works. So we can estimate queue wait time using some pretty stable metrics. It's the queue length divided by our maximum throughput. Why maximum? Because if we weren't operating at maximum throughput, in theory, we shouldn't have a queue, because people would be riding the attraction. So, if we want to minimize queue wait time, which is an important thing to do. why? Queue wait time is a good measure of the responsiveness of a theme park attraction. Queue wait time is the time between deciding to ride a theme park attraction, and getting to.

That's what gives you that feeling that the thing is slow. It's that feeling of, "When I decide to do this, there is a long delay between when I actually get to, or when the thing is actually done." So let's try and minimize queue wait time. Looking at the formula, there's two different ways of doing it. We can try and minimize queue length or maximize queue throughput. We already know how to do maximizing max throughput. We know that side of the graph is very problematic for theme park attractions. Minimizing the queue length is easy. You just shut the gate and say, "Nope, don't join anymore," which is a thing you might actually do in a theme park. Because you could say, "We don't want any theme park attraction to have a line that's longer than two hours." Because then somebody that's coming to the park for an eight hour day, they're only going to get to ride four things and then they go home.

So we'll cut them off earlier than that. So then you may be unsatisfied, because you try and go to a theme park attraction and it says you can't go on, but then you just go and pick something else to ride. If you measure the average queue wait time across the park for a day, you might get a graph that looks something like this. That is an incredibly useful graph, because the thing with the largest estimated queue wait time, the largest average queue wait time, that's where you should start your optimization. And as soon as you optimize that thing, it'll be a bit like moving a lump in a carpet. Because people will be coming off of that ride faster, which means they're going to go to the other attractions, so their estimated queue, wait time goes up. Then you optimize the next one, that one goes down, then all the rest come up. It's a constant battle to try and improve the situation.

With that, I felt like I had enough knowledge. I got back into the magic of Disneyland. And then exhausted, I went home. I tried to find this hat so I could recreate this incredibly glamorous look, but I have no idea what happened to it. When I went back to work, my boss asked me all the usual questions. "How was it? What was your favorite ride?" Space Mountain is my favorite ride, which is a roller coaster completely in the dark. It's inside a building so you can't see what's going on, and you're constantly being shifted in different directions, which is the experience that most closely matches being on an IT project.

Then with a twinkle in her eyes she asked, "What did you learn about the performance of distributed systems?" I don't think she expected an answer. But I grabbed that whiteboard marker and started scribbling things down. I said, "Look for each one of our components, we're going to measure how long does it take to process each one of the messages in the queue? How many messages are we able to process concurrently out of that queue? And how many messages are we processing in a given window of time?" In this case, we pick a minute. When you use that information, those measurements, I'm going to calculate the maximum throughput for each component. And once we know what the maximum throughput is, we're going to get a feel for the saturation. We're going to take the throughput, divide it into the max throughput. And we're going to find those components which are operating at or near maximum saturation. We're going to optimize those, because those are going to be the future bottlenecks.

Once we found those, and we did find quite a few, when we found some that were obviously running at 1.0 saturation, we're going to measure queue length. Using queue length, we're going to take queue length and maximum throughput, and we're going to calculate the queue wait time for each of these. We're going to average those up over a period of time, find the ones which are highest, and those are the things that we're going to optimize. How are we going to optimize those things? In a theme park attraction, the right-hand side of this graph was pretty much a no-go zone, and the left-hand side was the only lever we had to pull. For a software system, the left-hand side is pretty much a no-go. You can't shut down the gate and say, "Please stop sending me credit card payment requests," because otherwise they'll take too long to process. The business won't accept that.

Not only that, there's not an end of day for most modern software systems. It's not like you get to 5:00 and go, "Okay, no more credit card payments today. Off you go." There's always something going up. Amazon doesn't have downtime and night time. What about the other side? For a theme park attraction the other side was inextricably linked to the business value proposition. Concurrency was important to the business value. Duration was really important to the business value proposition. For software that's not the case. Charging a credit card, sending an email, producing a PDF, whatever part of the business process you're doing, rarely is going to be time-bound or concurrency-bound. There are exceptions.

That means these are the levers that we get to pull most often. How do we go about doing that? What are some ways that we can reduce duration for processing an individual message? We could spend some cycles hyper-optimizing the code. Doing that turns out is an expensive engineering proposition, it's error-prone, you're going to introduce bugs into the system. It also makes the system less maintainable, because frequently the hyper-optimized version is less readable for the next person that has to come along and add a new feature, and so they completely mess up your optimization. Usually the cheapest and quickest thing to do is just put it on a faster box. Put it on better hardware if you can. And this will only work if your work is process... What is it? CPU bound. So if it's primarily involving the CPU.

The other advantage that this will give you is that, most modern CPUs have rather than just faster CPUs, you just get more CPUs. You can do more things concurrently, so you get concurrency out of that for free as well. There's another way to get concurrency, an easy one. Run a separate copy of the component somewhere else. Remember when I talked about that for Goofy's Sky School, and said we could pick up Goofy's Sky School and put another copy next to it? That sounded insane, because the engineering costs were so high. In the software realm the engineering costs are incredibly low. It's an X copy operation, pick up a copy of the code, run it somewhere else. If you had a concurrency of five for one instance, then adding another two instances, now suddenly your concurrency is 15, your maximum throughput goes way up until it doesn't.

The reasons why it doesn't is usually because, whatever you're doing inside of that process, inside of that component, are connected to some other resource which is limited. I'll give you an example. We did have a component that was flatlining on throughput, which is how we learned that measuring throughput, when it flatlines, it doesn't tell you what the maximum throughput is. Because we thought it's hitting maximum throughput, but throughput's flatlining, we'll scale it out, and its throughput dropped. And boy, did I need Donald Duck that day. With a little bit of investigation, it turned out that our original component was running into deadlocks. It was doing things in parallel that ended up talking to the same data components in the database. Scaling it out made it hit deadlocks faster, which slowed it down.

In that specific case, the only thing that you can do, is to actually reduce concurrency to reduce the number of deadlocks you get. That is a good short-term solution, but it should be a massive red light. Because that then becomes, if we can't get the duration any faster, and we can't increase concurrency, and we can't cut off queue length, this is as fast as this component is ever going to get. It is as performant as it will ever be. And as the system grows, this is going to be a bottleneck. It's good to identify those early and fix them before you get called in at 2:00 a.m. on a Saturday, fix them. Anyway, we did a bunch of this stuff to our system. We identified the high estimated queue wait time. We managed to increase concurrency and reduce duration on some of the more expensive operations.

And over time, because I'm a stats guy, the number of occurrences of people saying it's slow dropped down to basically negligible. It's never zero, there's always somebody complaining in the organization. Now, when you go back to work, you're going to get the obvious questions. How was it? Did you enjoy it? What was your favorite session? Did you learn anything about the performance of distributed systems? You might not get that question, but hey, who knows? You might be tempted to talk about responsive roller coasters and Donald Ducking. You'll sound like an insane person. It's amazing. I love coming back from conferences.

Resist the urge to explain anything. Close your eyes. Take a deep breath and say, "In order to answer that question, I need you to send me to Disneyland." It might work. And if it does, I want to hear about it. Speaking of which, please rate and review this talk. The conference organizers have flown me across two continents to get here, so it's vitally important that if you enjoyed this, if you've got something out of it that you let me know. I think that's all I've got. Thank you very much, DevConf, you've been a wonderful audience.

The science of queues: Performance monitoring theme parks (and distributed systems)

About this video

🔗Transcription