How building a portable escape room made me a better developer
About this video
This session was presented at Copenhagen Developers Festival 2024.
I built a portable IoT escape room, and it changed how I think about development! In this session, we’ll uncover unexpected parallels between escape rooms and architecting resilient, scalable systems. Let me share the lessons learned, and how they translate into invaluable guidance for Domain-Driven Design (DDD), modularity, communication, scalability and fault tolerance. Doors will be unlocked at the end - even if the audience can’t solve the puzzles.
🔗Transcription
- 00:04 William Brander
- Okay, good afternoon everyone. We made it to the end of the second day, Copenhagen Dev Fest. I was going to make a really bad joke about locking the door there and having to escape, but I think given the heat in the room, it's a bit unfair and cruel to do that. My name is William Brander and in this session we are going to be talking about how building a portable IoT escape room improved my skills as a developer. I work for a company called Particular Software. We make NServiceBus. Has anybody heard of NServiceBus? That's actually more than usual. So NServiceBus, if you haven't heard of it, is a library and framework for .NET developers to distribute their systems. I'm not talking about that today. I am instead talking about my absolute crippling addiction to escape rooms. I love escape rooms. Every time I travel somewhere, I try and find a group of people that I can go and do an escape room with.
- 00:57 William Brander
- They are for me, one of the most fun things to do. In fact, I've done an escape room on six out of the seven continents. There's just this pesky Antarctica at the bottom that I don't quite know what to do about. I've got two ideas though. Did anybody go to Richard Campbell's talk on nuclear power yesterday? That gave me some interesting spicy ideas. Maybe we could destroy Antarctica somehow.
- 01:20 William Brander
- The second idea that I've got is I can change careers completely, become a research scientist, get stationed in the Antarctic, and then as soon as I get there, take all of the keys and access control magnets I can find and throw them outside, thereby triggering a forced escape room with my new colleagues. I'm sure they're going to love it. Now, I'm South African, I come from Johannesburg, and South Africa does have a research station in that Antarctic, but I wouldn't do it there.
- 01:43 William Brander
- I would instead do it at the Norwegian research station because they call it the Troll station. And that just sounds like the best place to act like an idiot. Another fun fact about the Troll station, because it's in Antarctic time zones don't really make but sense here. They've got their own time standard called Troll Time, which again, just sounds like the most amazing place to have to work. Now you might be thinking to yourself, well, William escape rooms are fun and all, but why would you build your own one? And that is a very good question because it was a lot more work than I expected. But the reason I built an escape room is because of these two... That's gone off. I pushed the wrong button. Because of these two. These are my daughters and I absolutely adore them. They are very, very cute.
- 02:25 William Brander
- And they know that I love doing escape rooms. So they asked me if they could come and do an escape room with me. And unfortunately, all of the escape rooms in Johannesburg are not at a difficulty level that children will enjoy themselves. So I thought to myself, well, maybe I can build something that they can do with their friends. And that's what I set out to do. This picture by the way, is taken in the most punny place ever. It's a maze that's built out of mazes and they call it amazing and it is amazing. So it's a maze that the kids are walking through and then they find little clues and they have to solve puzzles along that way as well. So I already knew that they liked that sort of thing. Now, when I want to build an escape room for my kids, I obviously want to build one that I think I'm going to enjoy so that they will potentially enjoy as all.
- 03:07 William Brander
- So I started thinking about what things I enjoyed out of escape rooms. And the first is the decor. Escape rooms have a very specific decor look. No-one would ever live in a room like this unless maybe you do, and if you do, please come and talk to me afterwards because you're probably one of the more interesting people in the world. But this is very clearly an escape room. Actually my hotel at the Axel Guldsmeden looks a little bit like this. It's weird. You can tell this is an escape room because there's padlocks all over the place and there's weird little pictures like on the right-hand side, this picture that says hell hath no fury like a woman scorned. Love the quote, fantastic quote, but I don't know anyone that's got that hanging up in their house, seriously.
- 03:47 William Brander
- Again, if you do let me know, I desperately want to meet you. So one of the things that I wanted for my escape room is I wanted a fun theme. I wanted something that the children would enjoy and would engage them. The next thing that I wanted is lots of different rooms. And because this was going to be portable, I didn't know that I was going to have access to lots of rooms. So if I couldn't get multiple rooms, I wanted lots of containers for the children to open up. For me, one of the more fun things is when you think you've solved the entire escape room, when you look at the clock and you think, huh, I've got 20 minutes to spare. I'm a genius. And you open a door and there's a whole new room there with a bunch of new puzzles and you realize, you've actually just taken the longest time possible to get through the first section. So I wanted to have lots of different multiple areas for the children to do the escape room.
- 04:31 William Brander
- I also wanted the puzzles to be stimulating, and I don't just mean visually stimulating, like these look nice, but I also want the different puzzles that use perhaps the sense of smell, use their sense of touch, sense of hearing, things that make them use different senses while they're solving the puzzles. And then of course I wanted it to be finishable, and I don't think finishable is actually a word, but I'm going to claim it. And the thing is, you don't want it just to be finishable. You don't want to finish the escape room in 20 minutes because you haven't engaged properly with the puzzles yet. It hasn't stimulated you enough. So there's like a sweet spot I think of somewhere between 50 and 60 minutes for an escape room that just is really great for me.
- 05:09 William Brander
- If you can finish the last puzzle and you can see the clock ticking and it's 59:20, 59:21, 59:22, and you solve that last clue and everybody sprints to the front door to get out, I love that. So those are the things that sort of stand out to me that I enjoy about escape rooms. And that gave me my list of requirements. First was I wanted simple puzzles that the children would be able to understand. I didn't want to have to spend a lot of time explaining puzzles to children. Now, I don't mean simple and easy, those are different things. For instance, pushing a button is simple and easy, but if you put that button behind laser tripwires, it is still simple to push the button, but much harder. So I wanted the puzzles to be simple. I wanted the escape room to be finishable. Again, within that sort of 50 to 60 minute sweet spot.
- 05:58 William Brander
- And I also want the different areas that they could open and I could run it in my house for instance, or run it at one of their friend's houses and have different rooms that they have to open and explore or open lots of different chests and containers. I've got makeup bags in my house that are just purpose-built for this escape room now. And I also wanted all of the puzzles to have a sense of wonder. I wanted every time something happened for the kids to go, "Oh, wow, what is this? How does this work? What do I do with this?" Some optional requirements I had as well is I wanted replayability. I wasn't going to go through all of this effort to do an escape room and run at once, which also means I need to be able to reconfigure some of the puzzles in different ways so there's not the same thing for all the children every time they run through it.
- 06:40 William Brander
- Now, all of these requirements are really easy except for one, which is the finishable one. Because I don't know if you know children, but some of them are really, really smart and some of them, I don't want to say they're not smart. Maybe just as adults, we haven't realized in which ways they are smart yet. So coming up with an escape room that any random group of children of age between five and 10 will be able to finish within 60 minutes was quite challenging. And eventually what I ended up doing was I needed a way to control the difficulty of all the puzzles in the escape room while the players were playing. So if they were taking a really long time on the earlier puzzles, I wanted the difficulty to decrease so that the later puzzles were a bit easier for them and if they were going through it really quickly, make the later puzzles a little bit harder.
- 07:31 William Brander
- Now, we've got a few more slides worth of context and then we're going to get into what this means for software development. So let's take a look at some of the puzzles that are built. There's a whole lot more. So if you want to talk to me about these afterwards, you're welcome to come grab me. I love talking about this stuff, but the ones that we're going to cover here are ones that are relevant to what we're doing about software development. So the first was a balance puzzle. So inside this, it's a little 3D printer container with some LEDs, addressable LEDs, and there's a little gyroscope unit in there and you can use that gyroscope to work out exactly where this thing is pointing. So the children would either have to make it point perfectly level for a bit to unlock the next puzzle, or maybe they could point it at things around the room.
- 08:11 William Brander
- So you can change the difficulty of this by changing how close it needs to point at the target before it registers, yes, you've successfully pointed at the thing. How long it must stay on target, the number of targets it has to point at. This, I really like this puzzle because you can use it in lots of different ways. So if there's a combination in the room and the combination is 912, if you put the numbers one through to nine around the room, you can have it pointed them in a specific order. But you can also make it that they have to point it north, south, east, west. So you can do a lot of flexibility with this. And then the last way is how they actually manipulate the puzzle. So if you're holding it in your hands, it's a lot easier to get that fine movement.
- 08:50 William Brander
- The children had to do it with strings attached to it. When I run it for all the people, they do indirect manipulation. So the strings go through pulleys and there's levers to control some of the other axes. So you can make this quite difficult or substantially easier. The next puzzle of interest to us is something that I call the musical tone. So there's a little speaker in there and it plays a tune and then the keypad, each number on the keypad corresponds to a musical note. So they have to repeat the song that's played in order to unlock the puzzle. You can change the difficulty of this puzzle by changing the length of the song. So if it was Twinkle Twinkle Little Star, that's a lot easier than Beethoven's 5th, for instance. The randomness of the song, so whether it's a song that the people know or not, because if you know the song you can hum it after you've heard it once and you can kind of get a better sense of what the next part's going to be.
- 09:40 William Brander
- When I ran it for the children, we had three different levels of the song. So the one was Clair de lune for if they were going really quickly through all the puzzles because I knew there was no ways they were going to ever have heard Clair de lune before. Then there was a Tetris theme song to make it a little bit easier. And then there was the Super Mario Brothers theme song because I know my kids had at least heard that. So you can change the difficulty by changing the type of song that they have to play.
- 10:05 William Brander
- Another fun puzzle is Simon Says. This one is my absolute most favorite to watch the children do. There are multiple devices, I eventually made five of them. Each of the devices have got two buttons and two LEDs, and the LEDs flash across the devices in a specific pattern and the players have to repeat that pattern. This is amazing because children have no chill. If one person gets the pattern wrong, they all lay into him. It's like karma as a parent, it's great.
- 10:34 William Brander
- So we can change the difficulty of this by changing the number of devices involved. So there's five in total. If you reduce the number of devices, it's easier for the players to remember the order across the devices. You can also change the number of buttons involved. I don't mean physically remove the buttons, but if only the left LED flashes, it's easy for them to remember left on this one, then left on that one, then left on this one rather than also remembering the device and the side. And then again, the length of the password. So these are the puzzles that are kind of most interesting to us today. But once you've solved the puzzles, you also need a way to dispense the next clue.
- 11:09 William Brander
- So let's take a look at some examples of that. Easiest just a mag lock and a box. Put this up somewhere higher. As the mag lock opens, something drops out and makes a bit of a noise and the children can find it. Very basic. Treasure chest that opens with the little servo motor. Same principle. Pro tip with this though, make sure you've got a decoy one sitting somewhere that you can explain. Don't force this open because they will just as soon as they see it, rip it and strip all of the servo motor gears. I've had to replace it many times already. Another one that was quite fun, especially for young boys is a sword. There's a little mag lock on the side of the sword that keeps it stuck in a sheath. You can write whatever you want using a dry erase marker on the sword.
- 11:50 William Brander
- Once the mag lock disengages, they can pull it out and get whatever clue they want there. But by far my absolute favorite way to dispense the next clue is... Well, my wife is very generous. She's super kind. She says, my design aesthetic is somewhat prototypal. I think she's been way too kind to me because this just looks like shit. So this is a dispenser I like to call elastic bands and splinters. It's basically a catapult. There's a servo motor at the back at the bottom that keeps a string under tension. If the servo motor disengages, that string releases the elastic bands pull it forward and it makes the most amazingly loud sound. And this thing comes flying through the room.
- 12:33 William Brander
- If you can hide this where the children aren't expecting it and they don't know that it's there, it is amazing to watch their reactions, especially if you do it after, for instance, the musical tone puzzle because they're all standing around listening very quietly and all of a sudden this loud bang and something comes flying across the room. The screams are great. I like to think that what I'm teaching children in this moment is that making progress towards your goals is scary and you shouldn't do that.
- 13:03 William Brander
- All of these things have something in common and that's how I interact with them. So these are all based on ESP devices. This is an ESP-A266, not really relevant, but some of them have WiFi, some have got WiFi and Bluetooth. There's a bunch of options that you can use. But ultimately what these are is these are little portable web servers. So you can power them either through a battery or through USB, dependent on whether you need it to be portable or not. And then you can write some code, C++, Python, JavaScript. There's many options. And you can tell these little web servers what to do when they get a request on certain endpoints. So you can interact with them in that manner. So in this case we're saying when I get a request on, what is it, /inline, return a 200 text response that says this works as well.
- 13:50 William Brander
- The way that this actually works is that the library that you use in the C++ version has a server.handleclient method and this loop method gets called constantly. So it just checks, hey, do I have someone that's waiting for an HTTP response? If I do, I can give that response, otherwise do something else. The problem is this is single-threaded and your devices aren't always just sitting around waiting for an HTTP request. So in the happy path, perhaps they are. In the happy path, you've got your device and it gets an HTTP request and it goes and says, oh, okay, here's your response and everything's great. But maybe instead of just sitting around waiting, you've got a device that's calculating where it's pointing in three-dimensional space like we do with the balance puzzle. In that case that uses a chip called an MPU-6050. Some of these have got digital motion processing on the chip, some of them don't. You don't always know which one you've bought until you try it.
- 14:48 William Brander
- If you've bought the one that doesn't have digital motion processing, you to do all of the smoothing calculations yourself so it's not jittering around thinking that it's pointing at mark three somewhere else. And that can take a bit of time. So if it's busy calculating the angle that it's pointing at and it's taken some time to process that, when a HTTP request comes in, that server.handleclient loop isn't being called. And it could take long enough to calculate the angle that eventually when it's done, that HTTP request has timed out. So in this case, we've got temporal coupling between our devices and what they're doing. Now, there's a better way to do this, and that is if you've got your devices, they're busy calculating and do their things. They take as long as they want and they work out whatever they need to work out.
- 15:35 William Brander
- But after they're done and when they've got some time to breathe, they go and say, well, what do I need to do next? Is somebody waiting for a response for me? And you can store what am I doing next in a database or something like that or a file that you can access. But because I work for particular, we do a lot of queuing technology. So for me, a natural fit was to go with MQTT, which stands for Message Queue Transient Telemetry. It is also a blatant lie. We'll get to why. MQTT is nice for these devices because it's a queuing technology, which it isn't, but it's lightweight. So a lot of these devices work really well with it. It's reliable. You can specify three levels of reliability at least once, at most once, and exactly once message delivery, which is very nice for a lightweight queuing technology like this.
- 16:24 William Brander
- It's also standards-based. So if you're using the MQTT protocol, you should be able to switch from, for instance, an on-premise instance of it to Azure Event Hubs or something like that, and it should just work in theory. The downsides are it's not actually a queue. It's got queue right in the name, not a queue. Don't let that fool you. The queue length of a MQTT topic can either be zero or one, which means you publish a message to a topic, it sits there, and then you publish a second message to the topic and your first message has disappeared. Someone needs to explain to me how that is exactly one's delivery because that's not exactly one's delivery as far as I'm concerned. So it's not actually a queue, even though it's got queue in its actual name. It's also got a slightly smaller feature set than a more traditional broker. And we'll sort of see how to work around that a little bit later.
- 17:19 William Brander
- So we've got all of these devices and they're communicating via MQTT. The instance of MQTT I used was Mosquitto because I could run it on a Raspberry Pi and I could take the Pi and everything with me and have a network set up with the broker and everything's good to go. That's all the context we need. Now we're going to take a look at the things that I learned from this experience and how we can apply these to software development in general. So we're going to take a look at how I monitored the system. We're going to take a look at what I learned about aggregate routes while developing this. We're going to take a look at how you can manage the escape room flow with these devices and then how you can work around the limitations of specifically MQTT and other broker technology if you're using this.
- 18:02 William Brander
- So when it comes to monitoring these devices, I started off with thinking, okay, I need to know when something fails because these devices, they aren't always reliable. And what happens with these devices is when they start up, you can go and check and ask, well, what is the reason that you are starting up? Are you starting up because you lost power? Are you starting up because of a hardware fault, because of an exception, because of a watchdog timer? Whatever the reason is, you have to go and check that. And then I thought, well, I can also add health check APIs so that I can know whether my devices are online or not. So the health check APIs would just be a little HTTP endpoint, make a request to there and it'll give you a 200 okay response. Yay, my device is working. Looking good so far.
- 18:47 William Brander
- So then I also thought, well, because MQTT has weird queue length things, maybe I can also monitor the queue length of some of these topics because if a queue length stays one for a really long time, it means something isn't taking messages off of that queue. But at this point, it kind of started to sound a little bit weird to me. So I said to myself, self, we've done this before. This is not how you monitor proper systems. This is how you monitor systems from the early 2000s where maybe you had a few components in your system, a web server, database server, and you only had to ask three questions. Is the web server up? Is my database server up? And can the web server talk to the database server? When you've got only a small number of components involved in your system, this is sufficient.
- 19:33 William Brander
- But when you've got a lot of moving parts, this doesn't really help. And working with particular, I spend a lot of time thinking about how to monitor distributed systems because when you monitor distributed systems, you get a lot of information and not all of it is useful. And there's a tool you can use to decide whether or not what you monitoring is actually useful. So when you monitor distributed systems, there are three attributes that are important. The first attribute is the area that you're monitoring. So are you monitoring the infrastructure of your system? Are you monitoring the application of your system or are you monitoring the capability of your system? What does this mean? So some examples in the case of my escape room, if I ask the question, are my devices online, that's me monitoring the infrastructure, the physical hardware, are my devices working, and do they have IP addresses that I can access?
- 20:28 William Brander
- Are my devices responding to publishers is looking at the application level because it's not just up, it's now actively taking messages from a queue and doing something based off of those messages. And asking can my players solve the puzzle is asking questions about the capability that my puzzle is exposing. So if the device is up, it's responding to publishers, but they can't solve the puzzle because something else is wrong. Maybe a button's broken. That's looking at a different angle, a different area when I'm monitoring the system. So that's monitoring area.
- 21:01 William Brander
- Next to that, we've also got the monitoring concern. So when you monitor distributed systems, you can also monitor the health of your system. Now there's a many-to-many relationship between the monitoring areas and the monitoring concerns. So you can monitor the health of your infrastructure, you can monitor the health of your application and the health of your capability.
- 21:22 William Brander
- After health, you've got performance. And again, you can monitor the performance of your application, infrastructure and capability. And then you've got capacity. Do I have enough room to add another 10 customers to my system? In the case of the escape room, the MagLogs draw a certain amount of current, and the battery packs have a current limit that they put out. So if you've got a current sensor there, you can check, do I have enough capacity to turn another MagLog back on or not? So that's the monitoring area and the monitoring concern.
- 21:55 William Brander
- The third part that you look at when you building monitoring systems for distributed systems is the interaction type. How do you interact with your monitoring system? Is it a passive interaction? Do you just go and look at a dashboard every now and then when something goes wrong? Or is it reactive? Does your monitoring system pick up potential things and send you a push notification to say, hey, something might be going on here, take a look?
- 22:21 William Brander
- The third option is you can do proactive monitoring where your monitoring system will actively try and repair whatever is going on. So maybe it'll look at the queue length of a specific topic, and if the queue length is constantly growing, growing, growing, growing, your proactive monitoring system might go and try to scale up another instance of your system. You see this quite often in Kubernetes things. So using these three attributes of monitoring, let's take a look at what I was monitoring in my escape room. So looking at the health check APIs, that was telling me about the health of my infrastructure, and it was passive. I was just looking at the screen every now and then when something was wrong. But if I was looking at the MQTT queue length, that's also looking at the health of my infrastructure passively.
- 23:10 William Brander
- So if I'm monitoring the same things, what benefit am I getting from doing both? And the power of this tool is that these types of things, when you've got these duplicate information or these duplicate attributes that you're monitoring, it's a flag that maybe there's something that you're wasting time monitoring so you can discard some of it. And in this case, the health check APIs were false positives because the C++ library I was using for the web server was more reliable than the library I was using for the MQTT protocol. So what was happening was it was responding 200, okay, on the health checks, everything's fine, but the MQTT code hadn't to the broker. So it was a false positive on the health check. So for me, it was a better thing to monitor the queue length of the topics in this case.
- 23:55 William Brander
- If I was looking at the failures, so what would happen is the device would start up and it would check, well, what is the reason that I started up? Oh, it was a failure. I would then send a message to myself to say, hey, I've restarted because of this. So that was reactive monitoring of the health of my infrastructure. Another thing that I could have had was canary devices. So have a physical separate set of puzzles, and as my players are doing one puzzle, I quickly solve the next puzzle to make sure that the next thing's working. That would be proactive monitoring of the capability at the health level. So this is a tool that you can use to look at what you're monitoring and whether it's worth the effort of you putting the work into monitor it. Because monitoring doesn't just happen. Especially in distributed systems, it has to be an intentional effort.
- 24:41 William Brander
- You have to think very hard about what you're monitoring and how you're going to work with that. Let's take a look at some more real world examples because I don't think anybody has built an escape room, right? Anyone? Nope. Didn't think so. Some of us have to have real jobs. So if you look at the queue length of a system, that's often a good metric to look at, right? If the queue length keeps growing, that's a sign that something might be wrong. That's looking at the performance of your infrastructure most of the time, except in MQTT, because queue length is either zero or one.
- 25:13 William Brander
- But in most brokers, that is looking at the performance of your infrastructure. But queue length doesn't necessarily tell you whether anything's wrong, right? A queue length of five bad or is a queue length of 1,000 bad? That's not nearly enough information to make a call because if you've got five messages in your queue, but it takes five messages to process each of those five, that's substantially worse than if you've got 1,000 messages in your queue and it takes you 100 milliseconds to process each of those messages.
- 25:44 William Brander
- Maybe there's another metric we can look at there instead of queue length, and that's something called processing time. How long does it take to process the message? So once the code picks the message up from the queue, how long does it take to finish processing that message and say, I'm done act removed from the queue. That's the processing time. So that looks at the performance of your application because now you're looking at the performance of your business logic. How long has it taken to update the customer's credit card information?
- 26:12 William Brander
- But again, that's not necessarily enough in a distributed system because now you know how long it takes to process a message, but maybe the queue length is important there. So there's another metric that we can look at called critical time. So critical time is the time that it takes from when you send a message to when it is successfully processed. So that includes all network time, that includes all the time sitting in the queue. That also includes all of the processing time. And that you can use to tell you about the capacity of your application. Because if your critical time for a message is five minutes, you know that as soon as you send this message, it's going to be roughly five minutes before it's successfully processed. And that because it includes both the queue length and the processing time is something meaningful.
- 26:59 William Brander
- If you use the queue length to trigger scaling, which we just said you shouldn't, you should use critical time to trigger scaling, that would be proactive monitoring of your performance of your infrastructure. So these three attributes are a tool you can use to identify what you're monitoring and what you're going to do with that monitoring information. It's not necessarily going to tell you if there are duplicates, don't do it. But if there are duplicates, that's a flag that you can go and look a bit deeper and see why you're monitoring duplicate things. So that was how I monitored the escape room.
- 27:31 William Brander
- Let's take a look at aggregate routes. So aggregate routes, we should all know of the boundaries of consistency within DDD. That's all fun. The musical tone puzzle is a good example of this I think because it's one physical puzzle, everything is contained physically in the thing. I could change the entire logic of how the puzzle works. I can change the entire image of the puzzle. The only thing that it needs to do is publish a message when the puzzle has been solved, and it only needs to subscribe to difficulty change messages.
- 28:01 William Brander
- So this for me is a nice example of a self-contained aggregate route that's easy to reason about. The boundary of consistency is a physical boundary as well. And the service boundary here is very well-defined. Not so for the Simon Says puzzle unfortunately. So what happens is you've got these devices and they're all talking to each other and they're all talking to the broker. And if you've got another process running and this process publishes a message to say, hey, the difficulty must now be five, where does that go? How do the devices, the individual devices know what that means for them?
- 28:37 William Brander
- And the reason that this doesn't work is because the boundary of consistency isn't around a device. It's around all three of the devices, or all five of them. So what I ended up doing was I had a concept of a primary device and secondary devices. So these devices would start up, they would create a peer-to-peer network between themselves that wouldn't use MQTT, they would do a TCP connection between themselves, and only the primary would subscribe to the MQTT topic of difficulty changes, which means that when something changed the difficulty, it knows that it only has to go to the primary device. These were also battery-powered. Guess what goes flat when kids take a really long time to solve these? So what I had to do was also if a device goes offline, say the primary device goes offline, one of the other secondary devices needs to become primary, create a new network between the remaining devices, subscribe and carry on with that, kids are great. But ultimately what this means is that the boundary of consistency is around all three of these devices.
- 29:40 William Brander
- And the concept of these devices includes different types of devices, a primary and a secondary, and it could be comprised of multiple things. So in the case of the primary Simon Says, the logic is all contained within the multiple devices, I can change the logic of how those devices control themselves, but they need to work together as a unit. The primary is the only one that has to publish when the puzzle is solved, and the primary is the only one that has to subscribe. So when the primary gets a message that says, hey, the difficulty is now four, it can go and tell the other secondary devices what it needs to do.
- 30:17 William Brander
- And in that case, the service boundary is better defined because it doesn't need to go outside of the boundary of consistency to find information. In the real world, TM, maybe you've got a stock service and a user service or an authorization service. This is something that I see fairly often where someone's trying to delete an entry from a stock system and the code says, well, does the user have permission to delete this record? And the user service says, yes, they do. Go ahead, delete.
- 30:50 William Brander
- This isn't a good aggregate route because the stock service has to go outside of its boundary to get information from something else before it can make a decision. So the logic is not self-contained in this case. The stock service needs input from the user service and the service boundary here is poorly defined. Another example, maybe you working in the insurance space, one of my favorite sectors, you've got a claim service, you're processing a customer's claim, and the claim service says, well, before I can approve this claim, I need to know whether they've paid all of their premiums. So I'll go ask the premium service. "Hey, has the customer paid all of their claims?" This is another example where you're crossing service boundaries because you have to go and ask something else before you can make a decision. You're not autonomous. If you're an e-commerce site and someone is ordering something and your order service has to ask the stock service, do we have five of these items in stock?
- 31:47 William Brander
- That is again, an example of where you go in across a service boundary and you're not autonomous in your own. Now, an option would be, well, if we need to cross the service boundary, why don't we just make our service boundary bigger, right? We'd include the stock service in the order service, and then everything's fine. Yay, productivity. We don't have to cross things. That could work. It probably won't, but it could. Another option that you could do, and the times when you can get this to work is absolutely amazing as well, but you can make your service boundaries even smaller. So instead of having a stock service, have a service that only does removal of stock items, that's the only thing it does. In this case, it won't have to go and ask a user service, does the user have permission to remove items? Because by virtue of the fact that the user has permissions to publish to that topic, it means that the user has the rights to be able to do that.
- 32:45 William Brander
- So the times when you can make your service boundaries even smaller and make your service boundaries better are amazing to work with. So in this case, if the service only does stock removal, it knows which users can access it. That becomes an infrastructure concern, not something you have to code. It doesn't need to ask, can the user delete this item? It knows. And the key thing here is that this is a verb instead of a noun. Stock service is a noun. It's describing a thing. You can pick up an item, you can pick up a bit of stock. But when you interact with stock, that more naturally models the way users interact with your system, which gives you a nicer way to think about your aggregate roots within the bounded context.
- 33:33 William Brander
- So if you can make it smaller, then it only needs to know what it needs to. It doesn't have to go and ask other examples. Now, what if the business rule was, okay, you can delete stock items, but if it's above $5,000 in value, it needs to get approved. Okay. Same thing. You have a stock removal service and you have a second service that is called approved stock removal service. The only thing this has to do is get a request and say, oh, okay, show it on someone's dashboard. They click, yes, approve. By virtue of the fact that they can read messages from that queue and publish to that queue means that they've got permissions to do this. So again, we've got workflows here, but with smaller service boundaries so the boundaries of consistency can get shrunk down. And the key thing is that your service boundaries are business rules. They reflect how people interact with your system. They do things with your system. It's not the system.
- 34:29 William Brander
- You might be asking yourself, well, if your service boundaries are so small, how do you get any real logic in there? And we're going to get to that next. So looking at aggregate routes, we take a look at identifying the boundaries of consistency is a good way to look at whether or not your service boundaries are poorly defined. But if you've got all of these devices and they're communicating via MQTT, but the service boundaries are really small, how do we do anything with difficulty changes? So when I started implementing this, I, of course only had one puzzle and one little clue dispenser, and what I would do is in the puzzle, start keeping track of how long it's been, and if it's been five minutes, make it a bit easier. That worked until I started adding more puzzles because now each of the individual puzzles that I add, they need to track their own time.
- 35:20 William Brander
- But then they need to know, well, has this other one been solved? Because if that one's been solved, I am on time, but maybe it hasn't. Oh, no. So there's a lot of interaction. They need to go and ask each other. And again, once you start asking each other for information, you've crossed service boundaries, which means your aggregate route's poorly defined.
- 35:43 William Brander
- And as you add more and more and more of these puzzles, that eventually gets uncontrollable. So what I eventually did was I added a concept of an escape room run. And the escape room run is a state machine. So the state machine in this case does three states or tracks three states within the system, within the escape room. As the players start, they are on schedule, they're on track to complete within the correct timeframe, they solve some puzzles and they stay on track. Time passes. Eventually enough time passes that they fall behind schedule. So now we're in a new state. We've transitioned from on schedule to behind schedule. Once you're behind schedule, at this point, the state machine can publish a message to say, decrease difficulty or set difficulty to two or something like that. That message goes out to all the devices. They just know, oh, difficulty two, that means I need to be only the left button and the password's only three digits long or whatever.
- 36:39 William Brander
- And they play some more. They solve enough puzzles that they eventually get back on track and they solve too many puzzles so they're ahead of schedule. Now the state has transitioned again, we're ahead of schedule. At this stage we publish more messages to say, make the rest a little bit harder. So you can use the state machine to transition between an easier and a more difficult escape room for the players. And they eventually finish. I didn't actually have to do all of the flows, but I did because I will twitch if I didn't. But eventually they finish the escape room and they can either finish from on schedule, ahead of schedule or behind schedule. Hopefully they finish on schedule. So the state machine or the escape room run is a state machine that does a few things. It keeps track of how long the players are spending across the entire escape room.
- 37:29 William Brander
- It also adjusts the difficulty based on the time. It can also change the entire flow of the escape room. So it can take devices out. Oh, these players are really struggling. That entire puzzle disappears. So as soon as they solve this one, both of these clues get dispensed. So the escape room run can track all of that and control the flow of the individual devices, but the devices themselves don't need to know about other devices. So I can add additional puzzles to my escape room, change the flow of it completely, and the puzzles themselves are still self-contained within their own thing. They only have to subscribe to difficulty change events. Ultimately, what this means is that the escape room run is an orchestral conductor. It's someone standing in front of an orchestra telling the players, I want more from the strings, less from the percussion. It's just controlling how the orchestra sounds.
- 38:24 William Brander
- Now, conductors probably could play most of the instruments in the orchestra, but they're not. They're focused on keeping the whole unit cohesive and keeping the flow of this amazing piece of music. The violin player, all the violin player has to do is look at their sheet music, which is an algorithm for music. I quite like that. They're looking at their little algorithm for music, and all they have to do is listen to whether the conductor is telling them faster, slower, louder, softer, keeping pace with what they need to do. So they've got their bit of code that they're following, and the conductor is keeping track of the rest of the room, making sure everything else talks together in the way that it's supposed to. Now in code, if you want to implement this, there are three options that I know of.
- 39:08 William Brander
- There's the process manager pattern, in .NET would typically call the sagas. There's also the actor model, which is really good if you've got state transitions, explicit state transitions that you want to track, or you've got the routing slip pattern. So these are all really good ways that you can do this. I used the process manager pattern because I was using end NServiceBus, which has sagas built into it. And the key thing here is that your aggregate routes remain simple. I don't think I can stress this enough. The devices don't need to know about anything else other than what they are doing, what's within their boundary of consistency. But you can use the conductor to build complex escape room flows, and you can change flows completely without having to worry about how you build that into the devices themselves. And most importantly, the complexity of that flow is encapsulated within the process manager pattern, which means you can unit test it and you can unit test that independently of your devices.
- 40:06 William Brander
- So you can unit test your device code, you can unit test your escape room flow code separately as well, and then of course you can do integration tests between them. Outside of escape rooms, what this might look like is if we're doing a claim capture server, so again, someone has submitted a claim, the process starts and the claim capture service will publish an event to say, hey, a new claim has been captured. The assessor feedback service knows that there's now a claim and the saga knows that a claim has been started. The assessor feedback service doesn't know about anything else besides, okay, I need to go and assess whether this claim can be paid or not. The saga knows, okay, now I need to go and check do I have the right regulation documents for the customer? Do I have whatever else the saga needs to do? The complexity is encapsulated there.
- 40:57 William Brander
- Once the assessor feedback service finishes, it says, well, I've captured the feedback, published an event. The saga gets it and says, okay, claim has been started and now I've got the feedback from the assessor and I've got the feedback from the regulatory bodies, and, and, and, great, now the claim can progress. So the saga can then go and publish a claim-approved or rejected message. So all of that complexity sits in one place, and the individual service boundaries can stay smaller. So what we've got is we've got these devices, they're all talking MQTT to each other, and we introduce an escape room run, which coordinates the flow of all of these devices. Sounds great, except there's a few things with MQTT that make this really difficult, specifically the smaller feature sets and not having an actual queue. Some things that I really like from more full-featured brokers that MQTT doesn't have is the ability to send delayed delivery messages.
- 41:56 William Brander
- So that's where you send a message and you say, I only want this message to actually arrive at that topic in five minutes. Really good if you're trying to schedule difficulty changes around a room, but MQTT doesn't support that. Along with delayed delivery, having the time to be received. So discard messages that were received after their expiration date. Really useful if you are sending messages in the future. And you can't queue up messages, even though it's a queue, and surprisingly, MQTT is actually hard to monitor. For something that's really so simple, it is unbearably hard to monitor what's going on with MQTT. So what I actually wanted was I wanted my devices talking MQTT, which they are really good at, and I wanted my process manager talking with a more full-featured broker that I could also take around with me, so RabbitMQ. And I wanted something in the middle that could go and take messages from one and put them in the other.
- 42:55 William Brander
- And thankfully, again, we have a pattern that does that for us, known as the messaging bridge pattern. There are a lot of references to Gregor's blog in this talk. So the messaging bridge pattern takes care of taking messages from one queuing technology, route them to where they need to go in another queuing technology and vice versa, which means that you can add additional functionality later on. So add Catapult Analytics, and this one just subscribes to whenever the catapult fires. Maybe there's a webcam that it triggers so I could record the reaction of the children and put that up on YouTube somewhere. That will subscribe to RabbitMQ. Even though the catapult fired event comes from MQTT, the bridge takes care of saying, oh, there's this new thing that's come up and it wants to subscribe to this message type that comes from MQTT. The way that this works is that we start replacing these instances of things.
- 43:48 William Brander
- So we take the sword and we replace it with the philosophical concept of a sword, it be the sword. You are the sword, the logical version of a sword. And we do the same with the treasure chest. We do it with all of the devices. So all of the endpoints eventually just become the logical versions of it. The transports that you're using also just become logical representations of these transports. And then the bridge says, oh, okay, well, I know that the Catapult Analytics service is on transport two, and it is interested in messages that come from transport one, the Catapult event. So I'm going to send a subscribed message to the transport one so that the message comes to me. Then I can forward it on to the analytics engine. So it takes care of all that for you, which means when you publish an event, it crosses multiple transports and goes to where it needs to go.
- 44:39 William Brander
- Importantly though, it means that when you move things around, because you're not using the physical addresses anymore, if you add a logical sword to the RabbitMQ side, this new instance of the sword and the old instance of the sword are going to get the same messages. So when you publish an event, it goes to both of them because the bridge knows, oh, this one is actually on this transport, not the first transport. So for escape rooms, if you build an escape room, really powerful. Even more so if you are modernizing legacy systems. Maybe as an example, you've got a system that's on MSMQ and queuing messages based off of that. First of all, my condolences, that must be very hard on you. Second of all, what it means is that you can introduce the bridge as a component within your system and start modernizing that system.
- 45:28 William Brander
- So you can then have half of your system running on MSMQ, the other half running on SQS, and the new endpoints will be running Lambda instances perhaps. The really cool thing about this is that as you modernize the old endpoints, as you start taking one of those MSMQ endpoints away and you migrate them over to SQS, the old MSMQ endpoints don't need to know about the change because the bridge will route it to where it needs to go on the SQS side, and the new endpoints that use SQS don't need to know that it's now being moved from MSMQ because the bridge takes care of that. This is an amazingly powerful pattern. 10 out of 10 do recommend. It's also really good if you've got cross-region instances in the cloud, for instance, you don't necessarily want to duplicate all of the information across multiple regions.
- 46:15 William Brander
- So you set up the bridge so that only the events that need to cross between the cloud regions can be shoveled across where they need to be. Super powerful pattern. So those are the four things that we looked at. You might ask yourself, well, how did it go, William? And it didn't go great, if I'm honest. Even just the construction of this was painful. So here's a picture of me making the balance puzzle. You can see I'm using a brand-new strip of addressable LEDs here because this was about the fifth iteration, and I was convinced this was the one that was going to be perfect. It was about five minutes afterwards that I noticed that I forgot to include the holes in the SDL file to print, so I couldn't get the electronics through, so I had to reprint it again. Iteration six.
- 47:01 William Brander
- Very similar thing happened with the musical tone puzzle. I tried to get in a really small form factor, but I just for the life of me could not get the print to be strong enough. So that's why it eventually looks like a cell phone from 1996. It's just missing the aerial that they have to pop up on the side. I hurt myself a lot. I'm incredibly clumsy. You'll note in this photo there's little censored blocks because I realized not everybody wants to see blood, but I already have a plaster on in this picture. This was the time to date. I am incredibly clumsy. In fact, right now I have a plaster on my toe. I should not be allowed near sharp things.
- 47:38 William Brander
- When it comes to soldering all this stuff, I'm terrible at soldering and apparently really bad at planning. So some of you may note, these two buttons need to go through these two holes. Here's the fun thing, I've already connected them together. You don't need to be an ontology major to know that this isn't going to work. So I had to unsold all five of these devices or all 10 of the buttons, put them through, resold them. Not only did I do that, I burnt wires with the soldering iron. I burnt the wire of the soldering iron with the soldering iron. Thankfully I had a second one that I could then use to repair the wires of the original soldering iron beforehand.
- 48:20 William Brander
- Anybody like mechanical keyboards here? Close your ears. I melted through two keys on my keyboard with the soldering iron, the control key and the Windows key just have these giant gaping holes in. I should not be allowed near soldering iron at all. It is dangerous. But eventually I got this thing set up and I got everything assembled, and I had enough that I wanted to now do this for the kids. And I was really, really excited. This was going to be my moment to shine. They're going to have so much fun and it's going to be great. So we went to one of their friend's houses and we set this up and in my mind, I'm going, this is great. This is going to be like the Coliseum and we're going to have these gladiators come out and they're going to just be amazing. Or if you've ever watched The Goonies, which is a movie, a kid's movie from my childhood, the kids all fight with each other, but they come out stronger at the end.
- 49:11 William Brander
- I didn't get any of that. I got this. They broke absolutely everything, absolutely everything. It took about three or four attempts before I got the escape room to work. In fact, here's a picture my wife took of me after a particularly bad one. My dog is very happy that I'm sitting on the ground looking very despondent. I was not happy at all. But I remember after this photo was taken, one of my daughter's friends, she's five, and she's absolutely tiny. She's such a little sweetheart. Children, they're kind, right? They're sweet. And I think as we grow up, we lose a lot of that sweetness somehow, and we need to get her back because this little child comes up to me and she's like, I'm sitting on the floor, and she's still shorter than me when she's standing. She's so minute, and she puts a little hand on my shoulder and looks me in the eyes and says, "Maybe you should have tried harder."
- 50:02 William Brander
- And the kindness that she was giving me hurt a lot. But eventually I realized she was right. And I think it was about the fifth or the sixth version of trying to get this whole thing to work, I eventually did manage to get the escape room to work. So there's a few other pictures of the kids doing some of the more physical puzzles because we've taken a look at some of the more electronic ones. So they solved this. They had a great time. They got through to the end. It took them about an hour and six minutes. So just over the hour part, which I was quite happy with. I could have maybe been a little more aggressive on reducing the difficulty, but they did them. They all had a great time. And I've since run this twice for the different groups of friends of theirs. I've run it for adults. I've run it for a scout group, which was just embarrassing. They were terrible. It eventually took them almost the full two hours of their scout meeting to get through the thing.
- 50:58 William Brander
- But it's been something that I've gotten a lot of value out of. And the things that I'll remember from this experience is even though I was just monitoring a simple system, it pays to be intentional about it and to think more clearly about what I was monitoring because I got a better monitoring system out of it at the end. Service boundaries are business rules. Make sure you try and include your business rules as part of your service boundaries. If you can distribute state, do something that can encapsulate that state, it means that you can keep simplicity in your devices and you can still have testable business flows that are quite complex. And the polyglot messaging pattern is super useful, super-duper useful. It's one of my favorite patterns because it just unlocks so many potential, especially in the real world. Not necessarily just for escape rooms. But I think the thing that stuck out to me the most is that the impact that this had on my children, so on the left is my daughter.
- 51:54 William Brander
- She's built an escape room for me to do now at my parents' house when we were visiting because she just had so much fun doing it. And on the right-hand side, my daughter is looking through a Corona telescope at the sun because part of the manual puzzles involved magnifying glasses, which then became binoculars. And then she wanted to know about telescopes. Now I have to take her telescope camping, and I really hope she turns into an astronomer one day.
- 52:15 William Brander
- The impact that this little thing had on these children is more than the effort that I've put in. So as you go through the rest of the conference tomorrow, don't just look at the cool technology and the cool techniques. Maybe there's something else that'll unlock more value for you later on down the road. I do have plenty time for questions, but I think it might be more fun if we do it over beer. So if you do want to talk to me, I can show you pictures of some of the other puzzles that I did for the kids, and we can talk about this as much as you want. I'm always interested. Thank you very much.