From layered web services to an event-driven architecture at Rikstoto

00:01 Jan Ove Skogheim

I work for Norwegian Consultancy called Web Step. And five years ago, I had to get, we did Enterprise Development Course, and I've basically been spreading the gospel about the NServiceBus ever since. This is going to be a case study about my current clients. And it's basically a textbook case of somebody doing just web service layers and layers and layers and just creating a big ball of mud. And the client is actually from someone called Rikstoto and a literal translation is the Norwegian national toe. So we are the only licensed totally setter that's allowed to do parimutuel betting on equine in Norway.

So how many people understand what that meant? Yeah two. That's great. So short version is, you bet on horses, you bet into a big pool, and if you get the correct horses, you split the pool with the other guys who did the same thing as you. Of course, we take a substantial percentage out of the pool. Most of the money goes directly back to the Norwegian state cause we're a monopoly. So we have to pay them for it. The rest of it goes to basically fund horse breeding and activities around that in Norway. And last year we had a turnover of 380 million pounds. So we're not Google big, but people care about stuff.

So one of the biggest challenges is illustrated quite nicely about this mountain range and Norway up north. And that is that we have a very peaky load profile. We had last year, we did 55 million transactions. That's accepted. Of course, it's not a steady flow, extremely peaky because once you bet into the pool, the information becomes public. So you want to deliver your bets as late as possible to avoid giving other people information about the kind of edge you think you have on everybody else. So we get every day we have what we call a race day which is basically seven races. So, on the daytime we have maybe a couple of bets a minute only and when we approached the deadlines, we get like two to 300 per second. So managing the peaks is extremely valuable for us, valuable. So downtown close to race start is just not acceptable.

So this was the architecture where I entered. This is three years ago. This is a cleaned up illustration of architecture. I call it the matrix, the Blue Pill Matrix Architecture because basically this is someone who just read a lot of early stuff on SOA, especially like the books of Thomas Earl around if you read them, but you have lots of, lots of layers and you have services of different types. So you see here we have on the top, we have something called Adaptor Services, whatever that is, I have a Process and Composts Services, Basic Services, Protocol Services. So I have here a rectangular box you here, it's a service of some kind. And that made it very hard for me when I joined this client because I have a specific thing I think about when somebody says service and this is not.

So, you have all the basic problems. You have every kind of coupling. You have logical, physical, special, temporal and it's very hard to protect against peaks in this kind of architecture because the slowest thing will slow down your entire thing. Your entire business operation because this is RPC between the boxes. So you have clients coming in here on the top, calling this one, this one, this one, that one. So the slowest one will stop you.

And of course you have no, this is, the communication here is VCF. So it's synchronous and you have no basic transactional support. So a lot of manual cleanup was usual. And of course, handling time and this kind of architecture is really hard as well. So, there's a lot of things I've cut out here like tasks and manual polling solution and stuff like that, but it made it even worse. So this is what I call the blue pill architecture. If you remember from matrix, that was like staying in the matrix, just living blissfully nothing or ignoring the consequences. This is our new one. Where we're basically, I'm going to tell you how we got there. But when we talk about services now we're talking about big stuff like this. Customer, Betting, Experts, Prize Payment, Accounting, Raceday instead of the 40 plus things you saw on the last slide.

One of the things that enabled this Roosevelt that if you look at this slide, every single thing here has an VCF interface on top. So the clients do VCF to this and this one will be used to have down there. So as long as we kept the VCF interfaces for the client facing things, we should be okay because we didn't get permission to basically say, Oh, let's not do anything new for the next two years. This was a moving target. So moving along, we had to make sure that things worked. So we did this in a span of two, almost three years refactoring. And of course, adding new functionality as well. All that as well. These are actually fully autonomous. There's no more RPC between them. There's Pub/Sub with some queue and the normal share databases. Everything is disposable, deployable on their own. And we still have the customer facing VCF interfaces we had when we started. So just to give you some numbers, we have 25 Sagas and around a couple of more maybe now, but like a hundred plus events flowing and 200 plus events handlers.

So this was always started. The first step we did was to basically remove all the semantics of the previous, of this one, because the process can compound services made no sense for me. So what we do is that we have components. These are components that are available in the architecture. So everybody who said service for a half a year, I went over and slapped them in the face because that basically meant we couldn't communicate because then they meant something totally different from me and from other people on the team. So we basically wiped out the concepts of a cobbler and reset the semantics completely. But we still kept the VCF interfaces. That was important for us because they had to work. The next step was actually, bring out some crayons. We started looking at what the components were doing, what the functionality inside them were, how they were calling each other, how they were called, which kind of databases that we're using and we combined that with talking with the business and asking them what's actually this for you.

So we're trying to identify the actual business capabilities behind the boxes. So doing that, we managed to color in ... used one color per action per business capability. And we got this and the next step was to actually cluster them together.

So we have a much bigger, more coherent kind of boxes.

And when we did this, we actually combined the source code for these kinds of things. So this was a new source code. It wasn't a repository but was instead of 40 in different places or as I wanted 12 here, that's one source code three and we removed some serious code duplications between this and we removed the VCF communications inside them. So, what you're seeing here is one process instead of the cell processes that used to be.

And, we introduced the host the NServiceBus.

This worked kind of well, but there was still VCF communication going on between these services. There was no VCF inside, but between them, there were. But we had to do this step by step. So, that was the first step. Next step was to kill the VCF between the services. We did but basically, since we introduced the host here, we now have the possibility of actually using messaging. So, that's great. So that's what we did. We started chipping away at all the VCF between the services and the easiest ones are basic Fire-and-Forget. When you have this service calling this service, this has to do something but you don't really care over here. That's an amount. It's a basic event. It might not be a great event, but for us it was a way to get started.

The next, the more difficult ones are when you have a service calling a service, and you need response because there was, let's say it's a query. You have a client called this, and this calls this, it aggregates and sends the result back up to the client. What we do there is we basically will let the client do both and do the aggravation on the client instead. So we do the composition on the clients. So the most difficult ones is where you have a client calling a service, a service calling another service, where that's basically a command and you need the result before you can do something more here. Why we handled that was basically to move a business. We started moving entire chunks over to other services because the service boundaries were not good from the start.

And the other thing we did was to do more stuff on the client. So we have some operations that are actually now two commands down to two different services, instead of one command going to one service that does the command to the second one. That kind of depends on how much, if you own this client and you have the possibility you can do stuff like that, and then we use sagas based in the correlate between them if we need to do something when both of them are done. So we kept refactoring as we learned because the first version wasn't perfect. It never is. You never get a perfect SOA. That's my experience. So you just keep on learning and adapting to the new stuff you learn. So, okay. So this was how we ended up. We actually have a couple of more services now, but not that many, actually. So, but this is the services we have and we have the VCF interfaces and we have, lately we have introduced a couple of new rest based interfaces for our newest client.

So I think the single biggest win for us is the fact that what you're seeing inside here, this is stuff that business actually knows what is. They know what accounting is. They know what price payment. They know all these things. We're speaking the same language. The old architecture that didn't know what the hell all the service things were. I had no idea. So when we communicate with them, it gets much easier. All right. So the last thing we did was to scale it out. We basically had to scale up because we have sagas that correlate to different commands from the client. And if you deployed several machines and you don't want the first command to go this machine, creating a saga, the next one going to this machine creating a new saga there. It will never complete.

So when you do scale out, you have the logical queues on your cluster. And that works perfectly. And also, we used the load balancer. We used the distributors to make sure we have enough power to power some of our services, because like our betting service gets hit a lot when we're close to the deadlines. But stuff like the prize payment doesn't. So we have many more machines for that. Compare. We have much more machine power for betting. So we need them to run on more machines without having to deploy all the services and those machines as well.

So this was fun by the way, because when you're introduced to windows cluster, you get introduced to the clustered MSMQ and DTC. And operations will just love you for it. I think that the simple stuff has been getting a stable system. Once you get it configured, it just works. We haven't had a single instance of our service just trashing and taking some data with it. Of course, you have single message things have failed, but you just re-try them and you're back on track. At least if you do import them messages, that's important. Just do that.

The API is really good. Every single developer I talked to says that on my teams, that it's not really hard. The hard stuff is I'll get back to that later, but using the API, it's just simple. And the way a message handles frames, business operations, that's really beautiful in my opinion. And I think the API makes it really easy to do stuff correctly and really harden them wrong. So, that's a good API. I think a good example is that you can't do sand on an Island and you can't do publish on a command. That's really smart if you think about it. So sagas, we have 25 of those at least. And I think that's the most elegant solution I've seen for long running transactions ever, basically. We used an extensively to kill off all task and batch base stuff for doing time. And every developer has said, this is one of the killer features of in-services.

So sagas are great. We use name to ... We have a lot of deadlines. You have to deliver your bet in time for stuff. And if you don't, we send your reminders. We have something called subscriptions where you basically subscribe to games every week. So, we'll send you a new game and get your money. We use sagas for those. We use it to integrate with third parties. It's going to be quite complicated. And our new rest API, we actually use sagas to do delivering of the best as well. So we love sagas. The complicated stuff, that's the infrastructure that really surprised me when I started with this was that a lot of developers hate infrastructure. I come from a quite geeky background. So I've always been probably programming in my Amiga Commodore something like that and a lot of new developers don't. So, they just want to code, they don't care about the infrastructure.

And that's a problem when you're doing NServiceBus because if you think about the basic NServiceBus stuff, I call it, that's MSMQ and you have RavenDB and you have DTC. You have to know this stuff and it's not easy to combine. So it's a trade-off when you go to something like this, because you need to basically have people who wants to do it. So I think good developers who for fair, like to get dirty with infrastructure is important.

And with the infrastructure you get the configuration. There's been a lot of great stuff happening here within NServiceBus because you have the profiles, you have the PowerShell tools, you have a lot of good commands and some default valleys that you didn't have in start. It's getting better now. So there's still a lot of configurations. And operations really don't like a lot of this stuff. And so it was like doing a clustered DTC and clustered MSMQ. That's not likelihood of them having done that before you come thing you want to use NServiceBus that's pretty close to zero. So we had a lot of pushback from them, but it's okay now. But my biggest thing about this that you should involve them very early because there's going to be a lot of new stuff for them. And basically you'll have to train them yourself.

So I'm hoping service plus, service pulse will actually be good for this because we're using Microsoft System Center SCOM. And that's basically not good when you do this. It's too low level. You can manage queues and some stuff like that with it, but you don't have queues in the context of a service bus.

So, you need better tooling on that. So, involved on early. It's been expensive for us. We didn't. So next thing I have, I did have a slide on documentation, but that was covered yesterday So I'm not going to go into it. It hasn't been great. It's been getting much better. But one thing for us has been living on the cutting edge. I think somebody said to us today, yesterday, you should always upgrade early. I'm not sure that's entirely true of us. We did. When I think if you remember back to when RavenDB was introduced, that was too early. We had a lot of problems with Raven and a service person. It actually took several releases from both of them before it worked correctly. So I think that was a premature decision by NServiceBus, by the team. And so all that ever after that, we've always been a couple of versions behind. And that's maybe not that bad.

The good stuff is that the employees are really quick when you have a problem.

When I've actually sat on Andrea's a couple of times and the VIX is basically out next day. So that's a positive. So the complex stuff is actually getting your mapping from business to I.T. right? That's I think more of a black art and actually engineering. So a lot of developers, they don't like doing business. It's a strange thing whether they just want to code, but if you're doing an SOA you have to have developers that actually wants to learn how the businesses is doing stuff because the architecture should reflect your business. And if you don't, you're going to have a bad time. So this is a process, continuous process for us. We're trying to refine learner's stuff and basically the tools aren't going to help you here.

This is a pure ... It's an art basically.

So the other thing is, even if you have a perfect understanding of how the business maps to I.T., getting a service boundaries right, is very hard. There's a lot of moving parts and what are the business capabilities and how should I map them to business components and autonomous components? How should I run speed? It's really hard. And it grows quite big, quite fast as well. So this is the complex stuff, you're never finished.

So, I have a pro tip here and that's basically that you're going to have fewer services than you might expect. At least we have had. We started out with 40 plus services and in my earlier diagram, and we're ending up with, it was six service, maybe it's seven. So if you think about it, it's not unreasonable because you don't have. If you look at your business, they don't have that many business capabilities. And I don't add new ones that often.

So if you manage to home in on the basic capabilities that you actually have in your business, the fact that should map it that well to how many services you should actually have. So we have moved quite a lot of things between the services, but we don't introduce new services very often. I have an example. We recently introduced what I called subscriptions earlier, where you can subscribe to games. So you get a new game every week without thinking about it. And that was the default thinking for most of the developers were okay, that's a new service, but it wasn't. So we ended up using two of the old services by splitting out different parts of the functionality. So, that's been basically how we've done most of the new stuff last year. No new services. We just add to those we already have.

All right, did you watch this? This is a man versus a computer. They basically attack a robot so it's a ...We have something called ping pong messaging in our team. And that's an anti-pattern when you have two services sending events to each other where they basically can't do anything before they get a lot of data back in the event from the other service and they go back and forth like that. And you basically add a lot of stuff to your events, they should be small and beautiful when you don't have that small and beautiful, you get ping pong, messaging. A can't do anything before it gets the response from B and B can't do anything before it gets a response back from A again. And you go back and forth. That's not good. You have a service boundary issue. Basically something should be moved from one of the service to the other.

And it's actually a kind of, I call it a hidden form of temporal coupling. So it's especially easy to do this, go into this trap when you retrofit something into an architectural order they have. All right. So, I'll just take one more slide. That was a very fun thing. We did the introduction. How many people had know what a QMId is.

Why do you know? Yeah. And that's what I'm talking about here. The QMID is basically GUI that identifies the MSMQ installation on a single machine. And it gets sent with every message you send. It's metadata that gets sent to the other part. That's where the other way you send it to. And where you send it to use this QMID as the key and a dictionary where the value is the actual IP it was going to use to send responses back. So if you have two machines, but the same QMID, the first machine will get the entry into the table. So when the machine's going to send something back to the other machines that has the same QMID, all the responses go to the first machine. But it was really fun because basically we had a situation where a single end point ... we had several workers and just one of the workers worked. Everybody else stopped working, nothing got true at everything was in the outgoing queues on the machine.

That's this issue. And it's basically because you have operations that clone virtual machines without doing Sysprep. That's really bad. And the fix is quite easy you just read and install MSMQ or you run Sysprep on the clone machines. All right. So I think I'll wrap up my time is out. So any questions?

Yeah. Quickly host, at least in the start, we hosted them in NServiceBus host actually, the self hosted them. Before we hosted them in the IS and the first architecture, we just moved them into the exact same host as NServiceBus.

Developers. So it's not that big.

Plus there is fun.

So if they remove the DTC, I'm going to be really happy. All right. Anything more? Yep.

From layered web services to an event-driven architecture at Rikstoto

About this video

🔗Transcription

From layered web services to an event-driven architecture at Rikstoto

About this video

🔗Transcription

Additional resources