NServiceBus at Scale

00:04 Charlie Barker

Can you hear me? Michael? Can everyone hear me? Can you hear me now? Okay, slide one. Come and sit down. Awesome. Okay, cool. So. Okay, so my name is Charlie Barker. As Udi alluded to earlier, I've been working within Salesforce since version 1.9. The very early days, I have been working writing software for about 13 years, primarily in FinTech sector. Spent six years as a partner of a small, co-founder of a small software business. And then five and a half years ago, I joined Wonga, you may have heard of if you live in London or Canada or maybe Poland. We're not necessarily the best love company in the world. But we're quite proud of our tech elite. And I'd like to just find out a little bit about you. If you write software for a living, could you raise your hand.

Okay, figured that might be the case. And could you keep it up if you've deployed a message-based system into production? And if you use then NServiceBus for that system, can you keep your hand up? Okay, cool. So what I'm going to try and do is relay a bit of the journey that we went through at Wonga, when I joined that business, it was a 20-person business, 20 employees in the business, we had, obviously started the beginning of the journey that ended up being a 900-employee business, serving in, I think over six countries around the world.

And I'm going to talk about why we adopted NServiceBus, the problems we were facing at the time with our existing platform. In order to give you an understanding of the decision-making process we've made, I'm going to talk a bit about that platform and how it works and the shortcomings that we had with it. I'm going to talk about once we implemented our NServiceBus solution, the relationship between Monitoring and Tuning and how important that became when we needed to scale up and meet customer demands. And I'm also going to talk about the growth of that demand. And what that meant for us as a development team.

I'm going to show you the request of NServiceBus team an integration with a payment service provider. So Wonga is in the business of lending money to customers. And that means we have to collect that back when their loans fall due, so we need to integrate with payment service providers to do that. And so just a way of an example of something that we've done a concrete example of what he's done, how he did it. And I'm going to talk about some of the lessons we learned along the way now, thankfully Udi's opening presentation alluded to some of the stuff that they've NServiceBus have built since we implemented this our new platform, which was on the CRX version of NServiceBus.

And so thankfully, some of the pain that we suffered, you guys, if you're developing on the later versions won't have to suffer or you can take advantage of some of the new tools, some of the ServiceMatrix tools that they're particularly developed. And so first up, what did our platform look like before we started with NServiceBus. We had a website because we deal directly with customers. So we needed to be able to collect their applications for a loan. We had a back-office system that we administered all their accounts when they were successful and taking out a loan. And we had a decision engine. So this was the piece that was responsible for assessing everyone's application sliding whether or not we were going to lend to that customer.

And the services or these systems talk to each other primarily over SOAP. So XML firing backwards and forwards over HTTP. And when it was good, it was very good. Everything works. customers were happy. Customers could apply for loans, they could get loans, they would receive the money in their bank account, and everything was wonderful. But our product turns out has an interesting feature to it. Throughout the month, customers turn up and they borrow money and typically they want to pay us back at the end of the month, our products are short term products. And so this created an interesting effect towards the end of every month, we'd have one day where we might see a transaction load that was five times higher than any day we'd seen throughout the entire month, we affectionately named these peak days.

There's also an interesting anomaly that if the last Friday of the month happens to be the last working day of the month, you get a kind of an added effect, we call those twin peak days. These three services or the back-office service and the decision engine service, they talk to a lot of third-party services. In order to process applications, you need to go out and you need to make calls to credit bureaus, you need to talk to systems for anti-fraud, you need to talk to systems that will validate payment cards, all sorts of stuff like that. So it was a long synchronous chain styling with a customer who's just submitted their application on the website, going through to the back office was persisting the customers data that was going through the decision engine, and change out to third party services, all entirely secret, it's all entirely over web services, and SOAP and HTTP.

And what we'd observe is that they only took a failure in one part of that long chain. And we would result in the whole thing coming down. We affectionately call this the House of Cards architecture. So we thought about this long and hard, and we thought. Well, what can we do to mitigate this problem? How can we prevent the House of Cards coming tumbling down in the event that we either have one of our third parties start to run slowly or completely go offline, it didn't seem to make sense that just because we couldn't process a loan application for a set of customers, that other customers weren't able to access their accounts login, check their balance, that sort of thing? So the first thing we did was, we switched on to I seven into an integrated mode.

And this opened up the possibility to make async calls to our backs back-office API's from the website. And this helped, this bought us threads effectively, it brought us more calls, we can make more time calls from the website back through, but it didn't entirely solve the problem. And typically, we'll be targeting calls that we knew API calls that could have high latency could be sitting there waiting for long periods of time, those are the ones we started with. And then we worked our way through. And then we introduced more web servers. Again, we're just buying more threads, and we put them behind a load balancer. And so we decided that we still couldn't give the business a guarantee that the website would always remain up and always remain available.

And the reason for that being that we were seeing growth month on month, so 10% growth month on month, every month, we're seeing maybe 10% more customers than we saw the previous month. And on top of that, we don't control our third parties, we don't know how long they're going to be down for, even if they give us an SLA and say we'll always be maximum out of time, we're going to be down at half an hour, they can still make mistakes, they can still go offline, we internally can have failures as well, that we would, parts of our system can go offline, and it can take us time to bring those systems back up.

So even if we were incredibly diligent, we could still see failures on the website. And we didn't want to give the, we want to be able to give the business the guarantee that said, Look we may suffer failures may suffer outages, third parties may go offline, but the website will stay up. reputational risk, that sort of thing was at stake here. We're a small tech startup, if your website keeps going down, it doesn't look very good. And so Okay. So, Okay, sorry, not quite as polished as Udi. So, why and NServiceBus? What were the problems? or How was it going to solve these problems that we were facing with our current architecture?

Well, obviously, Udi says is a good answer to that. And to be fair, Udi's reputation in this space is good, and it gave us a lot of comfort. He was talking knowledgeably about the sorts of problems we were trying to solve. And so interestingly, early on, we were attracted to kind of figure out whether NServiceBus might be a solution for us. We needed a robust solution, we wanted to tackle this problem of the website going down in the event of a third-party service. So how, so what did end NServiceBus give us that our old solution that didn't? It gave us the ability to queue commands. So a customer could say, I want to do this. And if we were unable to process that message immediately, it would sit in the queue and wait until we were able to process it.

Now, it's not a panacea, it doesn't solve every single problem. You still got a customer waiting for an answer. And if you can't get back to them in time with that answer, then they may go elsewhere. But it did solve this problem of, even if we couldn't Process loan applications, that wouldn't affect the website, it wouldn't take the website offline, it just mean all the commands that are waiting to process for those applications would sit and wait. And other commands that are coming through like checking customer or queries that are coming through that the customer wants to check his account balance would succeed, and we wouldn't starve our web service of threads.

And we knew that each month, we were going to see this 10% growth month on month, we weren't sure where the end of that journey was, we were fairly small company, we had a fairly small number of customers, but we knew every month we were seeing this growth, and no one in the business could say, oh, okay, after X number of months, growth will stop. We'll have saturated the market with this product because no one knew how big the market was. And so we were very interested in building a system that when it was under really stressful load at the end of each month that we were saying that it would behave predictably. When servers were running at full tilt, the commands would queue up and everything that was, everything the server could process it would process and it wouldn't slow down, it wouldn't collapse under its own weight was the previous system would fail in unpredictable ways and wouldn't come back up after a failure in a predictable fashion either.

So this was another key feature of why we thought, and NServiceBus would be a good choice. We also want to do the ability to scale the system. So by distributing the processing of customer requests across multiple machines, we felt we'd be in a better position for me, the scale, and the growth of the system. And obviously as linear type scaling is what we were after. Because we didn't know where we were going to end up. We didn't know when we were going to reach saturation point. As it turns out, it took about four years for us to reach a saturation point in the market. And we were only about halfway through that journey when we switched on to or when we started building our solution and NServiceBus. The distribution of processing meant that we could keep the hardware relatively simple, we could use commodity-based hardware. And this was important to us because our infrastructure team was relatively small. And that was the hardware they were familiar with.

Introducing high performance, expensive, bespoke hardware would have been tricky for the team that we had at the time. Opinionated. So again, we were a relatively small company, a relatively small technical team and relatively inexperienced team in developing something like a service-oriented architecture, we didn't have a lot of the knowledge in house that we needed to build that type of system. So we were looking for a tool that had been built for the job, it had opinions baked into it that would help us become productive quickly. We also were looking for training. And this is something that and NServiceBus offered, we were able to set up internal training courses where we could train up to 20 developers at a time, we could introduce them in that time to the concepts that underpin the end service button. So if you get those guys up to a basic level of proficiency very, very quickly, we set up courses to have our sites one with Andre Sutherland and one with Jimmy Bogart.

And those guys came in and did a fantastic job of training our developers that we'd hired to come build this platform for us. And something that Udi mentioned is community. So the community that surrounds and NServiceBus is amazing community, it's full of super talented, super helpful and super friendly guys and girls that whenever you post any kind of question won't be on Twitter or Stack Overflow or on the message boards, you usually get a great response to that answer. I have certainly never had anything but. So now I'm going to talk a little bit about monitoring and tuning. Now, Udi touched on all the great new tools that have come with the later versions of NServiceBus.

We were building our platform on the 2X version of and NServiceBus. And we didn't have any of these wonderful tools yet. So we kind of had to build them. I still think this is valuable because it's not necessarily a talk about the tools. It's more about the relationship between why you need monitoring in order to tune your system and how that becomes helpful. So hopefully this is relevant. We had to add some more granular events. We wanted to track the time that it took to deserialise messages or separate out the time it took to deserialise a message the time it took to handle that message which would be the developers code and the time that it took to commit that message. So typically the time the DTC needed.

We were using WMI to send the data. And we were aggregating data every minute. So every minute an endpoint would emit a packet of WMI data or statistics about what that endpoint had been doing. And we were logging our exceptions by message type to help us locate the source of any errors. So something that now NServiceBus does by default. So this allowed us to locate bottlenecks quickly now, I kind of have to stress that we were not following the best practice of having single message type in a single endpoint, we had quite a bit of overhead in our scanning and our message footprint and individual endpoint. And to separate out every message point message into its own handler from the get-go would have been a little bit too much of a cost in terms of the amount of memory we've had to fire up and a number of servers we've had to have running, the kind of ratio of message, memory to CPU use would have been a bit ridiculous.

So we started off with having a single endpoint per service. When we're launching into a new region, when we knew volumes, were going to be low. And then we gradually would split out handlers from that main endpoint into their own endpoints as and when the monitoring dictated, we needed to the metric that we built ourselves was called percentage busy time, which would allow us to see the percentage of time an endpoint is busy. And basically what that meant was, as that number rose, we knew that the endpoint was starting to reach capacity. So once you get to about 75%, you start to realize or if you see over the course of time, whatever you're monitoring out, if you see that endpoints busy time was coming up to about 75%, you knew that endpoint starting to reach capacity. And it was a good indicator that either you need to add more worker threads to the endpoint or you need to go and investigate it and see does it need some more performance tuning or work done.

So due to the nature of our business and the way it works, and the way we would always have this really busy day at the end of the month. We could track through the first two weeks of the month. And we could see watching an endpoints busy time, we could see that if it was continuing to rise, day after day that whether or not we were going to breach SLAs by the end of the month. And so we'd still have maybe two weeks to do something about it. Now, obviously, every domain is different. And that may or may not be useful to you depending on what you're doing. But in our case, it was really, really useful. And I wanted to give you just a kind of a few screenshots from the dashboards that we built.

This is the high-level view; you can see regions down the left-hand side here. And then the various parts of the system being represented by these colorful globes. Obviously green is good, red is bad. And in some numbers about the business, what we're doing, some of these have been blurred out, I'm going to give away secrets. But this is very high-level view, we put this up on monitors around the office, this was something designed to alert engineers and support people to the fact that a problem was occurring immediately. And then engineers were able to drill down into the machine overview I apologize for this, it hasn't really come out very well on the projector. But essentially, what you're looking at here is a overall view of every single message across all endpoints.

It's a very good place to start if you something's wrong, but you just don't know what it is. So this is a way that allows you, something that you can start, you can investigate, you can look if you see very high spike in the number of errors count, you can go and drill down into that just by clicking on that spike. The next screen I'm going to show you is the message messaging detailed screen, I have to refer to my notes because my memory is absolutely terrible. So this view is most useful when isolating the message type that is causing problems. Typically, excessively high processing time or high number of errors or a high number of handler failures. So this screen would show that out now, if you're trying to diagnose a production issue, this is very useful information to have. And I'm sure this is something you'll be able to get from some of the new tools that have come out.

This view is the same screen, but the bottom chart is different. It's now looking at a just a single message type again, if you were splitting more, you're handlers out into separate endpoints. You'd be interested in this view because it now be down to that level. And you now want to see, okay, I can see, I know my message type that I'm interested in finding out what that's about. But I want to know why is this message slow to process? Or what's the error message that's been triggered from this error message? And so I can go in from this message. So I can go and investigate. Why is this a problem for me.

And so I'm going to talk a little bit about some of the metrics that we had. %Busy Time, I've already talked about. %Idle Time was just the inverse of %Busy Time, and Critical Time, which is the standard and NServiceBus counter. We split our metrics out between metrics that counted the number of events in the one-minute time slot that we were recording over the percentage of time and the average duration, and they become, those are useful for different types of scenarios when you're investigating a problem with your production system. Message Count is typically, the number of messages that have been successfully handled during the time period. Error Count, the same before errors. This is messages that have hit the max retry threshold.

So I've been through whatever your retry has been set to, they've been through that process and been pushed the error queue. The Retries Count, and the handle of failure counts are related. The Retries Count is the number of times message processing has been retried as a consequence of an unhandled exception. And the Handler Failure Count is the number of times messaging handlers failed due to unhandled exception. So they're closely related. Percentage of time metrics. So again, here message deserialisation which we'd spit out and was measuring how long it would take to deserialise a message. Typically, this would be very, very small. If you ever saw this jump being a significant percentage of the overall message processing time, that'd be very, very weird.

Message handler time is the time that the developers code is typically executing inside the handler. So if you're seeing a long message handling time, then it looks like you've got some problem with your code in there. And then finally, the message commit time was how long it was taking the DTC to process your messages. This was typically 10% of the overall message time. So anytime we saw more than 10%, we were starting to suspect there was some issue with the DTC that we needed to go and address. Message processing time, just percentage time is just the addition of first three, and failed message processing time, just the amount of time for processing messages that ultimately failed.

So again, these are the same metrics that you've just seen. But instead of percentage time, these are the average millisecond duration metrics. And these metrics are particularly useful when you've got messages that are infrequently arriving at your endpoint but taking extended periods of time to process. So they wouldn't show up in the percentage message stats, but they would show up in the average message, average duration stats. So yeah, it would be unfair of me not to mention the talented engineer the build out our monitoring and solution and the dashboards that we used around it. Just because it has been so useful. When we've launched in new regions and seen good traction. It's allowed us to really dig down into the system and figure out what it was we needed to change where the bottlenecks were, in a timely fashion really saved developers a lot of time struggling through logs and digging into messages manually.

So Ivica Cnrkovic is the engineer who built all this. And if you have any questions about any of it, I'm happy to put you in touch with him. He did ask to be called out on it as well. So he is very proud of it. So next, I'm going to talk a bit about an integration example. And this is just by way of showing you something concrete that we've had to build at Wonga. Again, this is part of the process when we need to collect a loan that's fallen due. And this is obviously a common should, be a fairly common scenario for anyone who's working in the kind of consumer finance space. Very simple system. Have two endpoints, an engine, and a bot.

Responsibilities of the engine, pretty much to maintain state of what's going on in the system, it's got to decide when to collect the payment that's due, it's got to trigger that collection attempt. And it's got a record the outcome of that attempt, that's all it has to do. And the bot is merely responsible for talking to your payment service provider of choice, typically over HTTP, given that most things nowadays are done over the internet. This endpoint, we didn't want to run the risk of inadvertently collecting the same payment twice, it was a business decision, we'd rather have, send a request to the payment service provider to collect a payment, get an error, and not collect the payment, then inadvertently send that same request twice to the payment service provider by way of a retry and collect that payment twice, simply because from the customer's point of view, it's very annoying.

If you collect their payments twice, they're only expecting you to collect it once. And if you don't collect it, well, then they're not upset about that either. So that's why we separate these two endpoints out. And it's just saying the next slide. It gave us the freedom to set the max retries to zero. And also to disable second level retries in the endpoint config. So this endpoint will only send that request once regardless of whether or not the request results in an exception. And one other thing is we had to produce, process quite a large volume of requests in a relatively short space of time. So we were set the task of being able to process 100,000 payment requests in under an hour.

And in order to do that, we had to because we're making using the web HTTP web request object, we had to adjust, make an adjustment to this setting, the max connection setting in the connection management mode, and in our app config. If we didn't do this, then by default, we'd be throttled to only two concurrent requests are the same endpoint, same PSP. Took us a while to figure that out. Unfortunately, but once we had figured it out, it will work very, very well. And throttling these types of endpoints is very straightforward. You can either throttle by the number of worker threads that you configure, or you can throttle by maximum concurrent connections that have the same effect pretty much. And what can I say about these sorts of requests, they're usually high latency, 500 millisecond requests, they've got to go out to the PSP, the PSP is got to contact the bank, find the authorization for payment requests has to come back to the PSP and come back to you.

So you typically just waiting for response, you're not doing much processing in terms of CPU just sat there waiting. And that's why you need to be able to send out a lot of requests in parallel, if you need a lot of throughput. The other thing I'd say is, if you need to do more than 100,000 requests an hour, then you start to pay quite a large overhead for sending one request at a time in terms of all that HTTP talk. And you'll probably start talking to your provider and say, can we batch these requests and send you small batches. Unique attribute came along in NServiceBus 3.0. This, what this example I'm showing you wasn't built on the new platform was actually built on the old platform. And we got the opportunity to try out and NServiceBus 3 for the first time. And along came the unique attributes a wonderful way to tell and NServiceBus should only be one instance of a Saga based on a given field in your Saga state. In this case, the agreement reference.

When used properly, this is an amazing feature. Unfortunately, due to small oversight on my part, when we deployed this code to production for the first time, we emitted that unique element or that new unique attribute. And everything works. We had no problems with the system, it performed flawlessly until I spotted that we've missed that unique attribute. And I put it in, and I released a patch. And then some strange things happen. If everything's working, as it should be, in this example of what you're looking at when you see in RavenDB, you should see we've got three different Sagas here actually unrelated to the example I'm showing you, but three different Sagas and we've got four instances of each Saga, and we've got 12 Saga unique identity documents, and that's a good Number because they match.

And that means that for every Saga instance, there's a unique slug or identity document in RavenDB. Had I've looked at this screen before I or immediately after I'd patched the patch this and release it production. I'd seen the Saga unique identity document number and I'd seen twice as many Saga state documents because basically, we released the unique attribute, we told NServiceBus that this should be unique. So it went to look for the sub unique documents couldn't find them. So there is no Saga and create a new one with a blank state. So then we had a problem.

We can easily delete all these documents, the blank cyber state documents from Raven, that wasn't a problem. But we had to fix this problem, we had to create the sub unique identity documents ourselves, they're not very difficult to create. There's simply the ID of the document. And Raven is just a name and Saga, the name of the unique column, and then a GUID computed from the value in the column. So actually, this was all that we needed, we just needed a little bit of code that would go in, go through select all the Sagas that we knew about, compute that GUID, create a document with the correct ID store that in Raven, and then just patch the metadata in on the Saga reference.

And we were golden was a bit of head scratching, we were lucky this was a fairly new process, it was an offline process, we were able to take the system down whilst we were figuring out the solution to it. And I think as of NServiceBus 4, I think I'm writing saying that if you have custom finding logic on a specific field in your Saga, and you don't put the unique reference in, you'll get a compiler warning, I believe, I'm right in saying someone after the talk can tell me if I got that right or not. So I think there's a little bit of a safety net in there now. So hopefully no one will. I know of one other person that's had this problem because they contacted me about it. But hopefully no one in this room will have this problem. So by way of just kind of finishing off this demonstration of your introduction to this particular implementation, this is just the Saga that we built in the engine Saga, the payment collection Saga in the engine endpoint.

It's very straightforward. Activated by a start collection message, it will stop if it gets a stop collection message from the system that's emitting those, it will scheduled a timeout to request the payment. And then it will record the outcome of that payment. Found here you can see we have custom Saga mapping on the agreement reference. So again, we want to find the Saga based on the agreement reference contained in the message that's coming in. Code to handle start collection is simply just going to store the reference, the amount is going to schedule the timeout for the collection attempt. And the stop collection message all it's doing is really just checking to see every automated collection attempts. If we have then, if we haven't then just mark is complete.

If we have, we want to keep the state around to record that we've made that collection attempt and we don't want to make any further ones. And the final piece of the puzzle is just the handler for the timeout, which when the request payment timeout comes in, it'll actually send the request payment message to the bot to actually trigger the payment collection attempt. And then the bot will do a bus reply. And with the outcome of that payment, slightly simplified on the actual implementation, but pretty true to the original. Once we've implemented the solution, I can tell you, I can't tell you the exact number of payments we've processed through it, but it's in the order of millions. And it's been very, very low maintenance as well. It's really, really well for us we've not had any not have to touch it the tree after we figured out the unique Saga.

So lessons learned. Doing presentations is really quite tricky. And you have to prepare. It takes a lot of time and effort. Udi it makes it look easy. Code dependencies. We were a small team. We were fairly naive, and we didn't pay enough attention to our code dependencies early on. We did this fairly late on in the project managed to split out our code into separate repos, we chose to split repos, a repo per service. And we had an ops repo which was supposed to be for just the code, this is the infrastructure code that we needed to kind of stand up an endpoint and nothing more. And we ended up with far too many dependencies on that code. Well, we actually ended up with business logic code in that repo, or developers having to make changes to that repo to implement business logic.

And as the team grew, we got more developers coming into the theme and building more and more stuff on our platform, we started to feel, really feel the pain, our CI environment, start to buckle under the weight of all these commits, coming in and hitting this low-level repo that should only build occasionally, and then started cause a cascade of builds. This massively elongated our feedback loop for developers, they had to wait a long time for CI to complete the build and complete all the tests. And the problems just exacerbated by as the team grew as we got more and more people in the team. So pay attention to your dependencies early on, pretty much from day one, if you're starting out on a brand-new project, I hope I'm preaching to the converted, and in that testing async.

So it was a bit of a shock to us that we didn't when we when we ran our tests that they didn't really know when a command would have been processed, when we first ran them. So we quickly learned that you have to pull for a given set state. So if you want to know something's happened, you're going to have to pull the database to find out that it's happening, because you don't know when it's going to happen. It's not a big deal. Once we learned it, it was fine, we just something you'd always have to introduce to a new member of the team and explain to them why we sat there polling database waiting for a particular state, in our acceptance.

Testing long running processes. Again, sometimes we'd have to build a Saga that was going to run for weeks or longer. We typically the collection Sagas would start when the loan was taken out. And they would complete once we've collected the funds that were due. And also you can sit around and wait for a month to pass by in your tests. So two things we had to do, the first thing we had to do was simulate stuff messages into the input queue, build tools to do that last to do that simulate the messages arriving like the timeouts and other messages. That was reasonably straightforward. The other thing that we had to do was occasionally, we were dealing with a problem where we needed the handler to think time had passed.

So we needed to think that 28 days had gone passed since it last process or method. So we had to inject some mocks into the time as acceptance tests to fool it into thinking that time had passed. Typically, we'd always do this in the future. Because if you try and request a timeout in the past and NServiceBus gets very grumpy with you, as it should. Deploying the DTC. Udi touched on this earlier, it's a horrible thing to have to do, especially in production, especially if you've got, not working on a flat network. If you've got lots of firewalls between your endpoints and your databases. It's a very unpleasant process. It relies on things like net BIOS for name resolution, which can be tricky as well, because problems not yet not only in production, you can cause problems on your dev machines in your trial, depending on how diligent your network admins are.

And it hampers performance. On more than one occasion. We've seen the DTC really dragging our endpoints back. And there's not a lot of good solutions when you start seeing DTC become the performance blocker. Rewrite from scratch? We chose to rewrite from scratch I think from a technical point of view is the right decision. We had a lot of baggage in our code base. But probably from a people and teams point of view is the wrong decision. The reason I say that is we started off with the best of intentions, we told the business we'd rewrite this system that took three years to build in six months with the six engineers on the job.

Six months later, we were nowhere near finished. In the meantime, the old platform have kind of gone begging it hadn't had any love the most experienced engineers that kind of brought it to where it was, had been diverted away from it. And unsurprisingly, the business kept growing customer kept growing, we started to get quite serious performance issues. The business could no longer afford for that team of experienced engineers to carry on working on the new platform. They then had to move them back to the old platform was functionalities built there was performance issues to solve. And that was regulatory issues that we had to take care of as well.

So we didn't abandon the new system. We hired a whole bunch of new engineers to come in and work on that. But they didn't have the benefit of the original teams kind of hard-won knowledge about the domain and how things worked. And they basically had to make the same mistakes, again, that first team made a lot of the time. So, for non technical reasons more than any other reason, I think it would be, in our scenario, in our circumstance, it would have been wiser to say to the business, look, we don't want to build from scratch, but we know we have got issues and we know we need to tackle those issues. Let's start with the most important issues, keeping the website up regardless, and then let us work out what we need to do to make that happen with the existing system.

That way, we could have all the knowledge to bear on it. We could have expanded the team in a better way, a more cohesive way, and I think we would have got a better result. We still got there in the end, but it was a bit of a bumpy ride. What would I like to see in the future? Not the biggest fan of drag and drop, I'll give it a go. I think probably just based on our experience of the things we found really difficult were deployment. Automating our deployment was really, really hard. I didn't expect it to be hard. I thought it would just be as simple as copying some files on a service somewhere and that would be would make deployment done.

Turns out not so. I am not sure what the guys in particular can add to this, but anything that they can put their impressively large brains to in that space and anything they can offer maybe to help the people that are coming and new to NServiceBus for the first time, anything that could help those guys, even if it was a really simplistic model that wouldn't necessarily see them all the way through to the end of their journey. But if it started them off with a way of getting to an automatic deployment scenario very quickly, I think that would be valuable.

And I'd say that automating your deployment from day one is very important as well, trying to retrofit that later. We found this very, very hard. Questions? Is anyone got anything, anything I have not made me made play or anything I talked about here. I must stress, if you don't want to stand up now and ask in front of everyone, you're more than welcome to come later and I am sure you can get my contact details from wherever the web. Google me. Yes.

Sorry, I just....

42:29 Speaker 2

Have you embraced any of the new tools?

The question was, have we embraced any of the new ServicePulse tools we have now? I am not entirely sure, given that we are on the two X version of answers that we could. But Udi you will correct me. Could we Udi? We have put our toe in the water with version 2, we have the example I just showed you with version 3, but it was not part of the main new platform. And there has been some investigation work by the team moving to version 3. Unfortunately, we do not have the benefit of the unobtrusive mode into that. So we are trying to figure out ways that we can run 2X and 3X, in parallel and give us some gradually move over rather than Big-Bang.

The business have no appetite for Big Bang. Question at the back. So we break our teams down by region. And the biggest thing by far is the UK team, responsible for the products in the UK, and that is about 50 people. If you include product owners, developers, testers, the other products and other regions, I think we have got if you are just talking about people that... sort of code NServiceBus type code, talking about another hundred people spread across, I think another six or seven teams. I would also say that we grew our development team too quickly.

Again, it's a balancing act between you've got a lot of stuff to do, so more people to do it is helpful, between keeping things cohesive and keeping a kind of shared values within the development team. Growing a team, especially distributing and keeping good strong shared values and keeping a good strong culture in team is very, very hard. I've learned. I am just about out of time. I hope some of that was useful, I'm sure maybe not all applicable to you guys in the various domains you working in.

I would love to hear about what you guys are working on, the domains you are working in. It's just as useful for me to come here and talk to you guys and hear what you're doing. And I can learn a huge amount, but I will say goodbye. Thank you.

NServiceBus at Scale

About this video

🔗Transcription