Webinar recording
Scaling for Success: Lessons from handling peak loads on Azure with NServiceBus
Learn how to use Azure and NServiceBus to handle unexpected peak loads and avoid embarrassment.
🔗Why attend?
What happens when 200k users unexpectedly decide to use your platform simultaneously? We’re using autoscale on Azure PaaS so surely we can handle that, right? Wrong! Ask me how I found out… After going through a bit of trouble, I want to help you avoid the same mistakes I made.
🔗In this webinar you’ll learn about:
Handling peak loads on Azure with:
- Azure App Services
- Function apps
- SQL databases
- And other technologies
🔗Transcription
- 00:02 Dennis van der Stelt
- Hello everyone and thanks for joining us for this Particular live webinar. My name is Dennis van der Stelt. I'm your host for today and I'm joined by Dibran Mulder. Dibran is a particular software recognized professional, so he knows a thing or two about NServiceBus, so today he'll talk about lessons he learned while handling peak loads on Azure among others within NServiceBus. Dibran works at Cloud Republic as a solution architect and he tries to focus on the latest technologies. He has experience in building high-performance cloud-based solutions using serverless and platform as a service technologies. And he mentioned to me he became intrigued with the intelligent applications principle and its possibilities that it offers to his customers.
- 00:51 Dennis van der Stelt
- And because of this, he has a growing interest in artificial intelligence and the changing application lifecycle that comes with smarter client software. He likes to explore those technologies and apply them in real business scenarios and with his enthusiasm also share all this knowledge in the presentations and webinars like the one we're hosting now. A last quick note before we begin, please use the Q&A feature to ask any questions you may have during today's live webinar and we'll be sure to try and address them all at the end of the presentation. Any questions we can't answer during this webinar, we'll follow up with offline. As a reminder, the webinar is being recorded and everyone will receive a link to the recording via email. Okay, so let's talk about handling peak loads on Azure. Dibran, welcome and take it away the floor is yours.
- 01:47 Dibran Mulder
- Well, thanks for having me. I'll put on my more video so it becomes a little bit more personal since I'm sitting here at my attic in the Netherlands. We'll be telling you a lot about handling peak loads on Azure. Well, as mentioned, my name is Dibran Mulder, I'm an Azure Solutions Architect working for Cloud Republic. Cloud Republic is a small company based in Utrecht, in the Netherlands and was part of a larger company, the Caesar Group, which I'm also the CTO of. I'm also the host of a Dutch-speaking podcast called DevTalks. So if there are any Dutchies in the room or in the webinar, make sure to follow it. So since it's quite a diverse audience, I saw some people joining from the US, Canada, maybe also other continents. It might be good to little bit talk about where I'm working. I'm based in Netherlands, which is a country in West Europe and it's the smallest country basically of West Europe. In that country there's a company called Cito, which is very well known to everyone in the Netherlands because they are the primary institute for testing for school testing in the Netherlands.
- 03:16 Dibran Mulder
- I did a project for them, it lasted for two years and today I will be talking about all the lessons learned from running that project, from basically building it up from the ground and running it into production and all the lessons learned from that. But first, let me talk about the situation, how it was before we joined Cito. While they basically had a very old system, which was a winforms based solution that was used, well, to test the students mainly on primary schools. So as you can see, it's a pretty decent winforms interface. Hopefully there are still some visitors in the webinar that also did wind forms back in the day. I surely did, I had a blast, but times are changing. With this client software, Cito basically had a distribution nightmare. Every teacher had to install this winforms application on his desktop or use cloud workstations to enter the results of tests to basically track the progress of a student. While Cito is basically a governmental institute, but it became a market organization.
- 04:56 Dibran Mulder
- The government decided to open up the market for testing in primary education so they had to innovate and they came up with a new platform. They said, "We have to reinvent ourselves. We have to make sure we can offer a proper online testing." We have to make sure we innovate so that we better can support our teachers and basically better support our students and better test their progress, better help them in developing. They basically launched two products on top of the new platform called Leerling in Beeld, which is Dutch, which is Dutch for a Student in View, basically. And they also delivered a second product as of this year, it's called the Doorstroomtoets, which is basically a progression test. Everyone that leaves the primary school has to take a progression test. It's stated by law in the Netherlands that everyone has to do such a test. They wanted to offer online testing for that as well.
- 06:00 Dibran Mulder
- We redeveloped, we re-architected an entire new cloud-based platform with authentication, administration, and several modules like reporting, online testing and test product definitions and such to create new products on top of that, that are widely used within primary schools in the Netherlands as of today, as of last year basically. So to give it a little bit more of a visual representation, this is a screenshot of Leerling in Beeld. It's basically a cloud-based application which teachers use to plan test, to enter test results, to look up into progress, to check how students develop themselves, what their progression is, in what types of areas they need to learn more. It also contains an online testing environment with audio and video types of questions, there's also multiple choice or open questions. It also has a online testing environment, but the gold of Cito, their bread and butter, are the reports that come out of the tests. Cito is well-known for their knowledge into the testing space and they provide very detailed reports on how children are progressing in their development in all kinds of subjects such as reading, math and others.
- 07:43 Dibran Mulder
- They are the institute that defines also the categories of grading. So on the right side of the screen you see that people are scored on certain M levels, M5, M6, M7 levels. In the Netherlands you can also buy books which have that level of reading comprehension, for example. So they are the standards in that case. These are the new reports and we developed that, but let's dive into the technical stuff. Every January 70% or more than 70% of our primary schools in the Netherlands take a test on our platform. You have to know that Cito is being used as an external benchmark often. During the regular year, students take all kinds of tests, but twice or three times a year, the entire country basically takes tests using the Cito platform to benchmark their students on a normal level. Once we developed this platform, it basically launched during COVID or right after COVID. Before COVID paper testing was dominant, but during 2023-2024, lots of schools used the new student tracking platform for the first time.
- 09:23 Dibran Mulder
- Well, basically on the 17th of January, every school had this same schedule, like 8:15 the school opens, 8:30, the teacher opens the day, and then if the students are still fresh, they are not tired yet, the entire country starts taking a test. Then, we sadly have to wait for Azure to scale up and then they can continue their test. They have a break and play outside and take the second test. I want to emphasize some little details. It's from the third grade until the eighth grade, like children from the age of around 6 until 12, all children are taking tests that very day. Some kids cannot even read so they use pictures to log in. In the middle of the screen it says Wachtwoord, which is password in Dutch. So even some children have passwords to log in into their online testing environment, they have picture passwords. It's a pretty funny detail. As you would expect, you get quite a predictable web traffic in Azure. Well, during 7:00 AM, 8:00 AM, all teachers are still dropping into the school. They start up their computer.
- 11:03 Dibran Mulder
- They welcome all kids at 9:00 AM and then at 9:05 or 9:10 AM, all students are taking the tests. Then the test usually takes up like 30 minutes and then they break and play outside. And then at 11:00 AM the second tests take place. We used app services in Azure for our online testing environment and the CPU percentage correlates well, one-to-one to the Https traffic. As you can see we hit a hundred percent at 9:00 AM. Well, hasn't Azure scaled? Well, it surely did. It actually did, but it did it too late, so that was actually our bottleneck. It wasn't scaling up as fast. We didn't expect every student to basically hit the test at exactly the same moment. So what happens when you make the newspaper? Well, lots of students cannot take tests on your platform. Well, your manager is going to sit right next to you. I was the team lead of the team back in the day. Currently, I'm doing another assignment, but I was sitting there. Your alerts and your monitoring is going off, everything is bleeping, service care is flooded with calls.
- 12:45 Dibran Mulder
- And then you never see your manager, but when the shits hits the fan, he's going to sit right next to you. You know trust has been violated. They thought the platform was in good hands, it surely was, but we had to fix the problem. Your work is going to be monitored all the time. So what would you do? What would you do in such a situation? You see the app service isn't scaling quick enough, what would you do? Would you take ownership and act? We're going to scale up now manually. Are we going to set it to 20 and never turn it back off or would you organize a meeting and discuss the best approach? You know students are not able to take tests on your platform. Well, we hadn't thought about it. I acted, but I could imagine people wanted to be a group effort. Well, then a more technical question, would you head to the Azure portal and scale up manually despite your Infra as Code policy or would you scale up using Infra as Code and take a longer time to fix the problem?
- 14:01 Dibran Mulder
- Well, would you fix the problem yourself as a team lead or would you let a team member fix it and you were the shield off the team member? All kinds of questions come up, you haven't thought about it. If you have been in such a situation, it's very good to think about these are the problems that are going to get to your desk and you have to take a decision. Well, let's dive into the app service scaling issue. So we used a rule-based system. We're taking a baseline, for example, two or five instances and we would scale up when CPU hits like 60% and then decrease over time. Well, that's basically not working for us because the time of which the CPU has to be over 60% has to be like five minutes. So it's going too slow. What you better can do is scale up aggressively, like scale up to 20 instances and decrease over time. Or if it's a predictable pattern, you could have a higher set of base instances.
- 15:21 Dibran Mulder
- So if you look at the lessons learned from running app services and expecting huge amounts of visitors within small time windows, a five minute evaluation time is too slow in some specific use cases.Also, it's better to scale aggressively, like not add two instances over time, but take huge leaps and scale down progressively. It won't hurt cost that bad because after all it's a performance versus cost debate and scaling aggressively isn't hurting cost that bad. Also pre-provisioning is very helpful in some cases, but pre-provisioning has a negative impact on cost. So basically the conclusion is it's very hard to be cost effective and confident at the same time. And also, if 70% of students in the Netherlands are taking tests and it's not working and your family knows you are working on that system, be prepared to get shit from your nephews and nieces. That was also a fun addition. Also, lovely, you have to ask yourself the question, haven't we tested right? Because we surely did load test the system. We load tested it with a ramp up test and that was basically exactly the problem.
- 16:59 Dibran Mulder
- A ramp up test is like a ladder, it moves up over time. It add users over time. It's not corresponding to the real world user scenario where we hit 150K users within a five minute period. Also, we didn't expect a paradigm shift in adoption of digital testing. There was a major success for the company that lots of schools were willing to take online testing instead of paper-based testing. We took the non-functionals according to the pre-COVID scenarios. Below you see a picture of the pre-provisioning which we are using right now. As you can see, we're just scaling up during work hours and scaling down after work hours. The company is willing to take those costs for granted. Once something breaks, once you hit 150K users and you fix the problem and the next day 150K users show up again, you always have to ask yourself the question, what is the next weakest link in our architecture? I want to talk about IdentityServer for a bit because we used IdentityServer as our identity and access management solution because it is very dynamic. You can implement lots of scenarios with IdentityServer.
- 18:34 Dibran Mulder
- We had several client applications starting from the left. We had several client applications communicating with our IdentityServer using OpenID Connect. And then we had several identity providers connected to our IdentityServer, which some are pretty industry standard, such as Azure Active Directory or Microsoft Entra, it's called nowadays. We use that to authenticate internal employees. We used Azure B2C for some other identities, but there were also some industry standard. In the Netherlands, remember that picture where the student had to log in using a picture? Well, that's called Basispoort and every teacher and every student in the Netherlands has a Basispoort account and is using SAML as a protocol to authenticate users. It has all kinds of weird flows. So Basispoort is truly legacy software, it has all kinds of weird SAML flows that you need to fix. So IdentityServer was giving us the flexibility we needed to implement all those industry specific flows we just needed. In IdentityServer, it's good to know that you have a client configuration and you have a persistent grant storage.
- 20:01 Dibran Mulder
- In the client configuration, it's a database which contains all consuming clients like our Leerling in Beeld application or our online testing environment or our reporting servers. It also has a operational data storage called the Persistent Grant Storage in which persistent data is being stored such as refresh tokens. Well, this is a screenshot. If you go to a sample application of Cito, then you are prompted with which identity provider are you willing to use to log in into the system and you can see it's using Basispoort for primary education and Trey for secondary education, but also internal accounts and stuff like that, Google for test accounts. So just to give you a sense of how it looks like. Well, the IdentityServer Persistent Grants, if you have a 150K of students and 10K teachers and the students use it for two hours a day and the teachers use it for eight hours a day, then students generate about 10 refresh tokens per student.
- 21:18 Dibran Mulder
- If you take a 900 second lifetime, which was default and teachers will generate 40 refresh tokens per teaching, if you calculate the amount of data being used, then you can see the database grows with almost 2.5 Gigabytes a day just with refresh tokens. Once we finished the test week, exactly 14 days after the testing week, we experienced DTU issues on our Azure SQL database. As you can see sometimes it spikes to 100%. And then we restarted the IdentityServer and then it just came back up again. Turns out, users made extensive use of the online testing environment and our composable front-end architecture, 3x-ed the amount of generated refresh tokens. So refresh tokens are kept in the Persistent Grant Storage to make sure the lineage of tokens is correct. You might think once a refresh token is invalidated because the time is not correct anymore, then it can be dismissed, but that's not the case. IdentityServer keeps track of the entire lineage for X amount of time, so it makes sure it doesn't have two separate lineages for the same refresh token.
- 23:02 Dibran Mulder
- Well, that made that the database grew to 100 Gigabytes in roughly two weeks. If you scale up a database in Azure, it takes up to one minute per gigabyte. If you're having issues on Azure with database sizing or DTU sizing, then you might reconsider scaling up because scaling up a database on the stress is going to take significantly longer. Well, why are those refresh tokens not being cleaned up? Well, IdentityServer has a feature for that. It's called the cleanup feature, which you can just enable, which we did. We trusted IdentityServer with this functionality. But if we take a look at the screenshot from Open Source IdentityServer4 code, here you can see the remove grants, which is being called, it has a amount of batch size it uses to cleanup the tokens. But as you can see, it uses Entity framework to load in the refresh tokens and then removing it using Entity framework.
- 24:33 Dibran Mulder
- This is basically loading in all refresh tokens from the database, loading them into memory, loading them out of the SQL database, over the line, into the memory and the app service and then removing them. This was causing our DTU issues. Keep in mind, sometimes you cannot trust a system with certain responsibilities. Exactly 15 days after the initial burst of users, DTU issues were taking place and IdentityServer was using Entity Framework to cleanup those tokens. I was really disappointed with this because I trusted IdentityServer. It's a very mature product and I wasn't expecting this, but because we are in such a awkward use case where 150,000 users are hitting, basically at the same time our infrastructure, we had to create a custom SQL stored procedure to delete those persistence grants and not load them from the database. But if you do that, make sure to throttle your stored procedure, do not delete them all at once because then your DTUs will also be consumed. So once we fixed that, our database size dropped dramatically and it was fixed over time.
- 26:08 Dibran Mulder
- This is still another sample of things you learn when you're in a atypical scaling environment. Lessons learned, composable UI architecture can increase the load on your Identity Access Management solution. I didn't know that refresh token lineage was being stored for security reasons, but right now it makes sense to me. IdentityServer is a very good product, but it clearly lacks database maintenance options. It's very good to keep that in mind. Scaling up a database can take a significant amount of time, especially in Azure. I'm not aware of the latest documentations, but one minute per gigabyte, I think that's pretty bad performance, to be honest. Well, and if you are going to fix a database size or you're going to add a stored procedure, make sure to not ask this questions when shits hits the fan. Make sure you have thought about this when you are running a production environment. So should I update this in a dev or a main branch depending on your branching strategies or should I create a hotfix?
- 27:29 Dibran Mulder
- If I deploy a hotfix to production, let's say you are turning off the token cleanup feature from IdentityServer, do you override your scaling settings? Do you have a separate CICD pipeline for your infra or do you have a combined pipeline for your Infra as Code as well as your app service that's running on top of that? You have to think about that in advance because when shit hits the fan, you don't want to make another problem when you're fixing one. Okay, so everyone was able to take a test, we're good, right? Well, we were good, but it felt like this. There was a big tsunami of test messages coming our way and we had to process them, but luckily we thought about that pretty well. We used the Microservices Architecture in Azure and we basically, used Azure functions as a primary hosting option. It's a serverless solution in Azure. It can be used with lots of languages like C#.NET, but also JavaScript or Python. We used C#.NET a lot. We made sure to use Azure NServiceBus for communication.
- 28:47 Dibran Mulder
- We didn't want to use REST because we knew it wouldn't scale that good and we wanted to have guaranteed processing of testing messages. Also, we used Azure SQL as our mainly source of data storage. Well, if we look at messaging, what I see a lot at clients I work for is that REST is the de-facto standard for communication. It's widely used. There's a very good ecosystem around it. Lots of libraries and programming languages that have support for this technology, so it's truly technology agnostic, but it hardly doesn't supports guaranteed delivery. If the receiving party is not online, you're in for a bad spot, you have to retry and you cannot retry forever. It's also suitable for one-to-one communication. You cannot do a pub/sub kind of approach. Also, if we look at messaging, I think it's a better alternative because it's asynchronous in nature. It enables recoverability, so once something fails, you can retry it and therefore it really increases your resilience. I think also it has great patterns with one-to-many kind of scenarios like pub/sub, but you have to do it well and this is where NServiceBus comes in.
- 30:38 Dibran Mulder
- I also worked with other clients that were using messaging before just using plain Azure NServiceBus. We were using Azure NServiceBus as a transport, but you could also use RabbitMQ or even SQL Server or other types of transport for messaging with NServiceBus. But we were using Azure NServiceBus because of the scalability and also because of the resilience it has, the SLA it has, et cetera. But we wanted to make sure we had an opinionated way of managing our messaging and we came up with NServiceBus and it was really a great fit. Let me talk about some topics of NServiceBus, which were truly a great fit. One of the most overlooked features of end NServiceBus is transactional consistency. In an event-driven architecture you always have to incorporate transactional consistency because handling a message has to either succeed in its complete endeavor or has to fail. If handling a message, for instance, a StudentChanged message comes in and it updates the database and then it sends a UserChanged event after that, you have to make sure that entire scope either has to succeed or either has to fail.
- 32:17 Dibran Mulder
- If for instance, a UserChanged event cannot be sent, then you also have to roll back your database update. NServiceBus has great APIs to hook those transactions, your SQL transaction, for instance, and your NServiceBus transactions together so they complete as a whole or they get rolled back as a whole. I think it's a very much overlooked feature, which truly helps to minimize ghost records or partial updates or phantom messages and stuff like that. Another feature we used quite a lot are Sagas. Sagas are basically workflows consisting out of several messages, which can be stateful in nature. It is somewhat comparable to Azure Durable Functions or Azure Durable Entities, but I think it is a very powerful feature of NServiceBus to create stateful workflows in a messaging structure and have the reliability of NServiceBus have the programming ease of use of NServiceBus, but also the reliability with its persistence in, for instance, SQL. We use that quite a lot. For instance, I'll come back to some examples. Some hard lessons learned on event driven architecture even with the NServiceBus.
- 34:05 Dibran Mulder
- Let me elaborate a little bit more on the test processing part. So when student finishes its test or starts its test or pauses it or what have you, does something with the test, then it generates a message to the NServiceBus and all kinds of microservices act on those events like a teacher get notified when students are taking tests, the student tracking system, it gets updated when a test is finished. Post-processing is kicked off when a test is finished so that we can do all kind of lexical analysis on which faults students are making. We can also do all kinds of analysis, like in this math test, did the student take all kinds of categorical mistakes such as fractions or multiplications or what have you? Also, we do historical analysis, like user finishes, its tests, did the whole class or did the whole group improve over time? Lastly, we have to sync data with third party systems such as student administration systems and so on. There's a whole variety of processes taking place that are acting on events.
- 35:28 Dibran Mulder
- One of the issues that we had was the interaction with line of business systems. We had a test products line of business system which people worked in during the day. That's not that big of a deal, people work during working hours. That's basically pretty obvious, but you have to expect that your data is being locked or incomplete. You always have to validate your data and cache it. And 99% of the time, in my experience, line of business systems are not built for skill. You have to incorporate and you have to think about it that sometimes a line of business system isn't able to answer your request. How does this test look like, maybe someone is updating that test? Maybe the line of business system doesn't have a versioning strategy for its data, doesn't have all the resilience you would expect. And then if you look at your messaging architecture, you have to distinguish between a transient error and a functional error. Like if you're retrying 10 times, which is the default behavior of NServiceBus, you might even put the line of business system under more loads.
- 37:01 Dibran Mulder
- If there's a functional error, the data isn't there, you would expect then back off and stop the process. Make sure to pick it up later or have a human look at it. This was one of the issues which we really came across quite a lot. After these students were finishing the test, the post-processing was using the test definition to calculate all kinds of categorical analysis, but people in the business were updating the test to fix issues in the test, for example. Also, one issue which happened quite a lot was the report generation. We had an external system, it was not developed by our teams, that used Puppeteer, which is a chromium instance to generate PDFs from basically web pages. It only exposed the REST API and we had our Microservice responsible for generating those reports scaling up in our Azure function and it basically DDoS'ed the external system because after all the students finished its doorstroomtoets, its final test of the primary school, then it couldn't handle 100K reports in one afternoon.
- 38:51 Dibran Mulder
- But luckily NServiceBus saved us. We could retry it in batches of manageable amounts and then we used our guaranteed delivery to fix the problem. Well, if you haven't looked at ServicePulse or you're not in the particular tier for using this into production, I would greatly encourage usage of ServicePulse. I think it's one of the finest products of particular, basically, because there are little alternatives. If you're using Azure NServiceBus as a transport, you have the NServiceBus Explorer or some other community funded tools to work with your NServiceBus to retry the lettered messages and such. But I think what NServiceBus built on top of that is truly one of the most missing features in a lot of companies that run a production workload. I would say I couldn't live without it anymore. Well, some lessons learned, messaging only works well if you design systems well. The opinionated part of end NServiceBus helps to design systems well. You have to think in commands versus events. You have to think about your NServiceBus topology. You have to think about a lot of stuff. One advice I would also highly recommend is distinguish between functional and transient exceptions.
- 40:40 Dibran Mulder
- Don't retry on functional exceptions or back off for a longer period of time. Make sure to address out of order event processing. If you're running a large scale system, out of order event processing is basically inevitable, so make sure it's idempotent or you can replay messages without damaging your data. If you're running at high scale with lots of Azure functions with lots of Microservices on the same Azure NServiceBus, you might run into this exception, cannot allocate more handles. The maximum number of handles is 5,000. Well, this is basically an infrastructure problem. Maybe if you use premium Azure NServiceBus, you don't have this issue, I don't know, but we run into this issue. That was a big problem for us. Also, make sure audit logging enlarges this problem because it sends more messages, NServiceBus audit logging, I'm talking about, might enlarge this problem. While some obvious things like prefer batching over streaming data in SQL Server. Built for resilience. And you'll most likely not lose your data.
- 42:09 Dibran Mulder
- Well, I cannot live without the NServiceBus monitoring solution anymore, especially if you run at a large scale with lots of public stress on your solution. You really have to have a solid monitoring solution for your event-driven workloads because it might not behave as expected or some schools might have such bad data you haven't thought about. And then you can address all those failed scenarios in service bills and fix them one by one or in batches. Transactional consistency helps to avoid zombie records and ghost messages. Lastly, I would very much encourage everyone, if you ever run into production incidents to regain trust by writing postmortems, it's a very good practice to write them. Also, to learn as a team and to become more mature in running production workloads. We used it a lot. We really made sure what was the nature of the problem, we could explain it to our fellow engineers, architects, also director of IT, for instance.
- 43:37 Dibran Mulder
- By precisely mentioning the problem and the impact to the customer and how you're going to fix it and how you are going to make sure such problems won't happen again, you are regaining the trust of your fellow engineers or well, the business, basically. Also, in the moment, take ownership of the situation. As a DevOps team, you must solve the situation, but don't act in emotion. As a team lead, this is my personal experience, shield off your fellow engineers from your stakeholders. Do not try to fix it all yourself, involve your team members and make sure you trust them with the solution. After the moment, you have to regain the trust, discuss with your team what went wrong, and write those postmortems and be very specific without putting the blame on individuals. Well, hopefully you learned a lot. I surely did, running Cito in production with thousands of students finally turned out to be a very exciting time.
- 44:56 Dibran Mulder
- We had some hiccups, but all students were able to take tests. More and more schools are using Cito to track progress of the progression of their students. Less test moments like in January, 2024 and the upcoming one, we experienced minor issues and no major issues at all. We really became mature and fixed those problems and learn from it. So with that, I'm going to return it to the host and we might have some time for Q&A.
- 45:39 Dennis van der Stelt
- All right, thank you Dibran. Some questions came in. The first one isn't really a question ... apologies ... But more a remark. Someone says, "Thanks for the lessons learned. Some were really valuable for the project I'm currently on." So that's good to hear. A question is, "Are there key elements that you would do differently when starting a new project that might expect peak loads? For example, how would you sell the time and money you need to management for proper testing?"
- 46:18 Dibran Mulder
- Wow. Wow. Wow. Yeah. Coming up with those non-functionals is a key. You have to make the business responsible for coming up with how many users are we expecting and I don't know, at what time range, for example. Come up with specific expectations. Create a safety margin for your team as well, like we add 20% or 25% on top of that and then you have to test it. Well, yeah, in the architecture you have to think about scalability and basically, everything that's not in control of your team, you have to incorporate it as well. Because yeah, at the end, if it goes down, you are going to fix the problem. I can elaborate a little bit further on that. Last January, we run a new two weeks of testing and we asked ourselves what is the next weakest link?
- 47:37 Dibran Mulder
- And then we couldn't think of any system within our architecture to go down. Turns out Basispoort was the next weakest link. Basispoort experienced trouble with logging in all those users, so even our identity provider was having issues. We couldn't have foreseen that. The central identity provider for primary education in the Netherlands was becoming the next weakest link. But you always have to ask yourself the question, "What's the next weakest link?" So hopefully that answers your question. Other questions?
- 48:21 Dennis van der Stelt
- All right. Yeah. Another question is, "You mentioned and NServiceBus allows you to scale your system. Can you elaborate on this on how it works?"
- 48:32 Dibran Mulder
- Well, I don't know if I mentioned that in that kind of words, but messaging helps to scale and then it very much depends on the transport you are using. We are using the Azure NServiceBus transport. I would say NServiceBus is a product on top of transports to design your system well. Surely it is built for scale, but the main bottleneck for scale is basically basically your transport mechanism, which Azure NServiceBus can take up quite a lot if you're using standard tiers. I've not worked with production, avoid premium tiers a lot, but I couldn't imagine stressing it that much that it goes down or something like that. I think some transports are truly built for scale and NServiceBus helps you to design systems on top of that that are built well, built resilient. Yeah, that's basically it.
- 49:42 Dennis van der Stelt
- All right, another question is, "Did you find that NServiceBus monitoring caused any performance issues in your system and would you discourage it from being used in a production environment?"
- 50:02 Dibran Mulder
- Well, you have several ways of hosting service bills, that's one of the things I think particular can still improve on is the hosting of service control and service bills to service on top of that. They are moving to Docker images and maybe they are available right now, I don't know, but they are basically being isolated, but they use the same transport mechanism like the audit cues from your Azure NServiceBus to basically create their data. That could negatively impact your performance of production, but in my opinion, it's totally worth it because otherwise you're running blind to production.
- 51:03 Dibran Mulder
- You have to have a mature Service Bus monitoring solution, because if you're running at scale like 10,000 schools, some schools are going to mess up their data. Like in the Netherlands, we have identifiers for students. We see students having the same identifiers, teachers swapping identifiers, which is not allowed by law. Things happen and you haven't tested that, then those individual cases will drop out and you will see them in service bills and then you can act on it. If you don't have a mature Service Bus monitoring solution, you're basically running blind and then I would rather give in, well, single digit performance loose versus running blind to production.
- 51:59 Dennis van der Stelt
- All right, one more question. Someone asks, "Did you use deduplication features such as the outbox? And can you elaborate on how it helped and why someone might consider using it or not using it?"
- 52:19 Dibran Mulder
- Well, the outbox feature, I haven't worked with NServiceBus the last half a year, so you have to help me on that one. I know we use the data box feature, but the outbox, I'm not really familiar with it right now so I guess we have not used it. So sorry, cannot answer that question.
- 52:50 Dennis van der Stelt
- Another question then. "Did you consider other storage solutions like CosmosDB, for example? And if so, why didn't you use, because you mentioned that you used SQL Server, why didn't you use any other storage solutions?"
- 53:05 Dibran Mulder
- Well, storage wasn't really an issue in this project. We had one issue with storage and I talked about IdentityServer issue. I've worked with Cosmos as well. I think it's a great option. The team I worked in has great proficiency with SQL and the Saga storage, for example, works greatly with SQL as well. So I think it can be used as well, Cosmos as well, SQL. It's a question about are you proficient in both technologies, also costs and reliability and performance, of course. But for us, SQL worked just fine. One benefit I would say SQL has over Cosmos is the maturity of the tooling like Azure Data Studio or SQL Management tooling, I think it's more mature than the Cosmos tooling that's available. So that was a big thing for us.
- 54:24 Dennis van der Stelt
- All right, thank you. Then, with that question, we'll end the webinar. Thanks for sharing your knowledge, Dibran.
- 54:31 Dibran Mulder
- Yeah, you're welcome.
- 54:32 Dennis van der Stelt
- I want to end with the following. Particular software is sponsoring Techorama in Belgium, NDC Oslo in Norway ,and KCDC in the United States. Some of our colleagues will be speaking at those events. Go to Particular.net/events and find us at a conference near you. That's all we have time for today. On behalf of Dibran Mulder, this is Dennis van der Stelt saying thanks for joining. Goodbye for now and see you on the next Particular live webinar.
- 55:06 Dibran Mulder
- See you. Thanks for having me.
About Dibran Mulder
I’m a Solutions Architect with a focus on new technologies. I’m all into building high-performance cloud solutions using serverless and PaaS technology. I’ve become intrigued with the Intelligent Applications principle and the possibilities for our customers. Because of that I have a growing interest in Artificial Intelligence and the changing Application Life Cycle that comes with smarter client software. I like to explore new technologies, apply them in real business scenarios, and share my knowledge.