Microservices with Service Fabric. Easy... or is it?

00:00:02 Daniel Marbach

Hi everyone. You're doing it wrong if you're not considering platform as a service on-premise or in the cloud, and if you're building applications today. I think the days of shipping data through tier to tier, through countless instances to the client, and back and forth are over. In modern computation models such as actors, we bring data closer to the compute layer and thus significantly reduce the latency. Today, I'm going to take you on a journey, and I will show you how you can free yourself from this data shipping paradigm, and no longer ship data through all the tiers towards a new stateful middle-tier architecture.

By leveraging smart routing, we can route requests that are coming from platform as a service offerings like Azure Service Bus, or Azure Storage Queues into stateful service components that are running inside Service Fabrics, that are responsible of handling business requests. At the end of this talk, you'll know the benefits of stateful services and Service Fabric and how you can leverage them for data intensive workloads. And how message patterns relate to data partitioning and how you can ensure proper routing into the cluster.

In this talk, I'll assume some basic knowledge around Service Fabric. I will cover some of the concepts on the fly, but basic knowledge is required. And I'll also assume some basic knowledge around standard messaging pattern, such as command, requests-reply and events. Of course, if you have any questions, feel free to ask these questions either during the talk or preferably at the end of the talk. So sometimes when I'm sitting in my office or I'm sitting at home and I'm working on something, especially when the weather is bad again in Switzerland, I get slightly depressed and I think how could I cheer myself up.

And what I usually do then, is I take a piece of Swiss chocolates. I love Swiss chocolates. You might wonder what has chocolate to do with this talk, with microservices and Service Fabric? Well, I'm going to give to you in this talk a sweet, but totally fictional story about Swiss chocolate manufacturing. Everything I'm going to show here is totally invented by myself. So Switzerland is well-known for its extremely excellent chocolate bars. It's even better than the Belgian chocolate, believe me. And I have a few samples here. So if you give me feedback at the end of the talk, or if you ask a good question, I'll give you a give away a little bit of Swiss chocolate, so that you can try it out yourself. We have a problem, our chocolate is so good that the international demand on these chocolate bars is raising.

Chocolate manufacturers in Switzerland, they realized they have a need for a highly reliable chocolate order management system. The team responsible for that new order management system wanted to go to platform as a service in the cloud, because that's what everybody does these days, right? That's why we're here, we learn about new things and they wanted to do it as well. But they realized, unfortunately they have a few things running on-premise that they can't really migrate to the cloud. We also have legacy in Switzerland, especially when we have to integrate with government services.

So in the architecture meeting of the team, Karl, the team architect starts the discussion with a horror story. Attention, the next slide is not for the faint of heart. He says, many Easter bunnies have died last Easter. Our chocolate is so good that everyone, people around the globe are ordering our extremely delicious chocolate. Last Easter, the demand was so high that our internal legacy applications started choking around. It became slower, and slower, and slower and eventually it started to fail. The ripple effects were so severe that some of the orders never came through.

And he also said, well, losing orders is a no-go, because one of our missions as a chocolate manufacturing company is to make people happy, by shipping our products to the people around the globe. And because we are not able to ship, even under spiky demands our delicious chocolates, we are failing in that mission. If we continue to do so, we will be wiped out by all our competitors and potentially Belgian chocolate, that will be a disaster. But he says, but recently I was stumbling over this awesome technology called Service Fabric. I was watching some Channel 9 videos and I'm really hooked on Service Fabric.

Let me explain you in this architecture meeting, what Service Fabric actually is. And he starts with empathy and enthusiasm, he says, Azure Service Fabric is a distributed system platform that makes it really easy to package and deploy applications that are reliable and scalable, both on-premises and in the cloud. And he says, Service Fabric also addresses all our needs of developing highly robust applications, and we can avoid managing complex infrastructure on-premises. And we can focus on implementing our mission critical chocolate ordering system without worry about all the nitty-gritty details of managing the infrastructure. And he continued to say, well, it's hyperscale, it supports different programming models. We can apply replication and failover, it has upgrade domains, fault domains, and we can do rolling upgrades. And the most important aspect, and our managers love that, it has a built in load balancer. Yes, check.

And of course Karl went on, and on, and on, and on in this meeting with his sales pitch, after all that's all what architects through the whole day, right? They talk, talk, talk, and get nothing done. And if you want to know more, what Karl explained to the team about Service Fabric, I suggest you just read these posts, that I posted here because after all, all the information that Karl gave the team, was anyway copy pasted from Microsoft material. So this is what Karl showed to the team.

He started Visual Studio, and of course every time you switch to Visual Studio, everyone has to wait in this architecture meeting, so you have to wait as well. And he said, look, I have here Service Fabric running on my local machine. I have a five node cluster, don't believe me, look at this. Here it is, Service Fabric running on my machine, five nodes. It already has two applications deployed. This is the service explorer, it's awesome. I can view the different nodes. I can dive into the application. Here are the few nodes. I've already applications deployed and so on, and so on. And he said, look how easy it is to create an application.

So he went to Service Fabric application, and then he said, "Okay, I have to remember where I have to put this." Well, let's put it to project, chocolate microservices. So he did that and put it in here, created the application and said, okay, how about we do order chocolate stateless service. So he created that stateless service. And of course it took some time until everything is created. The Microsoft machinery is going on. And like I, he also mistyped the service name.

And he said, well, I already have something prepared. I want to show you how robust it actually is. I have it already on my hard disk. So I just need to include it into the project. It's here, I have an application that will connect to the cluster and I have it included here. And what else? well, let's add something else. We need some contracts because I want to show you that. And well, because he mistyped the name of the service here, of course he had to change something in his application. So that's fixed right now. And he said, "Well, what I want to show you in this architecture meetings, I want to show you how Service Fabric can automatically failover."

And in order to do that, I need an application which connects to the cluster to a service running in the cluster. And I need to expose the service that is running in the cluster. And he created already the stateless service that is here, the order chocolate. So what he started doing is, while I already have an interface called IChocolateService, he needed to include this. And now he said, well, I'm using now Service Fabric remoting to implement it. And Service Fabric remoting automatically exposes a service inside the cluster of remote procedure calls. So he did this to implement this method as a demo. He said, oh, let's return result, and in order to show that the service is failing-over, we need to return... I'm going to return the instance ID of the service that is running. And that is available on this thing called context. And because it's a hello world application, I will also return here, hello world.

And that will then be displayed by the console application that is running outside of the cluster. But as always, we need some new good packages in order to make this work. In order to expose the service over remoting, we actually need to add in this service package reference and hopefully the demo gods are with him. So he says, okay, let's pull in this Service Fabric remoting package, that's that one. And it also pulls in this internal transport, which exposes over a thing called communication listener, a remoting service to the clients.

And of course, because we are doing Microsoft demos, it's important that we do, right click deploy on the solution as all the Microsoft demos are working. So we go here, I built it already and let's publish the application to the local five node cluster, and see what's going to happen. The machinery is kicking off. It's creates a new application. It gets deployed, as you can see three applications running, and this is the chocolate microservice application. And it contains an order chocolate, misspelt, as you can see here, and it's version one. And now we need to connect to this thing. And let's start the connector in debug mode and let's see what's going to happen.

What?

00:12:34 Speaker 2

Do you have the...

Good, good catch. Which one? Here? Yeah, exactly. So here, actually, this is the correct service name here, cool. It's always good when attendees of the talk actually pay attention to what the architect is doing, right? So chocolate bar, right, chocolate bar. Milk chocolate? I cannot throw it up, you just get one.

00:13:11 Speaker 2

Do you have dark chocolate?

Yeah. I have dark chocolate as well. Of course.

00:13:15 Speaker 2

Thank you.

Yeah. And of course he also forgets something else because it's not enough to just implement an interface, and actually nobody noticed that. I thought you people already know how Service Fabric works. What actually was forgotten is that here, when you expose a remoting service, you also have to create a communication listener. And this is the remoting communication listener that needs to be added. And this needs to be wrapped in a new service instance listener. And then this remoting thing needs to be added, because without that nothing will actually work. And of course, because we changed the service definition, we could now increase the version on redeployed it, and then it would automatically do rolling upgrades. But I just override it, right now.

So I publish it again, it will now tear down the existing application, redeploy it. We could also do that on the fly, in the rolling upgrade version in production. And now let's start the connector again, and now we see hello world. Nice. Finally, it's working. But that's not all, now we can essentially go back, I'll move it's a little bit to the side. And hopefully it will work with the screen size here. But we can essentially check where it's running, right? We have here the chocolate microservice and the order chocolate. We have to quickly see on which node it is running. It's on node four, okay. So let's deactivate and basically just remove the whole data package. And of course it tells me, you're sure you want to do this? Yes, I'm really sure. I want to just throw my cluster. And now it's removing it, and we should see that the service chokes a little bit, and it automatically fails-over, redeploys internally and it kills itself.

Okay, cool. So far so good. That was the demo that Karl, the team architect, gave to the team. And of course the team was absolutely hooked after this demo, although he made a few mistakes during the demo. And they decided they want to benefit from the scalability and robustness of Service Fabric. And the team wanted to immediately jump in front of their computers and start hacking in coding and with Service Fabric and to build their new ideas. But Karl stopped them and said, hold on people, before we proceed, we first need to understand the fundamentals of scaling. There is a thing called the scale cube. It comes from the microservices, design principles.

And the scales cube has a few axis. And the Y-axis talks about functional decomposition, or it's also known as Y-axis scaling. And in Y-axis scaling what we apply is we do scale by splitting into different things. And decided we already do that, we do microservice design principles. So we decompose into multiple bounded contexts or microservices. So we have order management service, shipping service, and manufacturing service and whatever. And he also said, well, we also have the X-axis scaling. And what we do there, is basically when we have these microservices, we deploy them multiple times into the cluster. And so we have multiple instances running off the same service in the cluster. It's also known as horizontal deduplication.

So the TL;DR version of what he gave to the team is, he said, well, we make things smaller and then we spread them out, right? That's what we do. And then he said, "Thereby I, Karl the great architect of the universe came up with the falling architecture blueprint." So first let's start with the load balancer. Every good architecture has to have one, right? So requests are routed randomly by this load balancer to the stateless web tier. The stateless web tier will contain our latest Angular React, JS Knockout, kill myself UI framework. From there, every request will be routed to the stateless middle-tier in the middle, that is responsible to do the necessary compute.

Well, Service Fabric has this concept of reliable services, namely stateless services, which I saw one in the demo. And the orders will be managed by the storage tier. And whenever a request comes in, the request is routed through all these tier, to the storage layer and the data, namely the orders will be shipped back to the client and so forth. And he said, well, and if hell breaks loose, if we have a lot of customers ordering chocolates and our storage layer cannot keep up with the demand that we have on our cluster. Well, we just throw in some Jhipster caching technology like Redis for example, and everything will be good, right? And then he also said, and by the way, if you want to impress your bodies in the local geek bar or whatever, you can tell them, this is called data shipping paradigm, because we're shipping data through all the tiers.

But then Mandy from the team, she spoke up and said, but I'm sorry, Karl. Well, nowadays nobody builds architecture like this anymore. And she said, well, we've tried that in-house before you joined the company. And we realized that building interactive services that are scalable and reliable is really hard. Because interactivity on the front-ends has strict constraints on the availability and latency on our services. Because when the latency is too high, it directly affects the end user experience because everything ripples through all the layers. And we realized that to support the large number of concurrent users sessions, we have to have a high through put, that's essential.

And she also said, well, with your traditional three-tier architecture that you just showed us, we are going to ship data through stateless front-end, stateless middle-tier to the storage layer and back and forth. And therefore the storage layer essentially limits our scalability. Because the latency of the storage layer directly impacts our end users, because it has to be consulted for every request. And she said, well, we also try to add a caching layer. And what we realized is, well, it did improve a little bit the performance with the caching layer, but we lost basically the concurrency in transactional semantics of our storage layer. And furthermore, we realized that at some point in time, we have to invalidate the caches. And caching validation is a really complex problem.

So they started to think about implementing a concurrency control protocol, something like rafter consensus base. And then she said, well, but at some point we gave it up and we realized that with, or without caching, the stateless middle-tier approach does not provide data locality. The data is never there, when you need it. Because for every request, like I already said, we have to go through all these tiers. And then she said in the world of IOT, cloud service and active models. The approach for distributed computing is that we basically take the state and move it closer to the compute layer.

So then she said, well, this is my successor architecture to Karl's approach. And that's my proposal. Well, Service Fabric has the concept of stateful services. Stateful services allow to consistently and reliably store the states like the orders, right? Where the service live inside the cluster, by leveraging the power of so-called reliable collections. And she explained, well, reliable collections, they have a similar C# API like .NET collection. But they are replicated, transactional and highly available inside the cluster. And she said, well, with that we can achieve that the application hot state lives in the computer. So all the orders are always highly available with low latency reads and writes namely in the middle.

And the external store has to be only consulted for exhaust or offline analytics purposes. And of course we can keep basically the orders transactionally in memory. And she also said, well, like every approach, it also has some drawbacks because well, the storage tier can have a larger capacity potentially than the capacity of our cluster. Because we can only save as much state inside the cluster as the cluster has capacity.

And furthermore, she continued, potentially Karl has been watching too many channel 9 videos late nights. And he actually forgot that it's a scale cube, right? A cube has three axis. And there is also a thing called the Z-axis scaling. And then she walks back to the architecture diagram where Karl draw this cube, and she says, well, there is this Z-axis. And in order to achieve hyper scale, it's not enough to just think about basically splitting into multiple microservices and then cloning them and having multiple instances. We also need to think about, can we split these microservices as well into multiple partitions. So is it for example, possible to take order management and partition order management into multiple partitions, and that's called the data partitioning.

And then she... Hold on a second. And then she says, well, let me go back to the other architecture diagram. And she also said, now, let me give you an example, but I'm totally pulling this out of thin air right now. So let's imagine for the sake of this architectural discussion that our order management system would only allow to have one chocolate type per order. And our customers would be only ordering one order. Let's say, for example, we have customers ordering dark chocolate, brown chocolate and white chocolate. And she says, I know it's a bit silly, but please bear with me.

Well, Service Fabric makes it super easy to develop scalable services by offering a first-class data partitioning concept. So conceptually you have to think about data partitioning of stateful services. A partition is a scale unit that makes a service highly reliable through multiple replicas inside the cluster. And the state is balanced across all the nodes in the cluster. And a great thing about Service Fabric is that whether depending on the demand that the partition has the internal resource manager of the Service Fabric cluster, will automatically make sure that the nodes get moved, sorry, the services get moved around the inside of the clusters.

So that allows basically the application, and in this example, partitioned by chocolate type, to grow to the nodes resource limit. And use all the memory and all of the compute that is available on the node. And if the resource need grows, we shall throw in more nodes and the Service Fabric cluster will automatically rebalance it. And she also said, well, Service Fabric has the concept of range partitioned, for example, or otherwise also known as uniform in 64 partitions. Where you can basically say you can divide the data into ranges. For example, from zero to 1,000 from zero to 300, or from minus in 64 to plus in 64.

And there is also the thing called a named partition. A named partition works when you can basically bucket your data into multiple buckets. And in this example, we could, for example, bucket our data in dark chocolate, brown chocolate and white chocolate, right? Other examples would be, you could, for example, have buckets like regions, postal codes, customer groups, or other self-invented business boundaries that you have and your bound with context.

And of course there's a third partitioning type in Service Fabric called the singleton partition. We're not going to talk about singleton partition, is the partitioning type that is primarily used for stateless services by default. And she says, well, I'm presenting this here a little bit simplified because, of course you would not use chocolate type as a data partitioning schema. Because if we have customers like me that all love dark chocolate, most of the customers who have basically ordered a lot of dark chocolate and our cluster would be totally no longer balanced out, right?

We would potentially use another data partitioning scheme. But that's just for the sake of the example that Mandy gave. And she said as well, and on the other hand, choosing the number of partitions is quite complex. Because we have to think about up-front how much scaling needs we have in the cluster, because Service Fabric makes it really hard to repartition once you already have stored data. Right now, there is no built in functionality in Service Fabric, you basically have to shuffle your data out of the cluster and then redeploy the service with the new partitioning schema and then shuffle the data in with the new partitioning schema. So you have to think about that up-front.

But then Joe, the youngest member of the team in this architectural discussion said, well, I can't really understand this technical debrish that Mandy was talking about. And he asked for a concrete example. And she said, "Well, look Joe, imagine a customer comes in and gets randomly assigned a stateless web note on the stateless front-End. Let's say he orders white chocolates. The stateless front-end will use the chocolate type to determine the partition key and the name partitioned. And then it will use the Service Fabric built-in service partition resolver, to resolve the partition that is responsible to host that data. And imagine Service Fabric has internally something like a naming service or a DNS that is exposed over to service partition resolver, that allows to basically take a service name, take a partition key, and transparently locate the service inside the cluster that is responsible for that specific partition.

And as soon as it retrieves the right instance for a given partition, then by RPC call, it we'll call into basically that service and hand over the data to that service. And with Service Fabric, the only way to do interservice communication is by exposing basically either an HTTP call, an RPC service like we did in the demo or adopt CF service or any other kind of what they call communication listener. Only, then you can actually talk to a service. So we can say, in this example, she said to Joe, that routing inside the cluster is done by the chocolate type. And she said, well, of course you all know that I like especially dark chocolate. So in my example I would enter basically on the front-end to my order. And then it would determine that I ordered dark chocolate and would automatically route it to the dark chocolate partition.

But then Sophia threw in a grenade. And she said, well, Mandy, thanks for explaining this so carefully to Joe, but I'm getting shivers when I hear you talk about all this RPC style communication between services. This reminds me of a project that I've been doing in my previous employer. So we built this huge intertangled, interconnected RPC legacy mess. At the time, the term microservices wasn't coined yet. But I believe we did something similar, but potentially horribly wrong. So we had this WCF order services that had to connect to a slow third-party component, and had to transactionally integrate with a fat database.

The temporal and spatial coupling that we introduced was horrible. Because every time the third-party service was no longer really responsive or took a long time, basically the customer facing latency on the front-ends, went through the rooftop and started failing. And she said, well, we couldn't fulfill our SLAs We actually lost orders. We couldn't throttle the orders, and what we also saw is we kept transactions open for too long. And the database started timing out, rolling back the data inserts, people got fired. And then she said, well, in case you wonder it wasn't me who got fired. Actually, I had the great idea which saved the project.

Well, what we started to do, we first introduced asynchronous programming into the game. And this drastically reduced the memory footprint that our services had, and allowed us to better satisfy the resources on our compute nodes. But of course, asynchronous programming did not really solve the problem we had in this project. So we decided to decouple the third-party behind the message queue. At the time we used the rusty MSMQ service that was available on Windows servers, but of course nowadays you would be using something more Jhipster like RabbitMQ or Azure Storage Queue or Azure Service Bus. And she said, well, this allowed us to throttle the request to the third-party services and allowed us to basically get a predictable load on the service.

We applied this pattern successfully in various areas of the system. And he said, well, the forgotten nature of messaging actually made our system scale much more. She said, and here is my proposal and it's only a slight modification of the architecture that Mandy proposed. We introduce between the stateless front-end, and stateful middle-tier. We introduce some kind of a broker middleware, like Azure Service Bus. And what will happen is basically, every time an order is created on the stateless front-end, we will shuffle that order as a message into a queue on Azure Service Bus or RabbitMQ, if you're running on-premises. And then the broker middleware will make sure that the queue listeners that are running on the stateful instances, will basically fetch this order and process these orders in the right order.

With that, we will no longer lose orders. It allows us the possibility to basically automatically scale up and down whatever we need. We can apply the competing consumer patterns on these queues. We can do throttling, we can do retries. And it allows us the possibility to, for example, only open a transaction when we take out a message out of the queue. So we are building a much more reactive architecture by introducing a cue into the game.

But then, Peter snatches the whiteboard markers, and furiously screams, "But that doesn't solve anything." After taking some moments of breath, and the team colleagues reminding him that this is only an architectural discussion, and nothing has been set in stone. He says, "I'm really sorry about my outbursts. But I recently, My son is not sleeping really well. And him in our bed and putting his knees and elbows the whole time into my face and back, really didn't help with my sleep deprivation. But what I wanted to say is, I completely or we completely forgot the data partitioning part that we talked about."

Well, with data partitioning, we essentially have multiple cue consumers running in all these partitions in the cluster. And with just a single cue, what could happen is a dark chocolate order is in the queue, but it gets picked up by the partition that is responsible for storing the white chocolate types. And you can imagine that this would lead to inconsistencies in our business data. And therefore, we would again, lose potentially orders. And this will not make our business happy, it will not make our customers happy. And I think we would start to see hats rolling in our engineering department.

So basically, we are back on square one, with queueing. And we need some kind of smart routing and, or queue partitioning to achieve hyper scale with data partitioning. So he said, well, what I would like to recommend is to have a dedicated queue per chocolate type, simple right? So when an order is created on a stateless front-end, the chocolate type is basically used, and used as an input into a partitioning function. And that partitioning function returns, again the chocolate type. And based on that by using either a queue naming convention or something, there is routing layer automatically knows that a dark chocolate order needs to go to the queue that is responsible for dark chocolate.

So we essentially have a uniquely addressable queue, in this example, per chocolate type. And only on the specific partition, we have one consumer that is responsible for that queue. So we have a clear separation of concerns from a partitioning perspective, also on the queues. And well, he said, and what we did here, we talked only about commands. The command pattern with messaging, right? We are basically entering orders, and we send them to an order receiver. So we are also, because we're doing microservices, and we like to be fancy, we also do UI composition. That's what we need to do with microservices.

So the UI composite in our order management microservice that is living on the stateless front-end will basically issue a command, order chocolate, to the order back-end part, which is receiving these orders. It's okay that the sender of a command knows the destination of the receiver, because with command pattern, we essentially have logical coupling. But that's okay because we belong to the same bounded context, right? We only introduce here and message queue for the scaling aspects, and throttling aspects, and retry ability aspects. So with that, we can basically temporarily decouple the sender from the receiver.

And in integration scenarios, where, for example, a message sender is not living inside the cluster. It's still okay that the sender knows the partitioning function of the receiver, it can just basically use the dark chocolate type, brown chocolate type, or white chocolate type to know to which queue it needs to route to. And then the right partition message receiver will pick it up. And he said, "Well, for simplicity reasons, I would call this sender side distribution. Because the sender determines by the partitioning function where the message has to be routed to.

But then the PhD dude from the team was triggered by all this talk about command patterns, and he started acting like a smart ass. But while the team already was quite accustomed to him acting like a smart ass, so they let him go. And he said, but with messaging, we do not only have command patterns, we have much more. For example, we have events, right? And with events, it's not possible to apply sender side distribution. Let me explain why. For example, we have a chocolate ordered event. The publisher of the chocolate ordered event cannot enforce a partitioning schema to its subscribers.

Because, well, that would mean in the end, with publish subscribe patterns, all subscribers with basically need to have the same partitioning schema, like the publisher. But with pub/sub, basically, the subscriber that receives the event, defines its own business processes that get triggered by these events. And therefore it also defines its own partitioning requirements or needs on the data. So it's always the subscriber of an event that defines how data needs to be partitioned, because only the subscriber knows basically the partitioning function of its own data.

For pub/sub semantics, usually what we see is that the subscriber is basically abstracted behind a logical thing, or a logical queue, or topic or exchange, depending on the queueing technology you are using. The fact that the subscriber, like in this example is scaled out is not visible to the publisher. So for non-data replication scenarios, usually only a single subscriber of that logical subscription group will get the event. So they basically act as competing consumers from a subscription perspective. So what can happen is, for example, when you publish a chocolate ordered event, and the shipping service, for example, applies partitioning by zip code. It will basically end up on any of these shipping queues, right?

So what the shipping service needs to do, is when it takes out the message, it basically needs to compute the partition. It needs to apply the partitioning function to the payload, determine the partition key, and then basically check whether it's already on the right partition. If it's not on the right partition, the subscriber itself needs to internally reroute the message with one single hop to the right destination queue. Because only the subscriber knows its internal routing strategies. So the PhD dude said well, in order to stay true with the lingo that was introduced by Peter, let's call this receiver side distribution.

Because the receiver distributes when necessary. And he said, "Well, I know it's not an official term. But anyway, let's stick with this." And then, he said, well, but it's not all. Well, we also have requests-reply. For example, sometimes we want to asynchronously confirm back to the user that an order was processed. It could be that we have a callback function on the stateless front-end, or it could be that we have some kind of data that is created on the order sender on this specific partition. And what we want to do is, we want to be able to reply back to the sender that is living on a specific partition.

And in messaging, there is a pattern called the return address pattern. So what we need to do here is essentially, when a sender sends a command to the receiver. The sender needs to create the header called, for example, reply address, that contains its specific partition queue. And then when a receiver or processor takes the message out of the queue, it can directly reply without needing to compute anything in that case. But then, in complex business applications, like we are going to build, we also have to think of process managers or sagas, right?

So we might have multiple aggregates and bounded contexts that need to be integrated together and communicated together. Sometimes we do that in a service itself, or sometimes we do that cross-service by applying orchestration process managers for complex business processes. And well, what I realized is basically, a process manager is nothing more than just an application of all these message patterns that we already saw. A process manager has a state, and that state lives potentially inside a microservice that has a certain partitioning schema. But it uses pub/sub pattern request-reply pattern, and potentially the command pattern to basically correlate multiple messages together into one business process.

So that's not really much more complex. And then the team decided to do a little PoC, and they decided to do it with NServiceBus on top of Service Fabric. They've chosen NServiceBus because they didn't want to basically build infrastructure bits and pieces to connect to the queues and everything like that. So they thought it's nice queueing obstruction and that's what they came up with. So basically, they had a stateless front-end called chocolate order, which applies sender side distribution by using the chocolate type, and basically sending to dark, brown or white chocolate queue.

And then they have a stateful back-end, called chocolate order. Which has a chocolate order process manager or saga. That applies sender and receiver side distribution. And what it's doing, it's basically, a chocolate order process manager is sending, for example, a ship order commands to the shipping service, when it knows that, for example, the buyer's remorse period is over. And then the shipping service in itself, in this example, to apply multiple partition schemas is partitioned by using the zip code. So they're saying, well, if we have a shipping order zip code between zero and 33,000. Then we send it to the 33,000 queue. And if it's something between 33,000 and 66,000, then we send it basically to the 66,000 queue. And this is what the team came up with. Let me show you a brief demo.

So this is the solution they came up with. What we see here is we are now in the home controller, which is the controller that will issue orders. Let me briefly show you that Visually, how this looks like. This is the awesome UI they came up with in their PoC. So it basically has three buttons. An order button for dark chocolate, an order button, for brown chocolate, and an order button for white chocolate. Whenever you click one of these buttons, it will automatically issue an order of the specific type and send it to the chocolate order back-end.

And this is how it looks like. So they have here an object called message session. And they issue an order, chocolate command with a send instruction. What they did in this PoC, is they realized, since they cannot inflict partitioning schemas on the shipping service, when they get back a reply or an event from the shipping service. The only thing that they can share is basically the order ID nothing else. So they decided to basically encode the chocolate type in the order ID. And they used a really silly but simple approach, so basically, the order ID becomes chocolate type, semi colon and then some kind of grid. Okay.

And on the receiving side, they have here, a so-called order process manager or saga. So what they need to declare is, IAmStartedByMessage order chocolate, which is the command, that will trigger in this case NServiceBus to create per order ID a new saga. It will automatically do all the concurrency control and multiversion concurrency that when multiple messages are concurrently handled. It will automatically roll back the state if necessary, and retry the message. And they were using the NServiceBus Service Fabric persistence package, which allows to either by using conventions or by using explicit attributes, to basically add an attribute to this process manager and say, we want that the state is reliably stored inside a Service Fabric collection that is called orders, and reliably and transactionally stored there. That's it.

And then from the perspective of actually handling that message, it's pretty simple. So they just basically implement this handle method, where they get the command, and then they save the chocolate type. And they apply a timeout to itself, which is the buyer's remorse period. Here in this example of one second. So after one second, if the customer doesn't cancel the order, it basically comes back into this method. And now you see here, we do not apply any partitioning logic to this code, it's all just business logic.

The partitioning is transparently done behind the scenes, I will show you that later. So here when the timer comes back, when the buyer's remorse period is over, we just publish a chocolate ordered. And we do send locally and make payments command. And make payment, we again add the chocolate type, here. And now let me show you how you handle this make payment command. So this is basically the message handler that is responsible to make a payment. And because the message handler itself, basically will be horizontally scaled, we will have multiple instances per partition type of this message handler running. But transparently behind the scenes, to send local call, we just saw, will automatically make sure that only the handler that is living in a petition type, for example, dark chocolate, when it's a dark chocolate order will receive that message.

And as we can see here, when we get the payment response, we then do send to the shipping service of the ship order. And what we do now here, since this order ID also has encoded in it the chocolate type. We can just send the ship order with the order ID and here's a zip code, it's a random number between zero and 99,000, we send it over. And then on the shipping order handler, which lives in another service. Which has its own data partitioning schema, which just handled that message. And it will then publish an order shipped. And as you can see, because it's still related to the order vocabulary, it will just reuse the order ID.

And now we have just seen the whole process of chocolate ordering. And now let's dive into how we actually apply the partitioning. Partitioning is not contained in the business logic. It is basically part of the routing infrastructure. And let me show you an example of the front-end, where we apply the pattern of sender side distribution. So what we use here is basically, because we belong to the same bounded context, we use the service partition resolver to essentially resolve the address of the back-end processing service. And we tell the service partition resolver, please give me all the partitions that this service has.

And then we tell it well, let's do some routing internally. And first of all, we tell this endpoint that we'll be living in a stateless instance. We tell it, whenever for this destination endpoint chocolate order, which is a logical queue, we will have multiple partitions. And what NServiceBus in this example will do, it will automatically create basically a routing table internally. And whenever we send the command, it will automatically then take the right prediction. Not fully automatic, we actually have to tell it how we derive the partitioning function, that's what we do here.

We basically say, okay, whenever you send this type, please use on this message type, the chocolate type property to determine the partition key. So when we set the chocolate type property to dark chocolate, it will then take this and then basically use this internal routing table that we created here, to basically take dark chocolate to the destination, and automatically route it to the right thing. And we apply this not only for sender side distribution, but also for receiver side distribution. For example, internally in the chocolate or the back-end, where we say, okay, whenever we have an order shipped, and now we know order shipped is not published by us, but we are a subscriber. But we know that we actually sent in the order ID, which contains the information that allows us map back to our petition.

So we basically tell it, it's a bit brute force, because it's a PoC. So basically, they say, okay, we just split by the semicolon. We take out the order ID, the grit, and we take the partition type that is encoded in this ID. And then we just use this chocolate type to again, basically, internally reroute whenever it is necessary.

And last but not least, we do the same for the shipping. And this is just an example to show how complex this partitioning logic can be, if necessary. Again, it's all infrastructure concern, we basically turn the ship order string into an int, and then we automatically pick the right partition high key based on the zip code, and return it to the partitioning function, and that's it. And when we run this, it's already deployed in the cluster. We didn't actually see this live happening. This is the diagnostics window, I might need to zoom in a little bit. Let's order a dark chocolate.

And at some point in time, we can essentially see... lets zoom in a little bit, that's probably not. Can you see it behind. Okay, what we see we got the order chocolates, and the partition key was dark chocolates. And then the order process manager picked it up on the dark chocolate partition. It sends a buyer's remorse period command to itself a timeout. It received it, RSD is receiver side distribution. It received it on the wrong partition, and applied receiver side distribution and send it to the right dark chocolate partition. It sends out the make payment on the dark chocolate to the dark chocolate order. Receives the make payment on the right partition, the partition dark. And then the payment process goes on. And then the ship order handle basically receives it on the wrong partition. And automatically, as you can see here, forwards it to the right partition. So that it's handled on the partition, which is responsible to manage that zip code.

I know this was potentially a bit fast, but I will give you the links to the demo so that you can try it out yourself. So brief recap. Well, it's always a bit more complex than Microsoft tells you. Like the team architects did show that it's so easy to just do, right click publish and everything works fine even with stateless services. As soon as you start doing stateful services, you basically need to make smart routing decisions based on the partitionings you are going to use. Stateful computation with low latency, like I said, require smart routing and smart routing decisions.

But I think that with Service Fabric stateless and also stateful services, combined with messaging, you basically get the best out of two worlds. You can decide wherever it fits to just apply the RPC communication style that is available inside Service Fabric, and use the service partition resolver. Or when you need basically, ordering of commands, offloading, throttling, for example, you can apply message patterns. And well if it was a bit too fast, and you're curious, and if you want to know a little bit more about how NServiceBus works you can go to this URL. There is a tutorial which explains NServiceBus works. You can download it and try it out. It's has multiple lessons.

And if you want to know more about how this example I just briefly showed works, you can go either to this URL, which gives a lot of diagrams and text and explains how the Azure Service Fabric routing works with NServiceBus. Of course, the same patterns that I showed here, they are universally applicable, also to Azure Service Bus itself. If you're just doing SDK, native SDK. If you're using mass transit or any other service bus, you can apply the same patterns as well. But we provide a sample out of the box. And if you want to try it out yourself, you can download the slides, the links and all the codes here on my repository github.com/danielmarbach/microservices.ServiceFabric. Feel free to start a repo, that always helps to promote the stuff we are doing. And if you have questions, feel free to shoot them now. Any questions? Yep.

00:59:41 Speaker 3

Are you using some fancy patterns that shield that consistency when some of that processing node fails...

Am I using some kind of consistency patterns to...

00:59:57 Speaker 3

To assure the data is consistently, if your order was half processed?

Oh, okay. Service Fabric reliable collections and that was using our offering that allows to store the state in the reliable collections, is a transactional system that has concurrency control built in. So basically, when we have multiple concurrent operations that try to store the same saga, what Service Fabric will tell us, that someone else changed the state. So basically, one loads the state at point one, another one loads it at point three, right? And then the one at point one will basically try to save it and will give basically a concurrency exception that bubbles up. And what we do is we roll back the transaction, and then it will be retried. And the next one will be successful. But it relies on retries to do the concurrency control.

01:00:57 Speaker 3

It sounds like that might cause some problems if...

Yes, yes. Well, there are different and Jimmy Bogard wrote an excellent blog post about different types of process managers. And of course, if you have a process manager that needs to integrate potentially thousands of messages that will be concurrently fired off, and then correlated back to the same process manager. If you apply this pattern and you have a lot of concurrency, you will basically go into a lot of exceptions and retries. So in that case, you need to apply other patterns, then it's not the right thing to do. But, of course, the solution works for most business cases, but not for highly concurrent updates. Any other question? Yeah, you'll get the chocolate, by the way. Yeah.

01:01:53 Speaker 4

...we have different queues already. I don't see the point of repartitioning again. Okay.

Okay. So you said, wouldn't it be more efficient to have actors listening on those queues?

01:02:15 Speaker 4

The stateless services, yeah. I was just thinking actors because you can fire as many as you want. Buy yeah even the actual services-

Okay, okay. Well, for example, if using Service Fabric actor model, they are also partitioned, right? And an actor has a certain lifetime. And an actor is activated based on the RPC call or message, and then lifts for that brief moment of time, and then basically gets decommissioned by the actor runtime. With queueing systems, you need a stateful or sorry, a persistent connection to the queue, so that you can basically pick up messages. So you need some kind of infrastructure service that knows about the partitioning schema of the microservice that is listening on multiple queues, and then basically shuffles off messages. So I'm not sure if I understand your question.

01:03:13 Speaker 4

... Okay. Forget about the actors, so stateful services...

I have an idea, Let's talk together after my talk, okay. And because we also have other people having question and then we can have a deeper conversation about this topic, cool. Any other questions?

01:03:26 Speaker 5

Can we scale partition or Partitions... So two instances will pose as partition dark and five partition white.

You're asking, can you scale partitions?

01:03:44 Speaker 5

...five instances of service posts in one condition and two instance of service...

Okay, so what you define basically, in Service Fabric, you're picking a partitioning schema. Let's say you're using the range partition or the name partition with dark, white. No, Let's say the range partition from zero to 300. And then you're saying I want 100 partitions, for example. So what it means basically, then zero to three is in one partition, four to six is another partition, and so on and so forth. But this defines how many instances of that service, of that petition service you will be having in your cluster. That's your scale unit.

That's why I said you have to think about the number of partitions upfront, right? Because what happens is, when you start with a five node cluster, you will have 100 instances in that cluster. So that means you only have a process per node, right? So you will basically have 20 instances per node, right? And then when you scale out, you're saying, oh, we have more demand, you basically add more machines. Let's say you add five more machines, then you have 10. And what then Service Fabric does it automatically rebalances these 100 instances. So basically it takes 10 instances from each node and spreads them out to the other nodes that you just joined into the cluster. I hope that answers your question.

Okay. I mean, like I said, you cannot change the partitioning scheme on the fly yet. Alright. Any other questions? Yeah.

01:05:29 Speaker 6

Do you have any experience with upgrades and reliable collections?

Upgrades and reliable collections. Okay. So the question was, do I have any experience with upgrades and reliable collections? Yes, I have experience. Well, we have to do this because we provide this saga of persistence that stores state into reliable collections. So the reliable collections right now, by default, use the data contract serializer, right?

Exactly. And of course, I think there are even books written to that topic from the DOP CF world. The data contract serializer has its own benefits and drawbacks. And there is a whole set of articles on MSDN, which talks about data contract versioning. So you have to apply the whole ordering of data members, you have to apply a namespace to the data contract. You have to implement this extensible interface thing, all that stuff, you have to do it. But you can use a different serializer if you want to.

01:06:31 Speaker 6

Yeah but a typical problem we have, is we have an existing service, and at some point we introduced a new collection. And we then did an upgrade and then we ... application.

Okay.

And then it's retries.

Yeah, yeah. What you can do is based on the application description, you can for example, deploy a pre-hook up process that does a partial migration on that partition. And then only then the actual service kicks in, for example. And that will be applied on while doing the rolling upgrade as well.

Okay, yeah. Take a chocolate. Yeah, thanks.

Microservices with Service Fabric. Easy... or is it?

About this video

🔗Transcription