Change is inevitable: Versioning event-driven systems

00:00:05 Laila Bougria

Hello, everyone. My name is Laila. I'm a software engineer and solutions architect at Particular Software. These are all the ways that you can find me online.

And basically, I've been building software now for two decades. Yeah, it's been a while. And over the last decade, or even a little bit more than that, I've been focusing on distributed systems. And I can tell you one thing, that building distributed systems is incredibly hard, okay? I mean, it's a lot of fun, really, it's a lot of fun, but it's also very, very difficult, because let's face it, we are facing a bunch of non-trivial issues and problems that we basically have to solve, right?

And over the years, when I sort of look back, I can see that we had so many lengthy discussions to understand the domain and get a sense on how are we going to build the systems. Like, how many services do we require? And what would be the right service boundaries? And what would be the right communication style to use? Are we going to use synchronous communication or asynchronous communication? Or in which context are we going to use each individual communication style? And should we use commands or events in this specific scenario? And what about the workflow? How does that look like? Is it complex enough that we need some kind of coordination style? Are we going to use orchestration or choreography?

And if you were here last year, maybe you saw one of those talks that I did around this topic. But these were so many incredibly, sometimes very heated discussions. And it sometimes took weeks and months of going back and forth and understanding the domain better and progressive insights until we basically landed where we felt it was comfortable, where we thought we were doing the best possible things with the knowledge that we had.

But when I look back, I also kind of see that there was always a blind spot, something that was often, if not most of the time, overlooked in those conversations, and that is change. Because as our systems grow and evolve, then change will come our way. And yes, sometimes it can be incredibly frustrating because we spend so much effort into getting everything right, but actually I like to think of change as a good thing. It means that the system is evolving. It means that it's being successful. And therefore, new things are being added. Things are being changed. And we are there to implement those. And that's something that we just have to deal with. It means that we have more work to do.

But change is also something that we need to start thinking about very early days so that we can successfully prepare our systems in terms of evolving that event-driven system. And that starts by appropriately designing the events that we emit, the messages that we are sending around as part of our event-driven system.

So, let's consider an example. Let's say that we have a credit card, like, the simplest thing you can think of, okay? Let's not overcomplicate things. You basically have some kind of a Mastercard, something like that. You have a balance of, I don't know, 1,000, 5,000 euros, dollars, whatever it is, that you can spend within a single period. And at the end of that period, whatever you spend is immediately deducted from your linked debit account. So, we're not going over periods and stuff like that. We're just keeping it simple. And we have this credit card transaction service that is basically responsible for handling all of those card transactions or card operations.

So, when you think about what can happen on a card like that, then you could argue, "Well, there are only really two types of operations, right? Either you have a debit operation and money goes off the card, or you have a credit operation and then money comes back. That's it." And also, the attributes are the same. So we basically landed on a single event type, CreditCardTransactionMade, with a bunch of attributes. And what we did is we introduced this operationType so that we can distinguish these debit from credit operations.

And this may appear like a good solution at first, but it starts to break apart very quickly once change comes our way. Because let's say that we got this new requirement saying, "Well, you know what? We also want the ability to reserve funds." So let's say that you check into a hotel and they're like, "Well, we're going to block," I don't know, "300 euros on your credit card, just in case you decide to empty your fridge and then run off to the Bahamas. We'd just like to make sure you don't do that without paying your bill. So we're going to block a certain amount on your credit card, but not charge you." So we're like, "Okay, well, fair enough. We can just introduce this new operationType that is for reserving funds. And we can add the reservedUntil attribute so we can indicate until when that money is blocked."

But the impact on the consumers of this event is massive because the consumers that don't really care about this operationType now need to start filtering this out, because they won't be able to recognize it and you won't want them to fail if they find something there that they don't recognize. But if they do care, it's also not ideal because what you've now created is what is called functional coupling. Because now your consumers need to understand, "Oh, if the operationType is a reserveFunds type of operation, it doesn't actually affect the amount of money you spend. It is only affecting the available balance that you have," right? And that means that they now have some coupling into how your internal domain works. They also need to understand that that reservedUntil field is only really there when the operationType is of type reserveFunds. So, this isn't really great.

But things could get even worse. Because our bank said, "Well, we don't really like it when people use their credit cards to withdraw money outside, from the wall," like we tend to say in Belgium. I don't know if that translates to English. But you get the point, right? So, how does a bank discourage it? Well, they're going to charge you money for it. That's what they do, right? So, we thought about this and we're like, "Well, you know what? A withdrawal is still a debit transaction, but we do want to differentiate the cost from the actual money that you got," right? I maybe withdrew 500 euros and it cost me 5 euros to do that. I want to know the difference. So, what we did is we added this field called cost. We made it required. And we just always set the value to 0 if there was no cost involved.

But then it's like, "Okay, but what happens if you withdraw money and you're in a foreign country?" I mean, we do have this currency as part of our payload for the money that was spent or withdrawn or whatever. But what about the cost? Is that then expressed in the issuing bank's currency or the currency of the country you were in or the currency of your account? Oh, the easy fix is just to add another field, cost currency, and then we keep on polluting this event. Or we can also do something worse and actually tell our consumers, "You know what? You can just assume that the cost will always be in the currency of the account," which, again, is leading to functional coupling. You are forcing your consumers to understand that this is the way that the cost works. And it's something that doesn't belong to their context. And worse, it's something that could change on our end and then needs to also be reflected in the consuming code.

So, when we looked at this, we were like, "Yeah, we don't really like this." And we re-evaluated the entire payload and we were like, "Why is that cost even there? You know what we should better do? I think we should be emitting two events. We should emit this operationType debit for the amount that was actually incurred, and then another event of the same type with an operationType of CostIncurred that represents that 5 euros. And then we also have that currency attribute that we could use. So, what's the big deal? Let's just do it. Everything is going to be great."

But the question is, how does something like this affect your consumers? And the answer is easy. As you can already see on your screen, this basically turns into what we call a poison message. Because when you emit a message or an event that doesn't contain the attributes that your consumers expect and rely on, and basically they can process that no matter how many retries they do, that message will end up in an error queue, in a dead-letter queue, without an ability to basically consume that in any way without going back to the code, making changes, redeploying those services. And then we are able to consume that. If you have many consumers in a larger system landscape, this is incredibly painful, right?

And also, from a producer perspective, I've seen this happen all the time. "It's fine. You can just upgrade to this new event type and everything will be fine." And your consumers are like, "What do you mean it's going to be fine? Have you actually thought this through? Because I don't know if you realize, but I have a bunch of in-flight messages, basically messages that are sitting in my queue, and I haven't had the ability to consume those yet. And now you're throwing this new types of events at me, and I don't know how to deal with both."

Or what if we actually scheduled delayed messages? This is something that you can do with plenty of message brokers out there, like Azure Service Bus. And some of those have been scheduled under this old event format. How am I supposed to deal with that? I don't know when those events are going to come through.

Or what about the messages that I might have in an error queue? Sometimes your consumers might have their own bugs that are basically stopping them from consuming some of those messages. And while they work on it and they deploy a fix, they will want to retry those messages to bring their service or their system up to date. But if we're then changing that event type, we're not making it easy on them.

But finally, a very important part is, "Hey, I have my own deadlines. You're just imposing this change on me. And what now? Why are your deadlines more important than mine?" And really, what it comes down to is that when we make these types of changes, we are taking away autonomy. Either we're taking it away on the consumer side, because basically we're imposing the change on them, or we're taking it away on the producing side. And we basically can't move our system forward because we are not allowed to break our consumers.

But there's a better solution for this. And that starts with a better design. Now, one of the things that I always say is that a good way to start making changes to avoid this is to design more granular and more business-meaningful events. I mean, I've been talking about a debit and a credit operation as if they make total sense. But if you think about it, within a system, outside of a specific service boundary or bounded context, what do those terms mean, debit, credit? Might mean nothing. It might actually have a completely different meaning than what we understand within the scope of this credit card transaction service, right?

So, it's important that we look at this, "What has really happened?" from the outside perspective. And when you think about it that way, well, a payment was made with this credit card, or a refund was issued on this credit card, or maybe there were funds blocked. Maybe there were costs incurred, right? And this is something that makes sense, and that is something that you can communicate to the rest of the system landscape, and it is meaningful to everyone, right? It also allows you to carry only the relevant attributes that make sense for that specific event that has happened. And you don't have this event type that can mean different things anymore.

So, it's important that we design events that have a single meaning. You have a single event type for a single business-meaningful event. And it's really important to think about it from the business perspective of what has happened in the system in a way that a product owner can communicate about, in a way that a product owner would understand, not in engineering or development terms. That's usually a pitfall, status change and things like that.

The meaning of the event type also must remain stable over time. And when you are talking in terms of business-meaningful events, that tends to be easier to maintain as well.

This also allows our event types to evolve independently of each other, because if something changes in how we incur costs, well, then only that specific event type would be impacted and not all of the other ones.

It also gives you flexibility in the sense that your consumers only have to subscribe to the event types that they actually care about without having to filter the stuff out that they don't really care about. You also get a situation in where you're not polluting the contract with conditional attributes, attributes that are sometimes there and sometimes not. And that also means that your consumers can reduce their functional coupling. They don't need to understand when is this around and when is this not around. And they can also keep their own code simple and focused.

But of course, when we start to split things apart very granularly, sometimes that can also be hurtful. So there's another side to this medal, right? And it is that sometimes our consumers don't really need this level of granularity. They just want an overview of what happened. So, a common solution to this problem is that consumers will start subscribing to many granular events, aggregate that information to generate this overview.

So, let's look at an example, right? Let's say that for this credit card example, I just want to know at the end of a period, how much money was spent, because I need to go and deduct that from the linked debit account in another service, right? So I just want that total amount that was spent. Well, in order to do that, I need to subscribe to the PaymentMade event, to the RefundIssued event, and to the CostIncurred event. Because remember that FundsBlocked does not affect the amount of money you actually spent. But this is tricky because when new events are added that affect that amount spent, well, you need to be aware of that. You need to become aware of that so that you can add a subscription or maybe remove a subscription, again, causing that functional coupling.

What is also tricky is that as a consumer, now I need to understand what a period is and how long that period runs, which is a very private concept to that credit card service to begin with. And your services are basically forced to aggregate it. And that's why I think it's really important when we think about events that we also differentiate the public from private events. Now, there are other terms that are used in the industry to talk about this from domain events and integration events in the DDD space, or events on the inside, events on the outside, external events, internal events. I just really prefer the term public and private, first because it's simple, and also because it really conveys the scope in which they can be used, which is what I care about most, from a bounded context perspective or from a service boundary perspective.

So, to solve this problem, what we could do is introduce what is called a summary event. Now, this can then be a public event that is accessible by any part of the system, but it does provide a much less granular view, but is still business-meaningful. So we could have this CardPeriodClosed event that basically tells you, "This is the amount of money that was spent, and this was a period." Done. And basically, we can construct this event from inside our credit card service while subscribing to the PaymentMade, RefundIssued, and CostIncurred event. And that's fine because inside our own service boundary, we do have the context of an understanding of how we should be aggregating those to get a correct CardPeriodClosed event. It also means that we are reducing functional coupling for all of the consumers that are subscribing to CardPeriodClosed because we've encapsulated the concept of a period. And we've also basically kept them from the impact of us maybe introducing additional events or removing additional other events that impact what the amount is at the end of the period.

Imagine, for example, that the bank decides, "You know what? Instead of charging everyone for the costs every time you do a withdrawal, we're going to do this at the end of the year." Well, then now, inside the credit card service, we could basically remove the subscription to the CostIncurred event, and the CardPeriodClosed event will not change. Our consumers are in no way impacted, but we still have the autonomy to make a change within our domain without breaking any of our consumers.

Now, in this scenario, you could make the choice to make all of those granular events private. That's not always going to work. It's something that you need to ask yourself, whether that is desirable within your system. And then in this case, to make that summary event, the only public event that can be subscribed to, right? And that's a way that we can reduce the change radius and that we can encapsulate those internal domain concepts. It also means that your consumers will subscribe to less events. Therefore, they have less events to process. They have less logic to write. They have less events that could be susceptible to traffic, to latency, and code maintenance, and there's less room for error because of that.

So, to recap quickly these private and public events, right? When I think of private events, I think of more granular events. They are also more prone to change, and they can evolve at a higher rate. And that's totally okay. Why? Because they're only consumed within our service boundary. They can require more domain context to consume. Again, that's fine because the consumers will be part of our own service boundary, so they share that context. We do not want to make these accessible outside of the bounded context or outside of our service. And we can also, because of that, share more context because we are in a context where it's okay to share more information.

When it comes to public events, much less granular. They need to remain stable over time. And they need to also eliminate that functional coupling. That's the whole idea. Think about the event payload and look at that and say, "In which ways would the consumer have to understand the internals of our service in order to consume this message?" And you want to get to a point where there's nothing really there. You make those accessible to the rest of the system. And they could summarize multiple private events. They don't always have to. Okay?

When I talk about this, I always remind people, "Just remind yourself how it is to post a picture on the internet. You can post it and then change your mind and delete it. But you can't really guarantee that it's already been copied by half the internet." And we've seen so many examples of this out in the media, right? But that's really how I want you to think about your public events. Once it's out there, it's going to start living a life on its own. So that's something that you need to think about upfront. "Am I okay with making this data public?"

So, let's see what we've achieved, right? Because when we talk about carefully designing our events so that they only have a single meaning, so that they're either public or private, what we basically achieve is a better control over the change radius. Because if we have a private event and that changes, fine, because we control the consumers of those private events. And when it comes to our public events, we are really, really, really careful about what data we expose as part of those public data contracts. And that way, we can basically start facilitating versioning upfront by protecting our service boundaries. It's not just about finding the right service boundaries, as hard as that already is, I know, I know, but it's also about how do we protect those boundaries in the long term when things start to change.

But we haven't even solved the entire problem, not even half of it at this point, because both your private and your public events will at some time have to evolve. And then we still have this problem of, "But I have all of these in-flight messages and delayed messages and messages that are in my error queue. How am I supposed to deal with those? You can tell me to upgrade, but I don't want to lose those old messages. That's really the point." So, how do we deal with that? And to do that, I think it's really important that we start to also rethink change as a general concept. Because many times when I talk to people, they're like, "Well, you make a change, and that's it. It's like, at this point in time, we have changed something." But that's not true. It's not a point-in-time event. A change in an event-driven system requires facilitation, and it requires coordination. And there are two angles that we need to consider to facilitate that transition. And that's both a versioning strategy and also an upgrade strategy.

And to do that, I want to take a step back and consider how it is that events flow over the wire, when you have one service that's publishing an event and another one that's consuming an event. Because when events are emitted, they are actually encoded into a specific format, right? It's also something we call format coupling. Now, this can be in JSON, it could be in Avro, it could be in Protobuf, in XML, whatever it is, right? Now, that encoding needs to be understood by the consumers of your event. That's why we have format coupling, right? There's an understanding, a shared contract there.

Now, most teams leave that open to interpretation. So they're like, "Yeah, it's in JSON," and that's where it ends. And what happens then is that on the consuming side, you basically get teams that start making assumptions on how that data should be interpreted. Like, "If there's an amount in the payload, what precision would this be?" "Well, I guess two numbers after a comma." Or, "What fields are actually required?" "Hmm, I don't know. Well, I guess this one is always around, so I assume it's required." And if there's a field in the payload that's called temperature, what unit is that expressed in? Is it Celsius? Is it Fahrenheit? Or is it actually the temperature of the light and it's expressed in Kelvin, which is completely different? Hmm.

The thing is that when this is missing in the payload, you get consumers that start making assumptions and, again, start creating functional coupling there. And you might say, "Ah, this isn't really important. You could use a schema to provide that information, but is it really necessary? It's also just so much overload and so boring." If you're thinking that, let me tell you a story about the Mars Climate Orbiter.

Now, this was a robotic space probe launched by NASA in December of '98. And yes, I just read that off my notes because there's no way I'm remembering dates, okay? But basically, what happened is that a year later, what happened is that this spacecraft completely lost communication with NASA on its trajectory. And what happened is that it started to basically deviate to a point where it got so close to the planet Mars that it just shattered into a million thousand pieces and was completely destroyed.

So, you might wonder why that happened. Well, guess what? They did an investigation to try to figure out why this happened and what actually failed. And the reason was a mismatch in systems, in the unit systems, because you had one system using the metric system, and the other one was using US customary units. Ah, if only they would have used a schema that actually makes it clear in which specific metric they are actually expressing that unit. Wouldn't that have been great? Then maybe all of this could have been alleviated.

So, now that we've established the importance of actually using a schema to convey that type of information, let's take a look at an example of a schema. Now, this is just a simple example. It's in Avro. Even if you've never used that before, it should look familiar because it's JSON. And if you're watching this on your phone, then you might want to rewatch this session after because there will be some things that you will want to verify.

This one isn't that important, though. It's just an example to show you we can basically use a schema as a mechanism to structure the data to ensure that it's interpreted accordingly by our consumers. Because what we can do is we can add type information. We can even add complex data structures so that we can understand what would be part of it and what would not be. We can add field-level metadata to indicate that a certain field is required, to indicate what a default value could be for that field, or even to indicate aliases or even descriptions that can help us understand what this specific field is actually representing to begin with. We can also provide rules on how that data is formed.

But schemas are also a very powerful mechanism to facilitate versioning. And that's the angle where I'm coming from today. Because once you have a schema for your payloads, then you can also define a versioning strategy. Now, a versioning strategy sets clear expectations to all of the consumers of your events on how they can expect that event payload to evolve. And it allows you to also make those changes predictable to consumers in a way that doesn't always break them, that in the majority of the cases will not break them.

So, let's take a look at the possible versioning strategies that are out there. First is forward compatibility. Now, when you use a forward compatibility strategy, we're basically saying, "You know what? You can delete any optional fields and you can add fields. Whether they're required or optional doesn't even matter. You can add any type of field." And this is usually used in traditional brokers like an Azure Service Bus, like a RabbitMQ, like an Amazon SQS, and things like that. And what it means is that consumers can basically use an older version of the schema, previous version of that schema, and still be able to process events that were constructed with a newer version of the schema. In this mode, our producers will start emitting events under a new version of the schema without breaking our consumers.

Now, let's look at this with an example to basically make it a little bit clearer. Now, let's say we have a first version of a schema and that contains a bunch of attributes. And both our producer and our consumer are using that same schema. Everything is good in the world, right? But then a change happens and we introduce this new version of the schema. Our producer starts to use this to emit events from a certain point in time. What happened in this specific example is in that v1, we had a currency that was optional. It was allowed for it to not have a value. In v2, we've removed that altogether. So, if we have a consumer that is using the old version of the schema, it will still be able to process any messages that were constructed with v2 of the schema. Currency will never be there, but that's fine. It could already deal with that because it allowed for it to not be there to begin with. Then, when the consumers are ready, they can start to upgrade.

Again, v3 is introduced. At this point in time, we're doing something else and we are introducing that customerId. So our producer upgrades to that new version of the schema and starts to emit events that contain a customerId. In this case, it's even required, because in Avro, fields are required by default.

Now, you have this subscriber that is still using v2 of the schema in which they don't recognize customerId. Fine, they're not going to break. There's a value there. Just bluntly ignore it. But we're not going to break when consuming that message. And that's what's important, okay?

Quick drink, because there's a lot I want to say.

Now, when you look at this from the infrastructure perspective that can provide a different view on things... So, what would happen is, when we have forward compatibility, your publisher can switch from v1 to v2 and publish those events. And your consumers can still hold on to v1 of the schema and successfully be able to basically consume messages that were constructed with v2 of the schema using exactly the same consuming code. So they won't immediately experience any impact. Nothing needs to immediately change. Of course, if they want to start using some of the fields that were added, well, they will have to opt in to a newer version of the schema. All right? But it doesn't break them. That's the point.

Another option is backward compatibility. And in this mode, we allow for the deletion of fields, both required and optional, and we allow you to add optional fields. Now, in this case, our consumers can use the next version of a schema to process messages that were constructed with an older or the previous version of a schema. Now, this is popular. I see a lot in... people who are using Kafka or event stores in which they are saving them as they are coming in so that they can still replay them. If you want to be able to do message replay, then you will have varying versions of that schema and you want to be able to move forward.

So, let's look at an example. In this case, you could argue that consumers upgrade first. Although they use that terminology, I don't think it's useful. So let's use an example instead. We have v1 of a schema. Producer and consumer both use that schema. Everything is good in the world, right? But at that point, we introduce v2, which introduces an optional field called currency. Now, our consumer is now using that v2 of the schema, but it still has messages around in the system that were constructed with v1. Okay, if I want to consume a message that doesn't contain currency, I can do that because it's currently optional.

Again, we get to a point where everyone is upgraded and we introduce another version. Again, we have the subscriber who's running up front and with messages that were constructed with an older version of the schema. In this case, what we did is we removed the periodTo attribute. Well, we have some events that still contain that. Fine. We'll just ignore it when we basically consume that message. And that way, we are not making breaking changes. We are making compatible changes that don't immediately break your consumers in a way that they have to do a code fix and deploy to be able to move forward.

From the infrastructure perspective, to visualize it, you would have, let's say, a queue, or whatever it is, or a stream or just an event store that contains a bunch of messages that were constructed with v1 of the schema. You might have a consumer that is using v2 of the schema, a newer version, and is still able to use that schema to deserialize and consume messages that were constructed with a v1. Are you still with me? I know this is complex to sort of understand, and that's why I have a bunch of visuals. So, definitely this is a part to re-watch.

Another option is to have full compatibility. Now, this is the most restrictive mode in the sense that you can both only delete optional fields and only add optional fields. So you can't really play around with anything that is required. And in this case, what happens is that your consumers can use both the old and the new version of the schema while processing messages that were constructed with an older or new version of the schema. And that sounds so confusing. So, let me put it like this. I'm a consumer and I am using v2 of the schema. Well, I can use this schema to process events that were constructed with v1 of the schema, with v2 of the schema, and with v3 of the schema, because I kind of have both forward and backward compatibility. All right? You get full compatibility.

But finally, there's also this concept of transitive compatibility, and this is not really a compatibility mode unto itself, because any compatibility mode that we talked about, forward, backward, and full, can also be made transitive. Now, we've been talking about next and previous versions of the schema, and that's important, because compatibility applies only to the previous or the next version, unless you make it transitive. When you're saying, "I have transitive full compatibility," then it applies to all of the versions of the schema.

Now, this is extremely restricting, so to speak, right? But this can be very helpful if you have systems in which the events are very, very long-lived. They're stored in some kind of an event store or data lake. And you don't want to modify them. You basically want to be able to consume them, even if they span multiple versions of a schema. Well, then this is what you kind of want. You want that type of transitive compatibility. It could also be that you're using forward compatibility in a broker like Azure Service Bus, but it's such a quickly evolving system that your in-flight messages might have multiple versions of the schema that's being used. Then making that compatibility strategy transitive can also be a way to deal with that.

And that rounds up sort of the compatibility strategies. But before we move on, I do also want to point out a common misconception. And that's the idea that schema versioning is not at all the same as semantic versioning. And I see these two concepts being confused all the time. Now, the thing is that semantic versioning, or SemVer, for the people who know it by that name, is all about APIs. And where it gets really confusing is because you can express data contracts, payloads, and stuff like that in an API, in a POCO, in a class, with properties. And that's where it gets confusing for a lot of people. Because, depending on your compatibility mode, deleting an attribute that's part of your payload is not a breaking change. But if you delete a property in a public class, according to SemVer, that is a breaking change. So, those things are not the same thing. And it's really important that we differentiate it because schema versioning strategies do not adhere to semantic versioning.

Now, I see many, many, many folks out there who basically express their event contracts in POCOs and then have them in NuGet packages and share them in between producers and consumers, right? And this is where this misconception happens. So, when you have a schema versioning strategy that you're applying and you're using NuGet packages to express those data contracts, that NuGet package should not adhere to SemVer because then you are basically conflating two different concepts. Okay?

When you think about schema compatibility, it's really all about the ability to read the data. I want to be able to receive that message and deserialize it and look at the payload and not become blocked there. The whole goal is to avoid poison messages. And adhering to a compatibility mode avoids breaking your consumers. It allows you some flexibility to make changes without causing this type of disruption to your consumers. Okay?

Now, for example, it's also important to understand that you can remain compatible and it doesn't mean that you can enjoy all of the additions without upgrading, right? So, let's say that you say, "I have a field and I want to rename it," right? "I want to rename this field from A to B." What that really means from a schema perspective, assuming that that was an optional field to begin with, then you can delete A and add B. For your consumers, that might not be a breaking change because A was already optional. However, in order for them to start using B, they do have to upgrade to a new version. But they won't be immediately broken. So, that's a differentiation to make.

Now, we've been talking quite a while about schemas. But that's not really the only thing that can change in a payload, right? I want you, everyone in the chat, start thinking about this. When you put an event out there, you publish it or you send a message, what can change? We've talked about the payload, but what else can change? Because the payload is just part of what is put on the wire. There's something else that travels with that schema, so to speak, on the wire, and that's metadata, right? The message headers. They are also part of what travels on the wire, and making changes to metadata or message headers can also equally break your consumers. Because metadata can be relied on by intermediaries for routing purposes, "Oh, if it's this type, then I'm going to send it over there," or even consumers for processing purposes. Or it could be relied on by both intermediaries and consumers for filtering purposes. So, it's also really important to think about that metadata. And we also want to structure that part of our payload accordingly. And that's exactly where CloudEvents come in.

Now, CloudEvents is a specification that standardizes the envelope of the message, of the event that you are emitting across your system. It's a specification that was developed under the CNCF, which stands for Cloud Native Computing Foundation. I always have to sort of think about, "What was it again? What was it again?" I just always say CNCF because so much easier. Cloud Native Computing Foundation, right? What they basically did is that they standardized the expected metadata for every message that is flowing through a system.

And it introduces seven metadata attributes or message headers, if you will. Four of them are required. The id, something that uniquely identifies this specific event that is flowing through the system. The type of event. The specversion. And the specversion refers to the CloudEvents specification version. That should remain mostly single. But also the source, where does this event come from, which service did emit this event, so that you're able to recognize that.

And there are three additional message headers that you can use, but you don't have to. One of them is the dataschema. Remember schema we've been talking about? The OrderPlaced event looks like this, has these attributes. Well, the dataschema header allows you to point at a specific schema that your consumers can expect when they consume that message. Also contains the datacontenttype. Is this XML? Is this JSON, Avro? Whatever it is, right? But also the time at which this event was constructed.

Now, CloudEvents also supports two modes, structured and binary. I won't go into that too much, but I want you to be aware of it so you can look it up further. I have a bunch of resources also at the end.

What I also want to give you and make you understand is that the CloudEvents specification is completely protocol-agnostic, but it does support multiple or does provide multiple protocol bindings, like HTTP and AMQP and MQTT and NATS and you name it. There's a bunch of them. So there's a lot of support out there.

But what, to me, from a versioning perspective and from a structuring perspective, is really important is that CloudEvents provides an extension model that allows you to define attributes that are relevant within your system boundary. So, we talked about the message headers that CloudEvents defines for you. But maybe in your system, you're saying, "Well, we kind of add a user ID message header to every event that we emit in the system. And that is required, and it should be a GUID." Okay. Then using CloudEvents extension model, you can define that specific message header and also structure that it has to be required and that it has to be a unique identifier, right? And this way, it allows you to structure the entire payload, both the data and also the message headers or the metadata that is flowing throughout your system.

But what if the metadata then needs to change? Because we talked a lot about how the schema can change. And in order to understand that, let's just circle back for a moment to how metadata can be used. Because it could be relied on intermediaries for routing, for processing, or for filtering, right? And then the question becomes, "Can you really change the metadata?" I mean, maybe you can change a harmless description here or there, but a significant change in the metadata, like changing the attribute or the header name, can cause havoc in your systems, right? Because these intermediaries and consumers are also relying on that information. Even if you think about the addition of a property, the addition of an attribute, of a header, can also be disruptive. Because, okay, it's not going to immediately break someone. But if your expectation as a producer is that that header is taken into account, then they will need to account for that. And therefore, it's best to think of any change in your metadata as a breaking change. And that warrants a new event type.

Now, whether it's a change in your metadata, in your message headers, or in your dataschema, sometimes we do need those breaking changes. Even if you're using a compatibility strategy for your schemas, sometimes you're like, "I'm sorry, the change is so impactful that I can't adhere to this compatibility strategy. I'm going to have to make a breaking change." So, what do we do then?

Well, the first thing we need to do is deprecate the old version and set a deadline on which that version will be completely deleted and non-existent in the system. We need to notify our consumers. And within that deprecation window, from which we say, "We're deprecating it," to where we're removing it, we need to do something called dual publishing. That means that you are going to emit both the older version of that event type and the newer version of that event type, which, to be clear, is a different event type, not just a new version, okay? Now, once the deprecation window is over, you stop publishing the old event type. You just remove it from the system altogether.

Now, in order to implement this, it's kind of impossible to go through all of the possible scenarios because it depends on the broker that you're using, compatibility strategy, and so many other things. So I want to discuss an option with Azure Service Bus, which I know many of you commonly use, in a situation where your messages are short-lived. So you're not storing them in an event store. You just consume, delete, and it disappears into your business data in a different form. So we're using a forward compatibility mode.

Now, what happens at this point is, our publisher is now publishing this vA of the message. This is the event type. And we have multiple subscribers. Now, when you use Azure Service Bus, every subscriber gets its own virtual subqueue. So each of them has in-flight messages there. Everything is good in the world. Now we've decided that we have a completely breaking change and we need this whole new type of event, right? So, at this point, what happens is the dual publishing window in which you said, "The deadline is three, six, nine months from now," whatever makes sense within your system boundaries, okay? And you say, "Within that period, I'm going to publish vA and vB." Now, in this specific case, we chose to have a topic-per-event-type type of topology, right? So every event type gets its own topic. So that means that the vB messages go to a separate topic. Now, we continue to publish both of them, which gives our consumers the time to upgrade when it fits their schedule, as long as it is within the deprecation window.

Now, let's say that we have our subscriber number one, and they do want to upgrade. But they, of course, don't want to lose all of those in-flight messages that are still in that subqueue. We don't want to lose them, of course not, because then our system would be inconsistent. So, the way that we go about this is, first... And the order is important, so pay attention. First, we want to subscribe to this new topic, to this vB topic, right? And then we get our own subqueue, we start receiving messages there. And as close as possible to that operation, we want to unsubscribe from that topic number A. Now, that means that our in-flight messages are not deleted. It's just a subscription that is gone, which means that new messages will stop flowing into that subqueue, but we still maintain the in-flight messages that we had.

Now, if you change the order of these operations and you're saying, "I'm going to unsubscribe from topic A first and subscribe to topic B first," in a high-throughput system, that can lead to message loss because those operations are not atomic. There's no transaction around these types of operations, which means that there might be split seconds in between where messages are coming in and you're not receiving them, not on the topic A, not on the topic B. That's why it's important to first subscribe to the new topic, then unsubscribe from the old one. But of course, you can see it coming, the reverse is also true. If I'm not losing messages, well, I might be getting duplicate messages now. And this is why it's also important that you use a logical message ID to deduplicate the ones that have the same meaning, right?

Now, okay, once we've basically completely emptied that subqueue, those in-flight messages, they're all gone, we can see there's nothing there, at that point, it is safe to remove the handler code that we had for the event type vA. And we just are fully now upgraded to this new event type. And once the entire system has been able to upgrade it and we reach our deadline, then the publisher can say, "Okay, we're done. We waited long enough. We assume everyone in the system has upgraded. We stop publishing event type vA. It doesn't exist for us anymore." And now we've basically closed the circle on schema versioning and upgrading strategies.

But the thing is that all of this, although it sounds really great, it's nothing but a promise. It's an informal handshake-based contract, cross-my-heart-and-hope-to-die type of thing. But these promises can easily be broken, not always because we have bad intentions or we don't care, but mostly just by accident. But that can cause massive issues. I really like this quote by Clemens Vasters, where he said that "Distributed systems are hard enough while being disciplined about sticking to the promises that we make. And they turn into absolute chaos when breaking those promises becomes easy."

So, it's important that we find ways to force ourselves to keep our promises. And how do we do that? Well, for starters, we could use a schema registry. I would like to hear in the comments who of you has heard of this before, who of you is using one, and which one you're using. Drop me a comment. I'll definitely scroll back.

But basically, schema registries give you a centralized approach to schema evolution. That is, basically gives you a central place where you can go and see what is the schema. Both producers and consumers are able to access that information. That also means that you're basically putting your money where your mouth is because schema registries have the ability to enforce your compatibility strategy. If you are saying, "Okay, it's forward compatibility," well, then it will basically not let you make changes that are not forward compatible. It means that you don't have to include your schemas in your payloads, duplicate them across consumers, have these shared NuGet packages, because it gives you that central place. And as I mentioned, it takes that compatibility promise and it formalizes it.

Now, what I also really like about this is that you could even take it a step further and say, "You know what? Before I emit a message, before I produce it, I'm going to query the schema and enforce it so that I never really put out a poison message again." Or on the other side, "When I receive a message, I'm going to query the schema registry to ensure that it adheres to the schema, or I'm just going to push it out because I don't know how to consume that," right? So you get runtime validation of your payloads.

But there's one big problem with the schema registry options that exist out in the industry today, and that's that they are very tightly coupled to specific brokers, to specific protocols. Now, I'm wondering, who of you is using something like Azure Service Bus or RabbitMQ or SQS? And those of you who are, have you ever used a schema registry? And I'm willing to bet that the answer is no. Because the thing is that schema registries are not a new concept, right? They are actually a well-known and very much utilized concept in the Kafka space or people who are using Azure Event Hubs or something like that, because those things tend to be tightly coupled together. Even in Azure, when you create an Event Hubs namespace, then you get a schema registry feature. But if I'm using Azure Service Bus, I can't really access a schema registry feature. I would need an Event Hubs namespace, which I'm not going to pay for if I'm not using that broker. It doesn't really make sense, right? So it's not really accessible outside of that. But I have high hopes that that is about to change with the introduction of xRegistry.

Now, xRegistry is a set of specifications that provide a lot of guidance on how to define metadata. It's being developed, again, under the CNCF, Cloud Native Computing Foundation. Now I know for sure. And it's actually being developed by the same group that has developed CloudEvents. I've been joining that group for half a year, no, actually, nearing a year by now at this point, and participating in that work. But at its core, I want to also clarify that xRegistry is not at all coupled to messaging. It's actually applicable for any type of information flow where you think it's important to define any catalog of information that is being shared between parties where you want a central registry. But in the context of messaging, which is what we are talking about here today, it does provide multiple additional specifications that can help you categorize endpoints, message definitions, the envelope, message headers, and schemas.

So, it gives you this foundation for a centralized registry, not only of schemas like the existing schema registry do, but also for your endpoints and your message definitions. So it goes way beyond what the existing schema registries offer today. It is also protocol-, broker-, and vendor-agnostic. Again, multiple protocol bindings. You have the ability to use it with multiple protocols, with multiple brokers, but they are agnostic to those individual implementations. And it's very, very useful for discovery as well. So you can go to a specific registry, look, "Oh, there's a sales endpoint. What does that sales endpoint emit? Oh, it's actually emitting an OrderPlaced event. Oh, and what's the schema of that OrderPlaced event? Well, there you have it." So, you basically get all of that information in a single place.

Now, you can already use xRegistry today because part of the group's work has been to develop a server that you can run in a Docker container and already utilize today. But I'm hoping that existing schema registries out there or even completely new servers will basically implement this specification. But even you can use this today because you can host a registry in a file. And you can host that file next to your code in GitHub or put it on S3 or Blob Storage or whatever it is.

Now, to make this a little bit visually also understandable for you, you basically have the concept of a registry. And that contains multiple groups. And each individual group can have then the resources of a specific type, which can also be versioned, right? If we translate this into our messaging space, a registry contains endpoints, which basically represent a group type in the specification. That endpoint adds channel information, protocol information, or even the envelope that is used for the messages that flow inside of that endpoint. It also has a collection of resources, which in this case are message definitions. "Oh, I have an OrderPlaced event. Oh, I have an OrderPaid event." And this is the metadata that those events carry. Now, a message definition can only have a single version.

On the other hand, we also have schema groups. And schema groups can have multiple related schemas that are defined together. For example, the payload for the OrderPlaced event, right? What is the schema for the content of that type of a message? Now, that can have multiple versions as well. And what's really nice is that specification also allows for cross-linking. So you can have a message definition for the OrderPlaced event that points to the schema that we can expect for that specific event when we are consuming it.

So, how does this get relevant for versioning specifically? Well, it gives you this single point of truth for endpoints, for messages, and for your schemas. It also defines compatibility in its core specifications, which is one of the parts I was heavily involved in as well. And it can, therefore, be used to query on egress when you produce a message so that you can avoid poison messages. Not only make sure that the schema is correct, but make sure that your message headers are correct, make sure that you're the appropriate service that's going to emit that event as well. So you get basically that capability for both the schema and the metadata. And that way, we can also facilitate the evolution of our events through its deprecation model.

Now, remember earlier I said, "Well, when we make a breaking change, what we need to do is mark the event as deprecated, set a deprecation date." We're going to effectively remove it after that date. That's not going to change. But we kind of also want to notify our consumers and make them aware that "Well, this event type is now replaced by this new event type." And we want to have some documentation around that. And I kind of just danced around that as if that is easy. But that's the hardest part, because think about it. When we are a producer, we don't want to know our consumers. That's the whole decoupling we're looking for. We don't know who our consumers are. They might be inside the system, outside the system. So how are we supposed to notify them when there is a breaking change?

And this is something that xRegistry also helps with. Why? Because it contains CloudEvent definitions that can notify of changes. So, as a consumer, I can say, "I care about these message definitions that are part of this endpoint. And whenever a version is added, I want to know that. Whenever a schema is deprecated, I want to know that. Whenever a message is deprecated, I want to know that." And now you can keep that decoupling between your producers and your consumers. Because any xRegistry-compliant server can emit these events, allowing your consumers to be notified without coupling them together.

And that brings me to the end. Wow. I know it was a lot. You probably want to rewatch this. But I do want to recap to give you away the main points that I don't want you to forget. The first one is that facilitating versioning starts with designing your events appropriately. Make them granular, make them business-meaningful, and basically, differentiate public from private events.

Use schemas and define a versioning strategy for your schemas early on. Make it visible to the rest of the system how they can expect these things to evolve over time. Also, define the event metadata with CloudEvents. You have the preset attributes that are there. If you have any additional ones, you can use the extension model.

And you can enforce the compatibility strategy that you selected using a schema registry or an xRegistry that also enforces the schemas for you with its compatibility mechanism.

Have a breaking changes upgrade strategy in place. It should be documented. How are we supposed to upgrade something when we completely change the event type, given our broker, our topology, and our compatibility mechanism as well?

And notify breaking changes to your consumers early on. The earlier that they know it, the more flexibility you can give them in upgrading and making this a friction-free type of experience.

And that was it for me. I hope you enjoyed it. I have a bunch of resources, as always, behind this GitHub QR code. And I'm looking forward to hearing any of your questions.

01:03:00 Matt Ellis

Wow, what a fantastic session to kick things off with. There was a ton of great content in that, Laila. Thank you very much.

Thank you.

01:03:09 Matt Ellis

Lots of really interesting ways of thinking about things, which has really sparked a bunch of ideas, I've just got to say. There's a whole number of things, well, just simple things as well, like good design being important with your event types, sort of encapsulation, almost sort of normalization as well of what the data is going to be. And I love the Mars lander story as well. That's such a good-

I had to include that.

01:03:39 Matt Ellis

Yeah, yeah. But it's such a good lesson, isn't it, in making a mess of things.

Absolutely.

01:03:43 Matt Ellis

01:03:47 Mehul Harry

You put together so much great content in there. So, when going through it, what was some of probably your bits that stand out where you're like, either, "That was something new to me," or, "That was something very interesting"? I know there's a lot of new stuff you're working on with the CNCF on those specs and all that kind of stuff. But what... Is it the CloudEvents that looks very promising to you? What kind of stands out to you in sort of this new frontier horizon?

That's a good question. I think, really, it's the combination of things, right? Also, it's not a coincidence that the xRegistry specifications have been developed by the same group who have been working on CloudEvents, because they also saw that CloudEvents is a foundation, the first step that we need in order to be able to formalize just the message envelope more. And then we can take it a step further and say, "Okay, but the schemas are also important, and there are solutions out there, but they only solve part of the problem. So we actually want to solve the bigger picture. And we want to have both discovery and validation techniques that make it available to solve all of the problems that we face," because the whole idea of event-driven systems is to have subsystems that are decoupled.

But that doesn't come without a cost. It's, first of all, very hard to achieve. And second of all, it does, yeah, create these sort of situations in which you're like, "Okay, but I have a bunch of consumers, and I don't know who they are." That's the point of having a decoupled system, right? But that also has repercussions when you try to evolve that system. And I think these things combined are bringing an answer to those types of problems that people have been running into the wall year after year, me included. So, this is bringing together a lot of learnings of not just the last period with the CNCF but from many, many, many years building these types of systems.

Yeah.

01:05:51 Matt Ellis

Speaking of the CloudEvents as well, we ran a little poll while it was running there, and it was pretty evenly split, really, with people who are either not using or have never heard of CloudEvents. There's only about 10% of the respondents using them. Is this one of the things where you'd strongly advise, it's like, "You folks should look at this"?

Well, yes, especially if you have a large system in which you may have multiple participants, you have a very large organization, and you can't just rely on communication between teams to solve these types of problems, which you could arguably say that those are the best environments to build distributed systems in, then definitely this is something you should be looking into at least to see how this can help facilitate basically maintaining your distributed systems in the long term, for sure. Because it helps create that structure and it helps to create expectable things. That's all that really matters is that we know what we can expect when we are subscribing to a specific event. We want that information to be stable. And if we want something to be stable, then we need to structure it to begin with. So, that's definitely the first step, for sure.

01:07:08 Matt Ellis

Yeah, yeah, yeah, cool. With the other poll, by the way, only about a third of viewers are differentiating between public and private events. And that whole thing sparked a really interesting set of conversations and questions about public-private events.

01:07:22 Matt Ellis

Mehul, have you got any other questions from the chat there, by any chance?

There are some. Somebody, they're just mentioning some technologies between Kafka and MassTransit and all that good stuff. I'd just recommend, Laila, if you have a moment, hang out in the chat for a little bit. Folks have been actually... What's great about these chats is they're kind of talking also amongst themselves.

One thing that I was kind of thinking is, as I'm watching all the stuff, there is an overall meta message of just change, right? I like what you said, change is not a point in time, which is right, because change is a process. It takes time. It's more about the decision when we make the change that's always in our mind, right?

Right.

But you even said change is inevitable. Is it just that in this event-driven space that folks are not thinking enough about these problems and how to address them, which is why you're kind of saying, "Look," which is why new specs are coming out and all that kind of stuff? Because I would think this space, it's still getting new specs, right? That's also surprised me.

Oh, okay. Yeah, okay. So, distributed systems aren't new. That's for sure. Messaging is also not a new concept, at all. MSMQ has been around for how long? I don't even know. And even before that, right? It's like, messaging is really the concept of sending someone a letter. So it does even exist even outside the software industry as well. It's all about this asynchronicity of communication.

But I think as systems have evolved and have started using more and more these types of capabilities, we've also been running into the pain points of that how it affects your organization, but also how do we really basically build decoupled systems? Because the thing is that I always have a bit of an allergic reaction because when you look at how brokers are marketed, then you basically see them being marketed as an infrastructure that allows you to build... that basically gives you decoupling, right? But that's not true. You can use messaging and have a tightly coupled mess that is then distributed. That's absolutely possible. The only real decoupling you get out of the box when you use a broker is temporal coupling. So basically because it introduces that asynchronicity. But there are all other forms of coupling that we are still basically susceptible to.

And that's why I, by the way, for those of you who are listening, I have a whole series on my LinkedIn where I talk through all different types of coupling and how we can basically even identify that type of coupling, how are we supposed to deal with that. But we are definitely still susceptible to all of that coupling and not safe from not having a tightly coupled mess.

So, I think many of those things, we are figuring it out. The whole idea of getting the right service boundary is an incredibly difficult exercise. It requires you to have access to business experts to be able to understand how the domain works. That in itself is a massive challenge. That's why I always say, as engineers, we are not just technical people. We need to be able to understand the business or we won't be able to find the right service boundaries because those things are tightly coupled together, if that makes sense.

01:10:57 Matt Ellis

Yeah, absolutely. And I wish we had a bit more time to dive into it. There was a great question on the chat. So if you get the chance to pop back in. But it was-

Yes.

01:11:04 Matt Ellis

... asking essentially, the service boundaries is really useful. It was about talking about service boundaries and essentially what makes something public or private. If you've got no external consumers, how do you have a public event? Everything is private-

If you have no external consumers?

01:11:23 Matt Ellis

Yeah. So if it's only your own stuff. But all of these concerns are still going to be there, aren't they? With versioning and problems and change, but it's just internal.

Absolutely. Yes, yes. Yes, it's just internal. And then you get less friction because you control the consumers, right? So then you would be able to basically say, "Hey, you have to take care of that and that and that." And then you don't have that sort of type of concern. You still have the upgrade concern, like we talked about, "Okay, now this goes to a new topic and we have in-flight messages." That problem still exists. But then maybe basically those schema versioning strategies are a little bit less important. I'm not going to say irrelevant because I don't believe that to be true. But yeah, of course, depending on the organization, the scope of the system, these decisions are going to be different. But yeah, I'm talking about large distributed systems and how those problems tend to be solved in those scenarios.

01:12:21 Matt Ellis

Yeah. I'm just thinking of some teams I've worked with and how trying to get schemas changed across multiple teams like that can still have a lot of friction.

Absolutely, yes.

01:12:32 Matt Ellis

Yes. I could probably talk a whole lot more about this, but I think we've run out of time. And I guess it's time to thank you very much for joining us. It was a brilliant session.

Thank you for having me.

01:12:42 Matt Ellis

As I say, it sparked a lot of interesting conversation in the chat.

I'd definitely go check it out.

01:12:47 Matt Ellis

Yeah, please do. Thank you very much. And-

And please reach out to me online. So everyone who still has questions, send me a message on LinkedIn. That's probably where you can get a hold of me most quickly.

01:12:57 Matt Ellis

Yeah, absolutely. Thank you very much, then, Laila. And we'll see you again sometime.

Bye-bye.

Change is inevitable: Versioning event-driven systems

About this video

🔗Transcription

Change is inevitable: Versioning event-driven systems

About this video

🔗Transcription

Additional resources