Let's talk about Kafka
We get a lot of questions about Kafka. Is it good? Does it live up to the hype? And most frequently, when are we going to support Kafka in NServiceBus.
But to fully answer these questions, it’s essential to understand what Kafka is, and more importantly what it isn’t, and then think about the kinds of problems that Kafka solves. So, let’s dive into the (heavily footnoted) details…
🔗What is Kafka?
Apache Kafka is a partitioned log and is a similar product in architecture to Azure Event Hubs or Amazon Kinesis. A partitioned log works precisely as its name suggests: it’s like a bunch of separate log files (partitions) constantly writing whatever data you throw at them. Those bits of data are usually called events, 1 and one of these append-only log files is called an event stream.
Events are written to an event stream in the order in which they are received, and read in the same order as well. 2 3 The events will then stay in that log until a defined retention period is reached—nothing else other than the passage of time can remove an event from the log once it is written.
When you want to receive an event from Kafka, you need to decide how you will read from the event stream. You are responsible (or, more accurately, the client library you are using is responsible) for determining what partition you are reading from and keeping track of your position within that partition. So you don’t just get the next message waiting to be processed. Instead, you have to decide where you want to start reading from (called a “cursor”) and then keep that cursor up-to-date yourself.
If another reader wants to receive an event, they can’t just pick up where you left off. Instead, they have to manage their own cursor because Kafka doesn’t manage cursors for you. It just provides you with repeatable access to the stream of events.
While Kafka can deal with high traffic volumes, the right partition strategy is a key factor and will have a major impact on the scalability and elasticity of the whole system. Choosing the right number of partitions for event streams is a bit of an art and can be difficult to change, especially on the fly.
But what happens if something fails while processing an event in Kafka? Since the log stream is still all there, 4 you can just re-read the log stream from a specific offset or decide to skip over the offset of the troublesome event altogether.
So now we know more about Kafka, but the big question remains, is it a message queue?
🔗What are message queues?
A message queue or message broker is a highly-optimized database for processing messages.
Like Kafka, events (but here, let’s call them messages) can be stored in the same order in which they are received. But that’s where the similarities end.
The goal of a message queue is to be empty. Therefore, it does not have a retention period; it only hangs on to a message until a client confirms it has been successfully processed, and then that message is deleted from the broker.
Each message queue client is not responsible for managing a cursor, as that is managed centrally by the broker. Therefore, the only message you can receive is the one that is next on the queue. If you attempt to process that message, the broker will prevent any other consumer from getting access to the same message unless you report you were unable to successfully process it 5 or a timeout expires. When this happens, the broker is forced to assign the message to someone else.
This behavior leads to the competing consumers pattern, where multiple clients can cooperate to process messages more quickly, something that Kafka and other event streams are not designed to do. Unlike Kafka, consumers can be added and removed at any point in time without any impact on the topology of the queuing system.
The error handling pattern on message queues is also different. With a queue, you only get access to the next message in the queue, but sometimes, that message can’t be successfully processed. For example, it may not be possible to deserialize its contents. When that happens, that message effectively blocks the processing of subsequent messages. Messages which cannot be successfully processed are called poison messages. Systems that rely on a message queue have a process for retrying messages and eventually will forward poison messages to an error queue or dead-letter queue for investigation. 6 Kafka can be made to emulate some of these error-handling patterns, but this will require more work than message queues, where it’s the default behavior. 7
For an even deeper dive into message queues, including how you’re using multiple queues right now without even realizing it, check out this video by Clemens Vasters, Principal Architect for the Microsoft Azure Messaging platform:
So as we can see, the only thing that partitioned logs (like Kafka) and message queues (like RabbitMQ, Azure Service Bus, and Amazon SQS) have in common is that they help you process messages. But, after that, the differences couldn’t be more stark.
🔗So which is better?
Now that we’ve established that Kafka is not a message queue, 8 the main question that remains is which is better?
The answer, of course, is neither. There are good reasons to use both, even at the same time.
🔗When to use Kafka
Kafka is great for situations where you need to ingest or process large amounts of data, and you want to be able to read that data repeatedly, typically from different logical consumers. This is very common in systems for telemetry and data distribution.
Consider using Kafka when “events” represent the state of something at a specific time, but any individual event doesn’t have much business meaning by itself. In these cases, you need a stream of these events to analyze changes and transitions in state to derive any business meaning.
🔗When to use a message queue
Message queues excel when messages must be successfully processed once and only once, and losing data is not an option. These systems depend on reliable state transition management for business processes.
When using a message queue, messages represent information about a specific business process step. Therefore, every message in a queue has business value by itself; it doesn’t necessarily need to be analyzed in relation to other messages.
Because every single message has significance, messages have to be processed independently and can be scaled using the competing consumers pattern. After a given message has been successfully processed, it’s no longer available to any consumers.
🔗When to use Kafka AND message queues
Imagine a system that monitors changes to stock prices and alerts users when specific changes occur.
Consuming the firehose of real-time stock price data is a perfect job for Kafka, which excels at handling that amount of data and storing it for further processing. Each point-in-time stock price value isn’t that useful to us; it’s just the changes we’re interested in.
There may be many different consumers of this data. Of course, there may be consumers that are interested in different stocks. Depending on the system requirements, the streams may be set up so that each stock’s data lives on separate partitions, and any reader is only looking at the values for one stock symbol at a time. Or, stocks may be organized on the same partition, and readers simply ignore data they don’t care about.
However, even for one stock symbol, we may have many readers looking at the same event stream but interested in different things. For example, one reader may be interested only in sudden spikes or drops in a stock price, while another may be monitoring for trends over an extended period.
It ultimately doesn’t matter because each reader of a partitioned log maintains its own cursor and can read the same stream as often as it wants.
Once one of these Kafka stream readers detects an important business event, such as a sudden increase in stock price, that’s the time to use a message queue. Now we publish a StockPriceSpikeDetected
event using a message queue, so that message handlers can execute business processes for that event. These message handlers might be making stock trades, updating data in a database, emailing fund managers, sending push notifications to mobile applications, or whatever else needs to be done.
It can be helpful to think of Kafka events more as raw data. Only when trends in the data become relevant to the business does something become a business event, and that’s the point at which you should be using a message queue. In fact, we’ve seen customers use Kafka in this way. They host code in Azure Functions using a KafkaTrigger
to monitor event streams, then raise business events in Azure Service Bus (using either a send-only NServiceBus endpoint or native sends) which are processed by NServiceBus endpoints. That’s a more appropriate use case for Kafka, and it works great.
🔗And what about NServiceBus?
Here at Particular Software, we believe in using the right tool for the job. So while it is possible to employ complex workarounds to make Kafka kinda-sorta behave somewhat like a queue, we think that if you have a situation where you should use a message queue, you should use a message queue.
Since NServiceBus is a communication framework over message queues, we don’t currently have any plans for a Kafka-based message transport that would be similar to our message transports for RabbitMQ, Azure Service Bus, and Amazon SQS. Unfortunately, when implementing a message transport, the devil is in the details—many of the features our customers love most about NServiceBus are much harder to achieve with Kafka due to seemingly inconspicuous differences between message queues and event stream processing infrastructure.
However, we are investigating other communication abstractions that might suit Kafka (and other partitioned logs) better. And that’s where you come in.
If you’re interested in using Kafka (or another partitioned log) we want to hear from you. Drop us a line and tell us what you want to do. We’re far more interested in building the right thing than getting something out there to check a Kafka box for the folks in the marketing department.
So let us know what you’re up to—we’d love to chat.
An overloaded term if ever there was one, but that's a subject for another post. Suffice it to say, this does not mean the same thing as an NServiceBus event.
This is a bit of an oversimplification. Ordering in Kafka is only guaranteed per partition, and the number of partitions heavily influences how your topic can scale with the load.
While this may sound like a small detail, it's one of the most important aspects that make technologies like Kafka so popular for data distribution. Guaranteed ordering is one of the fundamental requirements to ensure correct data replication.
That is, assuming that the retention period or maximum size has not been reached, at which point Kafka will start overwriting data. On the other hand, queues will not apply retention settings by default, and if your queue gets full, it will reject further writes until you consume messages from it to free up space. That may sound bad, but it ensures you don't overwrite critical business data.
A NACK or negative acknowledgment
See how NServiceBus retries message processing to determine if a message is a poison message before forwarding it to an error queue in I caught an exception. Now what?
See Error Handling Patterns for Apache Kafka Applications and Kafka Connect Deep Dive – Error Handling and Dead Letter Queues for more information.
…and the folks at Kafka will back us up on this one.