Pros and Cons: Single Message Stream

The Single Message Stream pattern was something used on an event sourcing based system I recently worked on. The core of this pattern is that all messages of a class (command or event) are placed into a single stream with a fixed ordering.

The model for consuming messages with this pattern is conceptually quite simple. Each client maintains a bookmark, which is the last message sequence number it has processed. As it processes a message it updates this bookmark to be the sequence number of that message then proceeds to the next message. Generally the client will request a batch of messages so that it can be downloading later messages while processing earlier ones but this is an purely an optimisation, the bookmark is still updated on every message processed. Use of a persistent bookmark means that the client can resume processing at the point it left off if it crashes or is restarted. Additionally filtering by entity type or aggregate ID is also possible, this is an optimisation to skip loading messages the client would otherwise have not used anyway.

Because there is only a single stream of events the need to worry about out of order delivery is greatly reduced. There is no circumstance where an event caused (directly or indirectly) by a previous event will be visible to any client. For instance if we know that an order cannot be created unless the customer has been created then nothing that consumers customer and order events (to build read models or trigger business process) ever needs to handle the case where an order creation is received for which there has not already been a related customer creation.

This has some additional consequences. It means that you cannot use any information owned by other events that's in the event stream until you have actually seen an event containing it. In standard operation this is not an issue but is an important consideration when writing bulk import tools or integration points where you are converting external data into events. Your code must ensure this temporal coupling remains intact. This is particularly problematic where you are generating the events asynchronously such as by sending commands. You may need to wait until you observe that the relevant events have been created before your process may continue.

Because there is only a single event stream it is relatively easy to build a disaster recovery environment that is kept very closely in sync with the primary environment. A replicator that copies the event stream from the primary into the secondary is a simple thing to construct. If the system is implemented to only modify aggregates via events it is therefore relatively easy to have all the services reconstitute themselves constantly from the replicated event stream. In a failover scenario the services can be switched back to standard processing to continue from the last replicated state. With a single event stream there is no question that the system is in an inconsistent state where some message types are more complete than others.

Having a single message stream comes with a number of significant tradeoffs. Enforcing a single global message ordering is something that by definition cannot be distributed. Implementing read replicas is relatively easy by writes must all go to a single instance. Additionally if you wish to have write redundancy you must ensure that the new write master has all the messages that have been seen by any client, with the correct ordering. This is potentially non-trivial, particularly without significantly impacting performance. It may be easier (if less desirable) to fail over to a DR environment for these failure cases.

Although each type of business logic can run off a message stream independently it is significantly more complex to have multiple threads of execution within a type. Command processors may in general be sharded because commands tend to relate to specific aggregates that can be handled independently. Processes that react to events may be significantly more complex. Determining how to shard an event stream to be suitable for a process is potentially complex and risky.

Scaling read models is also potentially problematic. Although you can do significant horizontal scaling to the extent of giving each read model its own population process making those processes run in parallel is difficult due to interrelationships between events. Read models that are complex or updated by many events may be a significant performance bottleneck with an upper limit on the effectiveness of throwing more hardware at the problem.

The other major tradeoff of Single Message Stream is that there is no capability for out of order message processing. You cannot delay processing a message until other messages are received due to the very nature of the pattern. In addition to enforcing strict conditions on the order things are written to the stream this has a major effect on error handling. When there is an error processing a message you have only two options. You may discard the message forever, or you can explode into millions of tiny pieces (possibly with a retry of that same message). This means any error will stop all message processing for a particular business process.

In general commands, being requests of the system, may be discarded on certain classes of unrecoverable error. Events however are an immutable record of what has happened and consumers are not allowed to argue with them. As such any process that encounters an error processing an event generally has no option other than to stop processing until that event is correctly processed. This means there is an inherent tradeoff of robustness for correctness. When I discuss the pattern with people this is generally one of their biggest objections.

Having used this pattern it is a valid question to ask if I would use it again. There is not a fixed answer to this. There are some things I would definitely not repeat. In particular I would not use this pattern with commands again. Although convenient from the perspective that it allows you to use common infrastructure the tradeoffs are too costly, particularly in terms of parallel processing. Commands are generally transient and there is relatively little benefit in giving them a fixed order and persisting them for long periods. A more traditional queueing system or synchronous calls is going to be a better choice here.

For use with events I would only consider this if some criteria are met:

  • A highly current copy is required of the system that must at all times be completely consistent.
  • It is acceptable to stop work rather than work past errors. This is not true of many domains and puts a burden on the development team to immediately react to errors. It's unlikely to be viable for systems that must be available at all times but may be acceptable for systems that must only operate during business hours, provided the tradeoff is adequately justified.
  • The system must not need to run across multiple data centres or scale by orders of magnitude. For an internal line of business system this may be a reasonable assumption, but this is less true of public facing websites.

This tradeoffs for this pattern suit problems that are in my experience somewhat atypical. For the problem domain I've used it with applying it to events fitted a number of the business requirements well and the tradeoffs were acceptable. That the pattern may not have more general applicability is not relevant to whether it fit the specific problem in question. This is OK, we don't live in a "one architecture fits all problems" world.

Colin Scott

Read more posts by this author.