Caching entity state vs rebuilding with event sourcing

When using event sourcing while your event stream is your source of truth a stream of events is not a particularly convenient representation to execute business logic against. What is generally required is a projection into some kind of object that represents the current state of the entity.

There are a couple of implementation choices when managing the projection from event stream to entity instance. The tradeoffs these options exhibit will have a non-trivial impact on the characteristics you require from the rest of your system.

At one extreme a system can rebuild the entity instance from the events every time it is required. This requires very little memory and storage because the only thing that is kept for any length of time is the events (which with event sourcing will be kept anyway). Further if you want to change the shape of the projection to better serve the business logic this is inexpensive as there is nothing persistent to be altered. As soon as the updated projection code is deployed it will be utilised by all entities.

The downside is that on every use of an entity instance it is necessary to fetch the events and project them into an entity instance. If the entity has a large number of events associated with it this is potentially a very significant overhead. In some cases this may be prohibitively expensive. The system would grind to a halt if tens of thousands of events must be projected before any business logic can be run.

It is also necessary to manage concurrency such that multiple operations against a single entity do not produce conflicting state. There are various ways of achieving this, a discussion of which is outside the scope of this post. It is however something you must deal with when choosing this approach.

At the other extreme there is storing the projection in some kind of datastore. In many respects this reverts to a more traditional approach of executing business logic against entities persisted to and loaded from a datastore with the addition of also writing events. As an industry we have a great deal of knowledge of building systems in this style. Involving a datastore also allows standard concurrency solutions to be applied.

The primary downside of this approach is resistance to change. If you want to reshape the projection of the events you need to restructure the persistent store. Experience with traditional applications suggests that this is unlikely to happen. The projection is more likely to be modified as generally happens with the datastore for an existing application: changes grafted on wherever is most expedient at the moment with little regard for coherence or long term maintainability.

If the changes to the projection require data that is only available in the events then you will run into another downside: rebuilding all the projections from the events is slow. You could potentially be looking at hours or days to rebuild. This rebuild could be performed while continuing to run the old version until the new projections are up to date but this massively increases the cost, complexity and latency of projection changes.

If you are storing projections you are also making your backup and disaster recovery more complicated. The projection must be consistent with the events or the system may behave inappropriately. Keeping backups of multiple (potentially different) datastores consistent at all times is difficult bordering on impossible. This implies that you must regenerate the projections from the events but as discussed this is potentially very slow. Adding days to your disaster recovery time is unlikely to be acceptable. I have addressed this previously by replicating the events into a DR environment and performing a realtime rebuild from that replication. This ensures that the state is consistent for DR but requires that there be a near-full sized active DR environment at all times. This represents a non-trivial ongoing cost you could potentially significantly reduce with an alternate approach.

In between these two options is a third alternative which addresses the latency at both ends. This approach works by maintaining a projection of the events but doing so only in memory. Once the initial hit of loading the projection has been taken subsequent operations are extremely fast. However these projections are transient, living only for the life of the process. To change the projections it is only necessary to deploy the new version, the relevant projections will then immediately be applied.

This approach fits very well with an actor model where each entity gets an actor. Systems such as Akka (JVM or .NET) come with persistence libraries that will handle a lot of the event related heavy lifting for you.

The primary downside of this approach is that it consumes a lot of memory. However the amount of RAM systems have available these days is enough to handle many datasets in their entirety. Where this is not true strategies to unload infrequently uses instances can be employed at the cost of some additional implementation complexity.

As discussed above doing a rebuild of a projection from all events for an entity may be prohibitively expensive if done when the entity is required. If the projection is cached then how much work can be done may be larger as it will somewhat amortised over many operations (with a large latency on the first) but there still remain practical limits. To address this we can use snapshots.

A snapshot is essentially a projection of the events that is periodically stored. When rebuilding the projection you start from the last snapshot rather than the beginning of the events for the entity. This significantly reduces the amount of work required to create the projection.

These options are not mutually exclusive. It is perfectly valid to use in-memory caching for some things and persisted projections for others. Persisted projections allow much more consistent access time for large numbers of entities which may be preferable to taking the hit of reconstituting a projection that is not currently in memory. In-memory caching with unload may also require additional undesirable complexity. For instance in an actor system it may be necessary to forward all commands to an entity type via a single actor in order to ensure that the relevant instance exists in a system where actors can be unloaded. Some forms of scalability may also be easier when you can do concurrency checks against a datastore (such as with traditional SQL based optimistic concurrency control).

My recommendation would be to first consider in-memory caching of projections with snapshots to see if that will suit your requirements. If your dataset is too large consider implementing unloading of instances in preference to storing them in a datastore, provided that access patterns to your data suit this approach. Before considering building the projection on every usage consider whether the viability of that solution indicates that event sourcing may not be warranted for that part of your system.

Colin Scott

Read more posts by this author.