Approaching Your First Event-Driven Design Implementation: A Case Study

Tapping into event-driven design can be daunting, even for experienced software programmers. We have a tendency to get intimidated by a lot of complex concepts involving the usage of an event bus like Kafka, and our focus starts off trying to understand how to maintain such a complex distributed network. In this post I will try to tackle a simple*(ish)* use case where I will provide some hints on how to tap into an already existing event bus architecture network.

Just a bit of a disclaimer: I will try to be as tech-agnostic as possible where the hints mentioned here can be applicable for similar use cases. I will not go into any deeper concepts of message queues like Kafka, and if you have a basic understanding of high-level concepts of asynchronous communication patterns using a message queue, then this post might be useful. I am not taking into account some good practices on what we have to do when we design the complete event-driven setup from scratch — I will tackle that in another post.

Let’s get started!

The Problem Statement#

Let’s say there is an e-commerce marketplace platform for selling any kind of products (e.g. Amazon, Walmart, Flipkart) and we need to provide a view to our Business customers who are selling their goods on the marketplace with a percentage of their goods being “out of stock”.

An article is considered out of stock if its sellable stock is 0.
Sellable stock is the quota which a Business customer would put as the stock value they would like to sell in a particular target market. For simplicity, the target market is simply a country.
Our use case requires an almost real-time display.
We are dealing with a lot of filters: the business customers have an option to filter the view based on the article type, country, etc.

Now, let’s say there already is an event-driven ecosystem and we have a central stock system which publishes a message sharing what is the current stock value of an article whenever there is a change to the stock.

This is a “publish-subscribe” design where consumers which would like to get the current sellable stock values can simply subscribe to the events. There already exist a couple of consumers which already subscribe to this particular “event”, and we would like to introduce a new consumer system to fulfil our use case.

Throughout this post, an event simply refers to the message in the message queue which is triggered whenever there is a change to the stock value. The event is of a JSON format and contains a stock snapshot for a given article and country.

The Solution Approach#

“This is easily doable, and we will just fine tune zookeeper to read 2 billion events per second, use a timeseries DB, have log compacting and after adding the correct calculations jump to Hyperspace” — said an excited developer.

Jokes aside, being overwhelmed is quite alright and it is quite common to hear a lot of buzzwords on how it can be solved. But the good thing is: if our architecture design principles are clean, this problem can be approached efficiently if we also split it based on the information flow — which is the key highlight of our solution.

From a pure algorithmic point of view, this is an aggregation problem. We are receiving a stream of data models, which we need to transform or reduce into a model which is an aggregated view of what we want, and it is this aggregated model which we would ultimately like to show to our business customers.

In terms of a high-level architecture, I would be targeting 3 main components which are essentially independently deployable applications and can be scaled (almost!) independently of each other:

Consumer: Reads from the event bus and stores the data into the DB.
API: The serving layer with read capabilities only.
DB: Shared between the API and the consumer.

What we do is: read events from the message queue → validate and store selected events in the DB → read from the DB from the API layer → serve the result via API responses.

Basic Algorithm#

Let me jot down the high-level algorithm and explain which part is tackled by which application.

1. Read events from the event bus (consumer)

We pretty much have to “subscribe” to the event bus to continuously read any new messages which show up in the queue. In this part we have to guarantee that we read the events in the order they are published — First-In-First-Out. Most message queue systems provide these capabilities.

2. Store to the Database (consumer)

Transform the list of events from step #1 into an acceptable list to store in the DB. We can optimise this part with batch requests to the DB for efficiency. We also need to validate events with a DB lookup first — this is expensive but, given the slight transactional nature of our use case, we can live with this tradeoff.

3. Commit the offsets (consumer)

If using a log-based queue like Kafka, we also need to store the offsets in a datastore which specifies until what position in the queue we have read the events. This is important for resilience — we can start off from where the consumer last finished processing in cases of restarts or deployments. This step should be idempotent.

Zalando developed an open-source event bus solution, Nakadi, which lets you conduct these capabilities out of the box. It uses REST API for publisher/consumer interfaces and uses Kafka under the hood.

4. Read and serve from the database (API)

The serving layer — a separate API application — reads from the database and serves the requests. For our use case, to keep it simple we will be conducting the map-reduce actions every time we read the data on the fly. The trade-off here is reduced latency for large data sets, but the advantage is we do not have to pre-aggregate multiple possibilities for the various filter combinations (country, article type, etc.).

Since the writes are the bottleneck, an optimised RDBMS solution is a good starting point. We can also scale horizontally with read replicas, or consider a document store database if it suits our arbitrary query needs.

The Information Flow#

Now that we have broken our algorithm into 4 high-level parts, let us look at the algorithm from an information flow point of view. This lens makes it easier to abstract the architecture with good separation of concerns, and makes it extensible for future requirements.

i. List[SellableStockEvent] → List[SellableStockC]

Transform the list of events into a friendly list of models (here referred to as SellableStockC, where C is a suffix for consumer). The ordering of the list matters and we should design our systems to take that into account, which helps design parallelism without race condition risks. In this part we also throw away events we do not need — ones already stored in the DB, as indicated by a latest_snapshot_timestamp field. The resulting List[SellableStockC] will always be smaller than or equal to the original List[SellableStockEvent].

ii. List[SellableStockC] → Set[SellableStockDB]

Transform the useful list of models into a set to store in the DB. Depending on technology choice, there might be another information flow step to adhere to your database’s data types. The advantage: you can change databases in the future but still have similar interfaces.

iii. Set[SellableStockDB] → OutOfStockStatus (API layer)

Data is read and transformed into an aggregated model which provides the result — OutOfStockStatus — essentially a percentage of the articles within the filter which are out of stock.

iv. OutOfStockStatus → OutOfStockStatusResponse

Separate the calculated OutOfStockStatus from the API response model, OutOfStockStatusResponse. This helps avoid accidental changes to the API response that would break integrations, and enables good backwards compatibility without needing redundant contract tests.

Here is a combination of both subparts into one final visualisation:

Consumer layer:
  1. Read events from the event bus
     i.  List[SellableStockEvent] => List[SellableStockC]
  2. Store events in the DB
     ii. List[SellableStockC] => Set[SellableStockDB]
  3. Commit the offsets

API layer:
  4. Read and serve from the database
     iii. Set[SellableStockDB] => OutOfStockStatus
     iv.  OutOfStockStatus => OutOfStockStatusResponse

If you design these information contracts, it becomes easier to start development where different parts can be worked on without waiting for the actual event bus consumption to be finished. This is quite essential — the more you work in event-driven design, the less you will be able to visually see changes with the naked eye, and good software architecture within your application is key for a sustainable target picture.

This design is also easy to extend later. For example, if we want to pre-compute values and introduce on-the-fly aggregation, we can add another step between ii and iii without touching the rest of the design.

Common Traps to Avoid#

Do not spend all effort figuring out how to connect to the Event bus and coupling your entire design to it. This should be a subpart of the overall research/spike. It helps if you can separate improving efficiency of your algorithm from relying on available tools — this gives you more control over your systems.
Do not leave migration of historical data for the end. The event bus will probably not store all historical data and will purge it on a regular cadence. The occurrences of events could be quite vast.
Be wary of your parallelism strategy. The essence of scalability of event-driven design is having some guarantees of order. Do not parallelize in a way that creates race conditions (e.g. two threads or two consumer instances reading events for the same article concurrently).
Do not jump into technology choices in haste. You may have a lot of dependency on the event bus and it is always simpler when you have a mature library or SDK. Similarly, do not jump straight to a time-series database — keep in mind that the events are snapshots and you have no idea what the previous value was unless you persist it.
Coupling the API and consumer into one application can be dangerous. The maximum number of consumer instances depends on the total parallelism allowed by your event bus (e.g. with 10 Kafka partitions you can have at most 10 effective consumer instances). If the API lives within the consumer, it becomes hard to scale beyond 10, and failure modes become more complex.
Not catering for idempotency can be dangerous. Given the nature of this design, you will need to re-run storage into the DB in many cases. Design for this from the start.
Always test your throughput before going live. Know how many events you can consume per minute and per hour, and with how much parallelism.
Do not skip writing good tests. The technical debt from not writing tests outweighs missing deadlines in many cases. Follow the information flow and write good testable unit tests for each well-defined contract. Relying solely on functional tests or manual tests is time-consuming to debug.

Conclusion#

I hope this case study helped you dive into an actual event-driven design implementation where an event bus already exists and a new system needs to tap into the infrastructure. I am very passionate about distributed systems and would be writing more articles around this topic to make it easier for developers and managers alike to approach them.

Originally published on Medium.