So You Want To Build An Event Driven System

So You Want To Build An Event-Driven System?

Speaker: James Eastham, Senior Cloud Architect, AWS
Source: NDC London 2024 - So You Want To Build An Event Driven System?

Core Thesis

This talk is a crucial reminder that Event-Driven Architecture (EDA) is a socio-technical pattern, not a silver bullet. It's a communication pattern first, and a technical one second. If we adopt it without understanding the immense discipline it requires around language, contracts, and consistency, we risk building systems that are more complex and brittle than the monoliths we sought to escape. As the speaker candidly admits, he has gotten it "terribly, terribly wrong" in the past. We must learn from those mistakes.

A system's success depends on a shared, unambiguous language. For EDA, that language is the events themselves.

Core Tenets & The Operational Pitfalls They Create

1. The Promise: Decoupling vs. The Reality: Temporal Coupling

The primary motivation for EDA is to break the tight, synchronous coupling that creates distributed monoliths. A failure in a non-critical service (like a Loyalty Point Service) should not be able to bring down the core Order Processing Service.

Pasted image 20250617214107.png

This temporal coupling is what EDA aims to eliminate. A failure in one service leads to a cascading failure across the system because everything is waiting on a response.

Pasted image 20250617214136.png

Takeaway: This is the only major benefit. If your system isn't suffering from this specific problem, the costs of EDA will likely outweigh the benefits.

2. The Price of Decoupling: Eventual Consistency

To break temporal coupling, you must embrace asynchronicity. This means you are explicitly trading strong consistency for availability.

The Fundamental Trade-Off

You cannot have both. If your business process requires immediate, strong consistency (e.g., a bank transfer), EDA is the wrong tool for that job. Eventual consistency means the system will eventually converge on the correct state, but there is no guarantee when. This is a difficult concept for many business stakeholders to accept and for engineers to debug.

Fat vs. Sparse Events

This is the most critical and contentious design decision in EDA, and it directly impacts system coupling and operational load.

Event Type	Description	Operational Risk	Evolvability Risk
Sparse Event (Notification)	Contains only an ID (`orderId: "123"`).	❌ High Risk of Callback Storms: Consumers must call back to the source system to get details. This re-introduces a form of coupling and can easily overload the source service (a "thundering herd" problem).	✅ Low Risk: The schema is stable. Very easy to evolve.
Fat Event (State Transfer)	Contains all data a consumer might need.	✅ Low Risk: Consumers are autonomous and don't need to call back, protecting the source service.	❌ Extremely High Risk: The event schema becomes a public, rigid API. A single change can break numerous downstream consumers, creating a deployment nightmare. This is the new monolith.

Takeaway: There is no easy answer. We must force a deliberate, documented decision for every event type. A default to "fat events" to avoid callbacks often leads to a distributed monolith where the coupling is hidden in the event bus schema.

4. The Hidden Cost: Governance and Evolvability

Because event schemas are a public API, changing them is dangerous. The talk highlights two non-negotiable practices to manage this, which are significant process overheads:

The Metadata/Data Pattern: Every event must be wrapped in an envelope containing standard metadata (version, traceParent, eventType). This is essential for routing, schema evolution, and observability. Without it, you are flying blind.
Governance via RFC: Changes to an event schema must be treated like a public API change. A "Request for Comments" (RFC) process is required so all consumers can weigh in. This slows down development but prevents widespread outages.

Takeaway: EDA is not a "move fast and break things" architecture. It demands a level of process maturity and discipline that many organizations are not prepared for.

5. The Observability Black Hole

In a synchronous world, a request has a single trace. In an asynchronous, event-driven world, a single user action can fan out into dozens of independent event flows.

Quote

"What happens when things break?"

Answering this question becomes exponentially harder. Correlating a failure in a downstream consumer back to the original event that caused it is a massive challenge.

Takeaway: Tooling for distributed tracing (propagating a traceParent in the event metadata) is not optional; it is a foundational requirement for running these systems in production.

Actionable Advice for Our Organization

Challenge the "Why". Start Small. Before we build another event-driven system, we must ask: "What specific problem are we solving that a modular monolith can't?" If the answer isn't clear, we should default to the simpler architecture. Let's pick one small, non-critical area to experiment with these patterns.
Mandate a Ubiquitous Language via Event Storming. The biggest risk is misunderstanding. We should use Event Storming to bring SREs, developers, and business stakeholders into one room to agree on what the business events actually are. This shared language is our best defense against building the wrong thing.
Separate Reads from Writes (CQRS). The talk introduces CQRS as a way to get performant reads from an eventually consistent system. We need to recognize this for what it is: adding another layer of architectural complexity to solve a problem created by our initial choice of EDA. While powerful, it means we now have two data models to maintain, not one.

By being intentional and disciplined, we can leverage the power of EDA. But by treating it as a default choice, we risk building complex, brittle systems that are harder to operate and slower to change than what we started with.