DATA ENGINEERING18 min read

Building Streaming Data Pipelines: Kafka, Exactly-Once, and Backpressure

Published May 8, 2026alphabench Engineering

Batch pipelines are forgiving. If a nightly job fails, you rerun it. Streaming pipelines are not - they run continuously, under load, and the failure modes are subtle: a dropped event here, a duplicate there, a slow consumer that silently stalls the whole system. Getting streaming right is mostly about getting the failure handling right.

This is how we build streaming pipelines that don't lose or duplicate data - the patterns behind systems that ingest millions of events a day without quietly corrupting the numbers downstream.

Why Streaming Breaks Differently

In a batch world, the unit of work is a job with a clear start and end. In a streaming world, the pipeline never stops, so every component has to handle failure while running. A consumer crashes mid-batch. A downstream database slows down. A producer sends a duplicate after a network blip. None of these can be allowed to lose or double-count data.

In streaming, "it worked when I tested it" means nothing. The question is what happens when one component fails while the rest keep running.

Delivery Semantics: The Decision That Shapes Everything

Every streaming pipeline makes a choice, explicitly or by accident, between at-most-once, at-least-once, and exactly-once delivery.

At-most-once can lose data and is rarely acceptable for anything that feeds a number someone trusts. At-least-once never loses data but can deliver duplicates. Exactly-once - or more honestly, effectively-once - delivers each event's effect exactly one time.

True end-to-end exactly-once is hard and sometimes impossible across system boundaries. The practical pattern is at-least-once delivery plus idempotent processing: accept that you might see an event twice, and make processing it twice have the same effect as processing it once. This is almost always the right target, and it's far simpler to reason about than chasing exactly-once everywhere.

Idempotency Is How You Sleep at Night

The whole correctness story of a streaming pipeline rests on idempotent writes. If processing an event is idempotent, then duplicates - from retries, redelivery, or reprocessing - are harmless.

We make writes idempotent with deterministic keys: an event carries a stable ID, and the write is an upsert keyed on that ID rather than a blind insert. Reprocessing the same event overwrites the same row instead of creating a second one. For aggregations, we either make the aggregate recomputable from source or track which events have already been applied.

Once writes are idempotent, retries become safe and the entire system gets simpler. You stop trying to guarantee each event is processed exactly once and instead guarantee that processing it any number of times is fine.

Backpressure: The Failure Mode Nobody Sees Coming

The quietest way a streaming pipeline fails is backpressure. A downstream component - a database, an API, a slow transform - can't keep up. If the pipeline has no backpressure handling, events pile up in memory until the process dies, or worse, get silently dropped.

We handle it by letting the slow component set the pace. With Kafka, consumers pull at their own rate, so a slow consumer naturally slows its own reads without losing data - the log retains events until they're consumed. The key discipline is to never build unbounded in-memory buffers between stages, and to monitor consumer lag as a first-class health metric. Rising lag is the early warning that something downstream is struggling, long before it becomes an outage.

Observability Is Not Optional

A streaming pipeline you can't see into is a liability. Every pipeline we build emits three things at minimum: throughput (events per second per stage), lag (how far behind real-time each consumer is), and error rate (events that failed and landed in a dead-letter queue).

The dead-letter queue matters as much as the happy path. Events that can't be processed - malformed, schema-violating, referencing missing data - go to a DLQ instead of crashing the pipeline or vanishing. You get to inspect them, fix the cause, and replay, rather than discovering weeks later that a class of events silently disappeared.

Replay and Backfill

Things go wrong. A bug corrupts a day of aggregates; a downstream schema changes. The pipelines that survive this are the ones built for replay from the start. Because the event log is retained and processing is idempotent, correcting historical data is a matter of replaying the relevant events through the fixed code - not a frantic manual reconciliation.

The Foundation

Get three things right - explicit delivery semantics, idempotent writes, and backpressure with observability - and a streaming pipeline becomes boring in the best way. It ingests, it transforms, it stores, and when something breaks you can see it and replay it.

This is the core of our Data Pipeline Engineering work, and it underpins the trading and inventory systems we've built. For a deeper look at storing the output, see Why We Chose Event Sourcing for a $200M Trading Platform.

A streaming pipeline is judged entirely by how it behaves when something fails. Design for that, and the happy path takes care of itself.

Where LLM Automation Pays Off (and Where It Quietly Burns Money)

Designing Idempotent, Versioned APIs Backends Can Trust

Have a similar challenge?

Let's discuss how we can help you build the right solution.

START A PROJECT