Idempotency Is the Property Your Pipelines Are Missing
Most data pipelines break silently when run twice. Idempotency isn't a nice-to-have — it's the property that separates pipelines you can trust from ones you're afraid to touch.
Your pipeline runs. Then something fails at 2 AM. The orchestrator retries. Now you have duplicate rows, double-counted revenue, and a model trained on data that got processed twice.
This is a classic idempotency failure — and it’s more common than any team wants to admit.
What idempotency actually means
An idempotent pipeline produces the same result whether it runs once or ten times. Run it twice on the same input, you get the same output. No duplicates. No ghosts. No “we need to manually clean up the March 14th run.”
It sounds obvious. Almost no one builds it by default.
Why pipelines fail this test
The culprit is usually INSERT without deduplication. A pipeline reads from source, transforms, and appends to a destination table. Works great the first time. Retry it, and you’ve doubled your data. Retry it five times during an incident, and you’ve got a mess that takes a full day to untangle — if you even catch it.
Other failure modes:
- Mutable intermediate state — writing to temp tables that don’t get cleaned before the next run
- Stateful aggregations — summing or counting without a snapshot boundary, so each run adds to the previous total instead of replacing it
- Non-atomic writes — the pipeline half-finishes, leaves partial data, and now you have no clean way to distinguish what’s good from what isn’t
The pattern that fixes it
The simplest approach: delete-then-insert (or MERGE/UPSERT) with a deterministic key. Before writing data for a given time window or batch, delete whatever’s already there for that window, then insert fresh. Your pipeline becomes safe to retry by default.
For streaming systems, the same logic applies: use idempotent writes with deduplication on a unique key at the consumer, not at the source.
The key insight is that idempotency is a contract you design in, not a property that emerges on its own. Every write operation in your pipeline should answer: “If this runs twice, what happens?” If the answer is “bad things,” that’s a bug waiting for its moment.
Why this matters in AI pipelines especially
Feature engineering pipelines that double-count values corrupt training data in ways that don’t always surface immediately. The model trains, passes your test distributions, and then underperforms in ways that are hard to attribute. You spend weeks debugging model behavior when the actual issue is that your pipeline ran twice during a deployment in October.
Data issues compound. Idempotency is cheap insurance.
The operational payoff
Teams that build idempotent pipelines by default have dramatically lower incident response costs. When something breaks and the orchestrator retries, they don’t have to page someone at 3 AM to manually roll back data. The retry is safe. The pipeline handles it.
That’s the point: systems you can trust at 3 AM are systems designed for failure, not just for the happy path.
Every pipeline will run twice eventually. The question is whether you built for it.