The Pipeline That Runs Once and Trusts Nothing
Idempotency is table stakes. The next level is building pipelines that assume everything upstream is lying to you.
You made your pipelines idempotent. Good. That’s not enough.
Idempotency means running twice gives you the same answer. It says nothing about whether the answer is right. Your source system could be emitting duplicates, late events, soft-deletes that never propagate, timestamps in three different time zones depending on the region. All of that hits your pipeline looking clean.
The pipeline runs. The data lands. The dashboard says green.
Your numbers are wrong.
This is the failure mode nobody talks about because it’s quiet. No error logs. No alert fires. Just a slow drift between what your data says and what’s actually happening in the world. You find out during a board meeting or a billing audit.
Here’s how we think about it at Sovont:
Pipelines should treat upstream data the way good engineers treat third-party APIs — with documented assumptions and explicit failure modes when those assumptions break.
That means:
Define a freshness contract. If your source should update every hour and it hasn’t in four, that’s data, not silence. Log it, alert on it, decide what to do about it. Don’t just skip forward.
Check cardinality, not just row count. A million rows landed. Did any primary keys duplicate? Did any segment of the distribution collapse? Row count is a lazy check. Cardinality catches what row count misses.
Make late arrivals a first-class citizen. Late data doesn’t mean bad data — but it does mean you need a reprocessing strategy. If you don’t have one, you have a choice: silently wrong numbers or silent gaps. Neither is acceptable.
Version your expectations. When upstream changes their schema or semantics, you need to know the exact point where your assumptions diverged. Without that, debugging a bad quarter means archaeology.
None of this is sexy. There’s no LLM inside, no vector database, no attention mechanism. Just rigorous engineering around the part of the stack that makes everything else meaningful.
The companies with trustworthy AI products aren’t smarter. They just stopped trusting their pipelines and built verification into the muscle of the system.
Start there before you add another model.