Data Contracts Are How You Stop Breaking Each Other
Without data contracts, every pipeline change is a potential incident. Here's why informal data agreements between teams are a liability — and what to do instead.
At some point, every data team has this moment: a downstream pipeline breaks, the on-call scrambles, and the root cause turns out to be that another team quietly changed a field name, dropped a column, or altered a data type. Nobody told anyone. There was no process to.
This is the data contract problem. And most teams don’t fix it until it costs them enough.
The informal contract you already have
Every time one team produces data that another team consumes, there’s a contract. The producer says: “This table will have these columns, these types, this freshness.” The consumer builds on that assumption.
The problem is when that contract exists only in someone’s head — or worse, in a Slack message from six months ago. Informal contracts break silently. No test fails. No alert fires. The pipeline keeps running, delivering wrong results, until someone notices the numbers are off.
By then, the damage is done.
What a data contract actually is
A data contract is an explicit, versioned agreement between a data producer and consumer. It specifies:
- Schema: field names, types, nullability
- Semantics: what each field actually means (not just what it’s called)
- Freshness SLA: how current the data will be and when it arrives
- Volume expectations: rough row counts, spikes, and what anomalies look like
- Breaking change policy: how changes are communicated, versioned, and deprecated
It’s not a novel concept. APIs have had versioning and changelogs for decades. Data pipelines deserve the same discipline.
Why teams resist this
“It slows us down.” That’s the pushback. And it’s true — in the same way that writing tests slows you down. For about a week. Then it starts saving you from the incidents that used to cost you two days each.
The teams that resist data contracts are usually the ones producing the data. The teams that want them are usually the ones consuming it. That tension is informative. It tells you where the accountability is sitting — and where it isn’t.
How to implement without bureaucracy
You don’t need a platform team and a six-month project. Start small:
1. Write it down, in code. Define schemas using something like Pydantic, dbt’s schema YAML, or a simple JSON Schema file. The format matters less than the habit. Put it in version control.
2. Validate on ingestion. When a consumer pulls data, validate it against the contract. A schema mismatch should fail loudly, not silently propagate.
3. Treat breaking changes like API changes. Major version bump, communication to consumers, deprecation window. Same discipline, different medium.
4. Own it at the producer level. The team generating the data owns the contract. They’re responsible for honoring it. If they change it, they notify. If they can’t, they negotiate.
The leverage is in the failure mode
An informal agreement fails silently, at 3 AM, in production. A data contract fails loudly, at the point of change, in development. That’s not a minor difference — it’s a complete inversion of where the pain lands.
If your pipelines break every time another team ships, you don’t have a pipeline problem. You have a contract problem.