The Real Cost of 'We'll Clean It Later'
Technical debt in data systems doesn't sit quietly. It compounds. Every downstream model, dashboard, and decision built on dirty data pays the price.
“We’ll clean it later” is the most expensive sentence in data engineering.
Not because cleaning is hard. Because later never comes — and in the meantime, you build everything on top of the mess.
How the debt compounds
Bad data doesn’t just sit in a table somewhere, inert. It flows. It gets joined, aggregated, transformed, and eventually used to train a model or drive a dashboard that informs a business decision.
By the time someone notices the numbers are wrong, the bad data has touched:
- The ETL pipeline that ingests it
- The feature store that serves it
- The model trained on it
- The dashboard built from it
- The report the CEO read last quarter
Now cleaning it means unwinding all of that. You’re not fixing a table. You’re auditing a chain.
The “we’ll address it in the next sprint” lie
Sprints end. Priorities shift. The messy customer ID mapping that’s been on the backlog since Q1 is still there in Q4 because something shinier always needed doing.
This is rational behavior in dysfunctional systems. If data quality doesn’t have an owner, it doesn’t get fixed. It gets tolerated. And teams develop muscle memory for working around it — building filters, special cases, and informal knowledge about which columns to trust.
That informal knowledge doesn’t survive turnover.
What this looks like in AI projects
Dirty training data is quiet until it isn’t. Your model learns from whatever you feed it — including the outliers you meant to remove, the nulls you filled with averages, and the labels that were wrong because the upstream system had a bug for six weeks.
You ship the model. It performs fine in testing. It underperforms in production. You spend three weeks debugging behavior before someone checks the data and finds the issue that’s been there since February.
The fix is structural, not heroic
You don’t need a data cleaning sprint. You need data contracts — explicit agreements between producers and consumers about what shape the data should be in, with validation that runs automatically.
You need quality metrics tracked in dashboards the same way application error rates are tracked. You need someone who owns data quality enough to say “this pipeline doesn’t pass, we don’t ship” — and have that carry weight.
The teams we work with who ship clean, reliable data products have one thing in common: they treat quality as a first-class concern, not a cleanup task for someday.
Someday never comes. The only way out of “we’ll clean it later” is to stop letting later exist.