The Experiment That Never Got Turned Off
That A/B test from eight months ago is still running. So is the one before it. Your production model is now a graveyard of half-decisions.
Somewhere in your production system there’s an experiment that was supposed to run for two weeks. That was eight months ago.
Nobody turned it off. The person who launched it is on a different team. The feature flag is buried in a config nobody reviews. The 15% of users still hitting the old model path? They’re just… there. Quietly getting worse results while everyone else moved on.
This is the experiment graveyard. Most ML teams have one. Few admit it.
Why It Happens
Experiments are easy to launch and painful to clean up. Launching gets you visibility. Cleaning up gets you nothing — except a slightly less chaotic system that nobody notices.
The incentives are backwards.
Add to that: experiment conclusions are often fuzzy. “It’s directionally positive” means nobody wanted to make the call. So the flag stays. The old path stays. The technical debt accrues in silence.
What Actually Lives in There
- Shadow models that are technically live but “not primary”
- A/B variants with no active analysis running
- Canary deployments that canary’d 10% six months ago and never graduated or rolled back
- Feature flags referencing model versions that no longer exist in the registry
- Logging that quietly fails because the schema changed
Every one of these is a liability. A debugging trap. A latency cost. A behavioral inconsistency you’ll spend hours chasing when something goes wrong.
The Fix Isn’t a Process. It’s Ownership.
Checklists help. Experiment TTLs help. But the root problem is that nobody owns cleanup.
Define it explicitly: every experiment has an owner, a conclusion deadline, and a forced-cleanup date. If the deadline passes without a decision, the default is rollback — not continuation. Continuation should require active justification, not passive inertia.
Your experiment framework should make it harder to leave something running than to close it out.
What “Closing Out” Actually Means
- Document the result (even if it’s “inconclusive, moving on”)
- Remove the flag or graduate the variant to 100%
- Archive the old model version, don’t just stop routing to it
- Delete the associated logging branches you’re no longer reading
Boring work. Nobody claps for it. But six months from now, when you’re debugging a latency spike and don’t have to wonder whether it’s the zombie A/B test from Q3, you’ll be glad someone did it.
Schedule the cleanup before you schedule the next experiment.