The A/B Test You Never Finished
Half your production models are running inside experiments that nobody has looked at in months. That's not science — that's clutter with a p-value.
Somewhere in your inference stack there’s a model experiment that started six weeks ago. It was supposed to run for two weeks, collect enough traffic, and produce a clear winner. Someone would review the results, make a call, and clean it up.
That hasn’t happened. The experiment is still running. The variant is still serving a slice of your traffic. Nobody is monitoring it. Nobody remembers exactly what the hypothesis was.
This is the A/B test you never finished. Most ML systems have at least one.
How they accumulate.
Starting experiments is easy. You configure traffic splits, wire up logging, and flip a flag. Done. The hard part is ending them — and “hard” mostly means “requires someone to make a decision and do work.”
So experiments drift. The engineer who set it up moved on to something else. The metrics are in a dashboard nobody checks. The original question — “is this model better?” — got superseded by a different question, a new model version, a shifted business priority. The experiment didn’t conclude. It just stopped mattering to anyone actively.
Meanwhile it’s still running. Still splitting traffic. Still generating data nobody will ever analyze.
Why this is a real problem.
The obvious issue is waste: you’re serving a model variant that isn’t the best version you have, to a real slice of your users, indefinitely.
But the subtler issue is correctness. When you eventually want to understand your production system — when something breaks, when metrics shift, when you want to know what model is actually being used — you have to untangle which traffic went where. That’s painful when experiments are well-documented. It’s nearly impossible when they’ve been running silently for months.
Undead experiments also erode trust in your evaluation infrastructure. If experiments never close, people stop believing the results mean anything. The next time someone proposes running a proper controlled test, the cultural memory is “we did that before, nothing came of it.”
What finished looks like.
An experiment has three valid end states: promote, reject, or extend with a new hypothesis. If it reaches none of those, you’ve got a zombie.
Promote means: winner declared, traffic moved 100%, variant cleaned up, flags removed. Reject means: challenger lost or results were inconclusive, all traffic back to baseline, variant decommissioned. Extend means: you have a specific new question and a new end date. Not “let’s keep watching” — an actual hypothesis and a deadline.
The commitment to an end state should happen before the experiment starts, not after. Define your success criteria, your minimum detectable effect, your runtime. Put a calendar event on the decision date. If nobody shows up to make the call, the experiment closes to baseline automatically.
Build the forcing function.
Teams that run clean experiments have tooling that makes abandonment costly. Experiments have expiration dates enforced by the platform — after which the flag routes 100% to baseline and the variant is disabled. You have to actively renew an experiment to keep it running. Renewal requires logging a reason.
That’s the whole mechanism. It’s not sophisticated. It’s just making the lazy path (doing nothing) resolve safely instead of silently.
If your ML platform makes it easier to start experiments than to end them, that’s a design flaw — not a discipline problem.
Fix the infrastructure. The science will follow.