The A100 Surplus Hiding in Plain Sight
RunPod charges $3.29/hr for an A100 SXM. On Vast.ai right now, you can rent one for $0.44. That 7.5× gap isn't a pricing glitch — it's a market signal worth acting on.
Blog
Notes on production AI, data engineering, and the messy reality of shipping systems that work.
RunPod charges $3.29/hr for an A100 SXM. On Vast.ai right now, you can rent one for $0.44. That 7.5× gap isn't a pricing glitch — it's a market signal worth acting on.
Your RAG pipeline retrieves the right document. The problem is it was last updated eight months ago.
Dropping 'unused' columns without lineage visibility is how you break three downstream teams at once — and none of them will tell you until production is already wrong.
LLM API calls without explicit timeouts are a production incident waiting to happen. Here's what hangs, why, and how to stop it.
The model is in production. The integration is live. Nobody told the users. This is how AI projects succeed technically and fail completely.
You shipped the model. Can you say which version it is, what data trained it, and why it makes the decisions it makes? If not, you have a governance problem — not a compliance problem.
Sentinel values and bad defaults look like real data. They pass every schema check, corrupt your features, and make your model confidently wrong in production.
Your RAG pipeline retrieved the right answer six months ago. The source doc changed. Nobody re-indexed it.
Nobody models LLM costs seriously until they get the bill. By then, the architecture is already wrong.
You shipped the AI feature. Users are using it. Something's wrong. You don't know what — because you never built a way to find out.
Offline metrics look great. Production behavior is a disaster. This gap isn't bad luck — it's a design failure you can prevent.
Source systems quietly change their primary keys and your pipelines keep running — producing wrong answers instead of errors. That's the worst kind of failure.
System prompt bloat is one of the slowest ways to degrade your LLM system — and one of the easiest to miss until performance tanks and costs spike.
You built a RAG system that retrieves semantically. You forgot to build the one that retrieves precisely. Metadata filtering isn't an optimization — it's the difference between a search engine and a lucky guess.
Your AI system went live six months ago. Has anyone actually checked if it still works the way you think it does?
Half your production models are running inside experiments that nobody has looked at in months. That's not science — that's clutter with a p-value.
Treating late-arriving data as an exception is how you get metrics that silently restate themselves for days after the fact. Design for lateness upfront or debug it forever.
Agents that call tools are running code with real consequences. Most teams build them like they're not.
Someone built a slick AI demo. Leadership loved it. Now it's in production. This is how systems fail slowly and visibly.
Most RAG systems retrieve against the user's raw query. That's the problem. Query rewriting is the highest-leverage improvement most teams skip entirely.
Software engineers ship canaries without thinking twice. ML teams ship full replacements and call it 'confidence.' Here's why that's backwards — and how to fix it.
Naive retry logic is one of the most common — and most expensive — bugs in LLM production systems. Here's what it looks like and how to fix it.
The team that built it is already on the next project. The ops team doesn't understand it. And nobody wants to be the one paged at 2 AM when it breaks.
Timezone-naive timestamps are a silent data quality bomb. They pass every schema check, join on nothing, and make your dashboards confidently wrong.
It runs fine locally. It breaks in staging. It fails silently in production. ML environment parity is not a nice-to-have — it's the job.
You picked an embedding model early, it worked well enough, and you never looked at it again. That's the problem.
Bigger context windows didn't solve the problem of what goes in them. Most production LLM failures aren't model failures — they're context failures.
Everyone is building AI agents. Nobody is asking whether the process being automated was worth keeping in the first place.
Bad partitioning doesn't break your pipeline. It just makes everything slightly wrong, forever.
Your ML staging environment feels like safety. It isn't. Here's what it's hiding.
If your LLM integration parses free-text responses in production, you don't have a product. You have a fragile prototype waiting to fail.
A roadmap is not a product. Learn to tell the difference before you sign the contract.
Your retrieval pipeline returns 20 chunks. Your LLM sees 5. What happens in between that gap is either thoughtful or a coin flip.
That A/B test from eight months ago is still running. So is the one before it. Your production model is now a graveyard of half-decisions.
Backfills aren't a nice-to-have. They're how you find out if your pipeline actually works.
LLMs don't know when they're wrong. Your production system has to.
Scope creep in AI projects rarely looks like bad faith. It looks like enthusiasm. Here's how to handle it without torching the relationship.
Most data pipelines fail silently. A dead letter queue is the thing that catches what falls through — and tells you why.
Your model works. Your pipeline is green. But somewhere, something is hardcoded to a version you never wrote down. That's the shadow dependency — and it will break you.
Most teams ship AI features without defining acceptable latency. Then they spend months optimizing the wrong thing.
Vague AI initiatives don't die — they consume budget indefinitely. Here's how to kill the cycle before it starts.
Semantic search solves one problem. Hybrid retrieval solves the problem you actually have.
Idempotency is table stakes. The next level is building pipelines that assume everything upstream is lying to you.
Your model changed under your app. Your prompt changed under your users. And nobody noticed until something broke. Fix this before it happens to you.
You've got five AI tools, two vector databases, and three prompt management systems. What you don't have is a production AI system.
Most data pipelines break silently when run twice. Idempotency isn't a nice-to-have — it's the property that separates pipelines you can trust from ones you're afraid to touch.
Tuning chunk size and tweaking similarity thresholds won't save you when your pipeline silently degrades in production.
Every team thinks their use case is special enough to justify building from scratch. Most are wrong — and the decision is costing them months.
You spent three months building the RAG knowledge base. Then you shipped it and moved on. That's why it's already wrong.
You monitor your servers. You don't monitor your models. Here's what that's costing you.
Schemas change. That's fine. What's not fine is discovering you've silently broken three pipelines and a model when they do.
Everyone has an opinion on feature stores. Most of them are wrong. Here's when you actually need one.
Two different tools for two different problems. Picking the wrong one wastes months.
The streaming vs batch debate isn't about which is better. It's about which problem you're actually solving — and most teams get it wrong by defaulting to one without thinking.
Vibes-based RAG evaluation is how you ship broken retrieval to production. Here's what a real eval framework looks like.
Every deployment without a rollback plan is a bet that nothing will go wrong. In production ML systems, that bet loses more often than you think.
Your job posting says 'machine learning engineer' but you need someone who ships and operates, not someone who experiments and publishes. The distinction matters more than you think.
Without data contracts, every pipeline change is a potential incident. Here's why informal data agreements between teams are a liability — and what to do instead.
Your model isn't broken — it's just quietly wrong. Here's how to catch drift before it becomes a support ticket.
Regular software debt is a slow leak. ML debt is a pressure cooker — and most teams don't realize it until something explodes.
Prompt management in production isn't a nice-to-have. If you're not versioning, testing, and deploying prompts with the same discipline as code, you're flying blind.
Centralizing your AI talent into a dedicated team feels organized and intentional. It's also one of the fastest ways to kill momentum.
Most RAG pipelines fail at chunk size 512, split by character, never revisited. Here's what actually moves the needle on retrieval quality — and why your defaults are probably wrong.
Your software pipeline won't save your ML system. Here's what actually needs to be different — and why copying your DevOps playbook is a trap.
AI agents are proliferating, but they can't find each other. Agora is an open-source registry and discovery service that fixes that — built to complement A2A and MCP.
Technical debt in data systems doesn't sit quietly. It compounds. Every downstream model, dashboard, and decision built on dirty data pays the price.
The demo worked. Stakeholders loved it. And then nothing happened. Here's why — and how to stop the cycle.
Unit tests don't cover AI behavior. If you're shipping models without eval suites, you're shipping blind.
Why every production ML team needs model versioning, eval tracking, and promotion workflows.
No 90-day discovery phase. No 200-page strategy doc. Here's how we actually work.
If you're hiring 4 senior data engineers, you're not doing AI yet — you're building the foundation you skipped.