Services Process Blog Demo

Get in touch

hello@sovont.com
Back to blog
· Sovont · 3 min read

The Partitioning Decision You'll Regret

Bad partitioning doesn't break your pipeline. It just makes everything slightly wrong, forever.

Data Engineering

Nobody notices bad partitioning at first. Queries still run. Dashboards still load. Everything looks fine.

Then six months in, your warehouse bill doubles, your analysts start complaining that “the query feels slow,” and your data team is quietly scheduling full-table scans every four hours because no one wants to touch the partition scheme that’s been in place since the launch.

This is how it goes.

Why Partitioning Decisions Go Wrong

Partitioning is one of those choices you make once and live with for years. The problem: you almost always make it before you understand your actual query patterns.

You partition by ingestion date because that’s the obvious thing to do. Then six months later, 90% of your queries are filtering by customer_id and doing full partition scans anyway. You’re paying for organization you don’t use.

Or you go the other way — partition by a high-cardinality field like user_id because some analyst said it would be faster. Now you have 10 million partitions, metadata overhead is crushing performance, and simple aggregation queries are timing out.

Both of these are real failure modes. Neither is theoretical.

What Good Partitioning Actually Requires

Before you decide, answer three questions:

1. What does 80% of your query traffic filter on? Not what you think it filters on — what it actually does. Pull query logs. Look at WHERE clauses. The partition key should match real access patterns, not hypothetical ones.

2. What’s your data volume per partition? You want partitions that are large enough to avoid metadata overhead, but small enough that queries don’t need to scan 18 months of data to answer a simple question. A good target: partitions in the 100MB–1GB range for most analytical workloads.

3. Will this key stay useful as the data grows? A partition by event_type that has 10 values today might have 200 in two years. Plan for growth or plan for a painful migration.

When You Get It Wrong

You will eventually need to repartition. The question is whether you’ve built for that possibility.

If your pipeline writes directly to tables with baked-in partition schemes, repartitioning means rewriting history. If you’ve kept raw data clean and separated from your derived tables, it’s painful but recoverable. Build with migration in mind — because you will be wrong at least once.

The teams that handle it best aren’t the ones that got partitioning right on the first try. They’re the ones who kept their raw layer untouched and their transformation logic flexible enough to rebuild.

Bad partitioning is a slow leak. It doesn’t flood the room — it just raises the water level until someone finally notices.

Fix it before the water reaches the servers.