Why One-Hot Encoding Almost Never Makes Sense

Native Categorical Trees and Embeddings Make Dummy Columns Obselete

Apr 20, 2025

It’s common knowledge that one-hot encoding explodes dimensionality, erases similarity information, and crumbles under drift. Yet, it’s still widely used by Data Scientists to encode categorical variables. This is partly due to misconceptions around their benefits (e.g. prevent overfitting, interpretability, outperformance in certain scenarios, lack of alternatives) and partly due to old habits. This post walks through why turning categories into dummy columns is hardly ever a good idea.

Structural Cost

The limitations of one‑hot encoding are familiar to most practitioners, but here's a quick refresher.

Sparsity explosion: A column with k categories balloons into k binary features. The resulting vectors are > 99 % zeros for typical web‑scale features (e.g., ZIP codes, SKUs). Any model that uses dot products or distance metrics now operates in a high‑variance regime: gradients are noisy, regularizers over‑penalize, and the risk of overfitting climbs sharply.
Gradient dilution in GBDTs (Gradient Boosted Decision Trees): GBDTs build per‑feature histograms. Because every dummy builds its own histogram, the total gradient mass for the original variable is smeared across k histograms, scattering the training signal so thinly that strong splits become hard to spot. Convergence slows, more trees are needed, and generalization suffers.
No notion of similarity: Under one‑hot, “Toronto” and “Mississauga” are orthogonal vectors. The learner must infer their geographic proximity from scratch. Native categorical algorithms can group semantically related values in a single split, capturing locality or hierarchy automatically.
Deeper, messier trees: A tree can only test one dummy at a time (“is Toronto?”). Capturing “{US, CA, GB} vs rest” demands multiple layers, reducing interpretability and inflating inference cost.

Operational Drift

Top‑N buckets age badly: Today’s “Other” bucket may contain tomorrow’s fastest‑growing product. As traffic drifts, your one‑hot schema drifts out of sync.
Unseen values: A brand‑new country code hits production, yields an all‑zeros vector, and quietly erodes metric lift or fires a null‑pointer in downstream feature stores.
Encoder churn: Each retrain recalculates category frequencies and rebuilds the ID mappings (e.g., US → 0, UK → 1), forcing you to update and verify every downstream system to keep them in sync.

Gradient‑Boosted Trees (Native Categorical Splits)

Native categorical splitting in modern gradient‑boosted trees almost always surpasses one‑hot encoding by directly capturing category relationships, eliminating sparse dummy columns, and accelerating both learning and inference.

LightGBM

At each node, LightGBM sorts the category histogram by gradient statistics and proves the optimal binary split lies among contiguously ranked bins, cutting the naïve 2^(k‑1) search to O(k log k). One instruction groups related countries like {US, CA, GB} against the rest—no dummy explosion, no information loss.

CatBoost

CatBoost replaces every category with ordered target statistics computed on earlier rows of a random permutation, eliminating leakage while delivering dense, information‑rich numerical features. As the trees grow, these statistics update online, and CatBoost even synthesizes interaction terms such as “user × merchant” internally—something a one‑hot regime would need exponential feature crosses to approximate.

XGBoost ≥ 1.6

Recent XGBoost releases introduced native categorical support: tiny‐cardinality features are one‑hot internally, but all others use LightGBM‑style subset splits (“value ∈ S”) that compact the entire logic into one node.

Linear and Distance-Based Models

Even if your final model is a linear one, embeddings can transform high-cardinality features into a format that’s more efficient and effective. On the other hand, clustering algorithms rely on meaningful distance metrics, but one-hot encoding undermines this by enforcing uniform distances between categories. Embeddings fix this, enabling better cluster quality.

Linear/GLM: To improve Generalized Linear Models (GLMs) with high-cardinality features, train a tree-based model (e.g., Random Forest) or a shallow neural network on the target variable—or use a factorization machine—to learn dense embeddings (8–64 dimensions), which are then fed into the GLM. This method reduces the number of parameters compared to one-hot encoding, yields interpretable coefficients tied to embedding dimensions, and strengthens the model’s signal by grouping similar categories in vector space for better generalization.
Clustering (k‑Means, HDBSCAN): For enhanced clustering with high-cardinality features, learn embeddings using supervised methods (e.g., shallow neural networks) or factorization techniques to capture data relationships, then pass these embeddings into clustering algorithms like k-Means or HDBSCAN. This approach ensures meaningful semantic distances—placing related categories like “Toronto” and “Mississauga” closer together than “Tokyo”—and maintains computational efficiency, making clustering scalable even with millions of unique categories.

Common Objections

Skeptics rightly raise a few edge‑cases, so let’s tackle them head‑on.

“One‑hot is fine when k is small.” Even with three categories, LightGBM or CatBoost can pose the question “{A,B} vs C?” in a single split; a one‑hot regime needs two levels and twice the depth. Native handling is the smarter default.
“Linear models need one‑hot.” Begin with trees to discover which categories—or interactions—actually matter. If a linear model is later mandated, encode only those validated groupings instead of flooding the design matrix with thousands of dummies.
“One‑hot is more interpretable.” Tree-based models with native splits already provide clear decision paths and global SHAP importances. To quantify a single category’s impact (e.g., country == ‘US’), just average row-level SHAP values for that category. This approach measures precise contributions without creating sparse dummy columns.
“One-hot prevents overfitting.” Splitting one column into k near‑empty dummies multiplies parameters; rare bins learn noise. Native categorical splits pool counts, apply built‑in smoothing, and generalize better.
“It’s fine to one‑hot the major levels and bucket the rest.” Even when two categories drive 50 %+ of volume, native categorical splits will automatically isolate those top levels and then optimally group or split the remaining tail—preserving latent patterns in the minority categories rather than flattening them all into an undifferentiated “other” bin
“Deep nets still need one‑hot.” Dense embeddings are lower‑dimensional, similarity‑aware, and O(1) at inference; TabTransformer and CTR models repeatedly outscore one‑hot baselines.
“What about raw strings or tokens?” Token embeddings compress millions of unique substrings into a few dozen trainable dimensions, letting similar strings share parameters and outperform sparse one‑hot baselines in AUC and convergence tests.

Takeaways

One‑hot encoding earned its place when toolkits lacked alternatives. Today, LightGBM partitions categories in O(k log k), CatBoost injects leakage‑free statistics, and XGBoost applies subset splits—all while shrinking models, speeding training, and improving accuracy. These advanced methods are not just likely to be better—they are virtually always the optimal choice, with one-hot encoding hardly ever leading to superior performance.

Jerry’s Substack

Discussion about this post

Ready for more?