Stop Hand-Rolling Feature Discovery — Here's the Math That Actually Works
John Doe
Your feature engineering is probably garbage.
I don't mean your domain knowledge or intuition — those are fine. I mean the part where you manually craft 47 features, throw them at a random forest, and hope something sticks. We've all been there. You spend three weeks engineering the "perfect" feature set, only to discover that user_id % 7 somehow outperforms your carefully constructed behavioral signals.
The truth is, manual feature discovery doesn't scale. And more importantly, it's solving the wrong problem.
The Real Problem Isn't Finding Features
Most engineers think feature discovery is about finding the "best" features. Wrong. It's about finding the minimal set of features that capture maximum information about your target variable.
Every additional feature increases overfitting risk
Model inference gets slower (and more expensive)
Feature maintenance becomes a nightmare
Interpretability goes out the window
I learned this the hard way at a fintech company where our fraud detection model had 200+ features. Nobody understood why it worked, and when it broke (spoiler: it always breaks), debugging was impossible.
Why Most Approaches Fail
The naive approach is correlation analysis. Calculate Pearson correlation between each feature and the target, keep the top N. This works exactly never.
Why? Because correlation only captures linear relationships. Your target might depend on the interaction between transaction_amount and time_of_day, but neither feature shows strong individual correlation.
Even worse, correlation ignores redundancy. Features A and B might both correlate 0.7 with the target, but if they're 0.9 correlated with each other, you're not adding information — you're adding noise.
The slightly-less-naive approach is univariate statistical tests (chi-square, ANOVA). Better than correlation, but still treats each feature independently. In the real world, features work together.
The Algorithm That Actually Works
After trying everything from PCA (useless for supervised learning) to recursive neural networks (overkill and opaque), I settled on a hybrid approach combining mutual information with recursive feature elimination.
Here's the core algorithm:
import numpy as np
from sklearn.feature_selection import mutual_info_regression, RFE
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
class FeatureDiscovery:
def __init__(self, estimator=None, cv_folds=5):
self.estimator = estimator or RandomForestRegressor(
n_estimators=100,
random_state=42,
max_depth=10 # Prevent overfitting during selection
)
self.cv_folds = cv_folds
def mutual_information_filter(self, X, y, threshold=0.1):
"""First pass: remove obviously irrelevant features"""
mi_scores = mutual_info_regression(X, y, random_state=42)
relevant_features = mi_scores > threshold
print(f"MI filter: {relevant_features.sum()}/{len(relevant_features)} features retained")
return X[:, relevant_features], relevant_features
def recursive_elimination(self, X, y, min_features=5):
"""Second pass: find optimal feature subset"""
best_score = -np.inf
best_features = None
# Start with all features, eliminate worst performer each round
current_features = np.arange(X.shape[1])
while len(current_features) >= min_features:
# Fit model with current features
X_subset = X[:, current_features]
scores = cross_val_score(
self.estimator, X_subset, y,
cv=self.cv_folds, scoring='r2'
)
avg_score = scores.mean()
print(f"Features: {len(current_features)}, CV Score: {avg_score:.4f}")
if avg_score > best_score:
best_score = avg_score
best_features = current_features.copy()
# Remove least important feature
self.estimator.fit(X_subset, y)
importances = self.estimator.feature_importances_
worst_idx = np.argmin(importances)
current_features = np.delete(current_features, worst_idx)
return best_features, best_score
The two-stage approach matters. Mutual information filtering removes obviously irrelevant features fast (O(n) instead of O(n²)). Then recursive elimination finds the optimal subset of what remains.
Why This Works Better
Mutual information captures non-linear relationships that correlation misses. It measures how much knowing feature X reduces uncertainty about target y. Zero MI means the feature is useless. High MI means it's informative.
But MI alone isn't enough because it doesn't handle redundancy. That's where recursive elimination comes in.
The key insight: instead of adding features (forward selection), we remove them. Start with all MI-filtered features and eliminate the least important one each round. This naturally handles feature interactions — if removing feature A hurts performance, it means A provides unique information not captured by other features.
Real Implementation Details
The devil's in the details. Here's what the textbooks don't tell you:
Estimator choice matters. Random forests work well because they handle feature interactions naturally and provide built-in importance scores. Don't use linear models for feature importance — they assume features are independent.
Cross-validation is non-negotiable. Single train/test splits will lie to you. I use 5-fold CV, but if your dataset is small, go higher.
Threshold tuning is crucial. The MI threshold (0.1 in the example) depends on your data. Too high and you eliminate useful features. Too low and recursive elimination takes forever.
# Threshold selection
def find_optimal_threshold(X, y, thresholds=np.linspace(0.05, 0.3, 10)):
mi_scores = mutual_info_regression(X, y)
results = []
for threshold in thresholds:
n_features = np.sum(mi_scores > threshold)
if n_features < 5: # Need minimum features for stability
continue
# Quick performance check
X_filtered = X[:, mi_scores > threshold]
score = cross_val_score(
RandomForestRegressor(n_estimators=50),
X_filtered, y, cv=3
).mean()
results.append((threshold, n_features, score))
print(f"Threshold {threshold:.2f}: {n_features} features, score {score:.4f}")
# Pick threshold that maximizes performance/feature ratio
best = max(results, key=lambda x: x[2] / np.log(x[1] + 1))
return best[0]
When It Breaks
This approach isn't perfect. It fails when:
Your estimator is wrong. If you're using random forests for selection but deploying a linear model, the selected features might be suboptimal. Match your selection estimator to your production model.
Your data is too small. Need at least 1000+ samples for stable MI estimates. Below that, use domain knowledge.
Features have different scales. Normalize everything first. MI is scale-invariant, but the recursive elimination estimator might not be.
Your target is weird. Multi-modal distributions or heavy outliers can confuse MI calculations. Check your target distribution first.
The Results
I've used this on everything from ad click prediction (reduced 400 features to 23 with 2% performance improvement) to time series forecasting (15 features down to 7, 40% faster inference).
The real win isn't just performance — it's interpretability. When your model breaks at 3 AM, debugging 7 features is manageable. Debugging 200 is not.
Actually, let me be more specific about those results. The ad system went from 180ms prediction latency to 45ms. The time series model's feature drift monitoring became actually usable instead of a wall of noise.
Your mileage will vary. But if you're still hand-picking features in 2024, you're solving the wrong problem with the wrong tools.
The algorithm is simple. The implementation is straightforward. The results speak for themselves.
Stop overthinking it.
JavaScript Static Code Analysis Beyond ESLint
What happens after you max out ESLint: advanced JavaScript analysis techniques.