How To Select From A Million Features
Introduction
How to do fast feature selection for high-dimensional data? This is a problem that one will see constantly in alpha research. You have a massive feature pool, maybe 5,000 trading signals or 10,000 alternative data features, and you need to select the useful ones.
Standard advice: compute pairwise correlations and remove redundant features. That’s 50 million calculations for 10,000 features.
There’s a nice O(N log N) algorithm that avoids most pairwise calculations while still removing redundancy.

