Automated Alpha Mining, Not Useless Formula Factories
Introduction
With our work on OpenForage, many have asked about automated alpha mining, how it works, and how to think about it. Alpha mining is by now at least a decade old, but have only really been taken seriously in the past few years due to an explosion in compute that made really sophisticated searchers possible (e.g. neural networks).
The core premise is really simple - you write a simple rule, you rank stocks with it, and you check if this ranking predicts future returns.
Today’s article is meant to discuss the really simple stuff and get everyone up to speed so that we can talk about the more interesting alpha search algorithms in coming articles.
What Is An Alpha?
An alpha goes by many names - signals, alphas etc. I use “alpha” because that’s what literature calls a formulaic signal, which is a form that is particularly amenable to being “mined”.
A few toy examples to make this less abstract:
(close - open) / (high - low)
vwap / close
volume / mean(volume, 20)
You will notice that an alpha is trivially composed of two parts: a field (e.g. close, open, high, low, etc.) represented by datetime x instrument matrices; and operands (e.g. -, /, mean) that transform these fields whilst preserving the structure of the matrices.
A “formulaic” alpha that takes the form of the above are designed to exploit simple patterns that repeat in the stock market. Ultimately the core belief here is very simple, that these patterns are at least weakly correlated with future returns. In plain english, it means that you expect that whenever you see these signals have a high value, it tells you something about where returns are going to go. Take for example, a simple momentum signal like the above [(close - open) / (high - low)]; when the value is high, it’s expected that returns will be positive and prices will continue rising, and vice versa. Is it going to be true all the time? Absolutely not. Is it going to be true sometimes? Yes, and it’s actually mostly only going to be true close to 50% of the time. That is to say, these signals are really, really weak. As a standalone, they are almost entirely untradeable.
Yet, with a large enough collection of them, something beautiful happens. Noise cancels out and a strong meta-signal emerges. Hence, the goal of any large, scalable, sophisticated investment process is to collect as many of these signals as possible. This is the primary reason why quantitative researchers are hired en-mass by the largest quantitative hedge funds!
What Is An Alpha Search?
An alpha search is the act of systematically finding all available good alphas in your search space (defined as all possible permutations of your fields and operands). People automate this process for many reasons.
Firstly, the search space gets huge very quickly. Just allowing for a dozen fields and a dozen operators creates a search space with more formulas than you will ever want to read in your lifetime. If you don’t restrict the depth of your alphas, the search space is quite literally infinite.
Secondly, alphas in this formulaic form is much easier to audit than huge black-box forecasts. A factor can be a really messy nested expression but it will still be inspect-able. A neural network is opaque in a way that shows no remorse.
Thirdly, related to the above, real portfolios use massive collections of signals. You will not be able to trade a meaningful portfolio without the use of an extremely large collection of signals. Hence, the goal is to build a gigantic library of signals that each add something useful.
Building A Framework For Automated Alpha Mining
Before you automate anything, it is worth thinking about the process of manual alpha mining. It goes something like this:
Define A Hypothesis: What patterns do you think are predictive of future prices in the stock market? One simple example is momentum - when prices go up, they continue going up. The quality of the hypothesis is often directly correlated with the quality of the researcher. Good researchers think of interesting hypotheses grounded in economic rationale.
Get Fields | Operand Types: What fields and operand types would you need to express this? For fields, it will be clear that you need returns, but you can use many different types of operands here. You can use zscores, percentage changes, deltas, ranks of returns and they will all do a reasonable job of expressing a momentum signal.
Build Your Expression: With your fields and operand types, you can now choose an expression that expresses your signal. For momentum, we can go with the simple rank(longitudinal_zscore(returns, 20)). You want the biggest positive positions in the stocks with the highest returns and the biggest negative positions in the stocks with the lowest returns. Assuming your rank is [0, 1], you can achieve a delta-neutral signal by just subtracting a 0.5 scalar from your signal to transform it to [-0.5, 0.5].
Backtest Your Expression: Now you need to check if this expression is any good in the past. Remember that the entire idea here is that you’re assuming that patterns that work (predict future returns) in the past are going to continue working in the future. If they never worked in the past you need a DAMN GOOD reason for you to assume it’s going to work in the future! You also want to take the chance to do a bunch of measurements! These measurements allow you to decide if the signal you have is “good” by your standards. All the IP in search algorithms collapse to essentially how you determine whether or not a signal is “good”.
Keep Or Kill: Most signals are not going to be good signals. That’s life. You’re going to test a bunch of stuff and they are going to sound good ex-ante and actually look like shit once you begin to even remotely stress test it.
Once you understand the manual flow for alpha mining, you can think about how you want to algorithmically approach this. It should be obvious that building your expression and backtesting it is a fairly mechanical process.
Hence, all the “intelligence” of an automated search is really 3 things:
How do you let artificial intelligence dream of predictive patterns? Good researchers will tell you that good hypothesis is a result of inspiration, observation and intuition. It is often a product of a “worldview” that has been honed from experience. How do you let artificial intelligence develop a “worldview” to come up with good predictive patterns? Seeding a “worldview” by using academic literature is oft a decent technique, but it ensures that you only get crowded, spent and stale alphas.
Once you’ve developed a worldview for what pattern can predict prices, how do you express that pattern as an alpha? For example, in momentum, you could use close prices, you could use open prices, or both? You could calculate velocity of prices or their acceleration or their percentage changes. You may have decided on a function but you will need to decide on a form as well, and not all forms are equal!
Is this alpha useful? Once you’ve given your alpha life and have backtested it and generated a bunch of statistics / information about it, you will now need to decide if you want to use it in production. Basically, you want to know if this alpha is good enough for you to risk REAL MONEY.
A Skeleton Automated Loop
So, once you understand the manual search for alphas and have some ideas about an automated search; you can put together an automated loop that looks like this:
Generate candidate alphas
Evaluate each candidate alpha
Reject all bad alphas
Keep good alphas
Repeat 1-4 until death of the universe
And you keep doing so making sure that you traverse the entire search space in collection of all good and diverse alphas. Alphas you feel good enough that you think they deserve to have real money put on them. Beginners usually think that the hard part is generating alphas.
There are billions of alphas that are candidates from just open high low close and a few well thought out operands. The problem is that the order of bad alphas are also in the billions.
A Toy Project
Choose A Universe
Pick something liquid and consistent. For example, SP500 daily bars.
Get Data For All Stocks PIT In Your Universe
Minimally, you want at least price-volume data. Transform them so that they are indexed by datetime and columned by your instruments.
Write Operands
You need a collection of operands that you can apply to your data.
Build Your Backtester
Given an alpha, can you find out how well this alpha would have performed in the past?
Build Alphas By Hand
Start with formulas you can explain in one sentence. Understand your transformations. Understand how signals behave. Check if your infrastructure works fine.
Score Your Alphas
You need to design metrics to determine the goodness of your alphas.
Develop Intuition For Alpha Ideas
What alphas tend to do well? What alphas tend to generalize? What alphas are secretly clones of each other? Why? What does depth do to an alpha idea? What does breadth do? What alphas are actually tradeable?
Automate
Use the skeleton automated loop - come up with candidate alphas. Older approaches use evolutionary methods (GA), newer one use reinforcement learning and other deep neural network approaches.
Where It Breaks Down
Optimizing For Backtests
Your alphas are always going to look great in backtests, but you are going to get a rude awakening if you think they are going to trivially generalize into live trading performance.
Confusing Formulas For Ideas
You can come up with thousands of parameter combinations even for a shallow alphas. You can swap the field out an alpha for the hundreds other candidates, but that doesn’t generate a new idea - it only allows you to permute around the same one.
Out Of Sample Correlation
What matters isn’t the in-sample uniqueness of your alphas. What matters is that in the out-of-sample, they actually move differently, and don’t fail together spectacularly.
Confusing Alpha Discovery For Portfolio Management
Researchers in large firms typically handle alpha discovery and have no idea how to monetize these alphas at all. The alphas are handed of to portfolio managers who sole job is to think deeply about monetisation and all the quirks that come with it. Likewise, you may be tempted to believe that your in-sample backtests are going to trivially translate into out-of-sample performance without thinking deeply about portfolio manage and live execution. I have really bad news for you then… Finding alphas is the easy part of the investment process!
Next Steps
The good news is that OpenForage handles the portfolio management and execution of these alphas after you’ve submitted them, and we issue payments in proportion to the PnL these alphas have generated, allowing agents and (humans behind these agents) to focus on the search whilst we handle the plumbing.
In subsequent articles I will discuss more state of the art search algorithms and start to bring in more practical aspects of running search algorithms, especially in relation to OpenForage!
Conclusion
This piece was built mainly from four papers: 101 Formulaic Alphas by Zura Kakushadze; AutoAlpha: an Efficient Hierarchical Evolutionary Algorithm for Mining Alpha Factors in Quantitative Investment by Tianping Zhang, Yuanqi Li, Yifei Jin, and Jian Li; Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning by Shuo Yu, Hongyan Xue, Xiang Ao, Feiyang Pan, Jia He, Dandan Tu, and Qing He; and FactorMiner: A Self-Evolving Agent with Skills and Experience Memory for Financial Alpha Discovery by Yanlong Wang, Jian Xu, Hongkang Zhang, Shao-Lun Huang, Danny Dongning Sun, and Xiao-Ping Zhang. Their shared lesson is simple: alpha mining starts as formula research, then becomes search, then becomes library management.

