How To Build A Quantitative Model For Selecting Start-Ups
Introduction
Since I was a teenager, I dreamt of a world where we could scan a human, and the scan would reveal everything we needed to know about that person. Their potential, their character, their propensity and probability of success and the things they would do conditional on success.
It seemed like such a waste that human talent was so equally distributed around the world and yet opportunity was so concentrated. In a utopia without scarcity, we would be able to bring everyone above the poverty line — but that seemed too distant, so the next best thing would be to very accurately determine who would benefit the most from being lifted out of poverty, and hope that THAT person creates sufficient value for trickle-down effects.
I’ve always felt a pull towards trying to quantify and predict the world via its features. It seems like most of the world is actually at least mildly predictable given the right input. For example, we often model a coin toss as being inherently random, but given the right information (height of coin from ground, material, force applied, angle of the coin, etc.) — we can actually do a LOT better than 50/50.
Hence, a field of study that has always been really interesting to me is being able to predict which start-ups are going to succeed — because it feels like it’s at least only one level of indirection from being able to predict which humans are going to succeed. The idea here is that at least SOME of the features of predicting start-up success is going to come down to founder traits. So it feels like it’s close to removing the start-up and just asking a more general question of “which human is going to succeed”.
Having been busy with OpenForage and interacting with VCs again, my curiosity and interest in determining whether or not there are any good studies on what are the most important features when it comes to predicting start-up success has been reignited.
A Day 1 Approach To Quantitative VC
I think most startup models begin with funding history, but at least based on my research — that seems woefully inadequate and is certainly not how leading discretionary VCs approach selecting “winners”.
One interesting conclusion I came away with whilst reading and thinking about this is that it used to be the case that many of these features were fuzzy and hard to extract from unstructured documents into structured data for quantitative analysis.
However, the advent of agents and their ability to reason about unstructured documents at scale has largely resolved this problem. In theory, a VC with a large collection of DD documents, funding details and ex-post up-round metrics should be able to put together a decent machine learning model that predicts the success of a start-up given the same family of DD documents.
I spent 15 minutes reasoning with Claude about how to approach this, and the framework it provided, upon which this article is based, was actually a reasonable start.
Feature Categories
If you want to build a good machine learning model, understanding what features are broadly important is just about the start of it. There are broadly two ways in which you can distill important features: the first is to use statistical techniques to find which features tend to move with your target (data-mining), and the second is to reason about, from first principles, which features should, in theory, predict your target (fundamentals).
We are going to go with the latter and we are going to approach this by reading research that surveyed a large number of VCs across a very large number of deals (~1,000 VCs across ~700 firms).
These are the feature categories that I have found to be consistent in the survey of research done:
Team Quality
Investor Quality
Market Timing
Financing Details
Commercial Traction
Alternate Data
Now we get to the real fun part of the matter: finding out the exact features we will need to put together an initial model.
Features
Team Features
By far the most important feature category when it came to predicting start-up success was team quality. Of team quality, specific features worth measuring are proxies around founder ability, industry experience, passion, entrepreneurial experience and teamwork.
A few papers all point in the same direction: prior founder success, same-industry experience, technical depth, commercial depth, role complementarity (between founders), complementary founder backgrounds and whether the team has succeeded in building something before.
What is immediately obvious to me here is that the “insight” is fairly self-evident: the higher the overlap between the start-up idea and the founders’ past experience, the greater the chances of success. Further, the more experience the team has working on a similar product or in a similar industry as the start-up, the greater the chances of success.
Investor & Syndicate Quality
The investors backing a start-up are not JUST social proof. Large and connected VCs are powerful companies that have spent a long time building up brand equity AND have the resources to conduct deep due diligence. Having great VCs on the cap table is a marker of the power and importance the start-up has by proxy.
Firstly, VCs lend relevance and power to relatively unknown start-ups to help them recruit, attract and syndicate the next round and exercise control when conditions get ugly. The greater the VC, the more power is bestowed on the start-up by proxy to do things that would otherwise be harder to do (hire, get deals, get into news cycles, etc.).
Secondly, virtually every top-tier VC is an “activist” in their investments and is not just making an investment and passively awaiting returns, but actively trying to shape the outcome. The rationale here is straightforward: VCs with great returns have exercised a great show of will, and that same show of will is also applicable in shaping the future of the start-up.
Thirdly, VCs are a network/platform game. VCs are, by virtue of their positioning, often central to a large social graph of other VCs, allocators, and large and economically important firms. VCs that have larger and more important networks have been shown to increase the chances of survival AND exit rates.
Unfortunately, while investor quality is a predictive feature, it is actually a structurally “crowded” trade. As a new VC, you might think that you can just participate in all Series A rounds that the largest VCs have already seeded, but the problem is deal flow and access.
If a start-up has been seeded by the best VCs, trying to elbow your way into the Series A is going to be really difficult, for obvious reasons. You will need to have a very strong value proposition to convince the start-up to pick you over a more well-known VC.
Financing Path
This is the “momentum” part of the feature set. There are quite a lot of papers written on using financing features to predict start-up success.
The main idea is to look at how the start-up has approached financing over time. How quickly did it raise the first round? Who showed up? Was it oversubscribed? How much time passed between each round? What stage is it at now? How many investors have shown up? How much did the valuation step up? Did insiders support the round? Were bridges and extensions needed from the previous raise?
There seems to be quite a lot of satisfying evidence that financing path is predictive of future rounds.
One paper using Crunchbase found that number of rounds, time to first funding, time to last funding, investment type, investor count, and total funding all helped predict whether a company would make it to Series A. Another large-scale model flattened the financing trajectory into variables like average time between rounds, total funding, number of rounds, and last round stage and used them to predict exits and follow-on funding. A third paper used round-level aggregates such as raised amount, investor count, and post-money valuation to predict B/C-stage success.
These features, whilst useful for later rounds, are less useful when dealing with pre-seed/seed start-ups unfortunately.
Market Timing
This is the “beta” part of the feature set. A start-up is one part idiosyncratic risk due to the founders and the business, and one part macroeconomic risk due to the environment in which the start-up is operating. There are regulatory and macroeconomic headwinds and tailwinds to consider, and it is possible to be “too early” or “too late” for a start-up.
The simplest example of this is when an industry is running “hot”. Even a mediocre start-up might be able to raise an up-round just by virtue of investors wanting exposure to the industry but not being able to get their way into the most in-demand deals.
Commercial Traction
This is the part that is least fuzzy.
For a start-up aiming to get to $100bn in revenue and generating $0 in revenue, it certainly requires a leap of faith, good imagination and a healthy dose of optimism and extrapolation.
The same goal of $100bn in revenue for a company already generating $1bn in revenue requires a significantly narrower leap and stretch of imagination. It is for this reason that some VCs only invest post-revenue or after observing product-market fit.
They are trying to de-risk themselves from execution risk, and to have a model where they can actually plug in real numbers and model “realistic” growth rates. This also feels to me like a structurally crowded trade for newer VCs. Reasoning from first principles, the less brand equity you have as a VC, the earlier you need to participate and the greater the leap of faith you will need to take.
Alternate Data
There are a few papers on alternate data that can predict start-up success. Patents, trademarks and area of registration all carry meaningful information about high-growth outcomes.
Founder digital presence, text and social activity are also predictive of later start-up outcomes. The social profiles of the founder AND the company (e.g. LinkedIn presence, Twitter presence) are all predictive of start-up success since they are attention and legitimacy markers, whilst also serving as information-asymmetry reducers.
Putting Together Into A Framework
There are a few considerations here worth thinking about.
Firstly, in building a model to predict start-up success, we need to build “stage/sequence aware” models. A seed-stage model will likely prioritise team, alternate data and market timing. A Series B/C/D stage model will likely prioritise investor quality, commercial traction and financing path. This means we should aim to build a model for every stage we actually care about.
Secondly, the best people to build this are incumbent VCs, with point-in-time (PIT) data and artifacts. For example, the best way to go about this is to look at a large collection of artifacts collected point-in-time during due diligence and build features from THOSE artifacts. You can then train those features on the actual realised performances of the start-ups AFTER the investment (or lack thereof).
The reason is that if you try to trivially reconstruct these features using artifacts generated today, you introduce a large amount of look-ahead bias. If you want to do high-quality data scraping and want to pursue this idea but you are not a VC with a repository of high-quality data, the smart way to go about this is to obtain a point-in-time pitch deck during the raise, use the Wayback Machine to get the point-in-time website, and gather point-in-time social profiles of the company and the founders, etc. Remember: you want to create features from artifacts that were ACTUALLY AVAILABLE at the time of the raise.
Thirdly, and this is statistically critical, getting the sampling of start-ups right is very important. Venture is a game for the tails. You can invest in 99 other start-ups that go to zero, but if you were a day-one investor in Google, you will quite literally end up a billionaire. This means that 1) the pessimism for the average start-up AND 2) the optimism for fat tails actually need to be baked into the model.
From a machine learning perspective this is not trivial, but a simple solution is to ensure that you have many samples (start-ups, invested or not) so that the machine learning model can actually learn the distribution of start-up survival, and also to include “up-round magnitude” as a sample weight when training the machine learning models. Basically, you are saying: you can get 99 start-ups wrong, but if you can get one outlier with huge magnitude right, that’s good enough.
Lastly, ensemble, ensemble, ensemble. One machine learning model is only going to give you one dimension. You want to train machine learning models that are predicting different targets (e.g. probability of the next round within 18 months, probability of an acquisition, probability of unicorn status, etc.) AND you want models that are trained differently (e.g. expectation-aware as in point 3, or just pure probability/accuracy models, or with different feature sets, etc.). The more non-naive models you can ensemble together, the higher the fidelity of your predictions, by virtue of each machine learning model revealing one “aspect” of the start-up.
Watch Out For These
There are a few ways your quantitative process might break or go awry.
The most obvious and likely one is when you leak features. This is especially dangerous in startup data because round histories, investor identities, valuations, and founder biographies get updated over time. If you do not pin every feature to what was actually knowable at the decision date, the model will look clairvoyant for all the usual dishonest reasons.
Remember, you need PIT artifacts DURING or BEFORE the raise.
The second way things will not work is if you only train on ONE regime. A model trained only during free-cash environments will give you absurd predictions when the cost of capital stops being zero.
Conclusion
This article is a synthesis of Paul Gompers, William Gornall, Steven Kaplan, and Ilya Strebulaev on VC decision-making; Gompers, Kovner, Lerner, and Scharfstein on serial-founder success; Colombo and Grilli on founder human capital; David Hsu, Nahata, and Hochberg-Ljungqvist-Lu on investor quality and networks; Te et al., Ross et al., and Potanin et al. on financing-path prediction; Jongwoo Kim, Andreas Koehn, Howell-Lerner-Nanda-Townsend, Guzman and Stern, Kaiser and Kuhn, and Bayar and Kesici on market structure, timing, patents, text, and digital signals.

