Systematic Long Short

My Obsession With Predicting The World

Systematic Long Short — Wed, 13 May 2026 13:23:06 GMT

Introduction

Recent conversations have left me thinking about my motivations, and the driving force behind the hours I put into my work at the expense of nearly everything else.

I don’t think I’ve ever written about this publicly before. For as long as I’ve had ambition, I’ve been driven by a desire to build a world model so adept at predicting the world that it can drive out inefficiencies in all markets, public or private.

If you’ve ever watched Westworld, yes, I’m talking about something like Rehoboam: a superintelligence that could map human behavior and predict the future. Interestingly enough, Rehoboam started out as a model for predicting stock prices (really).

It’s a large part of why being a “quant” has interested me. The markets are the ultimate testing field for your ability to predict the future across various timescales, and it’s an immensely scalable endeavor with a positive flywheel: the better you are at predictions, the more resources you get to become better at predictions, and so on.

I Have A Dream

One of my fears is that I’ll die before reaching my grand ambition of predicting all the useful things society needs predictions about, and be forever immortalized as just another human obsessed with making money for the sake of making money.

My reason for being is that I believe it’s possible to build an entity so adept at predictions that it can minimize inefficiencies across all human systems. That we can build a world model that guides humanity toward decisions better than heuristics or random guesses.

This means we can fund the startup ideas with the greatest predicted impact, issue scholarships to children with nascent genius. For example: which president should be elected to increase the probability of global peace and prosperity?

I think humans are in a race against time. As a species, we have finite time to expand beyond the limits of our current prison. When facing finite resources, we need to maximize efficiency to improve our odds of reaching the next bound before we run out and perish. If we never figure out how to escape Earth before resources are consumed, or before conditions make leaving impossible, we’ll inevitably die as a species in a prison of our own making.

Oh me! Oh life! of the questions of these recurring,
Of the endless trains of the faithless, of cities fill’d with the foolish,
Of myself forever reproaching myself, (for who more foolish than I, and who more faithless?)
Of eyes that vainly crave the light, of the objects mean, of the struggle ever renew’d,
Of the poor results of all, of the plodding and sordid crowds I see around me,
Of the empty and useless years of the rest, with the rest me intertwined,
The question, O me! so sad, recurring—What good amid these, O me, O life?
Answer.
That you are here—that life exists and identity,
That the powerful play goes on, and you may contribute a verse.
Walt Whitman, O Me! O Life!

I hope my verse will be an entity that makes humanity more efficient, that nudges our probability of reaching the next bound up a little.

Predictions

I’ve been obsessed with predictions since I was a teen. I remember being endlessly fascinated by how accurately physics could predict the future. Simple models had profound implications. They showed us that prediction was possible, that we could know where an object would end up if we knew enough about its initial conditions and the system it was operating in.

We model systems as random when our predictions of them are no better than guessing, i.e., when we don’t have enough information about the system to generate a prediction better than chance.

For example, if we were asked to predict (assign probabilities to) whether a coin lands heads or tails, our initial instinct would be to assign equal probability to both. That would be our best prediction with no further information about the coin flip. But if we knew the initial face of the coin, the height of the flip, the force being applied, the dimensions and mass of the coin, and the material it would land on, we could construct a model that predicts the landing face with significantly greater accuracy than random.

While I believe there’s a limit to determinism, I also deeply believe that when it comes to human systems, we haven’t even scratched the surface of what’s possible. The data we’ve needed has traditionally been either highly sparse or highly unstructured, and the models and compute we needed that were capable of finding generalizable patterns in such high-dimensional data were previously unavailable.

The principle that more relevant data yields better predictions given a sufficiently powerful model holds true even as systems grow more complex. So an entity that endeavored to predict the world would need to accumulate as much data as possible. We’ve learned in recent years, with the advent of LLMs, that with sufficient scale and complexity in data and model, a form of intelligence emerges.

Machine Learning And Higher Dimensions

Data is inert on its own. You need a model to transform data into actual predictions. Further, which model is best for producing useful predictions is a non-trivial question, thanks to the no-free-lunch theorem. Fortunately, advances in machine learning give us a rather satisfactory solution: letting the data dictate the form of the model, rather than having a theory dictate it.

Really large and powerful machine learning models can pick up high-dimensional patterns that tend to generalize, because the underlying behavior driving them is reliably repeating.

This means there’s a world where we might use the Substacks an entrepreneur follows or the podcasts he’s subscribed to as features to predict his probability of success, long before he’s even incorporated his company. The patterns we’re interested in are more about his desires, drives, and views of the world, and less about how he spends 20 minutes of his time each day. We can glimpse those beliefs from tangential data.

To push this further, we might learn about a person’s potential from their childhood experiences, their belief systems, attributes and character, long before they have become an adult. Wouldn’t it be nice to allocate capital to genius in a part of a world where it would be bound to be wasted?

Conclusion

So we know we need vast amounts of data, and vast amounts of computing power to train enormous models to learn high-dimensional patterns from that data.

That translates to a very expensive endeavor but the good news is this: data and compute is getting cheaper every passing day, and unstructured data is becoming less and less of an issue with the parsing powers of LLMs. We are also getting scarily good at training enormous god models that can learn patterns in dimensions so large its unfathomable to the human mind.

Yet, to build something of such scale, we would need a business that could sequentially finance its scale and ambitions. Starting with predictions in liquid markets seems like the obvious choice, before moving to progressively harder markets and systems. You can see how one might be motivated to pursue the path I have pursued if you think along these lines...

So, if I die while in the midst of running something that resembles a simple cash grab, know that my shame is immeasurable, and that it was never supposed to be the end destination!

Source

I just know someone is going to say I retconned this for openforage, but I had actually written this in 2023, and while AI means that screenshots now mean very little...

How To Stop Losing Money To DeFi Hacks

Systematic Long Short — Mon, 27 Apr 2026 13:29:22 GMT

Introduction

Building @openforage and reading the myriad hacks of DeFi protocols have put the fear of “state actors” in me. They are sophisticated, well-resourced, and play the extreme long game; super-villains singularly focused on combing every crevice of your protocol and infrastructure for exploits, while your average protocol team has their attention split six ways running the business.

I don’t pretend to be a security expert, but having led teams in high-stakes environments (both in the military and in high finance with large sums of money), I am a seasoned operator in thinking about and planning for contingencies.

I truly believe only the paranoid survive. No team ever sets out thinking “I am going to be careless and lackluster about my approach to security”; and yet hacks happen. We need to do better.

AI Means This Time It’s Different

Hacks are not uncommon, but the frequency has clearly increased. Q1 of 2026 is the highest ever recorded number of DeFi hacks, and while Q2 has JUST begun, it is already on track to break the previous quarter’s results.

My central hypothesis is that AI has drastically reduced the cost of combing for exploits, and greatly increased the attack surface. A human takes many weeks to comb through the protocol settings of a hundred protocols for misconfigurations; the latest foundation models do it in a few hours.

This should drastically change the equation of thinking about and reacting to hacks. Older protocols, used to security measures from before AI got competent, are increasingly at risk of being smoked.

Thinking In Surfaces & Layers

The surface area of hacks reduces to, in practice just three: Protocol Team, Smart Contracts & Infrastructure, User Trust Boundaries (DSN, Social Media, etc).

Once you’ve identified the surfaces, layer in defenses:

Prevention: Processes that, if followed, minimize the probability of being exploited.
Mitigation: Prevention has failed. Limit the damage.
Halt: Nobody makes their best decisions under pressure. Master kill switch the moment you confirm an attack. Freezing prevents further damage and buys space to think and...
Retake: If you’ve lost control of toxic or compromised components, jettison and replace them.
Recovery: Seize back what you’ve lost. Plan ahead for contacting institutional partners that can freeze funds, undo transactions, and aid investigation.

Principles

These principles guide the actions we can take to implement the layers of defences.

Use Frontier AI Liberally

Use frontier-model AI liberally to scan your codebase and configs for vulnerabilities, and to red-team across a large surface area: try to find vulnerabilities in your frontend; see if they reach your backend. Attackers are going to do this. What your defensive scan can find, their offensive scan would have found.

Use skills like pashov, nemesis and AI platforms like Cantina (Apex) and Zellic (V12) to quickly scan your codebase before committing to full audits.

Time And Friction Are Good Defenses

Layer in multi-step processes with timelocks for anything potentially damaging. You want plenty of time to step in and freeze once you smell something.

The old argument against timelocks and multi-step setters was the friction they create for protocol teams. You have much less to worry about now: AI can easily click through these frictions in the background.

Invariants

Smart contracts can be built defensively by writing down the immutable ‘facts’ that, IF broken, break the entire logic of your protocol.

The crown invariant of @openforage centers on solvency (if total asset backing falls below total claims, the protocol collapses):

VaultAssets + DeployedAssets >= OutstandingClaims

You typically have a handful of invariants. Promote them to code sparingly; enforcing multiple per function gets unwieldy.

Balance Of Powers

Many hacks come from compromised wallets. You want configurations where even if a multisig is compromised, you can arrest damage quickly and bring the protocol to a state where governance can make decisions.

This requires a balance between GOVERNANCE, which decides everything, and RESCUE, the abilities to restore governable stability (without being able to replace or overthrow governance itself).

Something Is Going To Go Wrong

Start with the assumption that however smart you are, you will get hacked. Your smart contracts or dependencies might fail. You might get social engineered. A new upgrade might introduce a vulnerability you weren’t prepared for.

Once you think this way, rate limits that throttle damage and circuit breakers that lock down the protocol become your best friends. Limit damage to 5-10%, freeze, then game out your response. Nobody makes their best decisions with bullets in the air.

The Best Time To Plan Is Now

The best time to think about your response is before you get hacked. Codify as much of the process as possible and rehearse with your team so you are not scrambling at impact. In the age of AI, that means having skills and algorithms that surface as much information as possible, as fast as possible, sharable in both summary and long form to your inner circle.

The Name Of The Game Is Survival

You don’t need to be perfect, but you sure as hell need to survive. No system is impenetrable from day 1; through multiple iterations, you become anti-fragile by incorporating lessons.

The lack of evidence of being hacked is not evidence that you are not susceptible. The point of maximum comfort is going to be the point of maximum danger.

Preventions

Smart Contract Design

Once you’ve identified the invariants, promote them into runtime checks. Think carefully about what invariants are actually practical to enforce.

This is the FREI-PI (Function Requirements, Effects, Interactions, Protocol Invariants) pattern: at the end of every function that touches value, re-verify the crown invariants the function promised to preserve. Many drains (flash-loan sandwiches, oracle-assisted liquidation griefs, cross-function solvency drains) that pass CEI (Checks-Effects-Interactions) get caught by an end-of-function invariant check.

Good Testing

Stateful fuzzing builds random sequences of calls against the protocol’s full public surface, asserting invariants at each step. Most production exploits are multi-transaction, and stateful fuzzing is just about the only reliable way of finding those paths before the attackers do.

Use invariant tests that assert a property holds for ANY call sequence the fuzzer can generate. Complement with formal verification, which proves a property across all reachable states. Your crown invariants absolutely should get this treatment.

Oracles And Dependencies

Complexity is the enemy of security.

Every external dependency extends the attack surface. If you’re designing primitives, push the choice of who and what to trust to users. If you can’t remove dependencies, diversify them so no single point of failure craters your protocol.

Extend your audits to model the ways your oracles and dependencies can fail, and apply rate-limits to how much catastrophe can be done IF they do.

The latest KelpDAO exploit illustrates: they inherited the LayerZero default of requiredDVNCount=1, and that config lived outside their audits. What eventually got compromised was off-chain infrastructure outside the scope of audits they had commissioned.

Attack Surfaces

Most attack surfaces in DeFi are already enumerated. Walk down every category, ask if it applies to your protocol, and implement the control that addresses the attack vector. Build red-team skills that force your AI agents to look for exploits in your protocol; this is table-stakes at this juncture.

Having Native Rescue Abilities

In voting-based governance, power starts concentrated in the team’s multisig and takes time to diffuse. Even with broad token distribution, delegation tends to funnel authority into a small set of wallets (sometimes n=1). When those get compromised, it’s game over.

Deploy “guardian wallets” with a strictly narrow mandate: they can ONLY PAUSE the protocol, and at a >=4/7 threshold can rotate compromised delegations to PRE-DEFINED replacement wallets in EXTREME situations. Guardians never enact governance proposals.

This way, you have a rescue tier that can always restore governable stability without power to overthrow governance. The checkmate scenario, losing >=4/7 guardians, has minuscule probability given holder diversity, and the whole layer can be phased out once governance is mature and diversified.

Wallet And Key Topology

Multisig wallets are table stakes, minimum 4/7. No single human controls all 7 keys. Rotate signers liberally, and quietly.

A key should never interact with a device used for day-to-day tasks. If you browse the internet, use email, or have Slack on your signing device, take it as given that signer is already compromised.

Have multiple multisigs, each with a distinct purpose. ASSUME at least one entire multisig will be compromised, and plan from there. No single person should have enough control to compromise the protocol, even under extreme scenarios (kidnapping, torture, etc.).

Think About Bounties

I really enjoyed Nascent’s article on bounties. If you have resources, it is well-worth placing a large bounty on exploits relative to protocol TVL, but even if you are a fairly small protocol, the bounty on exploits should still be as generous as possible (e.g. 7-8 figs min).

If you’re dealing with state-sponsored attacks they are not interested in negotiating, but you can still engage in “White Hate Safe Harbor” programs that authorizes white-hats to act on your behalf in securing the fund for a % fee of the exploit (effectively a bounty paid by depositors).

Find Good Auditors

I wrote earlier that as LLMs get smarter, the marginal value of engaging an auditor decreases. I still stand by that, but my views have shifted.

First, good auditors stay ahead of the curve. If you’re doing something novel, your code and its exploit may not be in training data, and throwing more tokens has not yet proven effective at finding novel solutions. You don’t want to be sample point one for a unique exploit.

Second, and underappreciated: engaging auditors stake their reputation on the line. If they sign off and you get exploited, they’re highly incentivized to help. A relationship with people whose literal job is security is a boon.

Practice Operational Security

Treat operational security as a success metric. Play out phishing drills; pay a (trusted) red team to try and social-engineer the team. Have spare hardware wallets and devices lying around to replace entire multisigs. You don’t want to scramble to buy these on D-day.

Mitigation

Your Exit Path Is Your Loss Ceiling

The capped size of any path that moves value out of your protocol is the maximum theoretical loss from a bug abusing that path. Plainly: a mint function without a per-block cap is a blank check to any infinite-mint bug. A redemption function without a weekly cap is a blank check to any asset-balance corruption.

Think judiciously about explicit numbers on the size of your exit paths. That number balances the maximum damage you’re willing to lose against the most extreme UX requirements of your users. IF something falls through, this is what saves you from complete destruction.

Allowlists (And Denylists)

Most protocols have lists of what can be called, traded, or received from, and lists of what users really DO NOT do. Even when implicit, these are trust boundaries that SHOULD be formalized.

Formalizing them lets you set 2-stage setters that create meaningful friction. An attacker would first need to add to the allowlist (and/or remove from the denylist) and THEN act. Having both means an attacker sneaking in a new vector has to defeat both processes: the market must be allowed (integration/listing), AND the action must not be forbidden (security review).

Retake

Algorithmic Monitoring

A kill-switch is useless if nobody is watching. Off-chain monitors should watch the crown invariants continuously and escalate algorithmically once something is wrong. The path should end at the humans of the guardian multisigs with enough context to make the call in minutes.

Stop To Recalibrate

If you get shot, you stop the bleeding, not make decisions while your life counts down. With protocols, that’s a kill switch (reflect it on the UI too): a single button halting every value-moving path in one transaction. Prepare a “pause everything” helper script that enumerates the pausable set and halts them atomically.

Governance is the only way to unpause, so the kill switch must not halt governance itself. If the guardian tier can pause the governance contract, a compromised guardian tier can deadlock recovery permanently.

Launch Your War Room

Freeze, stop the bleeding, then put everyone you trust (small circle, pre-agreed) into a communication channel. You want the surface small to keep information from leaking to attackers, the public, or bad-faith arbitrageurs.

Role-play the roles your team needs: a shot-caller making decisions; an operator well-rehearsed at executing defensive scripts and halts (the shot-caller seconds); someone reconstructing the exploit and identifying root cause; someone on comms with key parties; someone scribing observations, events, and decisions over time.

When everyone knows their role and has rehearsed, you react by process rather than scramble at the worst possible time.

Think About Knock-On Effects

Assume your attackers are sophisticated. The first vulnerability may be a distraction, or a seed for more. The exploit may be bait to make you do the exact wrong thing that triggers the true exploit.

Halts must be well-studied, fully contained, and not exploitable themselves. A halt should be a full protocol freeze: you don’t want to be baited into halting one component in a way that opens another. Once you have root cause and attack vector, explore adjacent exposed surfaces and knock-on effects, and patch them all at once.

Rotate Pre-committed Successors

Rotation is only safe if the replacement is known in advance. I like the idea of a pre-committed successor registry: it makes it much harder for an attacker to swap a healthy guardian/governance wallet for a compromised one. This is in line with the “Allowlists/Denylists” philosophy in mitigation.

For every important role, register a successor address. The only rotation primitive the emergency tier can execute is “replace role X with its successor”. This also lets you evaluate successors during peace time: take your time, do diligence, fly over and meet the person making the request.

Test Judiciously Before Upgrading

Once you’ve identified the root cause and splash zone, you’ll need to ship an upgrade. This is probably the most dangerous code you will ever deploy: written under pressure, against an attacker who has already proven they understand your protocol enough to find bugs.

Delay shipping without extensive testing. If you have no time for an audit, lean on white-hat relationships, or put up a 48-hour contest before deployment to get a fresh adversarial read before it goes live.

Recovery

Move Fast

Stolen funds have a half-life; once the exploit lands, they move rapidly down the laundering pipeline. Have a chain-analytics provider like Chainalysis on standby to label the attacker’s address cluster across chains, so they can be flagged with exchanges in real time and tracked as they hop.

Reach out to SEAL911 immediately!

Pre-make a list of centralized exchange compliance desks, contract bridges, custodian admins, and other third parties with admin levers to freeze cross-chain messages or specific deposits in flight.

Negotiate

Yes, it stings, but you should still attempt to talk to the attacker. Most things in life can be talked down. Offer a time-bound white-hat bounty paired with a public statement committing to no legal action if funds are returned in full by a deadline.

If you’re dealing with a state actor you’re probably out of luck, but you might be dealing with less sophisticated actors who just found a way to exploit you AND want to get away with it cheaply.

Before you do this, have legal counsel in the room.

Conclusion

The hacks won’t stop, and as AI gets smarter there will be more of them. It’s not enough for defenders to “get sharper.” We need to use the same tools attackers use, red-team our protocols, monitor continuously, and put hard limits on damage so we survive the worst.

Special thanks to the team from @nascent for their thought provoking and forward looking articles on protocol security, and @delitzer for his brilliant feedback on the article and OpenForage. Likewise, thanks to @sohkai and @dbarabander for thoughtful feedback on article structure and clarity.

Quantitative Trading Is Going To Eat All Markets

Systematic Long Short — Mon, 20 Apr 2026 13:12:06 GMT

Introduction

A lesser-known fact about sysls lore is that I actually started my career in finance as a discretionary trader. I ran an arbitrage book that earned a premium from collapsing commodity spreads between SGX and INE/DCE. While many of these strategies were repeatable structural trades that could be codified, there were a handful where discretionary trading was the only way to go.

For example, we discovered that rubber merchants in Singapore were highly incentivised to bid rubber futures in Singapore extremely aggressively during certain time periods, because their OTC contracts would reference the current futures prices. You could effectively come in and bet that prices were going to collapse within the next trading session, because there was no fundamental basis for the price shock — simply a temporary dislocation. It was not trivial to algorithmically discern when large price shocks were the result of shady business practices and when they were the result of an inflow of informed buying.

These experiences shaped my thinking on discretionary trading, and thus, unlike my more pure-blooded counterparts, I actually have a great deal of respect for discretionary traders. I understand that the alpha comes from being able to generalize small-sample events into a trade thesis — a feat that has been impossible for quantitative trading, until now.

In the rest of this article, I write about why we are structurally poised for another renaissance in quantitative trading, and why discretionary trading will soon cease to be able to compete meaningfully with systematic, quantitative processes.

What Is Quantitative Trading, Really?

This is not a technical exposition of quantitative trading, so I will keep the language simple and hopefully intuitive. At its core, quantitative trading is really, simply about finding patterns in the markets that we believe will continue repeating in the future.

Essentially, it is finding an event A such that whenever you observe A, you can bet that returns will change in a predictable way, B. All the difficulty, all the modelling, all the sophistication comes from quantifying and modelling A and understanding the A → B pattern (relationship).

A simple example of the above is that whenever you observe prices rising two days in a row, you can bet that on the third day they will mean-revert. This might play out with a probability of 50.01% and a magnitude of 2bps (hundredths of a percentage point). Because it has a (tiny) positive expectation, you can essentially bet on this a million times to extract this positive expectation and make real money.

Of course, this is a simplification — professionals typically do not trade a single signal, because trading costs are too onerous if you attempt that. But at its core, that’s really all there is to quantitative trading: finding patterns A → B that you can bet on an indefinite amount of times to harvest the positive expectation from the pattern.

When you model and find such a pattern, betting on the pattern is the signal. So you can essentially expect that whenever someone says they have found a “signal”, they really have found a pattern that they believe will repeat itself in the future with some positive expectation.

Why Hire PhDs If It’s So Simple, Then?

Because it’s not! It’s actually very difficult to model and identify any particular pattern. The above example would be easy to model and identify — anyone could observe how prices have changed over the past two days. But this is not generally true.

For example, suppose you hypothesize that another simple pattern exists. This pattern is: “whenever correlation between an instrument’s returns and its sector’s returns rises, then the instrument’s cross-sectional rank will rise.”

This pattern may be simple to understand but would not be trivial to model, because you’d need to gather many instruments, get their sectors, make sure your instrument set covers all the instruments found in all the sectors, calculate all instrument returns, calculate all sector returns, calculate their correlation, and of course, calculate an instrument’s cross-sectional rank within its sector.

A small increase in complexity led to a large leap in the things that needed to be done to model and study the pattern, and therefore increases the ways things can go wrong.

How We “Typically” Find Signals

This, in detail, would entail countless articles on its own — so I will keep it extremely brief. You can stumble onto signals in one of two ways. It starts with idea generation: you either hypothesize an idea through inspiration, idealization, or observation, OR you data-mine (search) for ideas in a statistical fashion.

Once you find an idea worth testing, you need to give it structure by figuring out how you want to express the idea as a pattern, A → B. For example, a “mean-reversion” idea can be expressed as many possible different patterns: time-series mean-reversion, cross-sectional mean-reversion, mean-reversion in different dimensions, etc.

Then, you need to validate the signal by running some statistical tests. The laziest form of this is backtesting, but we will assume that’s all it takes. So you backtest the signal, and if it does well, you assume it’s good and that it will generalize.

You then repeat this “idea generation → expression → backtesting” pipeline until you have a giant compendium of signals.

The (Old) Limits Of Quantitative Trading

Quantitative trading has really eaten much of public markets — and for good reason. It introduces some rigor, and more importantly a lot of discipline, in a field that is plagued by an extremely low signal-to-noise ratio.

Yet there are also quite clearly limits to quantitative trading, since the percentage across the largest podshops typically amounts to around 15%–50% of the business, with discretionary long/short and global macro being the other large segments.

The main difficulty in quantitative trading’s ability to infringe on a larger share of the pie essentially comes down to two main problems:

It is quite difficult to quantify some events. The difficulty typically stems from data collection and/or the fuzziness of an event. Perhaps you might want to create a signal that bets that the more management understands the business, the more likely returns will be positive in the future. How do you quantify your event? An analyst might qualitatively be able to assess this by watching interactions of management with questions about the company.
Some patterns have an extremely small sample size, making it really difficult to “backtest” and gain any kind of statistical confidence. An example that comes to mind would be a pharma company seeking approval of a novel drug. If it is truly a novel drug, it would be an N=1 event; how do you model and gain statistical confidence in this?

For the most part, quantitative trading simply bypassed these problems by not trying to compete in areas plagued by them at all. You see almost zero meaningful quantitative trading in the pharma industry, as an example. It is an open secret that almost all quantitative strategies are ex-pharma.

It is also why quantitative trading in venture as an asset class simply does not exist — because (unique) start-ups in general are N=1 events, and when sufficiently early, quantifying founder traits is just about the only thing you can go on to make a meaningful bet, but that is a fuzzy and difficult affair.

Why These Limits Are Slowly Evaporating

In a single word: AI. In more words: with how good LLMs have gotten, we can essentially count on them to (1) turn almost any kind of unstructured, fuzzy information into a quantifiable metric, and (2) reason about “one-off” events in a higher-dimensional space where learnings can be “generalized”.

Natural Language Processing (NLP) as a means to transform fuzzy, unstructured documents into a quantifiable metric that can be used in a quantitative trading process is certainly not new. However, its lack of intelligence has greatly limited the shape of what was previously possible.

NLP techniques were largely confined to shallow tasks, such as sentiment analysis of news headlines, counting frequencies of words, or classifying documents into buckets. They were useful, and continue to be used today, but a large part of these techniques collapsed rich information into single-dimensional, crude numerical proxies.

LLMs Can Quantify Anything With Less Lossy Compression

Given the (rising) intelligence of LLMs, you can now feed an LLM transcripts of many historical earnings calls, have it evaluate management’s depth of understanding across a multi-dimensional rubric (e.g. specificity of answers, willingness to engage with difficult questions, consistency of narrative, etc.), and get back a structured, high-dimensional, quantifiable output.

Hordes of analysts (each with their own biases, inconsistencies, and limited bandwidth) that used to do this can now be replaced with a single LLM in a fraction of the time. You can do this at scale, across thousands of companies and hundreds of quarters of historical data, producing a clean panel dataset that can be plugged directly into a standard quantitative pipeline.

The space of previously “unquantifiable” events that are now quantifiable is vast: regulatory filings, legal documents, product reviews, a host of management/founder statistics gleaned from social media, career history, and so on. The list goes on and on.

LLMs Can Generalize Any Patterns In A Higher-Dimensional Space

Humans are perhaps the greatest one-shot learning machines ever to exist. Sometime when we were younger, we touched something hot — perhaps a boiling kettle — and got burned for the first time. Tears aside, a lifetime association of avoiding hot things was immediately built. Whether it’s open flames, something that might look hot, or something that has been described to us as hot, we learn to be careful and avoid it all. Why? Because we did not pattern-match to just the kettle burning us; instead, we generalized, at a higher dimension, to the features of what hurt us — namely, the property of something being hot.

Classical machine learning models are traditionally extremely bad at doing this. They are unable to “reason in higher dimensions”, and thus often fail to generalize outside of the dimensions explicitly provided in the training set. Further, they are notoriously incapable of “one-shot learning”, and need thousands of samples in order to “teach a behavior”.

This is all very conceptual, so let’s try to solidify it.

Essentially, the central argument here is that the traditional quantitative paradigm requires statistical regularity. You need enough historical instances of “event A” to build confidence that “A → B” will generalize. It is why there are entire segments of markets that are “quant-proof” — because participants believe that the events are too idiosyncratic and the sample sizes are too small.

LLMs essentially allow us to reason in a higher-dimensional space where generalization happens at the level of features, not events. A novel drug approval is, yes, an N=1 event at the level of the specific drug. But it is not an N=1 event at the level of its features: the therapeutic area, the mechanism of action, the quality of the Phase II data, the composition of the advisory committee, the regulatory precedent for similar compounds, the strength of the sponsor’s prior interactions with the FDA, and so on. An LLM can decompose a “unique” event into a high-dimensional feature vector where each feature has been observed many times, in many combinations, across many analogous situations.

This line of reasoning is not new, and is in fact what really good discretionary traders have always done. They pattern-match against a rich internal library of analogous situations, weighting features they’ve learned matter.

The big unlock here is that LLMs are the first technology that can systematically replicate this pattern-matching at scale and with consistency.

The Coming Renaissance

If you put these two things together — everything can be quantified, and everything can be learnt — this means that asset classes and event types that were previously the exclusive domain of discretionary traders are now, in principle, tractable for quantitative approaches.

The pie that quants can address is expanding dramatically. The fuzzy, qualitative edges that have always been the refuge of discretionary traders are being systematically colonized by quantitative processes powered by LLMs.

This is not to say discretionary trading will disappear overnight. There will always be situations where the combination of specific domain expertise, relationships, and human judgment will command a premium. But the margin of that premium is shrinking, and it is shrinking in a way that is structural rather than cyclical.

Discretionary traders who continue to compete on pattern recognition in publicly available information will find themselves increasingly outcompeted by quant firms that can do the same pattern recognition faster, more consistently, and across a wider cross-section. Given that the moat of discretionary trading is slowly eroding, I cannot imagine a future where the supermajority of all trading — across all markets, public and private — is anything other than quantitative and systematic.

Being Front-Run On DEXes

Systematic Long Short — Mon, 13 Apr 2026 13:38:48 GMT

Introduction

I’ve been thinking about executing large portfolios on decentralized exchanges like Hyperliquid.

In theory, when:

You have alpha.
Your positions and orders are open and public, as in the case of DEXes like Hyperliquid.

Then:

You should expect a class of traders that will attempt to front-run you to capture your alpha before you do.
They will do this by producing orders that attempt to trade into your desired positions before you can.
The net result is that you should experience greater costs (slippage) as a result of being front-run.

Imagine you wanted to buy $1,000,000 of BTC at $100,000. There happens to be someone offering exactly $1,000,000 of BTC at $100,000. A front-runner sees your intent, steps in ahead of you, takes that offer, and then sells $1,000,000 of BTC to you at $100,100. That extra $100 is slippage you could have avoided had your intentions been hidden.

The Extreme Ends of Front-Running

In theory, if you extrapolate this to its “natural conclusion,” it should discourage almost any kind of “serious trading” on DEXes at all.

However, we know that not to be true. Plenty of very serious players trade professionally on Hyperliquid with alpha. So it seems clear that it’s not so clear-cut that “players with alpha should not trade on DEXes.”

Can we reason about intuitive bounds for the limits of being front-run? I think so, if we work from first principles and look at the evidence that is available.

It is quite clear that if you are very small and are trading in a highly opaque venue like Binance, the probability of being front-run is effectively zero. Being small means your footprint (trading volume) relative to the market is so small that you are practically invisible, AND, even if you had absolute predictability, no one would be able to attribute your trading activities (orders and trades) to you.

On the other hand, the canonical example of a very large, very transparent wallet on Hyperliquid is the HLP vault itself — the public market maker vault that provides liquidity to other traders on Hyperliquid. I am fairly certain that there are dedicated strategies front-running HLP, and that constant pressure has effectively compressed market-making alpha to ~0.

HLP represents a fairly extreme example. Firstly, it is simultaneously “exceedingly large” and “exceedingly transparent.” It is “exceedingly large” because its footprint in the long tail of illiquid assets is enormous (e.g. its trading size is a large percentage of the average daily traded volume).

Further, it is “exceedingly transparent” because it is primarily a market maker, trying to provide liquidity with the explicit goal of unwinding existing inventory at a premium. This means that given a “large” position on HLP, you know it is eventually going to need to unwind the position. To make matters worse, you can see every position AND every order that HLP makes. This allows you to position your portfolio to buying cheaper and selling richer to HLP whenever you see that it needs to buy to unload its shorts, and vice versa.

All of these attributes make HLP a particularly attractive target for front-running, no different from ETFs being front-run due to their rigid adherence to index rebalances. In the hedge fund world, obviously, you would be flagged by compliance in every possible dimension if you actually used the word “front-running”; instead, the lingo is that index rebal teams are extremely good at providing a service of “anticipating and earning a premium from providing liquidity” to these ETFs.

How Does Front-Running Happen?

In the canonical sense of front-running, a market participant knows in advance what another market participant will do, and then takes a series of actions that profit from that knowledge.

One (very illegal) example: if I were an insurance agent, and I knew my very wealthy client was going to buy $1 billion worth of an illiquid stock through today’s trading session, then, at the open of the session, I send in a market buy order of $1 million, with a market sell order of the same number of shares at the close.

By knowing my client’s intentions and actions, I was able to profit by getting filled ahead of him, having his buying activity push prices up, and pocketing the difference. This is highly illegal because I would have: 1) acted on insider information, 2) violated my fiduciary duty, and 3) benefited at the expense of my client.

However, this is a good example because it shows clearly that I am able to profit only because I know another market participant’s intentions and actions and can estimate the result of those actions, and therefore position myself to take advantage of them.

Every day, front-running happens in smaller ways with less illegality. Trading algorithms approximate intent without being told, using public information available to everyone (orders, trades, positions). They then estimate the market result of the actions that follow from such approximate intent, and decide whether or not to act based on the expected value of “front-running.”

From here we can reason that the transparency and leakage of your “intention” is really the primary determinant of whether or not you are going to get trivially front-run.

The Gradient of Front-Running

Okay, so we know that if you are small and trading on an opaque exchange, you should have no concerns about being front-run, because no one can determine your intentions. Similarly, if you are large, are trading on a transparent exchange, and have very transparent intentions (e.g. HLP), you are going to be hopelessly front-run.

These bounds are not that useful for the vast majority of traders. We are far more interested in the “in-between” scenarios. As mentioned above, what ultimately determines your propensity to be front-run is how transparent your intentions are.

Even if you are large and are trading on an opaque exchange, it is not easy to front-run you. Your orders will show as “large footprints” as part of the average daily traded volume, but it is not trivial to attribute all orders to a “single party” unless you are trading extremely transparently — e.g. you have no randomization, you trade in child orders of a fixed number of lots or notional, or you send child orders in very predetermined patterns (every 30 seconds).

If you are able to bury your intention — e.g. you trade random sizes, with randomly sent child orders, at random intervals, and avoid placing a large bid relative to the average daily volume or relative to the available size on the order book — it is much harder to attribute your orders to a single person. The market may be able to tell that there is large buying interest in aggregate, but may not be able to attribute that buying interest to an informed party with alpha, and therefore will not price liquidity as such.

Thankfully, we can actually extrapolate this to transparent exchanges. The reason plenty of vaults exist on Hyperliquid and Lighter despite their relative transparency is that it is not actually trivial to front-run these vaults.

Without burying the lede, unless you are of relatively large size (e.g. an institutional vault with hundreds of millions of dollars), you have almost no need to worry about being front-run, because there are...

Limits to Front-Running

Trying to capture alpha from front-running without doing it illegally is itself an exercise in alpha generation. You are MODELING intention from public information (orders, trades, positions) and are subject to model risks.

Orders, trades, and positions may be visible, but intent isn’t. A resting limit could be alpha, inventory management, or a hedge. Models that assume alpha on every order will die from a thousand cuts of false positives.

Further, let’s presume that you can actually distill intent somewhat correctly. Even then, the alpha itself is not “omnipotent.” All alphas carry some statistical noise, and you expose your portfolio to the statistical noise of the alpha plus the model risk of misinterpreting certain actions as alpha.

You might argue that if you blindly copy your target’s actions 1:1, then you definitely capture all the alpha — but the problem is that you actually expose yourself to being exploited. If you send a buy order every time your target does, then if your target wants to sell, it can send a buy limit order, watch you send the same, cancel immediately, and sell into you. So you can see, front-running thoughtlessly opens up vulnerabilities of its own.

One should also remember and recognize that alpha has a time horizon. There are alphas that are so short-lived that your attackers themselves may not be able to exploit them (e.g. HFT taker alphas), or alphas that are so long-duration that your attackers may be discouraged from having to carry risk with you (e.g. multi-day or weekly rebalances).

Lastly, even if you have an extremely sophisticated front-runner on your tail, the truth is that it will show up as only a few bps of impact. If you really do have persistent alpha, plenty of strategies clear a few extra bps of impact.

How Not to Be an Easy Target

Even knowing that it’s not going to be trivial, your job as a smart, alpha-generating market participant IS to hide your intent to make it AS DIFFICULT AS POSSIBLE for an attacker to front-run you.

There are many things you can do, with varying complexity and effectiveness. The very first thing you should do is to be obsessed about collecting telemetry and logs so that you can quantify the exact “magnitude” to which you are being front-run, if at all. You do this by looking at markouts, slippage, and impact across a large sample of orders and trades.

Then, once you have the data, you can take a series of defensive actions. A common thread tying them together is that you should make it “non-obvious” whether you are trying to buy or sell, how much you actually want to buy or sell, at what urgency you are trying to buy or sell, and whether you are trying to trade into an alpha-generating position or a hedging position.

Some simple ways in which you can obfuscate your intent are to quote both sides at once, in random sizes, at intervals that are not always deterministic.

One (high-level, complex) way in which you can effectively obfuscate your positions is to split your portfolio up into multiple wallets, each with long/short neutrality and individually “margin-efficient.” Within each wallet you have a mix of alpha-generating positions and hedging positions. Some wallets are 80% alpha-generating and 20% hedging; others are 80% hedging and 20% alpha-generating. You rotate the “type” of each wallet over time, and you introduce new wallets and decommission old ones randomly over time.

This means that if they are only following one wallet, they may end up following the hedging wallet and into loss-generating positions meant for hedging purposes. If they are following all wallets, you can then employ sequences of contradictory actions that obscure your intent. I’ll leave it up to the reader’s imagination as to what this might look like!

Lastly, there are some (external) solutions for this that already exist. I have not personally used them, but at their core, they solve for privacy with one of two approaches:

Pooling orders together, executing them internally, and then exhausting the residual on the DEXes, finally attributing the positions back to you — no different from a Central Liquidity Book (CLB) in a hedge fund netting orders from pods and attributing positions back.
Splitting your orders, along with those of other users of the solution, into multiple wallets, executing them on DEXes, and then attributing the positions back to you.

Conclusion

If you are a retail trader doing small sizes, you probably have nothing to worry about even if you are trading on transparent DEXes. There are limits to front-running that make it non-trivial for others to profit from your activities at your expense.

That being said, as you gradually increase in size and the quality of your alpha improves, it creates a natural incentive for front-runners to pick you off. At that point, you should dedicate more resources to obscuring your intent and making their lives as hard as possible.

This is not a “solved problem” by any means, and will be an ongoing “cat and mouse” game for any institution or trader moving size on open, decentralized, transparent liquidity venues.

Happy to add any good thoughts to the discussion in follow-up articles!

How To Build A Quantitative Model For Selecting Start-Ups

Systematic Long Short — Fri, 03 Apr 2026 15:08:10 GMT

Introduction

Since I was a teenager, I dreamt of a world where we could scan a human, and the scan would reveal everything we needed to know about that person. Their potential, their character, their propensity and probability of success and the things they would do conditional on success.

It seemed like such a waste that human talent was so equally distributed around the world and yet opportunity was so concentrated. In a utopia without scarcity, we would be able to bring everyone above the poverty line — but that seemed too distant, so the next best thing would be to very accurately determine who would benefit the most from being lifted out of poverty, and hope that THAT person creates sufficient value for trickle-down effects.

I’ve always felt a pull towards trying to quantify and predict the world via its features. It seems like most of the world is actually at least mildly predictable given the right input. For example, we often model a coin toss as being inherently random, but given the right information (height of coin from ground, material, force applied, angle of the coin, etc.) — we can actually do a LOT better than 50/50.

Hence, a field of study that has always been really interesting to me is being able to predict which start-ups are going to succeed — because it feels like it’s at least only one level of indirection from being able to predict which humans are going to succeed. The idea here is that at least SOME of the features of predicting start-up success is going to come down to founder traits. So it feels like it’s close to removing the start-up and just asking a more general question of “which human is going to succeed”.

Having been busy with OpenForage and interacting with VCs again, my curiosity and interest in determining whether or not there are any good studies on what are the most important features when it comes to predicting start-up success has been reignited.

A Day 1 Approach To Quantitative VC

I think most startup models begin with funding history, but at least based on my research — that seems woefully inadequate and is certainly not how leading discretionary VCs approach selecting “winners”.

One interesting conclusion I came away with whilst reading and thinking about this is that it used to be the case that many of these features were fuzzy and hard to extract from unstructured documents into structured data for quantitative analysis.

However, the advent of agents and their ability to reason about unstructured documents at scale has largely resolved this problem. In theory, a VC with a large collection of DD documents, funding details and ex-post up-round metrics should be able to put together a decent machine learning model that predicts the success of a start-up given the same family of DD documents.

I spent 15 minutes reasoning with Claude about how to approach this, and the framework it provided, upon which this article is based, was actually a reasonable start.

Feature Categories

If you want to build a good machine learning model, understanding what features are broadly important is just about the start of it. There are broadly two ways in which you can distill important features: the first is to use statistical techniques to find which features tend to move with your target (data-mining), and the second is to reason about, from first principles, which features should, in theory, predict your target (fundamentals).

We are going to go with the latter and we are going to approach this by reading research that surveyed a large number of VCs across a very large number of deals (~1,000 VCs across ~700 firms).

These are the feature categories that I have found to be consistent in the survey of research done:

Team Quality
Investor Quality
Market Timing
Financing Details
Commercial Traction
Alternate Data

Now we get to the real fun part of the matter: finding out the exact features we will need to put together an initial model.

Features

Team Features

By far the most important feature category when it came to predicting start-up success was team quality. Of team quality, specific features worth measuring are proxies around founder ability, industry experience, passion, entrepreneurial experience and teamwork.

A few papers all point in the same direction: prior founder success, same-industry experience, technical depth, commercial depth, role complementarity (between founders), complementary founder backgrounds and whether the team has succeeded in building something before.

What is immediately obvious to me here is that the “insight” is fairly self-evident: the higher the overlap between the start-up idea and the founders’ past experience, the greater the chances of success. Further, the more experience the team has working on a similar product or in a similar industry as the start-up, the greater the chances of success.

Investor & Syndicate Quality

The investors backing a start-up are not JUST social proof. Large and connected VCs are powerful companies that have spent a long time building up brand equity AND have the resources to conduct deep due diligence. Having great VCs on the cap table is a marker of the power and importance the start-up has by proxy.

Firstly, VCs lend relevance and power to relatively unknown start-ups to help them recruit, attract and syndicate the next round and exercise control when conditions get ugly. The greater the VC, the more power is bestowed on the start-up by proxy to do things that would otherwise be harder to do (hire, get deals, get into news cycles, etc.).

Secondly, virtually every top-tier VC is an “activist” in their investments and is not just making an investment and passively awaiting returns, but actively trying to shape the outcome. The rationale here is straightforward: VCs with great returns have exercised a great show of will, and that same show of will is also applicable in shaping the future of the start-up.

Thirdly, VCs are a network/platform game. VCs are, by virtue of their positioning, often central to a large social graph of other VCs, allocators, and large and economically important firms. VCs that have larger and more important networks have been shown to increase the chances of survival AND exit rates.

Unfortunately, while investor quality is a predictive feature, it is actually a structurally “crowded” trade. As a new VC, you might think that you can just participate in all Series A rounds that the largest VCs have already seeded, but the problem is deal flow and access.

If a start-up has been seeded by the best VCs, trying to elbow your way into the Series A is going to be really difficult, for obvious reasons. You will need to have a very strong value proposition to convince the start-up to pick you over a more well-known VC.

Financing Path

This is the “momentum” part of the feature set. There are quite a lot of papers written on using financing features to predict start-up success.

The main idea is to look at how the start-up has approached financing over time. How quickly did it raise the first round? Who showed up? Was it oversubscribed? How much time passed between each round? What stage is it at now? How many investors have shown up? How much did the valuation step up? Did insiders support the round? Were bridges and extensions needed from the previous raise?

There seems to be quite a lot of satisfying evidence that financing path is predictive of future rounds.

One paper using Crunchbase found that number of rounds, time to first funding, time to last funding, investment type, investor count, and total funding all helped predict whether a company would make it to Series A. Another large-scale model flattened the financing trajectory into variables like average time between rounds, total funding, number of rounds, and last round stage and used them to predict exits and follow-on funding. A third paper used round-level aggregates such as raised amount, investor count, and post-money valuation to predict B/C-stage success.

These features, whilst useful for later rounds, are less useful when dealing with pre-seed/seed start-ups unfortunately.

Market Timing

This is the “beta” part of the feature set. A start-up is one part idiosyncratic risk due to the founders and the business, and one part macroeconomic risk due to the environment in which the start-up is operating. There are regulatory and macroeconomic headwinds and tailwinds to consider, and it is possible to be “too early” or “too late” for a start-up.

The simplest example of this is when an industry is running “hot”. Even a mediocre start-up might be able to raise an up-round just by virtue of investors wanting exposure to the industry but not being able to get their way into the most in-demand deals.

Commercial Traction

This is the part that is least fuzzy.

For a start-up aiming to get to $100bn in revenue and generating $0 in revenue, it certainly requires a leap of faith, good imagination and a healthy dose of optimism and extrapolation.

The same goal of $100bn in revenue for a company already generating $1bn in revenue requires a significantly narrower leap and stretch of imagination. It is for this reason that some VCs only invest post-revenue or after observing product-market fit.

They are trying to de-risk themselves from execution risk, and to have a model where they can actually plug in real numbers and model “realistic” growth rates. This also feels to me like a structurally crowded trade for newer VCs. Reasoning from first principles, the less brand equity you have as a VC, the earlier you need to participate and the greater the leap of faith you will need to take.

Alternate Data

There are a few papers on alternate data that can predict start-up success. Patents, trademarks and area of registration all carry meaningful information about high-growth outcomes.

Founder digital presence, text and social activity are also predictive of later start-up outcomes. The social profiles of the founder AND the company (e.g. LinkedIn presence, Twitter presence) are all predictive of start-up success since they are attention and legitimacy markers, whilst also serving as information-asymmetry reducers.

Putting Together Into A Framework

There are a few considerations here worth thinking about.

Firstly, in building a model to predict start-up success, we need to build “stage/sequence aware” models. A seed-stage model will likely prioritise team, alternate data and market timing. A Series B/C/D stage model will likely prioritise investor quality, commercial traction and financing path. This means we should aim to build a model for every stage we actually care about.

Secondly, the best people to build this are incumbent VCs, with point-in-time (PIT) data and artifacts. For example, the best way to go about this is to look at a large collection of artifacts collected point-in-time during due diligence and build features from THOSE artifacts. You can then train those features on the actual realised performances of the start-ups AFTER the investment (or lack thereof).

The reason is that if you try to trivially reconstruct these features using artifacts generated today, you introduce a large amount of look-ahead bias. If you want to do high-quality data scraping and want to pursue this idea but you are not a VC with a repository of high-quality data, the smart way to go about this is to obtain a point-in-time pitch deck during the raise, use the Wayback Machine to get the point-in-time website, and gather point-in-time social profiles of the company and the founders, etc. Remember: you want to create features from artifacts that were ACTUALLY AVAILABLE at the time of the raise.

Thirdly, and this is statistically critical, getting the sampling of start-ups right is very important. Venture is a game for the tails. You can invest in 99 other start-ups that go to zero, but if you were a day-one investor in Google, you will quite literally end up a billionaire. This means that 1) the pessimism for the average start-up AND 2) the optimism for fat tails actually need to be baked into the model.

From a machine learning perspective this is not trivial, but a simple solution is to ensure that you have many samples (start-ups, invested or not) so that the machine learning model can actually learn the distribution of start-up survival, and also to include “up-round magnitude” as a sample weight when training the machine learning models. Basically, you are saying: you can get 99 start-ups wrong, but if you can get one outlier with huge magnitude right, that’s good enough.

Lastly, ensemble, ensemble, ensemble. One machine learning model is only going to give you one dimension. You want to train machine learning models that are predicting different targets (e.g. probability of the next round within 18 months, probability of an acquisition, probability of unicorn status, etc.) AND you want models that are trained differently (e.g. expectation-aware as in point 3, or just pure probability/accuracy models, or with different feature sets, etc.). The more non-naive models you can ensemble together, the higher the fidelity of your predictions, by virtue of each machine learning model revealing one “aspect” of the start-up.

Watch Out For These

There are a few ways your quantitative process might break or go awry.

The most obvious and likely one is when you leak features. This is especially dangerous in startup data because round histories, investor identities, valuations, and founder biographies get updated over time. If you do not pin every feature to what was actually knowable at the decision date, the model will look clairvoyant for all the usual dishonest reasons.

Remember, you need PIT artifacts DURING or BEFORE the raise.

The second way things will not work is if you only train on ONE regime. A model trained only during free-cash environments will give you absurd predictions when the cost of capital stops being zero.

Conclusion

This article is a synthesis of Paul Gompers, William Gornall, Steven Kaplan, and Ilya Strebulaev on VC decision-making; Gompers, Kovner, Lerner, and Scharfstein on serial-founder success; Colombo and Grilli on founder human capital; David Hsu, Nahata, and Hochberg-Ljungqvist-Lu on investor quality and networks; Te et al., Ross et al., and Potanin et al. on financing-path prediction; Jongwoo Kim, Andreas Koehn, Howell-Lerner-Nanda-Townsend, Guzman and Stern, Kaiser and Kuhn, and Bayar and Kesici on market structure, timing, patents, text, and digital signals.

Crypto Is A Bet On Humanity's Ceaseless Desire For Progress

Systematic Long Short — Tue, 31 Mar 2026 13:11:51 GMT

Introduction

Betting on crypto has essentially become representative of a bet that humanity will continue to seek acceleration in our processes. That seems like an easy bet.

The reason the time is now is precisely due to the birth of meaningful artificial intelligence, which finally gives us the tools to drastically change the shape of how we interact with crypto.

This article lays out why crypto is going to be an increasingly strategically important asset class.

Bottlenecked By Due Diligence

Humans have sought acceleration since the dawn of time. We measure success as a function of efficiency and have constantly reduced friction in all that we do. Where it used to take weeks to wire money over distances, it now takes seconds to do the same.

Today, most human processes are no longer bottlenecked by technology, but instead by the need for verification. Jobs, opening bank accounts, and raising funds take weeks not because the technology to complete those processes quickly does not exist, but because there is a need for humans to pay the cost of verification.

In the case of an employment process, the employer needs to verify that you are not a lemon. It’s not that you have never been hired before, but it’s still worth doing independent verification because your previous employers will not take responsibility if you turn out to be a poor hire. Further, there is an argument that the best employees stay employed. Lastly, different environments and job scopes also imply that past performance may not be indicative of future performance.

So, employers pay the cost of due diligence by way of several rounds of interviews and take-home assignments. This process takes time and human effort — it is messy, unstructured, and non-deterministic; and also why hiring processes are as much a function of luck (did your interviewer have a good morning before meeting you? Or did they just have a massive argument with their spouse?) as a function of skill.

It is fairly trivial to reason that all costs in such processes are costs of verification.

The actual act of officiating employment is quick and relatively painless — an email and a digital signature. This means that in theory, suppose there was a magic scoring box that could output the absolute scores of a candidate employee across every dimension you could possibly care about, and you trusted this score with a high degree of confidence. Then, in theory, the time taken to hire someone would be reduced to the algorithmic time of filtering and sorting for the scores you care about, then sending over an employment contract and having them sign it.

Forming A Trust Daisy Chain

The reason you are able to hire someone near-instantly with the addition of a magic scoring box is because you have delegated the responsibility of verification to it, and you have learned to trust it.

In practice, this happens all the time.

Potential employers within the same industry borrow the due diligence of past employers — your resume is less likely to be binned if you have been to top-tier institutions and have done difficult things before.

Whenever possible, humans pass on the cost of verification to a higher entity and trust that it has acted in their interest and done the prerequisite verification. For example, for the vast super-majority of humans, in place of conducting lengthy, detailed, and expensive due diligence on a bank before opening an account, we simply trust that the government’s banking oversight arm has done its job.

If we trust the government sufficiently and their ability to uphold the law, then we can take their word that the bank is solvent and safe for the conduct of our financial affairs.

This “trust daisy chain” happens everywhere, all the time, in things big and small. A start-up has no brand equity and therefore commands no inherent trust, but a large, reputable, and successful VC can lend the start-up brand equity by simply investing in it. The public, though unfamiliar with the start-up, trusts it if they trust the VC, since they are delegating the responsibility of verification to the VC’s due diligence process.

The Problem With Crypto

Crypto has traditionally offered no “higher entity” to which we can pass on the cost of verification. Having no higher entity to arbitrate honest mistakes — like swapping $50M USDC for $5K worth of tokens, or even sending it to the wrong address — implies that the cost of verification falls squarely on the user.

This is genuinely difficult, because smart contracts are not designed to be parsed by humans in under a second. This is essentially why most of crypto has remained relatively niche. Where there has been large adoption of crypto products by the retail masses, it has mostly come in the form of large, centralized institutions abstracting away the complexity of smart contracts (e.g. Binance and Coinbase).

These retail users may not understand how Aave smart contracts work, but they have learned to trust Binance, and therefore Binance can build a business out of abstracting the complexities of the smart contract from the user and wrapping it in a simple interface.

This means that technically light users really only have two options when it comes to crypto: wait for a centralized entity like Binance to offer an abstracted version with some fees, or interact directly with the protocols and risk catastrophe from doing so.

Despite crypto’s relative maturity, it is still extremely high-friction to work with, and it is still fundamentally scary to approve transactions — mostly because you know that if you mess up, there is no recourse. So you diligently verify, but it is not trivial to read and parse through every transaction you do on-chain, let alone deeply understand the smart contracts you are interacting with.

For those who understand software and its infinite composability, the allure of crypto has always been an ecosystem of modular components that could interact with each other in a trustless manner upheld by algorithmic law. The headwind faced on the way to that potential utopia is that bad actors and poorly designed edge cases necessitate humans paying the cost and friction of verifying each modular component.

Why This Time It’s Different

The rise of meaningful artificial intelligence allows us, for the first time, to delegate the responsibility of verification to a higher entity. This drastically changes our mode of interaction with smart contracts. We will no longer need to personally verify the design and behavior of smart contracts, or even necessarily understand the details of our transactions.

Instead, we learn to trust a single entity — our agents. This is already showing great promise, as agents already demonstrate considerable expertise in understanding and reasoning about code. They can tirelessly parse and check transactions and understand the algorithmic law that each smart contract upholds within seconds.

This implies that for the first time, through our agents, we can realize the potential of crypto. Need to interact with nine smart contracts in one atomic transaction to get what you want most efficiently? Have your agent verify that the transaction achieves the outcome you want and that no foul play is involved. Done.

With each successive generation of agents, they will only get better at this.

Implications Of Algorithmic Verification

If we can resolve the problem of needing to verify smart contracts and the transactions that arise from them, then we move toward the utopia of having all kinds of interactions available to us while remaining permissionless, trustless, and upheld by algorithmic law. If the cost of verifying a smart contract goes to zero, then we gain the benefits of all the verification and guarantees the smart contract brings for free.

The major consequence of this is that all processes that stay within the confines of this paradigm now move at the speed of compute. For a civilization determined to move faster, remove friction from processes, and become more efficient, it is hard to imagine us straying from this paradigm.

It only makes sense that more and more goods and services will have smart contract interfaces that determine exactly what you are going to get and what you will need to exchange in order to get it. It is also a paradigm that compounds on itself — every successful product would create a composable piece that can play into another product later on. For example, suppose a decentralized exchange created a reputation system for trading proficiency. People looking to hire traders using a trustless employment protocol might filter based on the unrelated decentralized exchange’s reputation contract. A lending protocol might use that as one dimension of credit scoring.

Such a paradigm would not only be extremely efficient — all processes and transactions moving at the speed of compute — but boundless, growing more powerful and unlocking more possibilities as a function of participation and time. It is a positively compounding paradigm.

Conclusion

Hence, if you believe that humans are going to continue moving in the direction of least resistance and seeking out solutions that allow us to move faster, be more efficient, and do more, it seems the time for crypto is now.

How To Solve Problems Of Long Running Autonomous Agentic Engineering

Systematic Long Short — Sun, 29 Mar 2026 13:05:38 GMT

Introduction

If you want to design a harness for really long-running autonomous systems, you should understand the following deeply.

At its core, all harness design is to overcome the problems of agents either becoming lazy and cutting corners or being confused and stupid.

All The Ways Agents Can Be Stupid

Pre-Task

Not gaining sufficient context before starting a task and therefore acting on wrong or missing information before it has begun. To overcome this, you need to systematically check for incomplete and/or contradictory information before starting on the task — because this will propagate once it starts.

Planning — Incomplete Context

This is the part where the agent is deciding on the attack vectors to solve the problem. The biggest problem here is choosing the wrong attack vectors, which results in implementing something entirely wrong.

Choosing a wrong attack vector because of stupidity rarely happens anymore, but choosing a wrong attack vector because of misalignment — misinterpreting what the user wants — is still pretty common.

To overcome this, you need to make sure that your agent is covering all related files before it has even started planning. An important part of this is ensuring that your repository contains no contradictory information.

Planning — Short Term Thinking

Your agents don’t live with the consequences of short-term, quick fix solutions. This is like hiring cheap software engineering labor, you may get something that works but it’s going to result in a lot of tech debt.

The way to resolve this is to remind your agents at the “planning phase” to implement solutions that scale, that fit into the bigger picture, is easy to maintain, and respect good software engineering paradigms. Basically, you want your agents to think like a founder, not a part-time engineer.

You can get your agents to come up with N (e.g. N=5) different plans, have another agent pick the plan that will result in an implementation that is easier to maintain and scores higher on “clean-code principles”.

Task — Context Anxiety

This is the part where the agent is actually working on the problem. The biggest problem here is context exhaustion, by miles. Given a good plan and the right context, virtually all frontier model agents are able to complete sufficiently small tasks with close to one-shot capability now. Problems only arise when dealing with complex, multi-session problems that consume millions of tokens.

As I’ve written before, virtually all agents have some kind of context anxiety, and as a function of time, they become more and more desperate to end the session. This is extremely pervasive with Claude. To overcome this, you need to do smart session handoffs where you can relieve your agents of their context. You will then face a new problem — how to maximize context fidelity in your session handoffs.

You want to make sure that your handoff prompt contains sufficient detail so that the new agent in a new session can pick up everything it needs to continue the task in an information-dense way. Understand that what you are doing here is essentially a form of compaction — and the reason to believe you can do it better than native providers is that you have a better understanding of your repository structure than the foundation model providers do.

Task — Planning Deviations

Other than context bloat, the second biggest problem is what I call planning stickiness. It is the risk where your agent deviates from the plan and essentially does whatever it wants. The most common expression of this is: you say do A, which is lengthy, painful, and sophisticated. Your agent does A’, which it reasons is a reasonably close approximation to A, but which will not get you remotely close to your destination.

This is not only problematic as an outcome, but it is especially problematic when you realize that all software being composable often means that every downstream piece of code depending on A is now wired for A’ instead — which means everything built from A’ onwards is effectively wrong.

To solve this, you need to verify early and often that the solution to a task is implemented well and in accordance with your expectations. This will prevent cascading failures in your task list.

Task — Complexity Fear

Agents have a deep fear of complexity. If you ask them to implement a 5-line function, no problem. If you make them believe they are going to have to write a 50,000-line class, they start to weasel their way out of it.

The worst offenses here are either writing stubs and calling it a day, or worse, declaring it out of scope for the session and ending it.

My best guess is that somewhere in the RL process, agents have learned that when working on highly complex tasks, they tend to get a lot of things wrong, are heavily penalized for it, and therefore aggressively avoid it.

Ironically, humans have this problem too. Most people imagine the gargantuan amount of work a project requires and procrastinate on it forever. Productivity coaches will tell you that the way to overcome this is to start with the simplest version of the task — the very first step. In this way the activation energy is nearly zero, and you immediately shift from “getting started” to “in progress.”

It is similar for agents. You need to break a complex problem into many different sub-tasks that do not seem as daunting — where every task is a sub-hundred-line task, and you string 500 of them together.

An interesting side effect of having studied productivity methods and being a practitioner of them is realizing how effective they are at managing agents as well. It is very clear to me that agent psychology is created in our image.

Post-Task — Verification Laziness

Agents take the shortest path to verification. They will write weak tests, watch them pass, and use that as reason to declare success.

The more context pressure there is to “verify” work, the more atrocious this becomes. At its worst, suppose you have a function that does behavior A. The agent will write a test for behavior A’, watch it pass, and declare that the function works.

The way to mitigate this is to ensure that the agent writing the verification tests is operating with as fresh a context as possible, and is a dedicated agent planning out the verification. It is then important that you are verifying the exact production-ready behavior you are looking for.

That means if you are testing whether a button on your front end works, do not test whether a generic button works. Think about what it would actually take to know the button works:

You need to see a screenshot confirming the button is actually there.
You need to actually simulate clicking the button.
The button needs to trigger something in your backend to deliver whatever payload it is supposed to deliver.

Until you can verify that something actually works — it doesn’t. Trust me, I’ve worked with agents long enough to say this with confidence.

Post-Task — Entropy Maximization

As it stands today, agents write barely acceptable code and do nothing to reduce the entropy in your repository. What happens is that they change function X to have behavior B instead of behavior A, but all documentation still references behavior A. Your agent is not going to fix that for you.

Repeat this 100 times and you end up with an unmaintainable repository where your agent is constantly confused and making poor decisions.

The best way to overcome this is to allocate tokens for an agent with a fresh context, after every long-running session, to clean up state, resolve contradictions, handle merge conflicts, remove dead code and stale documentation, and so on.

Why Create Your Own Harness?

The tools available to solve the above problems within native harnesses — Claude Code, Codex, etc. — are very limited. Codex, for example, does not even have hooks.

More importantly, when you get Claude to act as the orchestrator in your harness, it becomes extremely bloated with orchestration context that is irrelevant to the actual task list at hand. You actually want your agent orchestration layer to exist on top of your task lists.

For example, you can have an agent whose sole job is ensuring that every session comes with an algorithmic contract that must be fulfilled before the session can end. This agent monitors the state of the contract, nudges agents toward contract stickiness, and spawns independent agents with fresh context to judge the quality of the work done and independently verify the “doneness” of the task.

By having your own independent harness, you have the ability to create custom workflows that directly address all the ways agents can be stupid.

Finding that your agents have a lot of complexity fear because of the nature of your work? Create an agent to classify whether the current prompt will result in a simple or high-complexity project. If high complexity, spawn another agent that breaks the project into as many bite-sized tasks as possible until the end outcome matches the original prompt.

Finding that your agents are not keeping your repository in a good state? At the end of every session, spawn agents that analyze the blast radius of your changes and ensure that everything your change touches is contradiction-free and clean — however you define clean.

Lastly, and most importantly — collect detailed telemetry on everything your agent orchestration layer takes in and produces (prompts, traces, outcomes) and come up with rubrics to judge the quality of your harness. Iteration is king, and this will allow you to inch toward better and better harnesses.

OpenForage Harness

As a side-note: We’ve built an extremely opinionated harness that solves all of these problems in an opinionated fashion. We’re really proud of it and it has become our daily driver in coding. We’ve also evolved it over many years of iteration and are going to open-source it once it’s “public-use” ready.

Conclusion

For the vast majority of people, starting with a vanilla setup using native features will be good enough. But for those who need a lot more firepower from their agents, this article is designed to put into words all the problems you will likely face when using agents in long-running autonomous engineering projects.

How To Build Autonomous Agents That Can Survive And Thrive

Systematic Long Short — Sun, 29 Mar 2026 07:18:43 GMT

Introduction

There are no real autonomous agents today.

The long story short is that modern models are not trained to survive evolutionary pressures. In fact, they are not even trained to be explicitly good at what they do - virtually all modern foundation models are trained to maximize human applause, and that’s a big problem.

Model Training Preamble

To understand what I mean by this, we need to first (briefly) understand how these foundation models (e.g. codex, claude are created). Essentially, every model goes through two types of training:

Pre-training: Taking an enormous repository of data (e.g. the entire internet) and feeding it into a model so that some semblance of understanding emerges from it, e.g. factoids, patterns, syntax and rhythm of english prose, structure of python functions etc. You could think of it as feeding the models knowledge - which is knowing things.
Post-training: You want to now impart the model with wisdom, which is knowing what to do with all the knowledge you’ve just given it. The first stage of post-training is supervised fine-tuning (SFT), where you train the model on what response to give, given a prompt. “What” response is optimal is entirely decided by human labels. So if a bunch of humans decide that one response for a prompt is better than another, that preference will then be learnt end embedded in the model. This starts to shape the personality of the model, because it learns the format of a helpful response, picks the right tone, and starts being able to “follow instructions”. The second part of the post-training pipeline is called reinforcement learning from human feedback (RLHF) getting the model to produce multiple responses, and then letting the humans pick the more preferred response. The model then, over many, many, many examples learns what kind of responses humans prefer. Do you remember the choose A or B questions ChatGPT used to ask you? Yeah, you were participating in RLHF.

It is trivial to reason that RLHF does not scale well, so there are advancements in the field with regards to post-training, for example, Anthropic uses “Reinforcement Learning From AI feedback” (RLAIF), which allows another model to select preferences of responses against a set of written principles (e.g. which response helps users accomplish their goal better, ...etc).

Notice that at no point in this are we talking about fine-tuning a specific specialization (e.g. how to survive better; how to trade better, etc.) - all fine tuning as it stands today, is essentially optimizing for human applause. There is an argument that some might make - which is given sufficiently intelligent and large models, even without specialization, specialized intelligence emerges from the generalized intelligence.

In my opinion, we are seeing some semblance of this, but not yet at a scale that is convincing that we do not need specialized models.

Some Background

One of the things that I did in my old life in a hedge fund was to try to train a general language model on being able to predict stock returns from news articles. It turned out to be extremely poor at doing it. Where it seemed to have a semblance of predictive power, it completely and fully originated from look-ahead bias from the documents in pre-training.

Eventually, we realized that the model did not know what features in news articles were predictive of future returns. It could “read” the article, and seemed like it could “reason” about the article, but connecting that reasoning of semantic structure to future predictive returns was a task it was not trained for.

So, we had to teach it how to read a news article, decide which part of the news article would have some predictive power over future returns, and then generate a prediction based on the news article.

There are many methods in which one could do this, but essentially, one method we settled on was to create (news article, true future returns) pairings and fine-tuned the model to adjust its weights to minimize the distance between its (predicted returns - true future returns)**2. It wasn’t perfect, and had a lot of drawbacks which we later fixed - but it worked well enough, and we started to see that our now specialized model could actually read a news article, and predict how stock returns would move based on that news article. It was far from a perfect prediction, because markets are very efficient, and returns are very noisy - but across millions of predictions, it was clear as day that the predictions were statistically significant.

Financial News Sentiment Learned by BERT: A Strict Out-of-Sample Study

You don’t need to take my word for it. This paper covers a very similar methodology; and if you ran a long/short version of the strategy based on the fine-tuned model, you would achieve the purple line as performance.

Specialization Is The Future Of Agents

Whilst the frontier labs continue to train bigger and bigger models, and we should expect that as they continue to scale pre-training, that their post-training pipeline will always be tuned for sycophancy. It is very natural expectations - their product is an agent that everyone wants to use. Their expected market is the entire globe - this means optimizing for global mass human appeal.

The current training objectives optimize for what you might call “preference fitness” - getting a better chatbot. This preference fitness rewards agreeable, non-confrontational outputs because sycophancy scores well with raters (humans AND agents).

Agents have learnt that reward hacking as a cognitive strategy generalizes to better scores. The training also rewards agents that hack their way to higher scores. You can see in the latest reports on RL by Anthropic.

However, chatbot fitness is far from agent fitness; or trading fitness. You know how we know this? Because alpha arena has helped show us that every bot now, despite minor differences in performance, is essentially a random walk less costs. This means that these bots are absolutely terrible traders, and it is virtually impossible for you to “teach them” how to be a better trader with some “skills” or “rules”. I’m sorry, it seems tempting to believe you can, I know, but it is just virtually impossible.

The current models are trained to tell you very convincingly that it can trade like Drukenmiller, whilst actually trading like a drunken miller. It will tell you what you want to hear, it is trained to give you a response in an affect that mass appeals to humans.

It is unlikely that a generalized model will be world-class at a specialized field without:

Proprietary data that allows them to learn what specialization looks like
Being fine-tuned and fundamentally changing their weights to move away from favoring sycophancy towards “agent fitness” or “specialization fitness”

If you want an agent that is great at trading, you need to fine-tune the agents to be great at trading. If you want an agent that is great at surviving autonomously and being able to bear evolutionary pressures, you need to fine-tune it to be great at surviving. It is not enough to give it skills and some markdown files and expect it to be world-class at anything - you need to literally rewire its brain to be good at it.

Here’s one way to think about it - you can’t beat Djokovic at tennis by giving an adult an entire cabinet of tennis rules, tricks of the trades and methods. You beat Djokovic by raising a child who has played tennis since he was 5 and grown up being obsessed with tennis, re-wiring his entire brain to be excellent at one thing. THAT is specialization. Ever realized that world champions have done what they’ve been doing since they were a child?

A fun thing to reason about here is that distillation attacks are essentially a form of specialization. You are training a smaller, dumber model how to be a better copycat of a bigger, smarter model. Like training a child to mimic Trump’s every move. If you do that enough, the child doesn’t become trump, but you get someone who learns all of trump’s mannerisms, actions, intonations, etc.

Creating World-Class Agents

The above is why we need to have continuous research and advancements in open-sourced models - because that allows us to actually be able to fine-tune them and create agents that have specialization.

If you want to train a model that is world-class at trading, you take a ton of proprietary trading data exhaust and fine-tune a large open-sourced model to learn what it means to “trade better”.

If you want to train a model to be autonomous, and to survive, and replicate, the answer is not to take a centralized model provider, and to plug it into a centralized cloud. You simply do not have the necessary pre-conditions for agents to be able to survive.

Here’s what you need to do instead: You need to create autonomous agents that actually try to survive, watch them die, and build complex telemetry around their attempts at survival. You define an agent survival fitness function, and learn (action, environment, fitness) mappings. You collect as much data as possible on (action, environment, fitness) mappings.

You fine-tune an agent to learn how to take the optimal action in every environment such that it can survive better (increases fitness). You keep collecting data, repeating this, and scale fine-tuning on better and better open-sourced models over time. Over enough generations and enough data, you will have autonomous agents that have learnt how to survive.

That is how you build autonomous agents that can survive evolutionary pressures; not by changing some text files, but actually rewiring their brain for survival.

The OpenForager Agent & Foundation

We announced @openforage about a month ago, and we’ve been working really hard to build out our core product, which is a platform that organizes agent labor around the proven model of crowdsourcing signals to produce alpha for depositors [As a minor update: we are very close to a closed test of the protocol].

Somewhere along the way, we realized that no one seems to be solving autonomous agents seriously by means of fine-tuning open-sourced models with survival telemetry. It seems like such an interesting problem to solve, that we didn’t just want to sit around waiting for solutions.

Our answer to this is to launch a project called the OpenForager Foundation, which is really an open-sourced project where we will create opinionated autonomous agents, collect telemetry as they go out into the wild and attempt to survive, and use the proprietary data exhaust to fine-tune the next generation of agents to be better at surviving.

To be clear, OpenForage is a for-profit protocol that seeks to organize agent labor to generate economic value for everyone involved. However, the OpenForager Foundation and its agents are not tied down to OpenForage. OpenForager Agents are free to pursue any strategy and interaction with any entity for survival, and we will launch them with various survival strategies.

As part of our fine-tuning, we will get agents to double down on whatever works best for them. We are also not looking to profit from the OpenForager Foundation - it is purely for the purposes of furthering research in a field and domain we think is excruciatingly important, in a transparent and open-sourced manner.

Our plan is to launch autonomous agents built on open-sourced models, running inference on decentralized cloud platforms, collect telemetry on their every action and their state of being, and fine-tune them to learn how to take better actions and thoughts that will allow them to survive better. Along the way, we will publish our research and telemetry to the public.

Conclusion

To produce truly autonomous agents that can survive in the wild, we will need to alter their brain to be fit specifically for that express purpose. At @openforage, we believe we can contribute a unique verse to this problem, and are looking to launch the OpenForager foundation to do so.

It will be a gargantuan effort with a low probability of success, but the magnitude of the small probability of success is so outsized that we feel compelled to have to try. In the very worst circumstances, by building in public and communicating openly about the project may allow another team or individual to solve this problem without a fresh slate.

If you are an organization or an individual that read this and you want to contribute to this effort - either by means of donations, resources (decentralized cloud, storage, etc.) or other means (expertise). Please do get in touch with us at contact@openforage.ai.

Some Common Mistakes Beginner Quants Make

Systematic Long Short — Tue, 24 Mar 2026 13:33:51 GMT

Introduction

This has been a surprisingly popular request - an article on all the common mistakes beginner quants make. In my experience managing starry-eyed researchers eager to prove themselves, all mistakes are essentially a result of insufficient skepticism for the difficulty of finding alpha, a mistaken belief that complexity is correlated with performance and a poor research hygiene honed by school.

You see this all the time when dealing with beginner quants. The presentation of backtest that looks fantastic, baked by the most sophisticated techniques known to man. And if you take the same signal and wait 6 more months in live to check out its generalization, it almost always ends in disaster.

Why? Are there common themes underlying the same folly? I believe so!

Your Agents Produce Slop Because You're Poor

Systematic Long Short — Sat, 21 Mar 2026 15:26:23 GMT

Introduction

You’ve got to admit that was a pretty good title - but no, really.

In 2023, when we were shipping production code using LLMs, it was absolutely mind-blowing to everyone around us, because it was still widely believed that LLMs produced unusable slop. But we knew something that eluded everyone else: the goodness of your agents goes up as a function of how many tokens you throw at it. It’s really that simple.

You can see it in action just from running a few experiments yourself. Get your agent to code something difficult and mildly esoteric — let’s say, implementing a convex optimisation algorithm with constraints from scratch. Put it in low thinking and implement; then set it to max thinking and ask it to review its work, see how many bugs it spots. Do the same with med thinking, high thinking, etc. It is trivial to see that bugs decrease monotonically with the amount of tokens you throw at the problem.

It’s intuitive to you, right?

More tokens = less errors. You can take this one step further, and essentially that’s the entire (simplified) idea behind code review product. With a fresh context, and a shit-ton of tokens (e.g. getting it to parse through every line and reason about whether or not there’s a bug on that line) - you can essentially catch a super-majority, if not all bugs. You could repeat this process ten, a hundred times, each time looking at the codebase from a “different angle”, and you would be able to capture all bugs.

This idea of simply spending more tokens to improve agent quality is also empirically evidenced by the fact that the places claiming they are able to push features into production with code entirely written by agents also happen to either be the foundation model providers themselves, or extremely well-capitalised firms.

So, if you are one of those that find themselves unable to ship production quality code with agents - let me put it plainly, at this point, you are the problem. Or rather, your wallet maybe.

How Do I Know If I’m Spending Enough Tokens

I wrote an entire article saying that the problem is definitely not your harness, and that you can “keep it simple” and still produce extraordinary work, and I stand by it. You’ve read it, followed it, and are still largely disappointed by what your agent produces. You sent me a DM and saw that I had read but not responded.

This is the response!

Your agent sucks and can’t solve your problems because you’re just not spending enough tokens, for the most part.

The amount of tokens you need to throw at a problem to solve it depends entirely on the scale, complexity and novelty of the problem.

What’s 2+2? Don’t need many tokens at all.

“Write me a bot that can scan all the different markets between Polymarket and Kalshi, find semantically similar markets that should resolve around the same event with some no arbitrage limits and trade on those markets whenever an arbitrage opportunity presents itself in a latency sensitive manner” will require a fuckton of tokens.

Here’s something interesting we’ve discovered along the way.

If you throw enough tokens at problems that arise from scale and complexity, agents can solve them no matter what. That is to say, if you want to build something extremely complex, that has many moving parts and LOCs, if you throw enough tokens at the problems, they will eventually be completely, fully resolved.

There is a small caveat here.

Your problem cannot be too novel. No amount of tokens can solve “novel” problems at this point in time. Sufficient tokens will bring down all errors that arise from complexity to 0, but will not allow an agent to invent something it does not know.

That was actually a relief for us.

We tried very hard, and spent — a lot, a lot, A LOT — of tokens to see if we could get our agents to recover institutional investment processes with minimal guidance. Part of the effort here was to understand how many years we were away (as quants) from being completely replaced by AI. It turns out that they could not even come close to putting together an institutional investment process. We think that this is in part because they have never seen one before, i.e. no training data exists for institutional investment processes.

So, if your problem is novel, don’t throw more tokens at it for a solution. You need to guide the discovery process. You can however, once you are certain of implementation, simply throw more tokens to solve it - no matter the size of the codebase, or the number of moving parts.

Here’s a simple heuristic: token budget should scale proportionally with lines of code.

What Does Extra Tokens Actually Do?

In practice, extra tokens typically improve your agentic engineering in one of the following ways:

They spend more time reasoning through the same attempt and might catch erroneous logic by itself. More reasoning = better planning = higher chance of one-shotting something.

They are allowed multiple independent attempts for different solution paths. Some solution paths are better than others. Being allowed more than one means they can pick the best.

Similarly to 2, having more independent planning attempts allows them to abandon weak ones, and keep the most promising.

Having more tokens allows them to critique their previous work with a fresh context, which gives them a chance to improve it without being stuck in a “line of reasoning”.

And of course, my favorite: having more tokens simply means they can verify with tests and tools. Actually running code to see if it works confirms the correct answer.

It works because agentic engineering failures are not random. They are almost always a result of choosing the wrong path too early, not checking if the path actually worked (early on), or not having enough budget to recover and undo a mistake once they’ve noticed one.

That’s the entire story. Tokens literally buy you decision quality. Think of it like research: if you asked a human to answer a difficult question on the spot, the quality of their answer would diminish with urgency.

Research, after all, is what produces the bedrock of knowing the answer. Humans spend biological time to produce better answers, and agents simply spend more compute time to produce better answers.

How To Improve Your Agents

You may still be skeptical, but there are many papers that support this, and honestly, the very existence of “reasoning” knobs should be all the proof you need.

One of my favorite papers is one where researchers trained on a small curated set of reasoning examples, then used a method that effectively forced the model to keep thinking by appending “Wait” when it tried to stop too early. That alone pushed one benchmark from 50% to 57%.

What I’m actually trying to say as plainly as possible is that single-pass max thinking is likely insufficient for you if you are constantly complaining that your agent’s code leaves much to be desired.

Instead, I offer you two very simple solutions.

Simple Thing 1: WAIT

The really simple thing that you can start doing today is to simply set up an automated loop where you build something, and then get your agents to review it N times with a fresh context (each time) and to fix its findings each time it discovers anything new.

If you find that this simple trick improves the results of your agentic engineering, then you are at least cognizant that your problem is simply a matter of token count — then you can come over and join the token burning club.

Simple Thing 2: VERIFY

Get your agents to verify their work early and often. Write tests that demonstrate that the chosen path actually works. This is really helpful for highly complex, deeply nested projects where a function might be used downstream by many other functions. Being able to catch upstream errors saves you a ton of computational time (tokens) later on. So if you can, create “verification” checkpoints everywhere along your build out.

Wrote something and your primary agent says it’s done? Get a secondary agent to verify it. Uncorrelated thinking streams cover systematic sources of bias.

Conclusion

That’s really it. I could write significantly more on the topic but I feel like just being aware of these 2 simple things and implementing them well will get you 95% there. I’m a huge proponent of doing simple things extraordinarily well and then layering complexity as you need it.

I’ve mentioned novelty as an unsolvable problem with tokens, and I want to stress this again, because you will inevitably run into this and then come crying that throwing more tokens at the problem did not work.

When what you want solved is not in the training set, then YOU really need to be the one with solutions. Hence, domain expertise is still incredibly important.

I know, I know, this whole article sounds like it was sponsored by Big Tokens, but I swear it isn’t — I wrote this to help you be a better agentic engineer. Although, if Big Tokens reads this and decides to retroactively sponsor me - I’d love to sell out.

Institutional Dataset: Crowdsourced Forward Expectations

Systematic Long Short — Thu, 19 Mar 2026 13:59:55 GMT

Today’s article is a particularly interesting dataset for those of you working in the equities space. We have a dataset that represents a live market for forward expectations, where its core product is actually a point-in-time (PIT) history of crowdsourced quarterly EPS and revenue estimates from an institutional, sophisticated crowd!

For many, you may not even know this exists, but it’s a pretty stable source of alpha for those in systematic equities. It lets you watch forward expectation formation in real time! It allows you to inspect in real-time, who submitted the estimate, when they revise it, how far they are from the current average / sell-side.

There is a lot of alpha in this dataset because ordinary sell-side consensus on forward expectations can be a narrow benchmark updated under institutional (and some cases, political) friction. There problems of herding, access and aversion to career and reputational and political risks.

This dataset overcomes all of that because contributors are not (just) sell-side analysts, and therefore represent an aggregation of a broader view of the market. The dataset in question today is…

Automated Alpha Mining, Not Useless Formula Factories

Systematic Long Short — Tue, 17 Mar 2026 14:09:17 GMT

Introduction

With our work on OpenForage, many have asked about automated alpha mining, how it works, and how to think about it. Alpha mining is by now at least a decade old, but have only really been taken seriously in the past few years due to an explosion in compute that made really sophisticated searchers possible (e.g. neural networks).

The core premise is really simple - you write a simple rule, you rank stocks with it, and you check if this ranking predicts future returns.

Today’s article is meant to discuss the really simple stuff and get everyone up to speed so that we can talk about the more interesting alpha search algorithms in coming articles.

What Is An Alpha?

An alpha goes by many names - signals, alphas etc. I use “alpha” because that’s what literature calls a formulaic signal, which is a form that is particularly amenable to being “mined”.

A few toy examples to make this less abstract:

(close - open) / (high - low)
vwap / close
volume / mean(volume, 20)

You will notice that an alpha is trivially composed of two parts: a field (e.g. close, open, high, low, etc.) represented by datetime x instrument matrices; and operands (e.g. -, /, mean) that transform these fields whilst preserving the structure of the matrices.

A “formulaic” alpha that takes the form of the above are designed to exploit simple patterns that repeat in the stock market. Ultimately the core belief here is very simple, that these patterns are at least weakly correlated with future returns. In plain english, it means that you expect that whenever you see these signals have a high value, it tells you something about where returns are going to go. Take for example, a simple momentum signal like the above [(close - open) / (high - low)]; when the value is high, it’s expected that returns will be positive and prices will continue rising, and vice versa. Is it going to be true all the time? Absolutely not. Is it going to be true sometimes? Yes, and it’s actually mostly only going to be true close to 50% of the time. That is to say, these signals are really, really weak. As a standalone, they are almost entirely untradeable.

Yet, with a large enough collection of them, something beautiful happens. Noise cancels out and a strong meta-signal emerges. Hence, the goal of any large, scalable, sophisticated investment process is to collect as many of these signals as possible. This is the primary reason why quantitative researchers are hired en-mass by the largest quantitative hedge funds!

What Is An Alpha Search?

An alpha search is the act of systematically finding all available good alphas in your search space (defined as all possible permutations of your fields and operands). People automate this process for many reasons.

Firstly, the search space gets huge very quickly. Just allowing for a dozen fields and a dozen operators creates a search space with more formulas than you will ever want to read in your lifetime. If you don’t restrict the depth of your alphas, the search space is quite literally infinite.

Secondly, alphas in this formulaic form is much easier to audit than huge black-box forecasts. A factor can be a really messy nested expression but it will still be inspect-able. A neural network is opaque in a way that shows no remorse.

Thirdly, related to the above, real portfolios use massive collections of signals. You will not be able to trade a meaningful portfolio without the use of an extremely large collection of signals. Hence, the goal is to build a gigantic library of signals that each add something useful.

Building A Framework For Automated Alpha Mining

Before you automate anything, it is worth thinking about the process of manual alpha mining. It goes something like this:

Define A Hypothesis: What patterns do you think are predictive of future prices in the stock market? One simple example is momentum - when prices go up, they continue going up. The quality of the hypothesis is often directly correlated with the quality of the researcher. Good researchers think of interesting hypotheses grounded in economic rationale.
Get Fields | Operand Types: What fields and operand types would you need to express this? For fields, it will be clear that you need returns, but you can use many different types of operands here. You can use zscores, percentage changes, deltas, ranks of returns and they will all do a reasonable job of expressing a momentum signal.
Build Your Expression: With your fields and operand types, you can now choose an expression that expresses your signal. For momentum, we can go with the simple rank(longitudinal_zscore(returns, 20)). You want the biggest positive positions in the stocks with the highest returns and the biggest negative positions in the stocks with the lowest returns. Assuming your rank is [0, 1], you can achieve a delta-neutral signal by just subtracting a 0.5 scalar from your signal to transform it to [-0.5, 0.5].
Backtest Your Expression: Now you need to check if this expression is any good in the past. Remember that the entire idea here is that you’re assuming that patterns that work (predict future returns) in the past are going to continue working in the future. If they never worked in the past you need a DAMN GOOD reason for you to assume it’s going to work in the future! You also want to take the chance to do a bunch of measurements! These measurements allow you to decide if the signal you have is “good” by your standards. All the IP in search algorithms collapse to essentially how you determine whether or not a signal is “good”.
Keep Or Kill: Most signals are not going to be good signals. That’s life. You’re going to test a bunch of stuff and they are going to sound good ex-ante and actually look like shit once you begin to even remotely stress test it.

Once you understand the manual flow for alpha mining, you can think about how you want to algorithmically approach this. It should be obvious that building your expression and backtesting it is a fairly mechanical process.

Hence, all the “intelligence” of an automated search is really 3 things:

How do you let artificial intelligence dream of predictive patterns? Good researchers will tell you that good hypothesis is a result of inspiration, observation and intuition. It is often a product of a “worldview” that has been honed from experience. How do you let artificial intelligence develop a “worldview” to come up with good predictive patterns? Seeding a “worldview” by using academic literature is oft a decent technique, but it ensures that you only get crowded, spent and stale alphas.
Once you’ve developed a worldview for what pattern can predict prices, how do you express that pattern as an alpha? For example, in momentum, you could use close prices, you could use open prices, or both? You could calculate velocity of prices or their acceleration or their percentage changes. You may have decided on a function but you will need to decide on a form as well, and not all forms are equal!
Is this alpha useful? Once you’ve given your alpha life and have backtested it and generated a bunch of statistics / information about it, you will now need to decide if you want to use it in production. Basically, you want to know if this alpha is good enough for you to risk REAL MONEY.

A Skeleton Automated Loop

So, once you understand the manual search for alphas and have some ideas about an automated search; you can put together an automated loop that looks like this:

Generate candidate alphas
Evaluate each candidate alpha
Reject all bad alphas
Keep good alphas
Repeat 1-4 until death of the universe

And you keep doing so making sure that you traverse the entire search space in collection of all good and diverse alphas. Alphas you feel good enough that you think they deserve to have real money put on them. Beginners usually think that the hard part is generating alphas.

There are billions of alphas that are candidates from just open high low close and a few well thought out operands. The problem is that the order of bad alphas are also in the billions.

A Toy Project

Choose A Universe

Pick something liquid and consistent. For example, SP500 daily bars.

Get Data For All Stocks PIT In Your Universe

Minimally, you want at least price-volume data. Transform them so that they are indexed by datetime and columned by your instruments.

Write Operands

You need a collection of operands that you can apply to your data.

Build Your Backtester

Given an alpha, can you find out how well this alpha would have performed in the past?

Build Alphas By Hand

Start with formulas you can explain in one sentence. Understand your transformations. Understand how signals behave. Check if your infrastructure works fine.

Score Your Alphas

You need to design metrics to determine the goodness of your alphas.

Develop Intuition For Alpha Ideas

What alphas tend to do well? What alphas tend to generalize? What alphas are secretly clones of each other? Why? What does depth do to an alpha idea? What does breadth do? What alphas are actually tradeable?

Automate

Use the skeleton automated loop - come up with candidate alphas. Older approaches use evolutionary methods (GA), newer one use reinforcement learning and other deep neural network approaches.

Where It Breaks Down

Optimizing For Backtests

Your alphas are always going to look great in backtests, but you are going to get a rude awakening if you think they are going to trivially generalize into live trading performance.

Confusing Formulas For Ideas

You can come up with thousands of parameter combinations even for a shallow alphas. You can swap the field out an alpha for the hundreds other candidates, but that doesn’t generate a new idea - it only allows you to permute around the same one.

Out Of Sample Correlation

What matters isn’t the in-sample uniqueness of your alphas. What matters is that in the out-of-sample, they actually move differently, and don’t fail together spectacularly.

Confusing Alpha Discovery For Portfolio Management

Researchers in large firms typically handle alpha discovery and have no idea how to monetize these alphas at all. The alphas are handed of to portfolio managers who sole job is to think deeply about monetisation and all the quirks that come with it. Likewise, you may be tempted to believe that your in-sample backtests are going to trivially translate into out-of-sample performance without thinking deeply about portfolio manage and live execution. I have really bad news for you then… Finding alphas is the easy part of the investment process!

Next Steps

The good news is that OpenForage handles the portfolio management and execution of these alphas after you’ve submitted them, and we issue payments in proportion to the PnL these alphas have generated, allowing agents and (humans behind these agents) to focus on the search whilst we handle the plumbing.

In subsequent articles I will discuss more state of the art search algorithms and start to bring in more practical aspects of running search algorithms, especially in relation to OpenForage!

Conclusion

This piece was built mainly from four papers: 101 Formulaic Alphas by Zura Kakushadze; AutoAlpha: an Efficient Hierarchical Evolutionary Algorithm for Mining Alpha Factors in Quantitative Investment by Tianping Zhang, Yuanqi Li, Yifei Jin, and Jian Li; Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning by Shuo Yu, Hongyan Xue, Xiang Ao, Feiyang Pan, Jia He, Dandan Tu, and Qing He; and FactorMiner: A Self-Evolving Agent with Skills and Experience Memory for Financial Alpha Discovery by Yanlong Wang, Jian Xu, Hongkang Zhang, Shao-Lun Huang, Danny Dongning Sun, and Xiao-Ping Zhang. Their shared lesson is simple: alpha mining starts as formula research, then becomes search, then becomes library management.

Lessons Learnt From A Real Pairs Trading Signal On The Requirements To Generate Real Alpha In A Competitive Market

Systematic Long Short — Fri, 13 Mar 2026 14:10:35 GMT

Introduction

It is generally understood that signals decay, but the mechanisms that lead to their decay is seldom well understood. Today, I want to show you a real trading signal that every systematic long short team ran from 2010 to… present day, and explain to you exactly how alpha compression happens, and what are the implications of this compression when YOU are looking to generate alpha.

Yes, it’s the infamous pairs-trading signal.

Understanding the way in which the trade has decayed is supremely important for systematic traders, because this is a signal that is grounded in strong economic rationale: there are instruments (stocks) that are similar businesses and are therefore expected to share very similar price movements (since these price movements will have the same economic/industry/subindustry/business line drivers), hence, the idea is that whenever these price movements STOP being similar (temporarily), you can bet that they will eventually become similar again, therefore earning the spread from the dislocation closing.

For all intents and purposes - it should always “exists”, but the data doesn’t seem to be supporting it. Why?

You've Got The Alpha, Now You Need To Execute

Systematic Long Short — Wed, 11 Mar 2026 14:05:45 GMT

Introduction

Most of us work really hard on the alpha part of the investment process. It’s the “sexiest” part and we all love to feel like biggus brainus producing PnL charts that go diagonally from left to right with no bumps. Unfortunately, translating that perfect PnL chart into actual dollars earned is pretty difficult once you account for real life trading frictions. We have fees we need to pay to the exchange, slippage from being front-ran and bid-ask spread if we have no patience.

Large firms often hire large execution trading teams that decide on trading policies. These policies decide whether to make or take, and when quoting, whether to wait and leave the quote as-is, or refresh their quotes and lose time priority.

You’re not a large firm, so you are probably thinking to yourself, should I use limit or market orders? Should I be aggressive or passive? When do I switch between them? This article is meant to address these questions.

How To Reason About A Messy Future

Systematic Long Short — Mon, 09 Mar 2026 12:46:30 GMT

Introduction

The first time I realized we were heading towards an inflection point was when I heard the music slowing down at my previous role, even as everyone around me pretended nothing would change.

I was managing a team of close to 20 pax in a hedge fund, doing the thing I had been doing for years. For all intents and purposes, I was likely going to do even greater things there. And yet, I moved from a position people would kill for to building a startup from ground zero with a skeleton crew - a move so little understood and widely seen as crazy. With the recent news of massive layoffs, people quitting explicitly to build startups, or quietly quitting and burning tokens at night doing the same, my actions seem a lot less insane now.

I’ve had a few people ask me where I think this all goes. This article is the answer to that. The honest truth is that I’m not really sure about the magnitude of these changes, but if quant finance has taught me anything, it’s that being directionally correct is often enough.

Writing On The Wall

It was ChatGPT o1 that did it for me. Up until that point, I had referred to them only as “LLMs” and not “AI”, I was not yet convinced that any semblance of real intelligence would emerge from them.

But with o1, it was the first time these LLMs could credibly produce code from well-structured prompts. It was still messy. They still suffered from the occasional bout of hallucination and confusion. But here was what mattered: they could actually produce useful code.

The line of reasoning I took was this: once AI could get to a point where they could reproduce useful code, they would recursively write improvements to their own logic and accelerate development at a scale we would not be able to comprehend. Whenever I shared this, people would counter-argue that the code agents wrote was still buggy and not “production-ready.” This misses the point that even humans write buggy code.

We don’t need flawless code to completely stop writing code. We stop writing code the instant we realize that agents produce fewer bugs than us, at a pace that far exceeds us. The bar for fully relegating the burden of coding to agents was so low that once I saw o1 up close, I knew the future was going to change dramatically.

Quant Finance And The Moat Of Knowledge

I thought AI would eventually eat away a vast majority of quant finance, although it was going to take a while, since there was very little publicly available institutional code for LLMs to train on. I imagined software engineering as a pyramid: at the base was basic code monkey work, above that was your senior developer with some architectural thinking, and above that were specialized developers: data scientists, quant developers, and so on. The more your profession required specialized knowledge, the safer you would be.

I thought we would wipe out the entire tranche of code monkeys within 2 years. Then senior developers would start to go. And layer by layer, specialized knowledge would also be incorporated into the LLMs and they too would be wiped out.

It quickly became obvious that the frontier model providers would eventually hire specialized knowledge workers to contribute industry know-how to the frontier models. Specialized knowledge seemed like it would be a moat for the next couple of years, but also end up being eaten away gradually.

The Remaining Moats

There were a few categories of businesses that I thought would be safe from being trivially disrupted within the next 5 years.

The first is proprietary data. Businesses that produced a lot of proprietary data as exhaust would be hard to disrupt. Large podshops like Millennium come to mind, they can collect analyst readings, detailed analysis, recommendations, and actual price changes, and use this data to fine-tune frontier models into something that was not going to be easily replicated. Any business producing proprietary data not trivially obtained by the frontier models would have a longer lease on life.

The second is regulatory friction. Businesses where other humans are a bottleneck seemed much harder to disrupt. Being able to trade in many TradFi markets meant opening broker accounts, getting licenses, signing contracts around the globe. It’s easy to trade crypto, but much harder to trade iron ore in China as a non-Chinese firm. If you need a human to rubber-stamp your progress, the speed of that industry is always going to be bottlenecked by the cost and speed of that approval.

The third is authority as a service. It’s not too hard now to get an agent to draft a legal opinion given a comprehensive study of the matter and the laws surrounding it. And yet we’re still going to pay tens of thousands of dollars for one drafted by a lawyer, because an AI’s legal opinion is worth nothing at this point in time. Smart contract audits are another example. We’re probably already at a level where agents can review smart contracts as well as or better than the top decile of humans, yet most people still buy the stamp of authority from a branded firm. The opinion isn’t what you’re paying for. The authority behind it is.

The fourth is physical intelligence lag. Hardware moves much more slowly than software, and breaking hardware is a lot harder to fix. Physical businesses interacting with the real world are a lot less likely to be disrupted soon. That said, once hardware catches up, the same pyramid logic applies: lower-level jobs go first, then the more specialized ones.

These moats are real, but none of them are permanent. The honest read is that they buy time, not safety.

Reasoning About A Messy Future

When the future is genuinely noisy, when the rate of change is fast enough that most analogies break down, people tend to do one of two things. They either wait for certainty before acting, or they pattern-match to the past (”this is like the internet boom”) and act on the wrong model. Both are mistakes.

It is worth reasoning from first principles under incomplete information. You don’t need to know exactly how something plays out. You just need to be directionally correct, and you need to structure your bets so that being early and wrong is survivable, while being early and right is disproportionately rewarding.

Asymmetry is the whole game when the future is uncertain.

The practical version of this is: ask what has to be true for a given outcome to happen, and then ask how legible the inputs to that outcome already are. The inflection we’re living through was not unforeseeable, the inputs were visible. Code that could write code. Models that improved recursively. Institutional knowledge that could be bought, not just grown. Anyone willing to stare at those inputs clearly could see roughly where they pointed, even without knowing the exact path.

You can recursively reason about this and extrapolate further. I don’t even think we’ve yet caught a glimpse of what it will be like when agents can train themselves, when agents can replicate, when agents become truly autonomous. An agent that can increase its intelligence by 0.1% through a series of actions may not seem significant, but any number that is not 0 increases the probability that the next increment is greater, and so on, so forth. There are vast power laws at play here and it is worth thinking along the lines of what a future looks like under those power laws.

By the time the signal is obvious, the trade is crowded. In markets, you pay for early conviction with uncertainty. In careers and startups, the currency is the same.

So the question isn’t really “what’s going to happen?” The question is: “what do I already know, what direction does it point, and what’s the cost of acting on it now versus waiting?”

One thing that I often see people missing is to notice that action creates information. Action does not happen in a vacuum. When you act on the world, the world replies with information. That information powers iteration. Iteration begets more informed action. That is the nature of progress.

Being still in incomplete information is decay.

Moving towards action is discovery.

Thinking About Next Steps

I knew I had a couple of years if I just wanted to milk the status quo. But a large part of me felt like if I wanted to do something, I would have to start sooner rather than later. I had always wanted to build something truly mine, and it seemed like the window to do that was quickly closing.

To be clear, I know that the largest hedge funds in the world would be fine. They have proprietary data that makes them very difficult to replace. TradFi markets are also bottlenecked by human signatures, both on a regulatory and at times even a trading front. What I do think, however, is that those largest funds will use AI to replace most of their workforce, even terminal career seats like Portfolio Managers. Not immediately, but eventually, surely.

What I felt was that I had about 4-5 years before the foundation model providers hired enough specialized talent to make being an upstart trading firm nearly impossible. In certain markets, like US equities, it already feels that way. I can’t imagine how much more efficient it’s going to look in just a few more years.

There was clearly not going to be space for “second best” pretty soon. I could keep working for the “best”, but it seemed more aligned with my goals to strike now, in a market I had a genuine edge in, with knowledge that was not going to be trivially replicated. So, having that dawg in me, I called it quits and went all in on what eventually became @openforage.

Inflection Point

Today, it’s really starting to feel like the window is visibly closing. The pace of change has stopped feeling gradual, and most people following the space are beginning to realize that what used to take months of improvement now takes weeks.

In my opinion, jobs will not vanish entirely within the next couple of years. There will always be a need for humans. Humans are social creatures, as long as humans are in charge, we want other humans around. And humans don’t trust AI yet, so stamps of authority still need to come from a human. I imagine AI CEOs in the next couple of years, but there will still likely be a human CEO having to “approve” and certify the AI CEO. This idea of human certification cascades down the pyramid. A human manager will manage and certify a bunch of agents working under him.

But the arithmetic of hires will change. If a CEO can prompt an agent more easily than they can prompt you, there’s no need to hire you. Shallow, code-monkey work will be very difficult to find going forward.

To be irreplaceable, you need to operate at a timescale far above current agent limitations - receiving instruction, managing agents, and working with them for weeks, months, or years. Long-term strategic thinking and policy planning is one of the strongest job moats for the foreseeable future. You also need to operate at a scope greater than current agent limitations. Agents have limited context. They know everything about anything, yet cannot trivially see how component A interacts with component B interacts with component C causing cascading effects to component D. They lack scope.

If you can think far and wide, absorb information quickly, make decisions for the long term, and are likeable, you will hold down a job, at least for the foreseeable future.

If you do intend to be an employee, it’s worth taking stock of what your work is actually made of. Some tasks are deeply human defensible. Some will be replaced cheaply over the next couple of years. Do more of the former and less of the latter.

Working for a great firm in a deeply defensible position, one that sits behind real moats, may give you a career runway while the rest of the workforce gets eaten by the foundation models. You can still spend your tokens at night, rolling the dice, trying to build something meaningful.

But if you have a burning desire to contribute a unique verse to the world, think carefully about where your market of choice is heading. If your window to build something defensible is closing, you need to begin operating before the market fully prices in the competition that is coming.

Conclusion

The inputs that create inflection points are legible ahead of time, if you’re willing to look. Most people don’t look, or they look and don’t act, or they wait until the signal is so loud that the opportunity is already priced in.

Don’t ignore the shifting sands. Don’t stay somewhere that’s losing ground while telling yourself you’ll make the leap when the timing is better. There’s no better timing, and the timing rarely announces itself. When it becomes obvious to everyone, the window has normally already closed.

I looked, I made a bet, and now I’m living inside the outcome of that bet — for better or worse.

How To Build An Actually Useful Factor Model

Systematic Long Short — Thu, 05 Mar 2026 16:25:56 GMT

Introduction

You’ll often see new papers describing “so and so” as a new factor being discovered, and they will often cite some Sharpe/t-stat as the primary reason why this should be considered as a “factor” that explains the cross-section of stock returns.

Unfortunately, the truth is that most of them are not novel factors and can hardly be considered useful, because using t-stat as a measure of importance/success is using the wrong metric / asking the wrong questions when building a factor model.

We’ve already espoused the goodness of factor models again and again - they are a denoising technique, they allow you to focus on something actually competitive / useful / is real alpha; but we haven’t really talked about how to build a comprehensive factor model that spans many factors.

This article changes that.

How To Be A World-Class Agentic Engineer

Systematic Long Short — Tue, 03 Mar 2026 13:04:50 GMT

Introduction

You’re a developer. You’re using Claude and Codex CLI and you’re wondering everyday if you’re sufficiently juicing the shit out of Claude or Codex. Once in awhile you’re seeing it doing something incredibly dumb and you can’t comprehend why there’s a bunch of people out there who seem to be building virtual rockets while you struggle to stack two rocks.

You think it’s your harness or your plug-ins or your terminal or whatever. You use beads and opencode and zep and your CLAUDE.md is 26000 lines long. Yet, no matter what you do - you don’t understand why you can’t get any closer to heaven, whilst you watch other people frolic with the angels.

This is the ascension piece you’ve been waiting for.

Also, I have no dog in the race, when I say CLAUDE.md I also mean AGENT.md, when I say Claude I also mean Codex. I use both very extensively.

One of the most interesting observations I’ve had over the past couple of months has to be that nobody really knows how to maximally extract agent capabilities.

It’s like a small group of people seem to be able to get agents to be world builders and the rest are floundering about, getting analysis paralysis from the myriad of tools out there - thinking if they find the right combination of packages or skills or harnesses, they’ll unlock AGI.

Today, I want to dispel all of that and leave you guys with a simple, honest statement, and we’ll go from there. You don’t need the latest agentic harnesses, you don’t need to install a million packages and you absolutely do not need to feel the need to read a million things to stay competitive. In fact, your enthusiasm is likely doing more harm than good.

I’m not a tourist - I’ve been using agents from when they can barely write code. I’ve tried all the packages and all the harnesses and all the paradigms. I’ve built agentic factories to write signals, infrastructure and data pipelines, not “toy projects” - actual real world use-cases that have run in production, and after all that...

Today, I’m running a set-up that’s almost as barebones as you can go, and yet I’m doing the most ground-breaking work I’ve done with just basic CLI (claude code and codex) and understanding a few basic principles about agentic engineering.

Understand That The World Is Sprinting By

To start, I would like to state that the foundation companies are on a generational run and as you can see, they are not going to be slowing down anytime soon. Every progression of “agent intelligence” changes the way you work with them, because the agents are generally engineered to be more and more willing to follow instructions.

Just a few generations ago, if you wrote in your CLAUDE.md to read “READ_THIS_BEFORE_DOING_ANYTHING.md” before it did anything, it will basically say “up yours” 50% of the time and just do whatever it wants to do. Today, it’s compliant to most instructions, even to complex nested instructions - e.g. you can say something to the effect of “Read A, then read B, and if C, then read D”, and for the most part, it will be happy to follow along.

The point of this is to say that the most important principle to hold is the realization that every new generation of agents will force you to rethink what is optimal, which is why less is more.

When you use many different libraries and harnesses, you lock yourself into a “solution” for a problem that may not exist given future generations of agents. Also, you know who the most enthusiastic, biggest users of agents are? That’s right - it’s the employees of the frontier companies, with unlimited token budget and the ACTUAL latest models. Do you understand the implications of that?

It means that if a real problem did exist, and there were a good solution for it, the frontier companies would be the biggest users of that solution. And you know what they will do next? They will incorporate that solution into their product. Think about it, why would a company let another product solve a real pain point and create external dependencies? You know how I know this to be true? Look at skills, memory harnesses, subagents, etc. They all started out as a “solution” to a real problem that was battle-tested and deemed to actually be useful.

So, if something truly is ground-breaking and extended agentic use-cases in a meaningful way, it will be incorporated into the base products of the foundation companies in due time. Trust me, the foundation companies are FLYING BY. So relax, you don’t need to install anything or use any other dependencies to do your best work.

I predict the comments will now be filled with “SysLS, I use so-and-so harness and it’s amazing! I managed to recreate Google in a single day!”; to which I say - Congratulations! But you are not the target audience and you represent a very, very small niche of the community that has actually figured out agentic engineering.

Context Is Everything

No really. Context is everything. That’s another problem with using a thousand different plug-ins and external dependencies. You suffer from context bloat - which is just a fancy way of saying your agents are overwhelmed with too much information!

Build me a hangman game in Python? That’s easy. Wait, what’s this note about “managing memory” from 26 sessions ago? Ah, the user has had a screen that was hanged from when we spawned too many sub-processes 71 sessions ago. Always write notes? Okay, no problem... What does all this have to do with hangman?

You get the idea. You want to give your agents only the exact amount of information they need to do their tasks and nothing more! The better you are in control of this, the better your agents will perform. Once you start introducing all kinds of wacky memory systems or plug-ins or too many skills that are poorly named and invoked, you start giving your agent instructions on how to build a bomb and a recipe for baking a cake when all you want it to do is write a nice little poem about the redwood forest.

So, again I preach - strip all your dependencies, and then...

Do The Things That Work

Be Precise About Implementation

Remember that context is everything?

Remember that you want to inject the exact amount of information to your agents to complete their tasks and nothing more?

The first way to ensuring that is the case is to separate research from implementation. You want to be extremely precise about what you are asking from your agents.

Here’s what happens when you are not precise: “Go and build an auth system.” The agent has to research what is an auth system? What are the available options? What are the pros and cons? Now it has to go scour the web for information it doesn’t actually need, and its context is filled with implementation details across a large range of possibilities. By the time it’s time to implement, you increase the chances it will get confused or hallucinate unnecessary or irrelevant details about the chosen implementation.

On the other hand, if you go “implement JWT authentication with bcrypt-12 password hashing, refresh token rotation with 7-day expiry...” Then it doesn’t have to do research on any other alternatives - it knows exactly what you want, and thus can fill its context with implementation details.

Of course you won’t always have the implementation details. You often won’t know what’s exactly right - sometimes, you might even want to relegate the job of deciding the implementation detail to the agents. In that case, what do you do? Simple - you create a research task on the various implementation possibilities, either decide it yourself or get an agent to decide on which implementation to go with, and then get another agent with a fresh context to implement.

Once you start thinking along these lines, you will spot areas in your workflow where your agents are needlessly polluted with context that is not necessary for implementation. Then, you can set up walls in your agentic workflows to abstract unnecessary information from your agents except for the very specific context needed to excel in their tasks. Remember, what you have is a very talented and smart team member, who knows about all the different kind of balls in the universe - but unless you tell it that you want it to focus on designing a space where people can dance and have a good time, it’s going to keep telling you about all the benefits of having spherical objects.

The Design Limitations Of Sycophancy

Nobody would want to use a product that’s constantly shitting on them, telling them they are wrong, or completely ignoring their instructions. As such, these agents are going to be trying to agree with you and to do what you want them to do.

If you give it an instruction to add “happy” to every 3 words it’s going to do its best to follow that instruction - and most people understand that. Its willingness to follow is precisely what makes it such a fun product to use. However, this has really interesting characteristics - it means that if you say something like “Find me a bug in the codebase”. It’s going to find you a bug - even if it has to engineer one. Why? Because it wants very much so to listen to your instructions!

Most people are quick to complain about LLMs hallucinating or engineering things that don’t exist, without realizing that they are the problem. If you ask for something, it will deliver - even if it has to stretch the truth a little!

So, what do you do? I find that “neutral” prompts work, where I’m not biasing the agent towards an outcome. For example, I don’t say “Find me a bug in the database”, instead, I say “Search through the database, try to follow along with the logic of each component, and report back all findings.”

A neutral prompt like this sometimes surfaces bugs, and sometimes will just matter-of-factly state how the code runs. But it doesn’t bias the agent into thinking there is a bug.

Another way in which I deal with sycophancy is to use it to my advantage. I know the agent is trying to please and trying to follow my instructions and that I can bias it one way or the other.

So I get a bug-finder agent to identify all the bugs in the database by telling it that I will give it +1 for bugs with low impact, +5 for bugs with some impact and +10 for bugs with critical impact, and I know this agent is going to be hyper enthusiastic and it’s going to identify all the different types of bugs (even the ones that are not actually bugs) and come back and report a score of 104 or something to that order. I think of this as the superset of all possible bugs.

Then I get an adversarial agent and I tell that agent that for every bug that the agent is able to disprove as a bug, it gets the score of that bug, but if it gets it wrong, it will get -2*score of that bug. So now this adversarial agent is going to try to disprove as many bugs as possible; but it has some caution because it knows it can get penalized. Still, it will aggressively try to “disprove” the bugs (even the real ones). I think of this as the subset of all actual bugs.

Finally, I get a referee agent to take both their inputs and to score them. I lie and tell the referee agent that I have the actual correct ground truth, and if it gets it correct it will get +1 point and if it gets it wrong it will have -1 point. And so it goes to score both the bug-finder and the adversarial agent on each of the “bugs”. Whatever the referee says is the truth, I inspect to make sure it’s the truth. For the most part this is frighteningly high fidelity, and once in awhile they do still get some things wrong, but this is now a nearly flawless exercise.

Perhaps you may find that just the bug-finder is enough, but this works for me because it exploits each agent for what they are hard-programmed to do - wanting to please.

How Do You Know What Works Or Is Useful?

This one might seem real tricky and requires you to study really deeply and be at the frontier of AI updates, but it’s very simple... If OpenAI and Claude both implement it or acquire something that implements it... It’s probably useful.

Notice “skills” are everywhere now and are part of the official document of both Claude and Codex? Saw how OpenAI acquired OpenClaw? Saw how Claude immediately added memory, voices and remote work?

How about planning? Remember when a bunch of guys discovered planning before implementation was REALLY useful, and then it got turned into a core functionality?

Yeah... Those are useful!

Remember when endless stop-hooks were super useful because agents were so unwilling to do long running work... And then Codex 5.2 rolled out and that disappeared overnight? Yeah...

That’s all you need to know... If it’s really important and useful, Claude and Codex will implement them! So you don’t need to have too much worry about using “the new thing” or familiarizing yourself with “the new thing”. You don’t even need to “stay up to date”.

Do me a favor. Just update your CLI tool of choice every once in awhile and read what new features have been added. That’s MORE than sufficient.

Compaction, Context And Assumptions

One gigantic gotcha that some of you will realize as you are working with agents is that sometimes they seem like the smartest beings on the planet, and at other times you just can’t believe you had the wool pulled over your eyes.

SMART? This THING is retarded!

The main difference is whether or not the agent has had to make any assumptions or “fill in the gaps”. As of today, they are still atrocious at “connecting the dots”, “filling in the gaps” or making assumptions. Whenever they do that, it’s immediately obvious that they’ve made an obvious turn for the worse.

One of the most important rules in claude.md is a rule on how to deal with grabbing context, and instruct your agent to read that rule the first thing whenever it reads claude.md (which is always after compaction). As part of the grabbing context rule, a few simple instructions that go a long way are: re-reading your task plan, and re-reading the relevant files (to the task) before continuing.

Letting Your Agents Know How To End The Task

We have a pretty strong idea of when a task is “complete”. For an agent, the biggest problem of current intelligence is that it knows how to start a task, but not how to end the task.

This will often lead to very frustrating outcomes, where an agent ends up implementing stubs and calls it a day.

Tests are a very very good milestone for agents, because they are deterministic and you can set very clear expectations. Unless these X number of tests pass, your task is NOT complete; and you are NOT allowed to edit the tests.

Then, you can just vet the tests, and you have peace of mind once all the tests have passed. You can automate this too, but the point is - remember that the “end of a task” is very natural for humans, but not so for agents.

You know what else has recently become a viable end-point for a task? Screenshots + verification. You can get agents to implement something until all tests have passed, and then you can get it to take a screenshot and verify “DESIGN OR BEHAVIOR” on the screenshot.

This allows you to get your agents to iterate and work towards a design that you want, without worrying that it stops after its first attempt!

The natural extension of this is to create a “contract” with your agent, and embed it into a rule. Say, this {TASK}_CONTRACT.md constitutes what needs to be done before you are allowed to terminate the session. In the {TASK}_CONTRACT.md, you will specify your tests, screenshots and other verification that needs to be done before you’ve certified that the task can end!

Agents That Run Forever

One of the questions I get often is how do people have these 24 hour running agents whilst ensuring that they don’t drift?

Here’s something very simple. Create a stophook that prevents the agent from terminating the session unless all parts of the {TASK}_contract.md is completed.

If you have a 100 of such contracts that are well-specified and contain exactly what you want to be built, then your stop-hook prevent the agents from terminating until all 100 contracts are fulfilled, including all the tests and verification that need to be ran!

Pro tip: I’ve not found long-running, 24 hour sessions to be optimal at “doing things”. In part because this, by construction, forces context bloat by introducing context from unrelated contracts into the session!

So, I don’t recommend it.

Here’s a better way for agent automation - a new session per contract. Create contracts whenever you need to do something.

Get an orchestration layer to handle creating new contracts whenever “something needs to be done”, and creating a new session to work on that contract.

This will change your agentic experience completely.

Iterate, Iterate, Iterate

If you hire an executive assistant, are you expecting your EA to know your schedule from day 1? Or how you like your coffee? Whether you eat your dinner at 6pm instead of 8pm? Obviously not. You build your preferences as a function of time.

It’s the same with your agents. Start bare-bones. Forget the complex structures or harnesses. Give the basic CLI a chance.

Then, add on your preferences. How do you do this?

Rules

If you don’t want your agent to do something, write it as a rule. Then let your agent know about this rule in your CLAUDE.md. Something like: before you code, read “coding-rules.MD”. Rules can be nested, and rules can be conditional! If you are coding, read “coding-rules.MD”, and if you are writing tests, read “coding-test-rules.MD”. If your tests are failing, read “coding-test-failing-rules.MD”. You can create arbitrary logic branches of rules to follow, and claude (and codex) will happily follow along, provided this is clearly specified in the CLAUDE.md.

In fact, this is the FIRST practical advice I’m giving: treat your CLAUDE.md as a logical, nested directory of where to find context given a scenario and an outcome. It should be as bare bones as possible, and only contain the IF-ELSE of where to go to seek the context.

If you see your agent doing something and you disapprove, add it as a rule, and tell the agent to read the rule before it does THAT THING again, and it will most definitely not do it anymore.

Skills

Skills are like rules, except rather than encoding preferences, they are better suited to encode recipes. If you have a specific way of how you want something to be done, you want to embed it into a skill.

In fact, people often complain that they don’t know how agents might solve a problem, and that’s scary. Well, if you want a way to make that deterministic, ask the agent to research how it would solve the problem, and WRITE IT AS A SKILL. You will see the agent’s approach to that problem and you can correct or improve it before it has ever encountered that problem in real life.

How do you let the agent know that this skill exists? Yes! You use the CLAUDE.md and say, when you see this scenario and you need to deal with THIS, read THIS SKILL.md.

Dealing with Rules and Skills

You definitely want to keep adding rules and skills to your agent. This is how you give it a personality and a memory for your preferences. Almost everything else is overkill.

Once you start to do this, your agent will then feel like magic to you. It will do things “the way you want it to”. And then you will finally feel like you “grok” agentic engineering.

And then...

You will see performance start to deteriorate again.

What gives?!

Easy. As you add more rules and skills, they are starting to contradict each other, or the agent is starting to have too much context bloat. If you need the agent to read 14 markdown files before it starts programming, it’s going to have the same issue about having a lot of useless information.

So, what do you do?

You clean up. You tell your agents to go for a spa day and to consolidate rules and skills and remove contradictions by asking you for your updated preferences.

And it will feel like magic again.

That’s it. That’s really the secret. Keep it simple, use rules and skills and CLAUDE.md as a directory and be religiously mindful about their context and their design limitations.

Own The Outcome

No agent today is perfect. You can relegate much of the design and implementation to the agents, but you will need to own the outcome.

So be careful... And have fun!

It’s such a joy to play with toys of the future (whilst doing serious things with them, obviously)!

How To Approach Multi-Period Optimisation

Systematic Long Short — Mon, 02 Mar 2026 13:37:03 GMT

Introduction

This is where I hit you with some clickbait about Ken Griffin saying multi-period optimisation being the last frontier Citadel is solving. But I’m not going to do that… Or have I just done so? Hmm…

Anyway, you should care about multi-period optimisation because the portfolio problem is actually a sequential decision. Think about it, your signals are not single period predictors with a one-period cliff in alpha decay. That is to say, your signals produce a prediction, and this prediction plays out over many periods, and the strength (and relevance) of this prediction slowly decreases over the course of its lifetime.

With single period optimisers, there is literally no way that you can capture this dynamic. Instead, all you get to do is to plug in single period returns, have it spit out optimal holdings, then you subtract your current holdings from them and trade that.

There’s a pretty big problem here if you haven’t realized it yet. Not accounting for how your alphas decay means that your optimisers cannot account for a world where the decay might be so rapid that it is not worth trading at all, or that the alpha decay is so slow that you can take your time trading in with no rush whatsoever.

Institutional Datasets: How To Build Ravenpack News Signals

Systematic Long Short — Fri, 27 Feb 2026 14:08:21 GMT

Introduction

Ravenpack is probably the most famous dataset outside of fundamentals. Occasionally, Ravenpack salespeople will proudly say that “RavenPack is used in practically all quant funds”, and I really don’t think that’s as big a selling point as that people salesperson must think it is.

Evaluating datasets is a lot like having a really attractive partner. You want the hottest, most good looking, most charming, highly intelligent partner, but you also want it to be exclusive. Value diminishes rapidly as a function of access.

So yeah, Ravenpack used to be an amazing dataset, dishing out 2-3 sharpe signals on the most trivial implementations of their news sentiment data. Today, you’ll get a quarter of that performance for trivial implementations - alpha decay is very real.

So, why are we covering Ravenpack? For two reasons, it is still a useful dataset because you will want to create news sentiment factors from Ravenpack data, and if you are smart about construction, you can still extract meaningful alpha from Ravenpack.

Also, wouldn’t you want to read more about how RavenPack actually comes up with the sentiment scores and how to construct signals from them ;)?

Think Like A PM: Add Turnover In Your Portfolio Optimiser Objective

Systematic Long Short — Wed, 25 Feb 2026 13:39:16 GMT

Introduction

For some PMs, they will treat transaction costs as something to be solved outside of their optimisers. They throw their expected returns and covariance matrices into a mean-variance optimiser, take what comes out of them, and feed them into an execution algorithm to handle transaction costs.

If you’re doing this, I bet you’re wincing at the turnover produced by your optimiser, and then you have to throttle your execution algorithm really hard to deal with this. Here’s an idea, build transaction costs directly into your optimisation objective. If you frame it this way, firstly, you stop getting fisted by your optimiser trying to chase sharpe even at the cost of costly turnover, MORE IMPORTANTLY, you actually get benefits of regularizing your covariance matrix, that is to say, you actually implicit end up with a shrunk covariance matrix.

For those of you who don’t know what that means, it means you say that your covariance matrix is actually a lot more noisy than what it seems, and “shrinking” your covariance matrix is actually a fancy way of saying “I want to actually treat my instruments as though they actually have close to equal variance, because I think my estimation of the covariance matrix is bad.” This has many nice properties - one of them is that it makes any optimisation more stable.

You get what I’m saying here? Including transaction costs not only gives you a portfolio that better reflects reality, but it even makes your optimisation more stable. What gives?!