A Practical Guide To Cleaning High Frequency Data
Introduction
Almost all of us will deal with tick data at some point in our careers. Even a MFT book will likely deal with tick data to transform it into some aggregated form. Unlike aggregated price data, it can be tricky to deal with the large volume of data in real time. Most practitioners will implement some kind of filter for both historical/live to catch any outliers.
The point of this article is not to discuss the “correct” filter, but rather to make light all the kinds of data problems you might find in the wild and understand that every filter setting is a tradeoff, and your job is to manage that tradeoff consciously.
Every tick data pipeline has errors. Decimal mistakes, transposition errors, out-of-sequence trades, multi-market discrepancies. These errors corrupt volatility estimates, trigger false signals, and cause backtests to diverge from live performance. But the solution is not simply “remove all errors.”
Aggressive cleaning removes real market behavior along with the noise. This creates models that underestimate volatility, and those models fail in production when they encounter the reality they were shielded from.
The central practical concern is the overscrub/underscrub tradeoff. Filter too loosely and you have unusable data. Filter too tightly and you strip out legitimate market dynamics.

