Historical Validation Misses Black Swans

Overfitting describes what happens when you treat signal as noise. There is no harm to overfitting if all you are doing is compressing data. However, if you are extrapolating data then overfitting can be catastrophic.

One way to avoid overfitting is via a train-test split. The problem with a train-test split is that it is either vulnerable to black swans or it is extremely data inefficient.

This problem is mostly ignorable in Silicon Valley because failures tend to be short-tailed. It is unlikely for a car crash to kill more than a handful of people. If you train a neural network to drive a self-driving self-driving car then the worst thing you need to account for is the car crashes and kills a dozen people. If you want to keep the death toll below one in a million then all you have to do is reduce the odds of a car crash below one in twelve million. If you have twelve million roadtrips in your validation dataset then you things under control.

Financial trading systems tend to be highly-leveraged. A single failure can wipe out many years of profits. Suppose you are on Wall Street writing an algorithm to trade stocks. Your algorithm must weather financial crises 99.999% of the time without cataclysmic failure to average a profit. If you wanted to validate your algorithm by backtesting against historical data then you would need 100,000 financial crises in your dataset. There aren't that many financial crises in the historical record. You cannot solve this problem by backtesting.

More generally, to get $1-\epsilon$ confidence from a validation set, that validation set needs at least $n\geq\frac1{1-\epsilon}$ datapoints. Validation datasets are fine for short-tailed decisions. Validation datasets are inadequate for long-tailed decisions.

This is why I am so bearish on scaling up today's artificial neural networks (ANNs). I don't doubt they can handle short-tailed decisions. But today's ANNs are validated by using train-test splits. ANNs will be limited in their ability to solve small data problems insofar as they rely on train-test splits for validation.