The Entropy of a Dataset Obeys the Triangle Inequality

May 23, 2021

Let $D$ denote an infinite sequence of datapoints $D=\{D_i\}$. The sequence $D$ has entropy equal to the length of the shortest computer program which can generate $D$.

Our objective is to separate signal from noise. Signal is compressible data. Noise is uncompressible data. Signal generalizes. Noise does not generalize. The raw information of $D$ is divided into signal $Q(D)$ and noise $N(D)$. Let $S(x)$ denote the entropy of $x$.


If you have well-defined priors, infinite compute and you can observe all of $D$ then it is theoretically possible to perfectly separate $D$ into signal $Q(D)$ and noise $N(D)$.

If $D$ is time series data then you cannot observe the whole dataset. Signals can operate on short or long time horizons. If $n$ is small then you can detect signals that operate on short time horizons but you do not have enought data to detect signals that act on long time horizons. The opposite is not true. If you have lots of data then you can detect signal that operate on both short time horizons and long time horizons.

Moreover, if a pattern repeats across two datasets then the entropy of that pattern for the combined dataset will be smaller than the sum of the individual entropies. Entropy thus obeys the triangle inequality.

$$S(A\cup B)\leq S(A)+S(B)$$

This is a strict inequality for noise.

$$S(N(A)\cup N(B))=S(N(A)+S(N(B)))$$

In fact, noise can be defined as information which cannot be cross-correlated to anything.

Similarly, signal can be defined as information which can be cross-correlated. Signal therefore cannot exist in any independent unit of raw data. Signal is a relationship between data. As you collect more data the potential signal increases faster than the potential data because the potential relationships increase faster than the data itself.

Data Efficiency

If we had infinite compute then the best way to compress a dataset would be to run all possible computer programs and then keep the shortest program which generates the dataset. In the abstract sense, signal extraction is equivalent to compression. Therefore generalized compression is also the best way to isolate signal from a dataset.

All of the above implies that we have well-defined priors and infinite compute. In practice, the limiting factor is compute. Machine learning algorithms like neural networks implicitly sacrifice well-defined priors to limit compute costs. Consequently, the signal they extract from $D$ tends to be less than the maximum signal possible to extract $Q$. Let us use $W(\psi,D)$ to denote the signal extracted by an algorithm $\psi$ from dataset $D$.

$$W(\psi,D)\leq Q(D)$$

We can define data efficiency $\eta$ as the ratio of $W$ to $D$.


Given a dataset $D$, the best algorithm $\psi$ to use is the one which maximizes data efficiency $\eta$ (while satisfying your compute limits).