Before we begin: join us on Discord!
Kicking it up another notch on Overfitting! I’ll keep the math light and make things as intuitive as possible, relying on geometric reasoning. I’ll throw in some mathematical terms here and there, but I’ll skip the heavy details.
Let’s start with a simple alpha (and you can refer to Overfitting Part 2 for this), which is a forecast that takes the following form:
The set of zero-residual points for this alpha is the set of points on the sphere such that:
This means these are the points on the sphere where the model “perfectly predicts” s.
What I want to highlight is how overfitting depends on the “dimension” of the space occupied by all zero-residual points of a model on a sphere. I’ll wrap up with some interesting potential connections to Kolmogorov Complexity, a way of expressing algorithmic complexity.
Recap from Overfitting — Part 2
In the previous article, we looked at a simple AR(1) process and graphed its zero-residual points constrained to a 2D sphere (think of a beach ball). We saw that these points formed a pair of circles with the north and south poles removed. Then, we “fattened up” those circles, covering 5% of the sphere’s surface. Here’s a graph for reference:
The key takeaway: Overfit models take up too much space on the sphere.
If your in-sample data falls outside that fattened-up region, your model is overfit because expanding the covered region further would mean it covers more than 5% of the sphere. That implies your model would perform just as well (in terms of correlation) on more than 5% of randomly generated paths.
A Quick Thought Experiment on Dimension and Fattening
Let’s visualize this on a 2D sphere (the beach ball). Consider these cases:
- The North and South Poles — a 0-dimensional set.
- The Equator — a 1-dimensional set.
- If we fatten up the poles, we get two 2-dimensional disks (think polar ice caps).
- If we fatten up the equator by including all points within a distance d, we get a ring (like the region between the Tropics of Capricorn and Cancer).
Now, here’s the key insight:
- For a fixed d that isn’t too large, the fattened equator covers more of the sphere than the fattened poles.
- Flipping that around: To cover 5% of the sphere, you’d need a larger d for the poles than for the equator.
Imagine two models:
- Pole Model — Zero residual points at the poles.
- Equator Model — Zero residual points along the equator.
It’s easier to overfit with the Equator Model because an in-sample point must land in a much narrower range than with the Pole Model. The safe zone is smaller, so the probability of being outside it (and thus overfitting) is higher.
Here’s a visualisation fattening of the poles vs fattening the equator covering 5% of sphere:
Overfitting and Dimension
This leads to an intuitive claim (though not a formal proof):
The higher the dimension of a model’s zero-residual set, the easier it is to overfit.
Even though both the poles and the equator have zero probability of being randomly selected, their fattened-up versions do not. This is the key to understanding overfitting.
Extending to AR(k) Models
Now, let’s move beyond spheres. Consider a general AR(k) model:
For now, let’s not project onto a sphere — we’ll get to that in a bit. Assume we have an in-sample size of m. The zero-residual set is just given by recursion on the above equation and setting the residual to 0.
This set is determined entirely by:
- The k-dimensional coefficient vector
- The k-dimensional initial conditions vector
For fun, let’s take k=2 and m=5:
Then, given x1 and x2, we have:
Using a mathematical tool called the companion matrix or just brute force algebra, we can express this as a matrix (whose entries are polynomials in the coefficients) times a vector of initial conditions:
This structure is linear in the initial conditions and polynomial in the coefficients. Now here’s where things get cool.
Algebraic Geometry and Zero Residual Sets
Either by appealing to some theorems in Algebraic Geometry or by some clever first principle methods (which I can explain by pm), we can express all AR(2) zero-residual points as the solution set of polynomial equations in 5-dimensional space.
This is the equation you get:
In our special case of AR(2) in 5D, this is just a determinant=0 equation. In other cases, you have to be a bit more clever to extract the polyomials.
So a point on the zero residual set should satisfy this equation. Interesting how we magically got rid of expressing this in terms of coefficients?
Specifically:
- The zero-residual set is an Algebraic Variety (a solution set of polynomial equations).
- It’s kind of like a smooth space (think of a sphere or a torus aka donut), but it can also contain singular (non-smooth) points and self-intersections with lower-dimensional structures.
We have a set of points in 5D that is defined by one equation. That would make this a 4D space. Very hard to visualise.
We can, however, project to some lower dimensions to get some insights, or not. Here’s how the AR(2) zero residual set, which is an Algebraic Variety, looks when projected onto a few 3D subspaces (some are indeed quite bizarre):
Notice that it has a geometric structure but also intersections and some singularities, unlike a simple sphere or torus. but at least you can discern that these projections are in fact 2D surfaces, as bizarre as they might look.
By the way, here are some graphs of well knonw 2D algebraic varieties in 3D space:
Even though there are irregularities and/or intersections we can still talk about dimension. The largest connected piece of our AR(2) variety behaves like a 4D space sitting inside a 5D space. This aligns with our intuition — our coefficient and initial condition vectors span a 4D space, so the solution set should be 4-dimensional.
If we now intersect this variety with the unit sphere in 5D, we reduce the dimension by 1. That means the zero-residual set for in-sample points on the unit 4D sphere in 5D space has dimension 3.
The General Case
A general result holds (which I will not prove but if anyone wants to reach out, can show how its done):
- For an AR(k) model on an in-sample of size m, the algebraic variety defining the zero-residual set has 2k dimensions if k < m/2, and m otherwise. This means for k≥m/2, you have basically the TotalEvil predictor (See Overfitting Part 2; it’s the worst case of overfitting).
- When restricted to the unit sphere of size m-1, the dimension becomes 2k-1 if k < m/2, and m-1 otherwise, again TotalEvil.
The Bigger Picture: Dimension and Complexity
Now we tie it back to our earlier argument:
- The higher k in AR(k), the larger the dimension.
- A higher-dimensional zero-residual set takes up more space on the sphere. And by the way, while you have to take care of nasty irregularities, you can still fatten up an Algebraic Variety.
- This mirrors the fattening-up argument from earlier, reinforcing the idea that higher-dimensional (in our geometric sense) models overfit more easily.
- Lots of parameters doesn’t, prima faci , imply higher dimensional zero residual sets; there could be a lot of collapsing/compressing, so you can’t just count the number of parameters. Conceptually, if you did have some compression it might suggest you have some visceral fat in your code in the way of parameters.
Interestingly, this reasoning is reminiscent of Kolmogorov Complexity, which measures the shortest length algorithm needed to generate an output. Here’s a table comparing computing AR(k) zero residual set in terms of Kolmogorov Complexity and dimension of Zero Residual Set on the m-1 sphere. KC is dominated by the cost of storing the initial conditions and coefficients. And that is also clearly what drives the dimension of the Zero Residual set.
There might be a deeper equivalence between these ideas — perhaps overfitting is, in some sense, a function of the compressibility of a model’s zero-residual set. Perhaps Entropy can be brought in somehow.
And that’s a wrap! Hope this gives you a fresh perspective on overfitting, geometry, and complexity.