Overfitting Part 2: How I Learned to Stop Worrying and Love the Sphere

AlphaNova
7 min readJan 27, 2025

--

Ok, we’re going to kick up my ramblings a notch or two. I’ll try to keep the math side as light as possible so that an intelligent reader with some geometry and stats chops can follow along.

You’re given a time series of length MM.

Your mission: Build a simple alpha where you optimize some utility funciton in-sample. Using any or all information up to time i, your alpha predicts the next value in the time series at time i+1. By “simple alpha,” I mean the only information you can use is the time series itself — no other time series allowed. I’m sticking to this restriction for clarity; everything here generalizes to more complex alphas.

So here’s what you’ll do: chop the time series into at least two (but possibly three) sections — a training (in-sample) section, and either a testing section or both a validation and testing section. For what follows, we don’t care whether you’ve got one or two extra sections. What’s crucial is that you don’t peek at these sections during the process. Ideally, a gatekeeper should stop you from looking at the testing section entirely, and you definitely shouldn’t cheat by peeking at the validation section to tweak your model or its parameters for the training set. Let’s assume your in-sample section has length m:

Utility Function U

You’re utility function U measures goodness of fit. I don’t want to right down some complex mathematical expression for the domain of this function, but we can slop around and say it’s U(p,x) where p is your prediction time series and x is your in_sample. We’ll assume U is “scale invariant.” This means if you multiply your predictions or time series by any constant, U remains unchanged. (Lead-lag correlation has this property, as does Sharpe Ratio, the latter of which you can either prove this from first principles or just note that Sharpe Ratio is a function of correlation, and functions of scale-invariant metrics are also scale-invariant.) For simplicity, let’s roll with correlation. I note that in practise what one would do is minimize the variance of the residual, but in our case that will have the nice property of optimizing correlation.

Detrending and Normalizing

A common practice is to detrend and normalize your in-sample time series. If you’re worried about “peeking into the future” in-sample, don’t be — as long as you don’t apply the same treatment to the validation or testing sets. Sure, you could detrend and normalize on an expanding window basis, but that complicates things unnecessarily for this exposition.

After detrending and normalizing your in-sample​, its length is now 1:

This means in-sample​ now “lives” on a high-dimensional unit sphere of dimension m−1. Here’s the definition of a unit n-sphere:

The 0-sphere consists of two points, which you can think of as the north and south poles. The 1-sphere is the familiar circle. The 2-sphere is a beach ball. And it keeps going: spheres all the way down (or up). They’re beautiful, symmetrical objects. You can rotate from any point to any other point on a sphere. Other than dimension 0, the n-sphere is connected, and for dimensions >2, the complement of an n-1 sphere sitting inside an n-sphere is also connected (this follows by Alexander Duality: https://en.wikipedia.org/wiki/Alexander_duality). For dimension 2, though, there are two connected components (think north and south of the equator). The list of cool facts about spheres goes on and on — Hopf fibrations, exotic 7-spheres, you name it.

The Simple Alpha

Since you’ve detrended and normalized your xinxin​, your simple alpha is now a function of the form:

Here E is some parameter space and we do note that the range isn’t necessarily on the m-1 sphere, but we can always project it to this m-1 sphere (there is some math involved in this as it relates to some of the ideas below).

Here’s the catch: the i+1-th component of F can only depend on components up to i. This restriction makes things interesting. If you simulate trillions of Monte Carlo samples of i.i.d. standard Gaussian normals of size m, then normalize each sample to have unit length, you’ll get points uniformly distributed on the m−1-sphere. From both a geometric and measure-theoretical perspective, this distribution is perfectly “democratic”: every point is as good as any other. The probabilty measure is just the standard uniform measure on the sphere.

But your alpha? It imposes restrictions. It’s carving out a subset of the sphere where your in-sample, and alpha, is “special,” hopefully allowing you to predict the next value in the series with minimal error. Now imagine you’re given this magical point:

This point is special — we’ll call it AwesomeX. If you’re lucky enough to have AwesomeX, you wouldn’t need to think hard to build a perfect alpha. But odds are, your in_sample​ isn’t AwesomeX — it’s noisy, random, and crummy. And because of the functional restrictions on F, you can’t just rotate the alpha for AwesomeX to fit your crummy in-sample​.

TotalEvil: The Overfitting Monster

Let’s now build the most ridiculously overfitted simple alpha model, which I’ll call TotalEvil:

Note that the parameter space of TotalEvil IS just the entire m-1 sphere, looked at as parameters.

TotalEvil has some wildly evil properties:

  1. It can be calibrated to perfectly fit in_sample​.
  2. It can also perfectly fit any point on the sphere.
  3. The range of possible correlations or Sharpe Ratios it produces by varying parameters is massive — correlations from -1 to 1, Sharpe Ratios from −∞ to ∞.

Point 2 is the textbook definition of overfitting. TotalEvil isn’t predicting anything; it’s just contorting itself to fit whatever data you throw at it. It’s not even really using information from the in-sample​; it’s just fitting a meaningless “past.”

Point 3 (and I’m simplyfying), is a linked to a method used by many practitioners and is framed as a hypothesis testing, whose null is the model’s sharpe is really just bad, say, zero. It mitigates signal monkeys from just changing parameters till they get a good Sharpe.

Our approach is more aligned with Point 2, as we shall see.

Back to semi Reality: A Toy Example

To make this concrete, suppose your in-sample size is 3. This means your data lives on a 2-sphere. Consider an AR(1) model:

The points that perfectly fit an AR(1) form two circles (1-spheres) on this 2-sphere, excluding the poles (apologies for it showing some gap near the poles).

Minimizing residual variance here is equivalent to finding the closest point on these circles to your in-sample. Now let’s fatten these circles so their total area is 5% of the sphere.

If your in_sample​ lies outside this fattened region, we’d say your model is overfit. Why? Because covering your in-sample​ would require fattening the circles even more, taking up too much space on the sphere, ie, more than 5% of the sphere’s area. Now, notwithstanding the fact that you probably wouldn’t worry about AR(1) models overfitting and would use other methods to check the suitability of AR, overfitting is a statement about models taking up too much space on the sphere and about the structure of the in_sample.

Final Thoughts

Here’s the gist: overfit models take up too much space on the sphere. If your in-sample point lies outside a reasonably fattened “perfect” point set, the model is overfit. This geometric approach to overfitting, while toy-like in this example, generalizes beautifully. Once you compute the fattened region for a given model, it’s reusable, and beyond the beauty of the math, it’s got practical value too.

That’s it!

P.S. Much thanks to Guolong Li for the original concept, from which I went down a deep rabbit hole!

--

--

AlphaNova
AlphaNova

No responses yet