Sitemap

Aliens, Cross Sectional Financial Forecasts, Spherical Codes, and Spherical Cap Packing

13 min readAug 18, 2025

#AlphaNova #AI #QuantTrading #Crowdsourcing

Preamble

Picture this: an alien spacecraft drifts into orbit, filled with beings of astonishing intelligence but utterly without sight. They have no eyes, no images in their minds. Instead, their sensory endowment is tuned differently from ours. Like us, they perceive gravity and interact with the electromagnetic and nuclear forces, but their perception of electromagnetism is split. Unlike humans, they cannot see light; the visual spectrum is lost to them. Instead, they experience a noisy sense of magnetism — the other half of electromagnetism — as if a compass needle were buried deep within their nervous system. But this compass is imperfect: it jitters, drifts, and wavers, giving them only a blurred and probabilistic feeling for “true north.”. For them, “north” is not a direction on a map but a feeling in their bones — a compass needle humming faintly in their minds.

Their compass is terribly noisy. It jitters, wobbles, and drifts, offering only a shaky sense of where true north might be. Each alien interprets this signal differently, some better, some worse.

When they disembark, they are free to choose their drop point anywhere on the planet — a mountain ridge, a desert plain, a lonely coastline — based on what their magnetic sense suggests. Then their mission begins: take 1000 steps, each step guided by that noisy compass. The challenge is simple to state but devilishly hard to achieve: minimize the average angular deviation from the true north pole over the entire journey.

Some aliens march with uncanny steadiness, hugging the pole tightly. Others wobble, drifting in wide arcs that betray the noise of their compass. And of course, no two aliens follow exactly the same path: their noisy perceptions create endless diversity in the trajectories they carve across the sphere.

This whimsical experiment is more than just an alien thought-experiment. It is a parable for our forecasts. Replace aliens with models, the noisy compass with imperfect statistical correlation, and the north pole with ground truth. Each model specifies where to begin, then proceeds to take steps — time steps — always trying to stay “true” on average, though never perfectly. And just as we can cluster aliens by the similarity of their wandering paths, so too can we cluster forecasts by their average correlations.

In the geometry of this problem, the aliens’ mission and ours become the same: how do you arrange as many noisy, north-seeking paths as possible on the sphere without them collapsing into the same trajectory?

Press enter or click to view image in full size

Motivation

Back to Planet Earth, as it were.

For the last month or so, I’ve been musing over an interesting problem we at AlphaNova face. We run ML/Data Science competitions and want to onboard as many forecasts as possible, subject to three conditions:

  1. The forecasts aren’t overfit.
  2. They are predictive with reasonable confidence.
  3. They are not too correlated with each other.

This leads to obvious questions: How many forecasts are even possible? What’s a good geometric/topological framework for representing this problem in a way that respects statistics (at least up to second moments)?

The setup below addresses these questions. A more formal math paper is in the works, but this note captures the main ideas.

Cross Sectional Forecasts

Suppose you want to build a cross-sectional forecast of a vector time series of one-step-ahead returns for N assets.

(In reality, trading often requires multi-step-ahead forecasts, which complicates things due to autocorrelation and “effective” sample size reductions. For clarity, we ignore that here.)

Why cross-sectional forecasts? They’re a natural way of creating signals for long/short, dollar-neutral or market-neutral strategies. Effectively, they forecast relative returns versus an equally weighted index of the assets.

Let’s fix a historical sample length M. Denote the realized returns as

where gij is the return of asset j over period [i,i+1].

Now, we cross-sectionally z-score each gij:

Press enter or click to view image in full size

Thus each gi becomes a standardized vector with mean 0 and variance 1 across assets.

A forecast sequence is:

with the same z-scoring

If we treat fi​ as weights in a long/short portfolio at time i, then the Sharpe ratio estimate is (as long as no autocorrelation exists):

Press enter or click to view image in full size

i.e. time-average cross-sectional correlation divided by the time-standard-deviation of correlation.

Spheres and Spherical Codes

Because of z-scoring, each fi and gi​ lies on a sphere:

Press enter or click to view image in full size

(That is, the vectors live in the hyperplane orthogonal to the all-ones vector intersected with a standard sphere of radius sqrt{N}. The radius sqrt{N} is a detail we can ignore for geometry: assume unit radius without loss of generality.)

Thus the sequence F is a kind of spherical code, closely related to spherical cap packing in Information Theory and Error-Correcting Codes. The central problem in spherical codes is:

How can we arrange points on a sphere to maximize their minimum pairwise distance?

Press enter or click to view image in full size

The distance between two unit vectors is related to their dot product:

And since

is just the cross-sectional correlation, the geometric language maps naturally to statistics.

Poor Man’s Parallel Transport via Rotations

We want to simplify things further by “centering” the target sequence G.

Without loss of generality, we can assume g0​ is at the North Pole:

If g0 isn’t the South Pole, there’s a unique minimal rotation that takes it there, leaving all inner products invariant.

Then, iteratively:

Press enter or click to view image in full size

At the end, all gi​ map to NP. The forecast sequence is transformed into

You can refer to our Alien preamble to see the mapping, no pun intended.

This preserves:

  • Each correlation of fi with gi.
  • The time-averaged correlation and its variance.
  • Cross-forecast correlations if we had multiple forecasts.

So we’ve reduced the ground truth to a constant North Pole reference, simplifying geometry.

Why bother?

Visualisation

In low dimensions N≤4, you can actually visualize forecasts on a 2 dimensional sphere:

  • A “good” forecast stays close to NP. Just like the aliens that get it right on average.
  • An “excellent” one has low variance in radial distance (stable correlation). Just like aliens that march around a noisy circle around the north Pole.

In higher dimensions, you can’t visualize directly — but this rotation framework still frames what “good” means: average correlation > 0 (upper hemisphere). Adjusting for statistical confidence might shift the threshold northward (e.g. toward the Tropic of Cancer).

Overfitting

Even more importantly, this perspective simplifies the geometry of overfitting. In this earlier note, overfitting was described as taking place within a “fattened-up neighborhood” of the ground truth path. Here, the problem becomes simpler: it is equivalent to solving for a spherical cap around the North Pole such that if a model has average correlation below cos⁡(r), then it lies outside the cap and is considered overfit.

This reframing turns overfitting from a fuzzy high-dimensional neighborhood problem of a manifold into a clean geometric thresholding problem on the sphere from the North Pole.

Press enter or click to view image in full size

Expanding to M Copies of the Sphere

Finally, instead of thinking of F as M points on one sphere, we can equivalently view it as a point in the product space:

Press enter or click to view image in full size

Think of this as frames in a crude movie. Reality is in spacetime, we approximate spacetime by a sequence of pictures in pure space.

This perspective makes it clear that:

  • Each forecast is one point in a huge product of spheres.
  • Average correlation corresponds to the inner product in this product space, normalized by M.
  • Clustering and packing questions naturally become sphere packing problems in high dimension.

Here is an example of a cross-sectional forecast for N=4 assets, M=4 timesteps, rotated as in the previous section. We first plot the four observations on a single sphere, and then represent the forecast ensemble as a point in the 4-fold product of spheres.

One one sphere:

Press enter or click to view image in full size

On a sequence of 4 copies of spheres.

Press enter or click to view image in full size

The point of expanding a forecast over M timesteps into a point on an M-fold product of (N−2)-dimensional spheres is that the geometry of interest — the inner product (correlation) inherited from the ambient plain vanilla N-1 dimensional space — behaves very nicely under the product structure. Concretely, the inner product of two such forecast sequences in the product space is just the sum of the inner products of their components. Equivalently, the average correlation of a forecast with the North Pole (NP) sequence is, up to a factor of 1/M, simply the normalized inner product of the product-space point with NP.

Remark. This M-fold product construction is exact when there is essentially zero autocorrelation across timesteps. If forecasts have temporal dependence, the effective dimension is lower (roughly scaled by an “effective sample size”), but the cleanest formulation is in the independent case.

The Packing Problem

With this reformulation, the problem becomes:

How many points can we pack on the M-fold product of spheres such that:

  1. Each has average correlation ≥r≥0 with the NP sequence, and
  2. Pairwise average correlation between forecasts is at most H.

This is a modified spherical cap packing problem, generalized to a product of spheres.

Mathematically, if we let

Press enter or click to view image in full size

Then a forecast sequence is a point F:

The similarity measure between two forecasts F and G is :

i.e. the average correlation across timesteps.

The NP sequence, ie, pure North Pole corresponds to (NP,…,NP)

Now the problem is: pack as many points in the M fold product of N-2 spheres, ie,

as possible subject to (a) correlation with NP ≥r, and (b) pairwise correlation ≤H.

Connection to the Classical Problem

If M=1, this reduces to the classical spherical cap packing problem: arranging points on a single N-2 dimensional sphere subject to correlation constraints and each other.

If M≫1, the situation becomes more flexible because the ambient space dimension is:

This increase in dimension means there are a lot more degrees of freedom for constructing forecasts is M, or even N grow.

Extremal Cases (with explicit N-dependence)

Let:

be the maximum number of forecasts that can be represented as points in the M-fold product of (N−2)-dimensional spheres,

such that:

  1. Each forecast has average correlation with the North Pole (NP) at least ρ.
  2. For any two forecasts F1,F2​ in the set, their pairwise average correlation is at most H.

Note on Elbow room: As N increases, the per-time sphere of dimension N-2 has higher dimension, yielding more geometric “elbow room” to place points subject to the same constraints. (All else equal, larger N increases capacity.)

Case M=1

  • If ρ=0 and H=0: then J(N,1,0,0) =N−1, realized by an orthonormal basis sitting in the upper hemisphere. I leave this as an exercist to the reader..
  • If ρ=0 and H=1: then J(N,1,0,1) = ∞, since identical forecasts are allowed.
  • For 0<H<1: we conjecture J(N,1,0,H) is at least twice differentiable in H, monotone increasing (J′ ⁣≥0), and convex (J′′ ⁣​>0). Fairly sure is true, but we offer no proof.

Case M>1

  • If ρ=0 and H=0: J(N,M,0,0) = M (N−1), since the dimension of the ambient linear space is M(N-1) (mutual orthogonality in the product embedding).
  • If ρ=0 and H=1: J(N,M,0,1) = ∞.
  • For 0<H<1: a simple linear lower bound is J(N,M,0,H) ≥ M*J(N,1,0,H)),by replicating a one-sphere construction across the M copies.
    In practice, because correlations are averaged across copies and you are effectively packing in ambient dimension d=M(N−1) capacity typically grows much faster than linearly in M (closer to high-dimensional spherical code behavior)

we will return to this later, in the technical paper, but keep in mind when N=20 and M=10000, the space of forecasts is up to 180000…..and, really, much much larger….

📦 How Big Can it Be? — Greedy Packing Estimate

To get an intuition for how large the ensemble of admissible forecast vectors can be under a correlation constraint, let’s consider the simplest meaningful case: 20 assets and a single time step — that is, N=20 and M=1, i.e. just one time stamp observation.

We ask:

How many forecast vectors can I construct such that every pair has average correlation no more than H?

Greedy Packing Heuristic

A common heuristic for estimating how many well-spaced forecast vectors you can pack on a sphere is to treat each as occupying a “no‑go zone” — a spherical cap of a certain angular radius — and then approximate the maximum number by dividing the total sphere area by the exclusion‑zone area. This geometric approach is standard in the sphere‑packing and spherical‑code literature (e.g., see Cohn & Zhao, Sphere packing bounds via spherical codes, 2012):

This works well in moderate to high dimensions and when the correlation threshold is not too tight.

Example: N=20,ρ=0,M=1

Here’s the greedy estimate of J(20,1,ρ=0,H) as a function of the allowed pairwise correlation H:

Press enter or click to view image in full size

As you can see, the number of admissible vectors grows rapidly as H→1. Around H=0.3, the capacity starts ramping up — H=0.5, it’s possible to fit over 1 million models.

And that is is for just one observation. If the number of observations is large, then clearly the number of possible models is astonishgly large!

H-Clustering and Practical Use

Setting aside the technical challenge of deriving sharp bounds for J(N,M,ρ,H) when M>>1, one clear implication is that — when both N (number of assets) and M (sample size) are sufficiently large — the number of theoretically possible forecasts is immense, far larger than the small handful of component models typically used by practitioners. In this sense, the geometry strongly suggests there is far more “capacity” for diverse forecasts than most people exploit.

In our framework, the principle is simple: the more forecasts that are predictive and not overly correlated with each other, the better.

Suppose, for example, that in a competition we receive 500 model submissions that are statistically predictive at some confidence level. We then construct clusters of these forecasts (together with our existing library of models) according to the H-threshold condition. In other words, two models fall in the same cluster if their average correlation is at least H. This clustering is not just an abstract data-analytic exercise: it has a precise geometric meaning on the M-fold product of (N−2)-dimensional spheres. Each cluster corresponds to a connected component in this correlation-threshold geometry.

From each cluster, we would then select the representative with the highest Sharpe ratio as the “winning” forecast. That representative would then be onboarded into our production system.

This construction naturally raises two practical questions:

Choosing the optimal H.¹
A higher value of H produces more clusters, but also means forecasts across clusters are more correlated. A lower H produces fewer clusters, but with larger separations between them. In practice, one may want to tune H to minimize the variance of equally-weighted portfolio sums of forecasts — balancing diversity with reliability.

  1. Detecting “desert” regions.
    Just as a night-time satellite image of the Earth reveals bright cities and dark, sparsely populated deserts, the geometry of our product space will exhibit regions densely filled with models and others left blank. An open problem is how best to detect and exploit these “forecast deserts” — regions of the sphere product far from any existing clusters. Such regions may represent opportunities for constructing entirely new, uncorrelated forecasts.

Stay tuned for more mind-bending AI-quant stuff that is a bit left field from what you’re used to.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

  1. As one AlphaNova scientist on our platform (who is also incidentally an investor) pointed out, picking best Sharpe Ratio representatives from each correlation cluster doesn’t necessarily yield an ensembled best Sharpe. Furthermore, AlphaNova uses a Deep Neural net whose features include several weak forecasts. Therefore, we will be using an algorithm to iteratively add forecasts from each cluster such that a) there is one forecast per cluster, b) the loop picks iteratively the best foreast that incrementally improves Sharpe ratio downstream from the Neural Net output. Through warm startup, parallelization, fine-tuning and Net2Net methods, this can be made to be feasible from a computational perspective.

--

--

No responses yet