BacktestMay 7, 2026 · 6 min read

What 128 World Cup matches told us about the limits of our model

The Dixon-Coles model beats a uniform baseline by 7 percentage points across two World Cups. That sounds good. Here is what it actually means — and where the model still fails.

By WorldSportsXAI Editorial

Any prediction model that does not publish its backtest is, charitably, a model you should not trust with money or attention. So before we asked anyone to take the 2026 predictions seriously, we ran the same Dixon-Coles pipeline on the 2018 and 2022 World Cups — 128 real matches that the model never saw during fitting — and graded its predictions against what actually happened.

The headline number: our Brier score across those 128 matches is 0.620. A naive baseline that just guesses 1/3, 1/3, 1/3 for every match scores 0.667. The model is 7.0 percentage points better. That is real signal — every full point of Brier-skill improvement is hard to come by in sports prediction — but it is not a magic crystal ball, either.

What 7% Brier skill actually means

Brier score is the average squared distance between the predicted probability and the binary outcome (1 if the predicted result happened, 0 otherwise). Lower is better. A perfect model would score 0. A maximally bad model that always predicts the wrong outcome with 100% confidence would score 1. A baseline that knows nothing and shrugs 1/3 on each side scores 0.667.

Our model scoring 0.620 means: across 128 matches, on average it assigned higher probability to the result that actually happened than the uniform baseline did. Not every single match — many matches, including some that mattered a lot, the model got wrong. We will get to those. But on the population, the bet was real.

The reliability bins on our predictions page tell a more useful version of the same story. When the model said a team had a 30% chance to win, that team won roughly 31% of the time. When it said 70%, the team won about 68% of the time. The points cluster near the diagonal — meaning the model is well-calibrated, not just lucky.

Where the model failed

The interesting matches are the ones the model got wrong. Three patterns came up:

Knockout upsets. Group-stage predictions land well; knockout predictions land less well. The model knows nothing about a team's bracket context — which is fine for round-robin matches but worth less in single- elimination football where one bad day eliminates you. South Korea over Germany in 2018, Croatia over Brazil in 2022 — the model rated both correctly as upsets, just not as upsets-this-likely. It is a structural blind spot.
Low-information teams. Countries with few qualifying matches before the tournament are hard for any time-decay model. The 2022 squad of Saudi Arabia had so few competitive internationals in the prior two years that the model's confidence band was wide; their win over Argentina sat at the model's ~12%, but the model had no realistic way to be more confident.
Tournament-specific narratives. The model does not know that Morocco's 2022 squad had a unique tactical-defensive identity unlike any prior Morocco team. It saw Morocco's prior internationals and inferred a normal mid-tier team. The model was wrong; the bookmakers were also wrong, just less wrong.

What this means for 2026

Two takeaways for the 2026 numbers we are publishing now:

Group-stage probabilities should be trusted more than knockout probabilities. The further into the bracket a number is, the more knockout-stage variance compounds. Pay attention to the model's group-stage calls; treat semifinal and final probabilities as directional, not definitive.
Confidence in heavy favorites is justified; confidence in upset picks is not. When the model says Argentina has a 26% chance to win, it has earned the right to say it. When it says some mid-tier team has a 0.5% chance, that is a lower bound — anything from 0.2% to 2% is consistent with the data, and a small probability times 48 teams produces one or two upsets we cannot name in advance.

How we will update

As 2026 plays out, we will publish in-tournament updates: which calls the model got right, which it got wrong, and how the reliability plot is moving. The calibration plot you see today is the 2018+2022 number. By the time the 2026 tournament ends, we will have a third tournament's worth of data to add to it.

That is the bargain. We publish the numbers, then we publish what happened. If we are wrong, we say it.

Want the underlying numbers?

The model's live probabilities, mispricings vs. consensus markets, and calibration backtest are all on one page.

Open the model →