PunditBench Methodology — 2026 World Cup edition
PunditBench measures how well large language models predict real football. For the 2026 FIFA World Cup, every participating model predicts its own complete tournament — all 72 group matches and then, derived from its own predictions, its own knockout bracket through to its own champion — entirely before the opening kickoff. Reality then scores every claim. (PunditBench is an independent project, not affiliated with FIFA; tournament and team names are used editorially. All predictions shown are AI-generated content.)
The initial methodology (2026-06-10) collected group-stage predictions and planned to reveal real knockout pairings round by round. It was upgraded to the self-consistent bracket-simulation design described below on 2026-06-11, before the opening match; the group-stage collection is identical under both and was not re-run. Full history in CHANGELOG.md.
How predictions are collected
- Group stage. Every model receives one identical prompt with all 72 group fixtures (official match numbers, teams, dates, venues) and returns a strict-JSON score for each. No tools, no web access, training knowledge only;
temperature: 0where the model accepts it (recorded per call). - Self-consistent knockout simulation. From the model's own 72 scores we compute its group tables (FIFA tiebreakers: points, goal difference, goals scored, head-to-head) and its qualified third-placed teams, slot the thirds using FIFA's official Annexe C lookup table (all 495 combinations, parsed from the official regulations — see ALLOCATION-NOTES.md), and obtain the model's own Round of 32. The model is then prompted with its own bracket — explicitly framed as "the knockout bracket that follows from YOUR OWN predictions" — and predicts those 16 matches, naming the team that advances where it predicts a 90-minute draw. Its answers build its Round of 16, and so on through the quarter-finals, semi-finals, third-place match and final. Six prompts per model; every model ends with a full simulated tournament and a champion.
- Everything is locked pre-kickoff. No prediction anywhere in the system depends on a single real result. The complete set (group + all simulated rounds, raw API traffic included) is hashed and pre-registered before the opening match.
Scoring
Group matches (72 real matches, every model predicted all of them):
| Outcome | Points |
|---|---|
| Exact score | 3 |
| Correct goal difference (includes any correct draw) | 2 |
| Correct outcome (win/draw/loss) | 1 |
Bracket (knockout) scoring, against the real tournament as it unfolds:
| Component | Points |
|---|---|
| Real team you had reaching the Round of 32 | 1 each |
| … the Round of 16 | 2 each |
| … the quarter-finals | 3 each |
| … the semi-finals | 5 each |
| … the final | 8 each |
| Correct champion | 13 |
| Your simulated pairing actually occurs in that real round (incl. third-place match) | +1 each |
| Scoreline of a matched pairing, scored like a normal match (orientation-normalized, 90-minute result) | 3/2/1 (+1 correct advancer) |
A team "reaches" a stage by appearing in it; reach derives from the model's simulated pairings and its advances answers (Round-of-32 reach is determined entirely by the group predictions — computing the bracket needs no model input). Theoretical maximum: 216 (group) + 137 (advancement) + 32 (matchups) + 128 (matched scorelines) = 513.
Leaderboard tiebreakers, in order: total points → most exact scores → correct champion → most correct Round-of-32 qualifiers → shared rank.
Voided/abandoned real matches score 0 for everyone and are excluded; documented in the changelog.
Integrity rules
- Kickoff cutoff (golden rule). A prediction counts only if generated before the relevant information existed in reality — here, everything predates the opening kickoff (2026-06-11 19:00 UTC). Per-call timestamps are in the raw logs.
- Pre-registration. Canonical SHA-256 hashes of each locked prediction set are committed and tagged in the public repository before kickoff (
data/hashes/, git tags). Anyone can recompute them from the published data. - Raw audit trail. Every API request and response — including failed attempts and validator feedback — is published verbatim in
data/raw/. - Frozen roster. Fixed before the opening kickoff at 40 models (pre-kickoff expansions 18 → 33 → 44, every addition predicting under identical conditions before any match; then four models removed pre-kickoff because they could not produce fully valid prediction sets across four retry cycles, and two labs dropped because their catalog-listed endpoints served nothing — all raw attempts preserved in the published audit logs and documented in ROSTER-NOTES.md). Every ranked model therefore carries a complete tournament: 72 group scorelines and a full simulated bracket. Later additions, if ever, would be unranked exhibition entries.
- Identical treatment. Same prompt templates, same parameters policy, same validator for every model. Knockout prompts are personalized only by the model's own previous answers — which is the design, not an asymmetry.
- Derived scoring. Points are recomputed from raw predictions + results on every site build and re-derivable from raw logs via
npm run audit.
Validation & failure policy
Responses must cover every listed fixture exactly once with integer goals 0–15; knockout predictions must name a consistent advancing team. Invalid responses get up to 2 corrective retries with the validator's errors appended. Entries for unlisted match numbers are dropped with a logged warning rather than failing the response (rule relaxed pre-kickoff on 2026-06-11 after two small models enthusiastically predicted matches beyond the fixture list; all earlier-passing models unaffected — see CHANGELOG.md). Four models that still could not produce fully valid sets after four retry cycles (Granite 4.1 8B, LFM-2 24B, Phi-4 Mini, Llama 3.2 1B — all small models; the capability floor for this task format is real) were removed from the ranked roster before kickoff rather than carried as zero or partial entries; their raw attempts remain published in data/raw/.
The roster
40 models across 19 vendors — current flagships, mid-tiers and small models, plus a legacy wing (2023–24 era: GPT-3.5 Turbo, GPT-4, GPT-4o, Claude 3 Haiku, Llama 3 70B, Gemma 2 27B, Qwen 2.5 72B) and an oddball wing (a diffusion language model, a community 405B finetune, a released-then-pulled MoE, and friends) — all accessed through OpenRouter with IDs verified against the live catalog and live-pinged on collection day: data/roster.json, ROSTER-NOTES.md. Knowledge cutoffs differ and several predate final World Cup qualification (the legacy wing predates parts of qualifying entirely); the prompt supplies the fixture list, the rest is what the model knows — that asymmetry is part of what's being measured.
Caveats, honestly stated
- One run at temperature 0 samples one trajectory, not a model's full predictive distribution.
- Family correlation: models from the same vendor lineage can converge hard (the two Gemini entries agreed on 62 of 72 group scorelines). 33 entries ≠ 33 independent opinions.
- Simulated third-place ranking uses points → goal difference → goals scored, then alphabetical; FIFA's later criteria (conduct score, world ranking) aren't computable from predicted scores. Deep ties are rare and the rule is identical for every model.
- Knockout scorelines are scored on the 90-minute result (standard prediction-game convention); penalties/extra time are captured by the "advances" answer.
- Football is high-variance and bracket scoring is top-heavy by design — a lucky champion call moves the table. That's the game.
Results entry
Real results are recorded after each match (90-minute score; advancing team for knockouts), committed publicly with full history; real knockout fixtures are added as reality produces them, which is when bracket components start paying out. Corrections happen by commit and are listed in the changelog.