PunditBench changelog
Material events affecting scoring, data, or methodology are recorded here (C13/E-transparency).
2026-06-11 (all before the opening kickoff, 19:00 UTC)
- Methodology v2 — self-consistent bracket simulation. Knockout predictions are no longer collected round-by-round against real pairings; instead every model's own group predictions determine its own bracket, which it predicted through to its own champion. All collected pre-kickoff. Scoring extended with bracket components (advancement, matchup hits, matched-pairing scorelines) — see METHODOLOGY.md.
- Third-place allocation: FIFA Annexe C lookup (495 combinations) parsed from the official regulations and machine-validated; used for all simulated brackets (
data/third-allocation.json, ALLOCATION-NOTES.md). - Roster expanded 18 → 34 → 33: 16 models added (live-catalog-verified); OLMo-3 removed — listed in the catalog but no provider serves it (HTTP 404 on every attempt, logged in raw/group).
- Validator relaxed (applies identically to all): entries for unlisted match numbers are now dropped with a logged warning instead of failing the response. Reason: Phi-4 Mini and LFM-2 predicted past the listed fixtures into the knockout bracket; rejecting that measures formatting, not football. All previously-passing models unaffected. Both models passed on rerun.
- Group-stage hashes: 18-model set
9a7d3581…408fbc(tagpredictions-group), expanded 33-model set4afa1910…d3d83d(tagpredictions-group-v2). Full-tournament hash tagged after simulation completes. - Final pre-kickoff roster cut (operator decision): 44 → 40. Four small models — Granite 4.1 8B, LFM-2 24B, Phi-4 Mini, Llama 3.2 1B — failed a fourth retry cycle with byte-identical errors and were removed from the ranked roster entirely rather than carried as zero/partial entries. Their raw attempts stay published in
data/raw/(the small-model capability floor remains documented in ROSTER-NOTES.md). Every ranked model now carries a complete tournament: 40 models, 19 vendors, 40/40 full brackets. Final hash taggedpredictions-full-tournament-v4. - Pre-kickoff retry round (user-requested): the five models without complete brackets each got a fresh attempt. Hunyuan A13B succeeded on rerouted serving (OpenRouter provider variance) and now carries a full bracket — 40/44 complete; Granite 4.1 (partial through R16), LFM-2, Phi-4 Mini and Llama 3.2 1B reproduced their failures exactly and are final per the failure policy. Updated full-tournament hash tagged
predictions-full-tournament-v3. Site copy corrected: failed simulations are stated as final, not "being collected". - Roster expanded 33 → 44 (still pre-kickoff): a legacy wing (GPT-3.5 Turbo, GPT-4, GPT-4o, Claude 3 Haiku, Llama 3 70B, Gemma 2 27B, Qwen 2.5 72B) and an oddball wing (WizardLM-2 8x22B, Hermes 3 405B, Hunyuan A13B, Llama 3.2 1B; Mercury 2 and LFM-2 retagged oddball). All candidates live-pinged before joining. Inflection dropped: both its endpoints return empty content (raw logs kept). Llama 3.2 1B failed the group-stage format in all attempts (stops near match 48) and stands as a disclosed zero entry. Final ranked roster: 44 models, 21 vendors. Updated full-tournament hash tagged
predictions-full-tournament-v2.
2026-06-10
- Project created. Methodology v1 fixed (see METHODOLOGY.md), scoring rules D1, prompt template v1.
- Fixture dataset built from Wikipedia (two independent extraction passes) and cross-verified against ESPN, Sky Sports, FOX Sports and roadtrips.com official match numbering; all 72 group fixtures agreed across sources.
- Roster of 18 models across 10 vendors verified against the live OpenRouter catalog (
data/roster.json).