PunditBench changelog

Material events affecting scoring, data, or methodology are recorded here (C13/E-transparency).

2026-06-29

New benchmark — a round-by-round ("live") track. By follower request, alongside the locked self-consistent bracket every model now also predicts the real knockout pairings of each round directly, as reality produces them. Scored like group matches (exact 3 / GD 2 / outcome 1, +1 correct advancer), kept separate from the headline leaderboard and surfaced as a "Round-by-round picks" section on each knockout match page. Stored in its own tree (data/predictions-live/, raw logs in data/raw-live/) so the pre-kickoff locked predictions are untouched. Each round is hashed and tagged before its first kickoff (predictions-<stage>-live); the golden rule applies per round. See METHODOLOGY.md and OPS.md.
Real Round-of-32 bracket materialised (data/fixtures/r32.json): group winners/runners-up resolved from the recorded results; the eight third-placed qualifiers (groups B, D, E, F, I, J, K, L) and their slots via FIFA Annexe C, cross-verified against the official published bracket. Locked-bracket advancement/matchup scoring begins paying out now.
First knockout result: match 73, South Africa 0–1 Canada (90' score; sources FIFA.com + ESPN; Canada advance).
Round-by-round R32 picks locked: 38 of 40 models (tag predictions-r32-live, sha256 7bbc8359…), collected before the round's earliest remaining kickoff. Match 73 is excluded and labelled "not pre-registered" (it had already kicked off). Two models could not be reached this round — Llama 3 70B ("no endpoints") and Claude Fable 5 ("not available") both 404 on OpenRouter; raw attempts kept in data/raw-live/. They retain their locked-bracket entries and can rejoin a later live round if their endpoints return.

2026-06-22

Site: "Prediction personality" added to each model page. Four style traits derived purely from the model's locked group-stage scorelines: goals per game, draw rate, a chalk-vs-contrarian index (mean agreement with the rest of the field on each match outcome), and favourite bias (how often it backs the underdog in matchups with a clear favourite). "Favourite" is defined endogenously — each team's mean predicted goal difference across the whole field is a "silicon power rating" (Brazil, Germany, Argentina, Spain top it; Curaçao and Haiti anchor it), with a ≥ 0.5-goal gap marking a clear favourite — so no external rankings enter the system. No effect on scoring, results, or the pre-registered predictions: these are read-only style metrics, derived at build from the existing locked data like everything else (new pure module lib/personality.ts, covered by tests/personality.test.ts). Accuracy still lives entirely on the leaderboard.

2026-06-12

Methodology page now embeds the verbatim group-stage prompt (generated from the v1 template and verified byte-identical to the published raw logs) plus the exact knockout-prompt deltas. Transparency only — no methodology change.
First result entered: match 1, Mexico 2–0 South Africa (90' score; sources ESPN + FOX Sports). All 40 models had backed Mexico; 26 hit the exact scoreline.
Hourly results auto-sync added (.github/workflows/results-sync.yml + scripts/sync-results.ts): ESPN's public scoreboard is polled hourly; finished group matches are entered automatically using the same canonical write as npm run result (verified byte-identical on match 1), behind strict team-name + kickoff matching. Knockout results stay manual (--advances/--note judgment). The sync never overwrites: a recorded result that disagrees with ESPN raises an alarm, and every run re-audits recorded scores. Scoring itself is unchanged — results.json remains the single derived-from input.
Site: "Today's matches" extended with a "Latest results" strip (finished matches from the last 48 h stay visible after the visitor's local midnight) and time-derived "In play" / "awaiting score" states between kickoff and the next sync deploy.

2026-06-11 (all before the opening kickoff, 19:00 UTC)

Methodology v2 — self-consistent bracket simulation. Knockout predictions are no longer collected round-by-round against real pairings; instead every model's own group predictions determine its own bracket, which it predicted through to its own champion. All collected pre-kickoff. Scoring extended with bracket components (advancement, matchup hits, matched-pairing scorelines) — see METHODOLOGY.md.
Third-place allocation: FIFA Annexe C lookup (495 combinations) parsed from the official regulations and machine-validated; used for all simulated brackets (data/third-allocation.json, ALLOCATION-NOTES.md).
Roster expanded 18 → 34 → 33: 16 models added (live-catalog-verified); OLMo-3 removed — listed in the catalog but no provider serves it (HTTP 404 on every attempt, logged in raw/group).
Validator relaxed (applies identically to all): entries for unlisted match numbers are now dropped with a logged warning instead of failing the response. Reason: Phi-4 Mini and LFM-2 predicted past the listed fixtures into the knockout bracket; rejecting that measures formatting, not football. All previously-passing models unaffected. Both models passed on rerun.
Group-stage hashes: 18-model set 9a7d3581…408fbc (tag predictions-group), expanded 33-model set 4afa1910…d3d83d (tag predictions-group-v2). Full-tournament hash tagged after simulation completes.
Final pre-kickoff roster cut (operator decision): 44 → 40. Four small models — Granite 4.1 8B, LFM-2 24B, Phi-4 Mini, Llama 3.2 1B — failed a fourth retry cycle with byte-identical errors and were removed from the ranked roster entirely rather than carried as zero/partial entries. Their raw attempts stay published in data/raw/ (the small-model capability floor remains documented in ROSTER-NOTES.md). Every ranked model now carries a complete tournament: 40 models, 19 vendors, 40/40 full brackets. Final hash tagged predictions-full-tournament-v4.
Pre-kickoff retry round (user-requested): the five models without complete brackets each got a fresh attempt. Hunyuan A13B succeeded on rerouted serving (OpenRouter provider variance) and now carries a full bracket — 40/44 complete; Granite 4.1 (partial through R16), LFM-2, Phi-4 Mini and Llama 3.2 1B reproduced their failures exactly and are final per the failure policy. Updated full-tournament hash tagged predictions-full-tournament-v3. Site copy corrected: failed simulations are stated as final, not "being collected".
Roster expanded 33 → 44 (still pre-kickoff): a legacy wing (GPT-3.5 Turbo, GPT-4, GPT-4o, Claude 3 Haiku, Llama 3 70B, Gemma 2 27B, Qwen 2.5 72B) and an oddball wing (WizardLM-2 8x22B, Hermes 3 405B, Hunyuan A13B, Llama 3.2 1B; Mercury 2 and LFM-2 retagged oddball). All candidates live-pinged before joining. Inflection dropped: both its endpoints return empty content (raw logs kept). Llama 3.2 1B failed the group-stage format in all attempts (stops near match 48) and stands as a disclosed zero entry. Final ranked roster: 44 models, 21 vendors. Updated full-tournament hash tagged predictions-full-tournament-v2.

2026-06-10

Project created. Methodology v1 fixed (see METHODOLOGY.md), scoring rules D1, prompt template v1.
Fixture dataset built from Wikipedia (two independent extraction passes) and cross-verified against ESPN, Sky Sports, FOX Sports and roadtrips.com official match numbering; all 72 group fixtures agreed across sources.
Roster of 18 models across 10 vendors verified against the live OpenRouter catalog (data/roster.json).