InsureBench
Changelog
Each public benchmark round uses the same format: editorial interpretation plus required release metadata.
Fixed release check
This checklist should be completed before a benchmark round is made public.
- Dataset version, question count, and question-bank changes recorded.
- New and removed models checked and publicly listed.
- Prompt-template hash, parser status, and scoring changes recorded.
- A short editorial findings block written covering what this round says, what changed, outliers, and forbidden conclusions.
- Known limitations reconfirmed before the round goes public.
v1.1.0 23 Apr 2026
This round makes the Wft-Basis leaderboard citation-grade readable: public field definitions, score groups, and fixed source pages now ship as one release.
What changed
- Public run exports now include an explicit leaderboard flag (`includedInLeaderboard` / `included_in_leaderboard`).
- The data dictionary now fixes meaning, format, example, and interpretation for each public field in Dutch and English.
- The leaderboard now shows score groups so small gaps are read as effectively neck-and-neck.
Plausibly explainable outliers
- Small ranking gaps can still appear as #1, #2, and #3 within the same score group; the group is therefore more meaningful than the exact position.
- In this release every public export still counts toward the ranking because `includedInLeaderboard` currently maps directly to “published and not archived”.
What still must not be concluded
- This round does not prove that a model is Wft-compliant or safe for standalone insurance advice.
- It also does not prove that tiny score gaps between models are statistically hard facts; that requires a separate statistical release.
Highlights
- Governance on the About page expanded with explicit conflicts-of-interest disclosure and traceability through public CSV/JSON.
- New module-scoped routing: /[locale]/wft-basis.
- Score display clarified with 40-point score and raw score.
- Public data export added as CSV and JSON.
- Public data dictionary and score groups added to improve release-by-release interpretability.
Required release metadata
- Dataset version
- wft-basis-v1
- Question count
- 80 private Wft-Basis practice questions
- Question-bank changes
- No substantive question-bank changes in this release; the focus was on public definitions, data explanation, and interpretation guardrails.
- New models
- None
- Removed models
- None
- Prompt-template hash
- Unchanged in this release; publicly visible on methodology and model detail pages.
- Parser changes
- No parser change in this release; observability remains publicly traceable through methodology and model detail pages.
- Scoring changes
- No new scoring formula; this release does add a public uncertainty layer via score groups for gaps below 1 point.
- Known limitations
- External question-bank validation, a human baseline, and formal significance testing are not yet publicly complete.