InsureBench

Changelog

Each public benchmark round uses the same format: editorial interpretation plus required release metadata.

Fixed release check

This checklist should be completed before a benchmark round is made public.

Dataset version, question count, and question-bank changes recorded.
New and removed models checked and publicly listed.
Prompt-template hash, parser status, and scoring changes recorded.
A short editorial findings block written covering what this round says, what changed, outliers, and forbidden conclusions.
Known limitations reconfirmed before the round goes public.

v1.1.0 23 Apr 2026

This round makes the Wft-Basis leaderboard citation-grade readable: public field definitions, score groups, and fixed source pages now ship as one release.

What changed

Public run exports now include an explicit leaderboard flag (`includedInLeaderboard` / `included_in_leaderboard`).
The data dictionary now fixes meaning, format, example, and interpretation for each public field in Dutch and English.
The leaderboard now shows score groups so small gaps are read as effectively neck-and-neck.

Plausibly explainable outliers

Small ranking gaps can still appear as #1, #2, and #3 within the same score group; the group is therefore more meaningful than the exact position.
In this release every public export still counts toward the ranking because `includedInLeaderboard` currently maps directly to “published and not archived”.

What still must not be concluded

This round does not prove that a model is Wft-compliant or safe for standalone insurance advice.
It also does not prove that tiny score gaps between models are statistically hard facts; that requires a separate statistical release.

Highlights

Governance on the About page expanded with explicit conflicts-of-interest disclosure and traceability through public CSV/JSON.
New module-scoped routing: /[locale]/wft-basis.
Score display clarified with 40-point score and raw score.
Public data export added as CSV and JSON.
Public data dictionary and score groups added to improve release-by-release interpretability.

Required release metadata

Dataset version: wft-basis-v1
Question count: 80 private Wft-Basis practice questions
Question-bank changes: No substantive question-bank changes in this release; the focus was on public definitions, data explanation, and interpretation guardrails.
New models: None
Removed models: None
Prompt-template hash: Unchanged in this release; publicly visible on methodology and model detail pages.
Parser changes: No parser change in this release; observability remains publicly traceable through methodology and model detail pages.
Scoring changes: No new scoring formula; this release does add a public uncertainty layer via score groups for gaps below 1 point.
Known limitations: External question-bank validation, a human baseline, and formal significance testing are not yet publicly complete.

Fixed release check

This checklist should be completed before a benchmark round is made public.

Dataset version, question count, and question-bank changes recorded.

New and removed models checked and publicly listed.

Prompt-template hash, parser status, and scoring changes recorded.

A short editorial findings block written covering what this round says, what changed, outliers, and forbidden conclusions.

Known limitations reconfirmed before the round goes public.