Each public round now follows the same editorial structure: what this round says, what changed, which outliers are explainable, and what still must not be concluded.

What this round says

This round makes the Wft-Basis leaderboard citation-grade readable: public field definitions, score groups, and fixed source pages now ship as one release.

What changed since the previous round

Public run exports now include an explicit leaderboard flag (`includedInLeaderboard` / `included_in_leaderboard`).
The data dictionary now fixes meaning, format, example, and interpretation for each public field in Dutch and English.
The leaderboard now shows score groups so small gaps are read as effectively neck-and-neck.

Plausibly explainable outliers

Small ranking gaps can still appear as #1, #2, and #3 within the same score group; the group is therefore more meaningful than the exact position.
In this release every public export still counts toward the ranking because `includedInLeaderboard` currently maps directly to “published and not archived”.

What you still must not conclude

This round does not prove that a model is Wft-compliant or safe for standalone insurance advice.
It also does not prove that tiny score gaps between models are statistically hard facts; that requires a separate statistical release.

Quick datapoints from the current leaderboard

Mistral: Mistral Nemo ranks #1 with 22/40 (43/80 raw).
18 of 21 models clear the 68% CDFD pass threshold (86%).
Spread remains meaningful: Gemini 3.1 Pro Preview sits at 34/40, highlighting exam-set sensitivity across models.

Data & citation

Download aggregated run data as CSV or JSON. Use the BibTeX entry below for attribution.

CSV JSON

@online{insurebench_wft_basis_1_1_0,
  title        = {InsureBench: Wft-Basis AI Benchmark},
  author       = {InsureBench},
  year         = {2026},
  version      = {1.1.0},
  url          = {https://www.insurebench.nl/nl/wft-basis},
  urldate      = {2026-04-23},
  note         = {Public leaderboard, 80 questions, 3 runs per model}
}