IBInsureBench

Leaderboard Methodology About Privacy

Leaderboard wordt geladen

Benchmark version 1.1.0 · Last update 23 Apr 2026

Results under CC-BY-4.0. Question bank private.

Contact: Klik hier voor contactgegevens

About the author:marcdiks.nl LinkedIn

Methodology For journalists Validation Roadmap About Privacy Changelog

v1.1.0Last update 23 Apr 2026

Which AI models understand Dutch Wft-Basis knowledge?

InsureBench tests AI models on a private Wft-Basis practice-question set. Phase 2 expands this to open insurance-advice cases.

How we measure Download data Changelog Leaderboard

Key finding

Mistral: Mistral Nemo ranks #1 with 21/40 on the combined Wft and prompt score.

Main limitation

The score measures Wft-Basis knowledge, not whether a model is suitable as a standalone AI adviser.

Supported now

Model X ranks highest on the InsureBench Wft-Basis knowledge benchmark.

Not supported yet

Model X gives the best insurance advice in private simple-risk cases.

Only testable later

Whether AI models are reliable enough for standalone insurance advice requires phase 2-3.

Wft questions: 80
point scale: 40
threshold: 68%
advice domains: 6

Leaderboard

Score on a 40-point scale (Wft-Basis equivalent). Click a model for details.

All
WFT
Prompt
Combined

A model may appear in multiple rows — one per benchmark type. WFT measures knowledge (multiple choice), Prompt measures advice skills, Combined measures both.

Model
Provider

Wft-Basisv1.1

All
WFT
Prompt
Combined

A model may appear in multiple rows — one per benchmark type. WFT measures knowledge (multiple choice), Prompt measures advice skills, Combined measures both.

Model
Provider

Scores in the same group differ by less than 1 point and should be read as effectively neck-and-neck.

1Anthropic: Claude Opus 4.8 (Fast)
Anthropic
35 / 4066/80 rawGroep APass +7.8
Wft 33 / 40 · Prompt 36 / 4001 Jun 2026
Price / M tokens: €9.20 in / €46.00 out

Leaderboard: AI models ranked by Wft-Basis score (40-point scale)
#	Model	Provider	Open source	Score (40)	WFT	Prompt	Price / M tokens	Result	Last tested
1	Anthropic: Claude Opus 4.8 (Fast)	Anthropic	—	35 / 40 66/80 raw Groep A	33 / 40	36 / 40	€9.20 in / €46.00 out	Pass +7.8	01 Jun 2026

Showing 21-21 of 21 results

1 2 3

Release notes

Each public round now follows the same editorial structure: what this round says, what changed, which outliers are explainable, and what still must not be concluded.

What this round says

This round makes the Wft-Basis leaderboard citation-grade readable: public field definitions, score groups, and fixed source pages now ship as one release.

What changed since the previous round

Public run exports now include an explicit leaderboard flag (`includedInLeaderboard` / `included_in_leaderboard`).
The data dictionary now fixes meaning, format, example, and interpretation for each public field in Dutch and English.
The leaderboard now shows score groups so small gaps are read as effectively neck-and-neck.

Plausibly explainable outliers

Small ranking gaps can still appear as #1, #2, and #3 within the same score group; the group is therefore more meaningful than the exact position.
In this release every public export still counts toward the ranking because `includedInLeaderboard` currently maps directly to “published and not archived”.

What you still must not conclude

This round does not prove that a model is Wft-compliant or safe for standalone insurance advice.
It also does not prove that tiny score gaps between models are statistically hard facts; that requires a separate statistical release.

Quick datapoints from the current leaderboard

Mistral: Mistral Nemo ranks #1 with 21/40 on the combined Wft and prompt score.
17 of 21 models clear the combined 68% threshold (81%).
Combined runs give exam knowledge and reviewed advice quality equal weight in the headline score.

Data & citation

Download aggregated run data as CSV or JSON. Use the BibTeX entry below for attribution.

CSV JSON

@online{insurebench_wft_basis_1_1_0,
  title        = {InsureBench: Wft-Basis AI Benchmark},
  author       = {InsureBench},
  year         = {2026},
  version      = {1.1.0},
  url          = {https://www.insurebench.nl/nl/wft-basis},
  urldate      = {2026-04-23},
  note         = {Public leaderboard, 80 questions, 3 runs per model}
}