InsureBench

Methodology

A data-driven evidence file for the Wft-Basis benchmark: enough context for journalists, experts, and researchers without leaking the private test set.

Dataset card

Dataset card
Field	Value
Dataset name	InsureBench Wft-Basis v1.1
Question count	80
Question types	Meerkeuze, stellingen en korte casus
Domain	Wft-Basis
Based on	CDFD-eindtermen 2025 t/m 2026
Question sources	Eigen vragen, herschreven oefenvragen en gevalideerde oefenset op basis van CDFD-eindtermen.
Legal reference date	01 Jan 2026
Last review	Not completed yet
Review status	externe review in voorbereiding
Reviewer role	externe review in voorbereiding
Question distribution	Per CDFD-taak, toetsterm, vraagtype en moeilijkheid.
Difficulty	Laag, midden en hoog.
Publication policy	Vragenbank privé; alleen aggregaties, scores en methodologie worden gepubliceerd.
Contamination policy	Private set met held-out beleid en canary-vragen; exacte vragen worden niet publiek gedeeld.
Results licence	CC-BY-4.0

Learning objective distribution

This distribution shows which part of the private set maps to each CDFD task. Question texts, options, and correct answers remain private.

Domain	Questions	Weight
Onbekend	80	100%

Question type distribution

Type	Questions	Weight
KB	80	100%

Difficulty distribution

Difficulty	Questions	Weight
midden	80	100%

Difference from the official CDFD exam

Source: CDFD Initieel examen Basis, valid from 01 Apr 2026. https://cdfd.nl/initieel-examen-basis/

Feature	Official CDFD Basis exam	InsureBench Wft-Basis v1.1
Tijdsduur	120 minuten	n.v.t. (LLM)
Aantal vragen	42	80
Aantal punten	63	40 (omgerekend)
Slaaggrens	68%	68%
KB-vragen	21 vragen, 21 punten	Zie vraagtypenverdeling
PG-vragen	2 vragen, 4 punten	Zie vraagtypenverdeling
VC-vragen	19 vragen, 38 punten	Zie vraagtypenverdeling

Model version policy

Modelnamen kunnen bij providers als alias werken. InsureBench rapporteert daarom de gebruikte API-naam, gatewayroute, testdatum, promptversie en datasetversie. Bij gesloten modellen blijft een beperkte onzekerheid bestaan over interne providerwijzigingen.

Field	Why
Provider	Herkomst
Gateway	Route
Exacte API-modelnaam	Reproduceerbaarheid
Provider release date	Context
Testdatum	Momentopname
Endpointtype	Direct of router
Model alias of snapshot	Belangrijk verschil
Kan provider stil wijzigen	Ja / nee / onbekend
Seed ondersteund	Ja / nee
Tools uitgeschakeld	Ja
System prompt	Leeg of exact vermeld
Parserversie	Controle
Prompt template hash

Model selection criterion

We testen publiek beschikbare tekstmodellen van grote aanbieders die via de gekozen gateway beschikbaar zijn op de testdatum.

External validation

How external domain experts can review this benchmark is described on the external validation page. External validation

Run protocol

Field	Value
Parser version	answer-parser-v2
prompt_template_hash	5a732017979c
Benchmark version	1.1.0
Dataset version	InsureBench Wft-Basis v1.1
Retry policy	Gateway timeouts en tijdelijke providerfouten worden begrensd opnieuw geprobeerd; blijvende fouten tellen zichtbaar mee.
Timeout policy	Per model kan een request-timeout gelden; de hard cap blijft onder de serverless limiet.
Refusal policy	Een weigering telt als 0 punten en wordt als refusal zichtbaar gemaakt.
Rate limit handling	Rate limits pauzeren de run waar nodig; publieke API-routes hebben eigen rate limiting.
Logging policy	Alleen geaggregeerde runstatus, foutcategorieën en kostenmetadata worden publiek gemaakt.

Error handling by run

Errors are not a weakness when measured well. We therefore publish aggregated error categories by run.

Model	No letter	Parse fail	Refusal	Timeout	Retry	Failed jobs
GPT-4.1 (OpenAI)	0	0	0	0	0	0
Claude Haiku 4.5 (Anthropic)	0	0	0	0	0	0
Claude Opus 4.7 (Anthropic)	0	0	0	0	0	0
Mistral: Mistral Nemo (Mistralai)	0	0	0	0	0	0
GPT-4o (OpenAI)

Statistical method

The Wft score is the mean across attempts, converted to a 40-point scale with a 68% pass threshold. The standard deviation on model pages mainly measures pipeline stability at temperature 0 and top_p 1. Differences within a small margin are interpreted as tied within margin until confidence intervals and p-values are added in a dedicated statistics release.

Download results

The public CSV and JSON exports also include `includedInLeaderboard` per run. In this release the flag still maps one-to-one to “published and not archived”, so every exported run currently counts toward the public ranking.

CSVJSONMethodology JSONData dictionary

How to cite

Diks, M. (2026). InsureBench v1.1.0 [Benchmark]. insurebench.nl

@misc{insurebench2026,
  author       = {Diks, Marc},
  title        = {InsureBench: AI Proficiency on the Dutch Wft-Basis Exam},
  year         = {2026},
  version      = {1.1.0},
  url          = {https://insurebench.nl},
  note         = {Benchmark with private Wft-Basis question set}
}