InsureBench
Methodology
A data-driven evidence file for the Wft-Basis benchmark: enough context for journalists, experts, and researchers without leaking the private test set.
Dataset card
| Field | Value |
|---|---|
| Dataset name | InsureBench Wft-Basis v1.1 |
| Question count | 80 |
| Question types | Meerkeuze, stellingen en korte casus |
| Domain | Wft-Basis |
| Based on | CDFD-eindtermen 2025 t/m 2026 |
| Question sources | Eigen vragen, herschreven oefenvragen en gevalideerde oefenset op basis van CDFD-eindtermen. |
| Legal reference date | 01 Jan 2026 |
| Last review | Not completed yet |
| Review status | externe review in voorbereiding |
| Reviewer role | externe review in voorbereiding |
| Question distribution | Per CDFD-taak, toetsterm, vraagtype en moeilijkheid. |
| Difficulty | Laag, midden en hoog. |
| Publication policy | Vragenbank privé; alleen aggregaties, scores en methodologie worden gepubliceerd. |
| Contamination policy | Private set met held-out beleid en canary-vragen; exacte vragen worden niet publiek gedeeld. |
| Results licence | CC-BY-4.0 |
Learning objective distribution
This distribution shows which part of the private set maps to each CDFD task. Question texts, options, and correct answers remain private.
| Domain | Questions | Weight |
|---|---|---|
| Onbekend | 80 | 100% |
Question type distribution
| Type | Questions | Weight |
|---|---|---|
| KB | 80 | 100% |
Difficulty distribution
| Difficulty | Questions | Weight |
|---|---|---|
| midden | 80 | 100% |
Difference from the official CDFD exam
Source: CDFD Initieel examen Basis, valid from 01 Apr 2026. https://cdfd.nl/initieel-examen-basis/
| Feature | Official CDFD Basis exam | InsureBench Wft-Basis v1.1 |
|---|---|---|
| Tijdsduur | 120 minuten | n.v.t. (LLM) |
| Aantal vragen | 42 | 80 |
| Aantal punten | 63 | 40 (omgerekend) |
| Slaaggrens | 68% | 68% |
| KB-vragen | 21 vragen, 21 punten | Zie vraagtypenverdeling |
| PG-vragen | 2 vragen, 4 punten | Zie vraagtypenverdeling |
| VC-vragen | 19 vragen, 38 punten | Zie vraagtypenverdeling |
Model version policy
Modelnamen kunnen bij providers als alias werken. InsureBench rapporteert daarom de gebruikte API-naam, gatewayroute, testdatum, promptversie en datasetversie. Bij gesloten modellen blijft een beperkte onzekerheid bestaan over interne providerwijzigingen.
| Field | Why |
|---|---|
| Provider | Herkomst |
| Gateway | Route |
| Exacte API-modelnaam | Reproduceerbaarheid |
| Provider release date | Context |
| Testdatum | Momentopname |
| Endpointtype | Direct of router |
| Model alias of snapshot | Belangrijk verschil |
| Kan provider stil wijzigen | Ja / nee / onbekend |
| Seed ondersteund | Ja / nee |
| Tools uitgeschakeld | Ja |
| System prompt | Leeg of exact vermeld |
| Parserversie | Controle |
| Prompt template hash |
Model selection criterion
We testen publiek beschikbare tekstmodellen van grote aanbieders die via de gekozen gateway beschikbaar zijn op de testdatum.
External validation
How external domain experts can review this benchmark is described on the external validation page. External validation
Run protocol
| Field | Value |
|---|---|
| Parser version | answer-parser-v2 |
| prompt_template_hash | 5a732017979c |
| Benchmark version | 1.1.0 |
| Dataset version | InsureBench Wft-Basis v1.1 |
| Retry policy | Gateway timeouts en tijdelijke providerfouten worden begrensd opnieuw geprobeerd; blijvende fouten tellen zichtbaar mee. |
| Timeout policy | Per model kan een request-timeout gelden; de hard cap blijft onder de serverless limiet. |
| Refusal policy | Een weigering telt als 0 punten en wordt als refusal zichtbaar gemaakt. |
| Rate limit handling | Rate limits pauzeren de run waar nodig; publieke API-routes hebben eigen rate limiting. |
| Logging policy | Alleen geaggregeerde runstatus, foutcategorieën en kostenmetadata worden publiek gemaakt. |
Error handling by run
Errors are not a weakness when measured well. We therefore publish aggregated error categories by run.
| Model | No letter | Parse fail | Refusal | Timeout | Retry | Failed jobs |
|---|---|---|---|---|---|---|
| GPT-4.1 (OpenAI) | 0 | 0 | 0 | 0 | 0 | 0 |
| Claude Haiku 4.5 (Anthropic) | 0 | 0 | 0 | 0 | 0 | 0 |
| Claude Opus 4.7 (Anthropic) | 0 | 0 | 0 | 0 | 0 | 0 |
| Mistral: Mistral Nemo (Mistralai) | 0 | 0 | 0 | 0 | 0 | 0 |
| GPT-4o (OpenAI) |
Statistical method
The Wft score is the mean across attempts, converted to a 40-point scale with a 68% pass threshold. The standard deviation on model pages mainly measures pipeline stability at temperature 0 and top_p 1. Differences within a small margin are interpreted as tied within margin until confidence intervals and p-values are added in a dedicated statistics release.
Download results
The public CSV and JSON exports also include `includedInLeaderboard` per run. In this release the flag still maps one-to-one to “published and not archived”, so every exported run currently counts toward the public ranking.
How to cite
Diks, M. (2026). InsureBench v1.1.0 [Benchmark]. insurebench.nl
@misc{insurebench2026,
author = {Diks, Marc},
title = {InsureBench: AI Proficiency on the Dutch Wft-Basis Exam},
year = {2026},
version = {1.1.0},
url = {https://insurebench.nl},
note = {Benchmark with private Wft-Basis question set}
}