InsureBench tests AI models on a private Wft-Basis practice-question set. Phase 2 expands this to open insurance-advice cases.
Anthropic: Claude Opus 4.8 (Fast) ranks #1 with 36/40 on the prompt review score.
The score measures Wft-Basis knowledge, not whether a model is suitable as a standalone AI adviser.
Model X ranks highest on the InsureBench Wft-Basis knowledge benchmark.
Model X gives the best insurance advice in private simple-risk cases.
Whether AI models are reliable enough for standalone insurance advice requires phase 2-3.
Score on a 40-point scale (Wft-Basis equivalent). Click a model for details.
Scores in the same group differ by less than 1 point and should be read as effectively neck-and-neck.
OpenAI
Anthropic
OpenAI
DeepSeek
OSDeepSeek
OSDeepSeek
OSMoonshotai
OpenAI
Mistralai
OSAnthropic
| # | Model | Provider | Open source | Score (40) | WFT | Prompt | Price / M tokens | Result | Last tested |
|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-4o | OpenAI | — | 31 / 40 78% review Groep A | 29 / 40 | 31 / 40 | €2.30 in / €9.20 out | Pass +3.8 | 29 Apr 2026 |
| 2 | Claude Opus 4.7 | Anthropic | — | 30 / 40 75% review Groep B | 34 / 40 |
Each public round now follows the same editorial structure: what this round says, what changed, which outliers are explainable, and what still must not be concluded.
This round makes the Wft-Basis leaderboard citation-grade readable: public field definitions, score groups, and fixed source pages now ship as one release.
Download aggregated run data as CSV or JSON. Use the BibTeX entry below for attribution.
@online{insurebench_wft_basis_1_1_0,
title = {InsureBench: Wft-Basis AI Benchmark},
author = {InsureBench},
year = {2026},
version = {1.1.0},
url = {https://www.insurebench.nl/nl/wft-basis},
urldate = {2026-04-23},
note = {Public leaderboard, 80 questions, 3 runs per model}
}| 30 / 40 |
| €4.60 in / €23.00 out |
| Pass +2.8 |
| 28 Apr 2026 |
| 3 | GPT-4.1 | OpenAI | — | 28 / 40 71% review Groep C | 32 / 40 | 28 / 40 | €1.84 in / €7.36 out | Pass +0.8 | 28 Apr 2026 |
| 4 | DeepSeek V3.2 | DeepSeek | Open source | 27 / 40 67% review Groep D | 28 / 40 | 27 / 40 | €0.23 in / €0.35 out | Fail -0.2 | 29 Apr 2026 |
| 5 | DeepSeek-R1 | DeepSeek | Open source | 26 / 40 65% review Groep E | 25 / 40 | 26 / 40 | €0.64 in / €2.30 out | Fail -1.2 | 29 Apr 2026 |
| 6 | DeepSeek: DeepSeek V4 Pro | DeepSeek | Open source | 26 / 40 66% review Groep E | 29 / 40 | 26 / 40 | €0.40 in / €0.80 out | Fail -1.2 | 04 May 2026 |
| 7 | MoonshotAI: Kimi K2.5 | Moonshotai | — | 26 / 40 64% review Groep E | 30 / 40 | 26 / 40 | €0.40 in / €1.84 out | Fail -1.2 | 06 May 2026 |
| 8 | OpenAI: GPT-5.5 Pro | OpenAI | — | 25 / 40 63% review Groep F | 27 / 40 | 25 / 40 | €27.60 in / €165.60 out | Fail -2.2 | 02 May 2026 |
| 9 | Mistral: Mistral Nemo | Mistralai | Open source | 19 / 40 47% review Groep G | 22 / 40 | 19 / 40 | €0.02 in / €0.03 out | Fail -8.2 | 01 May 2026 |
| 10 | Claude Haiku 4.5 | Anthropic | — | 17 / 40 42% review Groep H | 28 / 40 | 17 / 40 | €0.92 in / €4.60 out | Fail -10.2 | 28 Apr 2026 |