InsureBench tests AI models on a private Wft-Basis practice-question set. Phase 2 expands this to open insurance-advice cases.
Anthropic: Claude Opus 4.8 (Fast) ranks #1 with 35/40 on the combined Wft and prompt score.
The score measures Wft-Basis knowledge, not whether a model is suitable as a standalone AI adviser.
Model X ranks highest on the InsureBench Wft-Basis knowledge benchmark.
Model X gives the best insurance advice in private simple-risk cases.
Whether AI models are reliable enough for standalone insurance advice requires phase 2-3.
Score on a 40-point scale (Wft-Basis equivalent). Click a model for details.
A model may appear in multiple rows — one per benchmark type. WFT measures knowledge (multiple choice), Prompt measures advice skills, Combined measures both.
Scores in the same group differ by less than 1 point and should be read as effectively neck-and-neck.
Anthropic
Qwen
OSMistral
OSX Ai
Anthropic
Anthropic
OpenAI
| # | Model | Provider | Open source | Score (40) | WFT | Prompt | Price / M tokens | Result | Last tested |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Anthropic: Claude Opus 4.8 (Fast) | Anthropic | — | 35 / 40 66/80 raw Groep A | 33 / 40 | 36 / 40 | €9.20 in / €46.00 out | Pass +7.8 | 01 Jun 2026 |
| 2 | Gemini 3.1 Pro Preview | — | 34 / 40 68/80 raw Groep B |
Each public round now follows the same editorial structure: what this round says, what changed, which outliers are explainable, and what still must not be concluded.
This round makes the Wft-Basis leaderboard citation-grade readable: public field definitions, score groups, and fixed source pages now ship as one release.
Download aggregated run data as CSV or JSON. Use the BibTeX entry below for attribution.
@online{insurebench_wft_basis_1_1_0,
title = {InsureBench: Wft-Basis AI Benchmark},
author = {InsureBench},
year = {2026},
version = {1.1.0},
url = {https://www.insurebench.nl/nl/wft-basis},
urldate = {2026-04-23},
note = {Public leaderboard, 80 questions, 3 runs per model}
}| 34 / 40 |
| 33 / 40 |
| €1.84 in / €11.04 out |
| Pass +6.8 |
| 04 May 2026 |
| 3 | Qwen: Qwen3.6 Plus | Qwen | Open source | 34 / 40 63/80 raw Groep B | 32 / 40 | 35 / 40 | €0.30 in / €1.79 out | Pass +6.8 | 07 May 2026 |
| 4 | Mistral Large 3 | Mistral | Open source | 33 / 40 63/80 raw Groep C | 31 / 40 | 34 / 40 | €0.46 in / €1.38 out | Pass +5.8 | 28 Apr 2026 |
| 5 | xAI: Grok 4.3 | X Ai | — | 33 / 40 67/80 raw Groep C | 33 / 40 | 33 / 40 | €1.15 in / €2.30 out | Pass +5.8 | 07 May 2026 |
| 6 | Claude Opus 4.7 | Anthropic | — | 32 / 40 67/80 raw Groep D | 34 / 40 | 30 / 40 | €4.60 in / €23.00 out | Pass +4.8 | 28 Apr 2026 |
| 7 | Claude Sonnet 4.6 | Anthropic | — | 32 / 40 59/80 raw Groep D | 29 / 40 | 35 / 40 | €2.76 in / €13.80 out | Pass +4.8 | 29 Apr 2026 |
| 8 | Google: Gemini 3.5 Flash | — | 32 / 40 64/80 raw Groep D | 32 / 40 | 31 / 40 | €1.38 in / €8.28 out | Pass +4.8 | 01 Jun 2026 |
| 9 | Gemini 2.5 Flash | — | 31 / 40 61/80 raw Groep E | 30 / 40 | 31 / 40 | €0.28 in / €2.30 out | Pass +3.8 | 29 Apr 2026 |
| 10 | GPT 5.2 | OpenAI | — | 31 / 40 59/80 raw Groep E | 29 / 40 | 32 / 40 | €19.32 in / €154.56 out | Pass +3.8 | 29 Apr 2026 |