InsureBench

External validation of the Wft-Basis benchmark

This page explains how domain experts can review the benchmark without making the private question bank public.

Status

Public evidence layer live

External validation status

externe review in voorbereiding

Purpose of this page

external validation and scope boundaries

Why external validation matters

The Wft-Basis benchmark has been made publicly explainable, but a private question bank still requires external substantive review. External validation should show whether coverage, substantive correctness, and difficulty hold up beyond the internal team.

What experts can review

Whether the question bank covers the relevant Wft-Basis learning objectives in a balanced way.
Whether question content is substantively correct within the chosen legal reference date.
Whether difficulty is reasonably aligned with the official CDFD exam.
Whether public claims remain aligned with the actual evidence layer.
Whether the boundary between knowledge benchmarking and advice suitability is sharp enough.

What experts cannot review here

The full private question bank is not publicly shared.
Raw model answers and internal review notes remain private.
This page does not claim official CDFD approval or certification.
This page does not claim official exam equivalence.
This page does not prove that models can safely provide standalone insurance advice.

Intended external validation scope

Review of question-bank coverage and learning-objective distribution.
Review of substantive correctness within the chosen legal reference date.
Review of difficulty and reasonableness relative to the official exam.
Review of public interpretation, limitations, and claim boundaries.

Intended reviewer roles: Not publicly specified yet

Explicitly out of scope

No request to formally approve the benchmark as a whole.
No public release of the private question bank as a precondition for review.
No endorsement of commercial suitability or advice safety.

Evidence available to reviewers

Methodology — Full dataset card, run protocol, and error statistics.
Methodology JSON — Machine-readable methodology export.
Runs CSV — Public aggregated run data per benchmark round.
Runs JSON — Machine-readable public run export by benchmark round.
Data dictionary — Definitions of public fields and interpretation boundaries.
Changelog — Traceability of release changes, limitations, and findings.
Validation — Existing explanation of review, contamination, and governance.
Roadmap — Outstanding validation steps and phase boundaries.
Conflict-of-interest statement — Context on ownership and public accountability.

Outstanding validation steps

Formally select the external reviewer or partner.
Deliver a review pack with scoped review questions.
Publish review date, reviewer role, and assessment scope.
Summarize what was and was not externally reviewed.
Reflect outcomes traceably in the changelog, methodology, and this page.

How experts can contribute

Contributions are welcome from Wft experts, compliance specialists, and exam-design experts. Input is most relevant for learning-objective coverage, substantive correctness, difficulty, and claim boundaries.

Registration or contact runs through the public contact point; relevant findings should then be reflected publicly and traceably in the methodology, changelog, or this page.

Contributions are meant as scrutiny, not endorsement.

Boundaries of the current benchmark

Closed models can change internally without full public visibility.
A private set limits contamination, but does not eliminate it entirely.
External validation is not legal or official certification.
Wft-Basis phase 1 remains a knowledge benchmark, not proof of safe advice deployment.