InsureBench
Validation
This page shows how the benchmark remains reviewable without publishing the private question bank or raw model outputs.
Core explanation
InsureBench does not publish the question bank. This limits copyright risk and contamination of future model training. To keep the benchmark reviewable, we publish a dataset card, learning-objective distribution, model card, run metadata, and aggregated results. We also arrange periodic review of the question bank and rubric by external Wft and compliance experts.
How questions are created
The private set uses 80 Wft-Basis practice questions based on CDFD learning objectives. The distribution follows CDFD task, question type, and difficulty so the set does not lean on one narrow slice of the domain.
Who reviews the work
The author maintains the benchmark and publishes the dataset card, model card, run metadata, and aggregated results for each release. External Wft and compliance experts may periodically review the question bank and rubric.
How often it is revised
Review status: externe review in voorbereiding. Last recorded review: Not completed yet. Changes to benchmark logic and governance are recorded publicly in the changelog.
How errors are reported
Factual errors, unclear claims, or methodological issues should be reported through the public contact point. Corrections should be traceable in the changelog, methodology page, or dataset metadata.
How experts can contribute
External experts can contribute as reviewers of question quality, rubric logic, or compliance boundaries. Contributions are meant as scrutiny and refinement, not marketing endorsement.
How contamination is limited
Private set met held-out beleid en canary-vragen; exacte vragen worden niet publiek gedeeld.
How model versions are pinned
Modelnamen kunnen bij providers als alias werken. InsureBench rapporteert daarom de gebruikte API-naam, gatewayroute, testdatum, promptversie en datasetversie. Bij gesloten modellen blijft een beperkte onzekerheid bestaan over interne providerwijzigingen.
How the advice rubric is validated
The prompt benchmark for advice quality is still under development. The rubric should only carry public weight after blind scoring, expert review, and example cases show that reviewers reach comparable outcomes consistently.
Which limitations remain
A private question bank does not remove all uncertainty. Closed models can still change internally, external review is not official certification, and phase 1 remains a knowledge benchmark rather than evidence of safe advice deployment.
Contact and contributions
Report an issue
Use the public contact point for factual corrections, methodology questions, or unclear claim wording.
Open contact detailsContribute as an expert
Contributions are welcome for question validation, rubric review, and claim boundaries. The contribution is intended as substantive scrutiny, not endorsement.
View governance