InsureBench

Validation

This page shows how the benchmark remains reviewable without publishing the private question bank or raw model outputs.

Core explanation

InsureBench does not publish the question bank. This limits copyright risk and contamination of future model training. To keep the benchmark reviewable, we publish a dataset card, learning-objective distribution, model card, run metadata, and aggregated results. We also arrange periodic review of the question bank and rubric by external Wft and compliance experts.

How questions are created

The private set uses 80 Wft-Basis practice questions based on CDFD learning objectives. The distribution follows CDFD task, question type, and difficulty so the set does not lean on one narrow slice of the domain.

Who reviews the work

The author maintains the benchmark and publishes the dataset card, model card, run metadata, and aggregated results for each release. External Wft and compliance experts may periodically review the question bank and rubric.

How often it is revised

Review status: externe review in voorbereiding. Last recorded review: Not completed yet. Changes to benchmark logic and governance are recorded publicly in the changelog.

How errors are reported

Factual errors, unclear claims, or methodological issues should be reported through the public contact point. Corrections should be traceable in the changelog, methodology page, or dataset metadata.

How experts can contribute

External experts can contribute as reviewers of question quality, rubric logic, or compliance boundaries. Contributions are meant as scrutiny and refinement, not marketing endorsement.

How contamination is limited

Private set met held-out beleid en canary-vragen; exacte vragen worden niet publiek gedeeld.

How model versions are pinned

Modelnamen kunnen bij providers als alias werken. InsureBench rapporteert daarom de gebruikte API-naam, gatewayroute, testdatum, promptversie en datasetversie. Bij gesloten modellen blijft een beperkte onzekerheid bestaan over interne providerwijzigingen.

How the advice rubric is validated

The prompt benchmark for advice quality is still under development. The rubric should only carry public weight after blind scoring, expert review, and example cases show that reviewers reach comparable outcomes consistently.

Which limitations remain

A private question bank does not remove all uncertainty. Closed models can still change internally, external review is not official certification, and phase 1 remains a knowledge benchmark rather than evidence of safe advice deployment.

Contact and contributions

Report an issue

Use the public contact point for factual corrections, methodology questions, or unclear claim wording.

Open contact details

Contribute as an expert

Contributions are welcome for question validation, rubric review, and claim boundaries. The contribution is intended as substantive scrutiny, not endorsement.

View governance