InsureBench
Data dictionary
This page explains the public fields in the InsureBench JSON and CSV exports so scores, metadata, and leaderboard status can be read without extra explanation.
Sources
The definitions below apply to the public exports at JSON, CSV, and run-detail level.
Public fields
| Field | Public? | Format | Example | Meaning | Interpretation |
|---|---|---|---|---|---|
| runId | Yes | UUID string | 640fcf56-71b6-48b5-ba79-2170330179b0 | Unique identifier of a public benchmark run. | Use this field to trace one specific run across the JSON, CSV, or detail export. |
| modelSlug | Yes | kebab-case string | gpt-4o | Stable public slug for the model inside InsureBench. | Useful for links, filters, and model detail pages. |
| modelName | Yes | Human-readable name | GPT-4o | Public display name of the tested model. | — |
| provider | Yes | Text | OpenAI | The provider or model maker as InsureBench shows it publicly. | — |
| gatewayId | Yes | Provider/model route | openai/gpt-4o | The exact gateway route used to run the benchmark. | This helps reproducibility when public model names may act as aliases. |
| versionPinned | Yes | Text / model route | openai/gpt-4o | The model version or route that InsureBench reports as the pinned reference. | — |
| triggeredAt | Yes | ISO datetime | 2026-05-07T13:22:15.562+00:00 | The timestamp when the benchmark run was started. | Use this field to judge how recent a result is. |
| score40 | Yes | Number on a 40-point scale | 33 | Public headline score for the run, converted to the Wft-Basis 40-point scale. | A higher score means stronger on this benchmark, not automatically suitable for standalone advice. |
| rawMean | Yes | Decimal between 0 and 1 | 0.8333 | The average raw result across the underlying attempts before conversion to the 40-point scale. | — |
| stdev | Yes | Decimal | 0.0059 | The standard deviation across the underlying attempts within the same run. | This mainly reflects pipeline stability and does not on its own prove substantive quality. |
| passed | Yes | Boolean | true | Whether the run reaches the configured pass threshold. | — |
| questionCount | Yes | Integer | 80 | Number of questions used to score this run. | — |
| passThreshold | Yes | Decimal between 0 and 1 | 0.68 | The pass threshold as a raw ratio, before conversion to the 40-point scale. | 0.68 corresponds to 68% of the raw scoring basis. |
| includedInLeaderboard | Yes | Boolean | true | Indicates whether this run counts toward the public ranking. | In this release this field still maps one-to-one to “published and not archived”. |
Which data is not public
InsureBench does not publish question text, answer options, correct answers, raw model output, review notes, or other internal scoring details. Only aggregates, methodology, and run metadata are public.
How to read the scores
A higher `score40` means a stronger result on this private Wft-Basis knowledge benchmark. It does not prove that a model meets Wft obligations, gives safe standalone advice, or is operationally deployable without human oversight.