InsureBench

Data dictionary

This page explains the public fields in the InsureBench JSON and CSV exports so scores, metadata, and leaderboard status can be read without extra explanation.

Sources

The definitions below apply to the public exports at JSON, CSV, and run-detail level.

JSONCSVMethodology

Public fields

Field	Public?	Format	Example	Meaning	Interpretation
runId	Yes	UUID string	640fcf56-71b6-48b5-ba79-2170330179b0	Unique identifier of a public benchmark run.	Use this field to trace one specific run across the JSON, CSV, or detail export.
modelSlug	Yes	kebab-case string	gpt-4o	Stable public slug for the model inside InsureBench.	Useful for links, filters, and model detail pages.
modelName	Yes	Human-readable name	GPT-4o	Public display name of the tested model.	—
provider	Yes	Text	OpenAI	The provider or model maker as InsureBench shows it publicly.	—
gatewayId	Yes	Provider/model route	openai/gpt-4o	The exact gateway route used to run the benchmark.	This helps reproducibility when public model names may act as aliases.
versionPinned	Yes	Text / model route	openai/gpt-4o	The model version or route that InsureBench reports as the pinned reference.	—
triggeredAt	Yes	ISO datetime	2026-05-07T13:22:15.562+00:00	The timestamp when the benchmark run was started.	Use this field to judge how recent a result is.
score40	Yes	Number on a 40-point scale	33	Public headline score for the run, converted to the Wft-Basis 40-point scale.	A higher score means stronger on this benchmark, not automatically suitable for standalone advice.
rawMean	Yes	Decimal between 0 and 1	0.8333	The average raw result across the underlying attempts before conversion to the 40-point scale.	—
stdev	Yes	Decimal	0.0059	The standard deviation across the underlying attempts within the same run.	This mainly reflects pipeline stability and does not on its own prove substantive quality.
passed	Yes	Boolean	true	Whether the run reaches the configured pass threshold.	—
questionCount	Yes	Integer	80	Number of questions used to score this run.	—
passThreshold	Yes	Decimal between 0 and 1	0.68	The pass threshold as a raw ratio, before conversion to the 40-point scale.	0.68 corresponds to 68% of the raw scoring basis.
includedInLeaderboard	Yes	Boolean	true	Indicates whether this run counts toward the public ranking.	In this release this field still maps one-to-one to “published and not archived”.

Which data is not public

InsureBench does not publish question text, answer options, correct answers, raw model output, review notes, or other internal scoring details. Only aggregates, methodology, and run metadata are public.

How to read the scores

A higher `score40` means a stronger result on this private Wft-Basis knowledge benchmark. It does not prove that a model meets Wft obligations, gives safe standalone advice, or is operationally deployable without human oversight.

Public fields

Field	Public?	Format	Example	Meaning	Interpretation
runId	Yes	UUID string	640fcf56-71b6-48b5-ba79-2170330179b0	Unique identifier of a public benchmark run.	Use this field to trace one specific run across the JSON, CSV, or detail export.
modelSlug	Yes	kebab-case string	gpt-4o	Stable public slug for the model inside InsureBench.	Useful for links, filters, and model detail pages.
modelName	Yes	Human-readable name	GPT-4o	Public display name of the tested model.	—
provider	Yes	Text	OpenAI	The provider or model maker as InsureBench shows it publicly.	—
gatewayId	Yes	Provider/model route	openai/gpt-4o	The exact gateway route used to run the benchmark.	This helps reproducibility when public model names may act as aliases.
versionPinned	Yes	Text / model route	openai/gpt-4o	The model version or route that InsureBench reports as the pinned reference.	—
triggeredAt	Yes	ISO datetime	2026-05-07T13:22:15.562+00:00	The timestamp when the benchmark run was started.	Use this field to judge how recent a result is.
score40	Yes	Number on a 40-point scale	33	Public headline score for the run, converted to the Wft-Basis 40-point scale.	A higher score means stronger on this benchmark, not automatically suitable for standalone advice.
rawMean	Yes	Decimal between 0 and 1	0.8333	The average raw result across the underlying attempts before conversion to the 40-point scale.	—
stdev	Yes	Decimal	0.0059	The standard deviation across the underlying attempts within the same run.	This mainly reflects pipeline stability and does not on its own prove substantive quality.
passed	Yes	Boolean	true	Whether the run reaches the configured pass threshold.	—
questionCount	Yes	Integer	80	Number of questions used to score this run.	—
passThreshold	Yes	Decimal between 0 and 1	0.68	The pass threshold as a raw ratio, before conversion to the 40-point scale.	0.68 corresponds to 68% of the raw scoring basis.
includedInLeaderboard	Yes	Boolean	true	Indicates whether this run counts toward the public ranking.	In this release this field still maps one-to-one to “published and not archived”.