The useful signal is clinical, not numerical.
The top frontier systems are close on a 30-question dataset. The more important distinction is where answers fail: treatment thresholds, guideline nuance, overstatement of evidence, refusals, and judge inconsistency.
Read this as a reproducible benchmark report. Small score gaps are noisy; every headline metric preserves the stored primary-judge endpoint, and the raw files are one click away.
Deployment accuracy by model
Refusals count as incorrect. Select a model for its full record; filter by domain, tier, or provider.
Performance shifts by clinical domain
Accuracy per domain, five questions each. Pharmacology separates the field.
The 30 clinical questions
Each dot is one model's outcome on that question.
What the wrong answers got wrong
A clinical reading of every incorrect row — full audit file ↗
Clear clinical error patterns
Judge consistency flags
Rows stored as incorrect although the primary judge marked every criterion satisfied and no violations — shown as manual-adjudication candidates, not retroactive corrections.
Secondary judges expose scoring uncertainty
Every stored answer was re-scored by two independent judges. Absolute accuracy is judge-dependent; the ranking is stable.
The one-image version
Generated deterministically from the benchmark data — download the SVG ↗
Reproduce it
The repository carries the question dataset, raw JSONL answer transcripts, judge verdict files, analysis scripts, manuscript source, and this report's generator. One command rebuilds every table.
@misc{teixeirabarbosa_dental_llm_benchmark_2026,
author = {Teixeira Barbosa, Francisco and Robles Cantero, Daniel and Brizuela Velasco, Aritza},
title = {Evaluating Frontier Language Models on Clinician-Reviewed Dental Questions: A Reproducible Benchmark},
year = {2026},
url = {https://github.com/Tuminha/llm-evaluation-for-dentistry}
}