Dental LLM Benchmark — Interactive Report

The useful signal is clinical, not numerical.

The top frontier systems are close on a 30-question dataset. The more important distinction is where answers fail: treatment thresholds, guideline nuance, overstatement of evidence, refusals, and judge inconsistency.

Read this as a reproducible benchmark report. Small score gaps are noisy; every headline metric preserves the stored primary-judge endpoint, and the raw files are one click away.

Deployment accuracy by model

Refusals count as incorrect. Select a model for its full record; filter by domain, tier, or provider.

Domain Tier Provider

Performance shifts by clinical domain

Accuracy per domain, five questions each. Pharmacology separates the field.

The 30 clinical questions

Each dot is one model's outcome on that question.

Search

What the wrong answers got wrong

A clinical reading of every incorrect row — full audit file ↗

Clear clinical error patterns

Judge consistency flags

Rows stored as incorrect although the primary judge marked every criterion satisfied and no violations — shown as manual-adjudication candidates, not retroactive corrections.

Secondary judges expose scoring uncertainty

Every stored answer was re-scored by two independent judges. Absolute accuracy is judge-dependent; the ranking is stable.

The one-image version

Generated deterministically from the benchmark data — download the SVG ↗

Visual summary of the dental LLM benchmark results and clinical error audit.

Reproduce it

The repository carries the question dataset, raw JSONL answer transcripts, judge verdict files, analysis scripts, manuscript source, and this report's generator. One command rebuilds every table.

@misc{teixeirabarbosa_dental_llm_benchmark_2026,
  author = {Teixeira Barbosa, Francisco and Robles Cantero, Daniel and Brizuela Velasco, Aritza},
  title  = {Evaluating Frontier Language Models on Clinician-Reviewed Dental Questions: A Reproducible Benchmark},
  year   = {2026},
  url    = {https://github.com/Tuminha/llm-evaluation-for-dentistry}
}

The best score on 30 clinician-reviewed dental questions. The intervals overlap — the ranking is not the story.