Frontier LLMs are strong on dental questions — but the errors are clinically meaningful.
We evaluated eight current language models on 30 clinician-reviewed dental questions spanning periodontology, implants, oral-systemic medicine, pharmacology, and patient communication. The goal was not to declare a universal winner. The goal was to make model behavior inspectable enough that dentists and researchers can judge where the systems are useful, brittle, or dependent on the evaluator.
What we found
GPT-5.2 had the highest deployment accuracy in this run at 96.7%, followed by Claude Opus 4.8 and GPT-5.5 at 93.3%, and Gemini 3.1 Pro at 90.0%. Those top results should be read cautiously because the dataset is small and the bootstrap intervals overlap.
The more useful signal is clinical. Incorrect rows clustered around periodontal treatment thresholds, pharmacology safety nuance, peri-implant evidence overstatement, diagnostic cutoffs, and patient communication omissions. One model, Claude Fable 5, refused five answerable dental questions; refusals counted as incorrect in the deployment metric.
Why judge dependence matters
The primary judge was Claude Opus 4.8. We also ran GPT-5.2 and GPT-5.5 as secondary judges on answered rows. Agreement with the primary judge was moderate: 81.7% to 83.8% verdict agreement, with Cohen's kappa from 0.506 to 0.524. That is enough agreement to be informative, but not enough to treat a single automated judge as ground truth.
The report therefore preserves the primary-judge endpoint and separately exposes judge-consistency candidates for manual review. This is important for any dental benchmark because small score differences can otherwise look more certain than they really are.
What is released
The repository includes the benchmark questions, raw answer JSONL files, judge verdicts, analysis scripts, LaTeX manuscript, PDF, visual summary infographic, and the interactive report. The release is intended to be reproducible and auditable rather than a closed leaderboard.
Open the interactive report ↓