Evaluation of GPT-5, a Large Language Model, in Replicating German Clinical Practice Guideline Recommendations in Oral Oncology: A Cross-Sectional Concordance Study

Scritto il 12/04/2026

da Julius Hirsch

J Oral Pathol Med. 2026 Apr 12. doi: 10.1111/jop.70140. Online ahead of print.

ABSTRACT

BACKGROUND: Artificial intelligence (AI) technologies, particularly large language models (LLMs) such as ChatGPT, are increasingly utilised in medical education and clinical information retrieval. Nevertheless, their capacity to accurately reproduce recommendations from established clinical practice guidelines (CPGs) has not been thoroughly examined. The present study evaluated the concordance between responses generated by GPT-5 and recommendations contained in German CPGs addressing oral potentially malignant disorders (OPMDs) and oral carcinomas (OCs).

METHODS: A cross-sectional analytical comparison was performed between GPT-5 outputs and German CPG recommendations available as of October 2025. Individual guideline statements were entered verbatim into GPT-5, which was asked to confirm or reject the statements. To assess methodological robustness, inverted versions of the same statements were additionally tested. GPT-5 was accessed through the free version without internet connectivity to ensure that responses originated solely from the model's internal training data. Accuracy was defined as the proportion of correctly classified statements. Concordance between guideline content and model responses was quantified using Cohen's 𝝹.

RESULTS: Two German CPGs comprising 111 recommendations were included: the S2k guideline for OPMDs (15 recommendations) and the S3 guideline for OCs (96 recommendations). GPT-5 correctly affirmed all authentic recommendations and rejected all inverted statements. Agreement between guideline statements and GPT-5 responses was perfect when the original recommendations were analysed (𝝹 = 1.0) and remained very high when both original and inverted statements were evaluated jointly (𝝹 = 0.96). The majority of references cited within the guidelines were published in English (> 93%) and originated from outside Germany (> 77%).

CONCLUSION: When guideline recommendations were presented verbatim, GPT-5 demonstrated complete concordance with German oral oncology CPGs. These findings indicate that the model is capable of recognising and retrieving established guideline information. However, this experimental design evaluates recognition of existing statements rather than autonomous clinical reasoning. At present, LLMs should therefore be regarded primarily as educational and informational tools rather than a replacement for expert clinical judgement in oral oncology.

PMID:41967843 | DOI:10.1111/jop.70140