Diagnostics (Basel). 2026 Apr 15;16(8):1171. doi: 10.3390/diagnostics16081171.
ABSTRACT
Background/Objectives: Multimodal large language models (MLLMs) are increasingly evaluated for diagnostic tasks in medical imaging, including radiographic interpretation. However, most studies focus primarily on binary fracture detection and rarely assess clinically relevant fracture characteristics such as displacement or intra-articular extension, which influence treatment decisions. In addition, most evaluations rely on single-run inference designs that do not assess response reproducibility. This study evaluated the diagnostic performance and inter-run reliability of five MLLMs for radiographic assessment of distal radius fractures. Methods: Fifty fracture-positive distal radius radiographs were evaluated by five MLLMs (ChatGPT 5.3, Gemini 3.1 Pro, Claude Opus 4.6, Grok 4.1, and ERNIE 5.0) across five independent zero-shot inference runs (n = 1250 observations). Diagnostic tasks included fracture detection, intra-articular extension, and displacement. Sex and age were exploratory endpoints. Performance was summarized using sensitivity (fracture detection) and accuracy (other tasks), with inter-run reliability assessed via Fleiss' κ. Results: Performance varied across tasks and models. Fracture detection sensitivity ranged from 39.6% to 99.6%, with two models exceeding 90%. Intra-articular extension accuracy ranged from 51.6% to 55.6%, consistent with chance-level performance. Displacement classification ranged from 34.8% to 70.4%. One model achieved substantial inter-run agreement across binary tasks (κ > 0.60), whereas two models showed slight agreement (κ < 0.20). Conclusions: Only two models exceeded 90% sensitivity for fracture detection, while intra-articular extension remained at chance level (≤55.6%). Substantial inter-run reliability (κ > 0.60) was observed in only one model. These findings indicate that current MLLMs do not reliably support multidimensional fracture assessment and that single-run evaluations overestimate robustness.
PMID:42072797 | PMC:PMC13115499 | DOI:10.3390/diagnostics16081171

