J Hand Surg Glob Online. 2026 Mar 6;8(3):100978. doi: 10.1016/j.jhsg.2026.100978. eCollection 2026 May.
ABSTRACT
PURPOSE: To evaluate and compare the responses of ChatGPT and Google Gemini to common patient questions about scaphoid fracture and scaphoid nonunion, and to compare responses between hand fellowship-trained orthopedic and plastic surgeons.
METHODS: A list of 30 common patient questions about scaphoid fracture and nonunion was developed and classified using Norman Webb's Depth of Knowledge levels 1-4. Each question was input into ChatGPT-4o and Google Gemini 2.0 Flash. An evaluation guide was created with four domains for each response, each rated on a Likert scale from 1-5: accuracy, clarity, Artificial Intelligence Response Metric, and comparison to an in-person clinician interaction. Responses were evaluated by three orthopedic and three plastic hand surgeons. Statistical comparisons were performed using nonparametric tests to assess differences between AI platforms, domains, question complexity, and surgical specialty.
RESULTS: There were no considerable differences between mean Likert scale scores for ChatGPT and Google Gemini. Plastic surgeons rated responses higher than orthopedic surgeons overall and for ChatGPT. Google Gemini performed better for DOK level 2 and level 3 questions. ChatGPT's responses had greater clarity. For both platforms, ratings for clinician comparability across all DOK levels were considerably lower than scores for all other metrics.
CONCLUSIONS: Our findings suggest that ChatGPT and Google Gemini offer clinical use for patient care regarding scaphoid fracture and nonunion. However, clinician comparability was not a key strength for either platform, highlighting a key area for improvement for AI-based large language models in clinical application.
TYPE OF STUDY/LEVEL OF EVIDENCE: Diagnostic V.
PMID:41847241 | PMC:PMC12989955 | DOI:10.1016/j.jhsg.2026.100978

