J Orthop Surg (Hong Kong). 2026 Jan-Apr;34(1):10225536261416605. doi: 10.1177/10225536261416605. Epub 2026 Jan 9.
ABSTRACT
BackgroundIntegrating large language models (LLMs) into decision-making and education has shown promise across various healthcare disciplines. The study aimed to evaluate the performance of leading LLMs-ChatGPT-5, Gemini 2, Grok 3, and DeepSeek R1-in accurately responding to structured multiple-choice and open-ended queries about complex case scenarios in hand surgery.MethodsA prospective cross-sectional analysis used 50 clinically relevant, guideline-based case scenarios developed for hand surgery. Each scenario consisted of four open-ended and two multiple-choice questions, totaling 300 points per LLM. Responses were independently assessed by blinded expert reviewers using a standardized six-point Likert scale evaluating accuracy, completeness, and adherence to international surgical guidelines.ResultsIn multiple-choice queries, Gemini (5.9 ± 0.2) and Grok (5.9 ± 0.1) outperformed ChatGPT (5.7 ± 0.3; p = 0.031 and p = 0.009, respectively) and DeepSeek (5.6 ± 0.4; p = 0.004 and p = 0.001, respectively). In open-ended queries, Gemini (5.6 ± 0.3 accuracy) and Grok (5.5 ± 0.4 accuracy) demonstrated superior results across all measured dimensions-accuracy, completeness, and guideline adherence-markedly surpassing ChatGPT (5.1 ± 0.5 accuracy, p < 0.001) and DeepSeek (4.9 ± 0.6 accuracy; p < 0.001). Notably, Gemini and Grok demonstrated consistently high performance with minimal variability, while ChatGPT, particularly DeepSeek, exhibited considerable inconsistency in complex clinical judgments.ConclusionGemini 2 and Grok 3 showed reliable and clinically relevant performance, positioning them as promising adjunctive tools for decision-making and education in hand surgery. The limitations in ChatGPT-5 and the significant shortcomings of DeepSeek underscore the necessity for cautious deployment and continued refinement.
PMID:41512247 | DOI:10.1177/10225536261416605

