Hand Surg Rehabil. 2025 Oct 28:102530. doi: 10.1016/j.hansur.2025.102530. Online ahead of print.
ABSTRACT
INTRODUCTION: Large language models (LLMs) have gained increasing popularity in several medical disciplines. In orthopedic research however, their integration into routine practice have been questioned as they do not seem to outperform experienced clinicians. Conversely, research on the role of artificial intelligence in hand surgery remains limited. This study aims to evaluate two common LLMs in medicine, Generative Pre-trained Transformer (ChatGPT) and Claude in the clinical hand surgery setting.
METHODS: Ten questions pertinent to common hand surgical diagnosis were formulated as prompts and entered into ChatGPT and Claude in a systematic manner. The generated responses were anonymously evaluated by hand surgeons, who assessed the quality of the responses according to the QUEST criteria. Gwet's AC2 was used to evaluate the agreement between raters.
RESULTS: In general, ChatGPT and Claude performed statistically similar according to the dimensions of QUEST including (1) Quality of information, 2) Understanding and reasoning, 3) Expression style and persona, 4) Safety and harm and 5) Trust and confidence although with relatively modest scores. Agreement between hand surgeons across all measurements was low according to Gwet's AC2 (0.29).
CONCLUSIONS: ChatGPT and Claude perform similarly when provided with various common hand surgery related questions. However, they demonstrate significant limitations pertaining to clinical accuracy and reliability that are the core foundation for patient safety, treatment efficiency and evidence-based practice. Furthermore, as the function of ChatGPT and Claude seem to differ between individual hand surgeons, these LLMs in their current state are not suitable for routine clinical use in hand surgery.
LEVEL OF EVIDENCE: V.
PMID:41167474 | DOI:10.1016/j.hansur.2025.102530