J Hand Surg Glob Online. 2025 Oct 10;7(6):100845. doi: 10.1016/j.jhsg.2025.100845. eCollection 2025 Nov.
ABSTRACT
PURPOSE: The role of artificial intelligence (AI) in medicine is rapidly evolving, with potential to improve both the clinician and patient experience. We sought to evaluate whether popular AI text-to-image generators could create anatomically accurate images of common hand surgery procedures. We hypothesized that the AI-generated images would not be adequate as patient education materials.
METHODS: We queried five AI text-to-image generators: Craiyon, DALL-E, DeepSeek, Gemini, Midjourney, and Stable Diffusion. They were given the prompt, "Create an anatomically accurate image with labels of [Condition] surgical approach to be used as a visual aid for patient education," with the following conditions inserted: carpal tunnel syndrome, Dupuytren contracture, trigger finger, thumb carpometacarpal arthritis, and de Quervain tenosynovitis. Images were then graded on legibility, detail and clarity, anatomical realism and accuracy, appropriate surgical site, and lack of fabricated anatomy. Images could score a maximum of 2 points per each criterion, with an assumed Control score of 10 points.
RESULTS: A total of 1,500 images were generated and reviewed. When comparing total scores, all AI generators performed significantly lower than the Control, except for DALL-E's images of Dupuytren contracture. For the image detail and clarity category, DALL-E, DeepSeek, Gemini, and Midjourney all scored similarly to the Control and each other. For the remaining criteria (legibility, anatomic realism, surgical site, fabricated anatomy), each of the AI generators scored significantly lower than the Control generator. In total, 99.8% of images contained at least some degree of fabricated anatomy. DALL-E consistently had the highest scores for each category, while Craiyon had the lowest.
CONCLUSIONS: Although the AI servers successfully produced highly detailed and visually engaging images, they failed to portray accurate anatomy and often included fictitious structures. Further work is needed to train and fine tune AI models to produce accurate and appropriate images.
TYPE OF STUDY/LEVEL OF EVIDENCE: Therapeutic V.
PMID:41141329 | PMC:PMC12547223 | DOI:10.1016/j.jhsg.2025.100845