Life (Basel). 2026 May 2;16(5):763. doi: 10.3390/life16050763.
ABSTRACT
Multimodal large language models (MLLMs) are increasingly considered for cancer diagnostic support, yet their suitability for histopathological image interpretation remains inadequately characterized. We evaluated six contemporary general-purpose MLLMs (Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, ChatGPT 5.3, Grok 4.2, Gemini 3.1 Pro) on 58 paired hematoxylin and eosin (H&E)-stained breast cancer histopathology images (26 malignant, 32 benign) and corresponding nuclei segmentation masks. Each case was classified five times per model under three conditions, image only (IMAGE), mask only (MASK), and both combined (BOTH), yielding 5220 observations. Mean accuracy dropped from 69.4% (IMAGE) to 49.6% (MASK), below the majority-class baseline of 55.2%. Providing the mask together with the image did not improve classification (68.0%), and for ChatGPT 5.3 produced a net loss of 31 correct predictions. Models maintained elevated mean confidence (67.6) under MASK despite near-random accuracy, and reasoning categories shifted in 67.5% of matched case-run pairs between modalities. Under the conditions tested, current general-purpose MLLMs exhibit strong dependence on visual surface features, fail to effectively integrate spatial structural information, and maintain confidence independent of accuracy. These behavioral limitations are directly relevant to the safe deployment of MLLMs in cancer diagnostic workflows.
PMID:42195318 | PMC:PMC13208433 | DOI:10.3390/life16050763

