Behav Sci (Basel). 2026 May 19;16(5):822. doi: 10.3390/bs16050822.
ABSTRACT
Artificial intelligence (AI) is increasingly integrated into assessment contexts, yet evidence on the reliability and measurement properties of large language models (LLMs) in high-stakes evaluation settings remains limited. This study examines the performance and reproducibility of contemporary multimodal LLMs in a structured medical assessment environment. A cross-sectional dual-setup design was applied using a complete national medical licensing examination (240 multiple-choice items, including image-based questions). Setup 1 evaluated ten models in a single run to characterize overall performance. Setup 2 assessed six models across five independent runs each to quantify measurement stability. Accuracy with 95% confidence intervals, inter-run agreement using Cohen's kappa, and paired comparisons using McNemar's test were analyzed. Accuracy ranged from 72.08% to 92.92%. All models demonstrated near-perfect inter-run agreement (mean κ ≥ 0.96) with minimal variability. After correction, only a small number of pairwise comparisons remained significant, indicating convergence among leading systems. In an exploratory submodule, performance on the small set of image-based items was comparable to or slightly higher than performance on text-only items. These findings demonstrate that multimodal LLMs achieve high accuracy and high inter-run reproducibility on a large-scale assessment, supporting their use as objects of AI-based assessment research while leaving questions of cognitive equivalence with human examinees beyond the scope of accuracy-based evaluation.
PMID:42193701 | PMC:PMC13203780 | DOI:10.3390/bs16050822

