Large language models for risk-of-bias assessment in randomised clinical trials-a comparative validation study

Scritto il 29/03/2026

da Lauri Nyrhi

EBioMedicine. 2026 Apr;126:106238. doi: 10.1016/j.ebiom.2026.106238. Epub 2026 Mar 28.

ABSTRACT

BACKGROUND: Large language models (LLMs) are emerging tools for evidence synthesis. Risk of bias (RoB) assessment of trials remains an essential but time-consuming step inconsistent even amongst experts. Early LLM studies showed mixed reliability. Advances in reasoning-enabled models warrant evaluation of their accuracy and consistency for RoB screening across randomised trials to reduce reviewer workload.

METHODS: We conducted a preregistered comparative validation study (March 11-May 19, 2025) of four LLMs-ChatGPT o3, DeepSeek v3, Google Gemini Flash 2.0, and Grok 3-prompted with full-text randomised clinical trial articles and protocols. Two corpora were analysed: 100 RCTs from recent Cochrane reviews (RoB 1) and 100 RCTs from meta-analyses in high-impact journals (RoB 2). The reference standard was published human RoB judgements. The primary outcome was interobserver reliability (Cohen κ, 95% CI); secondary outcomes were intraobserver agreement and diagnostic accuracy (sensitivity, specificity, predictive values, F-score).

FINDINGS: For RoB 1, interobserver agreement ranged from κ 0.0.27 (95% CI 0.07-0.46) with Gemini Flash 2.0 to κ 0.39 (0.20-0.59) with DeepSeek v3. For RoB 2, agreement was lower, from κ 0.06 (-0.07 to 0.18) with ChatGPT o3 to κ 0.13 (-0.04 to 0.31) with Gemini. Diagnostic performance was limited with sensitivity ranging 0.05-0.55, specificity 0.78-0.99, PPV 0.31-0.50, and NPV 0.48-0.61 across models, with models consistently over-flagging concerns.

INTERPRETATION: None of the evaluated LLMs were sufficiently reliable for fully autonomous RoB assessment. DeepSeek v3 and ChatGPT o3 approximated human performance best on RoB 1, but RoB 2 rule-in and rule-out performance remained modest. Current use should be supervised, with possible application of LLMs for triage or as a second assessor. Major improvements in protocol retrieval, task-specific tuning, and calibrated thresholds, prospectively validated, are needed for safe stand-alone deployment.

FUNDING: This study received no financial support.

PMID:41905260 | PMC:PMC13054287 | DOI:10.1016/j.ebiom.2026.106238