October 29, 2025
Nazmus Khan, BSc(Hons), MBChB, from McMaster University, compared the performance of four large language models (LLMs) in answering 46 IBD-related clinical questions from the American Gastroenterological Association (AGA) Digestive Diseases Self-Education Program. ChatGPT (OpenAI), Gemini (Google), Claude (Anthropic), and OpenEvidence were tested in May 2025 on accuracy and use of references in answering questions about ulcerative colitis (UC), Crohn’s disease (CD), IBD complications, and extraintestinal manifestations. Each question was presented in multiple choice and short open-answer format. All four models more accurately answered multiple choice questions (65.2% to 87.0%) than open-answer format questions (54.4% to 65.2%). OpenEvidence performed the best in answering multiple-choice questions (87.0%), whereas Gemini was the most accurate in answering the open-answer format questions (65.2%). OpenEvidence provided references in 100% of its responses, whereas Gemini and Claude did so inconsistently and ChatGPT did not provide any references. The authors concluded that the LLMs need optimization before deployment in IBD clinical or educational settings. (P1028)
Mao-Yuan Chen, MD, from the University of California San Francisco, used GPT-4o1 to analyze the root causes of diagnostic delay based on gastroenterology, primary care, and inpatient notes taken 3 years before and 2 years after the date of IBD diagnosis for 406 patients at UCSF. The model was asked to gather data on how the patient first presented, the time of IBD symptom onset, time of first visit to an external GI or non-GI provider for IBD, time of first visit to a GI, and time of endoscopy and diagnosis. The proportion of patients and the time spent between symptom onset, non-GI or external GI visit, GI visit, endoscopy, and diagnosis were mapped. The analysis revealed that the transition of patients between non-GI/external GI and internal GI visits accounted for the greatest delays and could be targeted to accelerate diagnosis. (P1079)
Aryan Ayati, MD, MPH, from the University of California San Francisco, asked whether AI could be used to identify patients with undiagnosed IBD in the electronic medical record. They conducted a retrospective cohort study in which GPT-4o was trained on 216 structured features predictive of IBD. IBD risk was scored at every patient encounter for 298 patients with pre-diagnosis IBD symptoms and 3000 randomly sampled controls with no IBD-related symptoms. The highest performing model had a precision recall (area under the precision-recall curve) of 0.90 and an accuracy (area under the receiver-operator curve [AUROC]) of 0.98. The model saved a median of 20 months in diagnostic lead time. The low false-positive rate supports the use of the model in screening for IBD. (P1181)
Darren Thomason, MDA, from Interactive Health Inc, described the development of a machine learning algorithm to classify the presence or absence of ulcers in ileocolonoscopy videos from patients with CD from the phase 2 SERENITY trial. The algorithm was trained on 472 videos and validated with 119 videos. In the SERENITY trial, each video was assigned a score by one of three readers; this score was used to assign each video with the presence or absence of ulcers in the colon or ileum. They found that the machine learning model had a AUROC of 0.94 for detecting ulcers in the colon and an AUROC of 0.87 for detecting ulcers in the ileum. Future development is focused on expanding the model to discriminate disease severity. (P3296)
Bhavana Baraskar, MD, from Mary Washington Healthcare, described the results of a systematic review of 16 studies evaluating machine learning and deep learning models for detecting mucosal healing in endoscopy images from patients with UC using the Mayo Endoscopic Subscore. The studies included more than 200,000 endoscopy images and videos from more than 11,000 UC patients. The most commonly utilized models were convolutional neural networks (CNN) with AUROCs ranging from 0.70 to 0.99, indicating high accuracy for detecting mucosal healing that outperformed expert assessments, particularly for distinguishing early disease (MES 1). (P3318)
Aryan Gupta, from the Bangalore Medical College and Research Institute, conducted a meta-analysis of articles that assessed the diagnostic accuracy of AI models in UC. The review included 5 studies of various AI models that were used for 10,632 diagnoses of UC. The pooled accuracy for diagnosing UC across all studies was 0.78 and the CNN model was the most accurate. (P3352)