Large language models performance on pediatrics question: a new challenge

Mondillo, Gianluca; Perrotta, Alessandra; Frattolillo, Vittoria; Colosimo, Simone; Indolfi, Cristiana; Miraglia Del Giudice, Michele; Rossi, Francesca

doi:10.21037/jmai-24-174

Background: This study investigates the application and efficacy of large language models (LLMs) in pediatric medicine, focusing on their capability to assist in training and decision support for healthcare professionals. Given the unique challenges in pediatric care, such as age-specific conditions and dosing requirements, we aim to evaluate the performance of various LLMs in this specialized field. Methods: We conducted a comparative analysis of several LLMs, including Claude 3-OPUS, ChatGPT 3.5 and 4, Gemini AI, Llama 2 70B, and Mixtral 8x7B. The models were tested on 227 multiple-choice pediatric questions in Italian before and after undergoing specialized training. The training data consisted of pediatric articles from a medical journal, ensuring compliance with HIPAA regulations by using de-identified and anonymous data. Results: The performance of the LLMs varied significantly. ChatGPT 3.5 improved from 65.20% to 83.70% accuracy (P<0.01) after training, while ChatGPT 4 increased from 77.09% to 91.62% (P<0.01). Gemini 1.0 and Mixtral 8x7B recorded accuracies of 70.48% and 71.37% respectively, both showing significant improvements post-training. Llama 2 70B had a lower performance, improving from 47.58% to 52.86%. Claude 3-OPUS demonstrated robust performance with an 82.82% accuracy pre-training, improving to 95.59% post-training. Conclusions: Our analysis confirms the effectiveness of LLMs in pediatric medicine, highlighting their potential in training and decision support. The study emphasizes the need for specific training datasets that reflect the complexities of pediatric conditions to tailor the models accurately. Moreover, there is a significant opportunity to utilize open-source models as a foundation for developing customized systems through training on dedicated datasets. These strategies promise to enhance the accuracy and accessibility of pediatric healthcare, ultimately improving outcomes for young patients.