Background: The “hallucinations” of Large Language Models (LLMs) raise concerns about their accuracy in pediatrics. This study aimed to evaluate whether integrating information from the Nelson Textbook of Pediatrics through a Retrieval-Augmented Generation (RAG) system could enhance the performance of Llama3.2 in addressing complex pediatric clinical cases. Methods: We assessed the RAG system performance using 1,713 multiple-choice pediatric clinical questions from the MedQA dataset (n = 1,572) and Archives of Disease in Childhood–Education and Practice (n = 141). Each question was presented to Llama3.2 both in standalone mode and with RAG integration. The percentage of correct answers between models was compared using the chi-square test. p < 0.05 was considered statistically significant. Results: The RAG-integrated system significantly outperformed standalone Llama3.2, achieving an overall accuracy of 67.78 % (1,161/1,713) compared to 46.18 % (791/1,713) for Llama3.2 alone (p = 1.5e-112). The improvement was consistent across all pediatric subspecialties. Conclusions: Incorporating RAG systems into clinical decision-making can enhance reliability and safety.
Artificial intelligence for solving pediatric clinical cases: A Retrieval-Augmented approach utilizing Llama3.2 and structured references
Colosimo S.;Frattolillo V.;Masino M.;Miraglia del Giudice E.;Marzuillo P.
2025
Abstract
Background: The “hallucinations” of Large Language Models (LLMs) raise concerns about their accuracy in pediatrics. This study aimed to evaluate whether integrating information from the Nelson Textbook of Pediatrics through a Retrieval-Augmented Generation (RAG) system could enhance the performance of Llama3.2 in addressing complex pediatric clinical cases. Methods: We assessed the RAG system performance using 1,713 multiple-choice pediatric clinical questions from the MedQA dataset (n = 1,572) and Archives of Disease in Childhood–Education and Practice (n = 141). Each question was presented to Llama3.2 both in standalone mode and with RAG integration. The percentage of correct answers between models was compared using the chi-square test. p < 0.05 was considered statistically significant. Results: The RAG-integrated system significantly outperformed standalone Llama3.2, achieving an overall accuracy of 67.78 % (1,161/1,713) compared to 46.18 % (791/1,713) for Llama3.2 alone (p = 1.5e-112). The improvement was consistent across all pediatric subspecialties. Conclusions: Incorporating RAG systems into clinical decision-making can enhance reliability and safety.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


