Background: The “hallucinations” of Large Language Models (LLMs) raise concerns about their accuracy in pediatrics. This study aimed to evaluate whether integrating information from the Nelson Textbook of Pediatrics through a Retrieval-Augmented Generation (RAG) system could enhance the performance of Llama3.2 in addressing complex pediatric clinical cases. Methods: We assessed the RAG system performance using 1,713 multiple-choice pediatric clinical questions from the MedQA dataset (n = 1,572) and Archives of Disease in Childhood–Education and Practice (n = 141). Each question was presented to Llama3.2 both in standalone mode and with RAG integration. The percentage of correct answers between models was compared using the chi-square test. p < 0.05 was considered statistically significant. Results: The RAG-integrated system significantly outperformed standalone Llama3.2, achieving an overall accuracy of 67.78 % (1,161/1,713) compared to 46.18 % (791/1,713) for Llama3.2 alone (p = 1.5e-112). The improvement was consistent across all pediatric subspecialties. Conclusions: Incorporating RAG systems into clinical decision-making can enhance reliability and safety.

Artificial intelligence for solving pediatric clinical cases: A Retrieval-Augmented approach utilizing Llama3.2 and structured references

Colosimo S.;Frattolillo V.;Masino M.;Miraglia del Giudice E.;Marzuillo P.
2025

Abstract

Background: The “hallucinations” of Large Language Models (LLMs) raise concerns about their accuracy in pediatrics. This study aimed to evaluate whether integrating information from the Nelson Textbook of Pediatrics through a Retrieval-Augmented Generation (RAG) system could enhance the performance of Llama3.2 in addressing complex pediatric clinical cases. Methods: We assessed the RAG system performance using 1,713 multiple-choice pediatric clinical questions from the MedQA dataset (n = 1,572) and Archives of Disease in Childhood–Education and Practice (n = 141). Each question was presented to Llama3.2 both in standalone mode and with RAG integration. The percentage of correct answers between models was compared using the chi-square test. p < 0.05 was considered statistically significant. Results: The RAG-integrated system significantly outperformed standalone Llama3.2, achieving an overall accuracy of 67.78 % (1,161/1,713) compared to 46.18 % (791/1,713) for Llama3.2 alone (p = 1.5e-112). The improvement was consistent across all pediatric subspecialties. Conclusions: Incorporating RAG systems into clinical decision-making can enhance reliability and safety.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11591/578849
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 1
social impact