The avatars speech intelligibility: comparison between two different speech generation methodologies for facial animation

Cioffi, Federico; Masullo, Massimiliano; Pascale, Aniello; Maffei, Luigi

Research on speech intelligibility has shown that visual cues, such as facial movements synchronized with acoustic cues, significantly affect listeners’ efforts during communication tasks. The mismatch in these elements can adversely affect speech intelligibility outcomes in terms of cognitive load and correct comprehension. This task is even more critical in noisy environments where listeners must discern speech against challenging background noise. In even more interactive virtual environments, communication with the avatars becomes increasingly prevalent, requiring a comprehensive understanding of their dynamics to ensure effective interactions between the avatars involved. Utilizing Unreal Engine’s MetaHuman technology, the present study compares two different speech generation methodologies (synthesised text-to-speech vs human voice recording) for testing automatic facial animation generations through a laboratory experiment that investigated how these can affect avatars' speech intelligibility under adverse acoustic conditions. Thirty-six words from the Diagnostic Rhyme Test (DRT) were recorded by a human voice and generated through text-to-speech software to drive the animations. Participants were presented with 72 animations with an adversarial babble noise with a fixed signal-to-noise ratio of -13 dB. The study showed that animations driven by the human voice, in comparison with the synthesized one, significantly improved the avatars’ speech intelligibility.

The avatars speech intelligibility: comparison between two different speech generation methodologies for facial animation

Federico Cioffi;Massimiliano Masullo;Aniello Pascale;Luigi Maffei

2025

Abstract

Research on speech intelligibility has shown that visual cues, such as facial movements synchronized with acoustic cues, significantly affect listeners’ efforts during communication tasks. The mismatch in these elements can adversely affect speech intelligibility outcomes in terms of cognitive load and correct comprehension. This task is even more critical in noisy environments where listeners must discern speech against challenging background noise. In even more interactive virtual environments, communication with the avatars becomes increasingly prevalent, requiring a comprehensive understanding of their dynamics to ensure effective interactions between the avatars involved. Utilizing Unreal Engine’s MetaHuman technology, the present study compares two different speech generation methodologies (synthesised text-to-speech vs human voice recording) for testing automatic facial animation generations through a laboratory experiment that investigated how these can affect avatars' speech intelligibility under adverse acoustic conditions. Thirty-six words from the Diagnostic Rhyme Test (DRT) were recorded by a human voice and generated through text-to-speech software to drive the animations. Participants were presented with 72 animations with an adversarial babble noise with a fixed signal-to-noise ratio of -13 dB. The study showed that animations driven by the human voice, in comparison with the synthesized one, significantly improved the avatars’ speech intelligibility.