This work proposes a novel Cross-Data Multilevel Attention (CDMA) approach for multi-type speech-based depression detection, encompassing both read and spontaneous speech. The main novelty lies in analyzing the unique and common representations of the two types of speech and integrating them into a unified end-to-end framework with novel Intra-Type Multi-Local Attention (IT-MLA) and Cross-Type Global Attention (CT-GA) mechanisms. In particular, IT-MLA highlights depression-relevant information unique in either read or spontaneous speech via intra-modal attention-aware interactions. Furthermore, CT-GA further emphasises the depression-relevant common information in both read and spontaneous speech, with each type being guided by the other. These multiple enhanced representations are aggregated to produce the final predictions. Experiments conducted on a publicly available corpus of 104 speakers (including 52 diagnosed with depression by professional psychiatrists) demonstrate that the proposed CDMA achieves an F1 score of up to 92.5%, the highest performance recorded on this dataset.
Cross-Data Multilevel Attention for Depression Detection: Analyzing the Interplay Between Read and Spontaneous Speech
Esposito, Anna;
2024
Abstract
This work proposes a novel Cross-Data Multilevel Attention (CDMA) approach for multi-type speech-based depression detection, encompassing both read and spontaneous speech. The main novelty lies in analyzing the unique and common representations of the two types of speech and integrating them into a unified end-to-end framework with novel Intra-Type Multi-Local Attention (IT-MLA) and Cross-Type Global Attention (CT-GA) mechanisms. In particular, IT-MLA highlights depression-relevant information unique in either read or spontaneous speech via intra-modal attention-aware interactions. Furthermore, CT-GA further emphasises the depression-relevant common information in both read and spontaneous speech, with each type being guided by the other. These multiple enhanced representations are aggregated to produce the final predictions. Experiments conducted on a publicly available corpus of 104 speakers (including 52 diagnosed with depression by professional psychiatrists) demonstrate that the proposed CDMA achieves an F1 score of up to 92.5%, the highest performance recorded on this dataset.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


