This study focuses on enhancing the accuracy of child-adult speaker classification in dyadic interactions, particularly in the context of Autism Spectrum Disorder assessments. Traditionally, this classification relies on audio modeling approaches. However, this study proposes incorporating visual cues, such as lip motion, through active speaker detection and visual processing models. The new framework includes video pre-processing, utterance-level child-adult speaker detection, and late fusion of modality-specific predictions. The results show that this visually aided classification pipeline improves the classification’s accuracy and robustness.

 

Publication date: 4 Oct 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2310.01867