This article investigates the use of both verbal and non-verbal cues in device-directed speech detection (DDSD), a system that distinguishes between queries directed at a voice assistant and background speech. Specifically, it focuses on the use of prosody features (non-verbal cues) in addition to verbal cues for DDSD. The study found that incorporating prosody can improve DDSD performance by up to 8.5% in terms of false acceptance rate. Furthermore, the use of modality dropout techniques further enhanced the performance of these models by 7.4% when evaluated with missing modalities during inference time.

 

Publication date: 23 Oct 2023
Project Page: https://arxiv.org/abs/2310.15261v1
Paper: https://arxiv.org/pdf/2310.15261