This article introduces a system for real-time and continuous turn-taking prediction in spoken dialogue systems (SDSs). The system is based on a model called voice activity projection (VAP), which maps dialogue audio to future voice activities. It uses contrastive predictive coding and self-attention transformers, followed by a cross-attention transformer. The article examines the effect of the input context audio length and demonstrates that the proposed system can operate in real-time with minimal performance degradation. Despite recent progress in large language models, turn-taking is still typically handled in a simplistic manner in practical SDSs, often leading to long response delays or frequent interruptions. The VAP model addresses this problem by predicting future voice activities of dialogue participants in a continuous time frame.

 

Publication date: 10 Jan 2024
Project Page: https://arxiv.org/abs/2401.04868v1
Paper: https://arxiv.org/pdf/2401.04868