The article presents a framework for continuous target speaker extraction (C-TSE), which aims to refine the process of extracting the target speaker’s voice from a mixture of sounds. The framework includes a target speaker voice activation detection (TSV AD) and a TSE model. The authors propose an Attention-target speaker voice activation detection (A-TSV AD) that directly generates timestamps of the target speaker. The effectiveness of this framework is evaluated using diarization and enhancement metrics. The results show that A-TSV AD significantly reduces diarization errors and improves extraction accuracy when integrated with TSE in a sequential cascaded manner.
Publication date: 31 Jan 2024
Project Page: herbhezhao.github.io/Continuous-Target-Speech-Extraction
Paper: https://arxiv.org/pdf/2401.15993