Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

The article presents a framework for continuous target speaker extraction (C-TSE), which aims to refine the process of extracting the target speaker’s voice from a mixture of sounds. The framework includes a target speaker voice activation detection (TSV AD) and a TSE model. The authors propose an Attention-target speaker voice activation detection (A-TSV AD) that directly generates timestamps of the target speaker. The effectiveness of this framework is evaluated using diarization and enhancement metrics. The results show that A-TSV AD significantly reduces diarization errors and improves extraction accuracy when integrated with TSE in a sequential cascaded manner.

Publication date: 31 Jan 2024
Project Page: herbhezhao.github.io/Continuous-Target-Speech-Extraction
Paper: https://arxiv.org/pdf/2401.15993

Post Views: 263

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Localizing uniformly moving mono-frequent sources using an inverse 2.5D approach

Masked Audio Modeling with CLAP and Multi-Objective Learning

Leave a Reply Cancel reply

Please allow ads on our site