The article presents a system that uses spatial-temporal activity for multichannel speaker diarization and separation. The architecture combines array signal processing units and deep learning units. A Spatial Activity-driven Speaker Diarization network (SASDnet) is used for speaker diarization, estimating the speaker activity from a spatial coherence matrix. For speaker separation, a Global and Local Activity-driven Speaker Extraction network (GLASEnet) is proposed. The system demonstrates superior speaker diarization, counting, and separation performance with low computational complexity.
Publication date: 31 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.16850