The article provides an extensive review of audio-visual speaker tracking, a field that has seen increased interest due to its academic and practical applications. This technology, which uses audio and visual data to determine speaker positions, has applications in human-computer interaction, speech recognition, and other areas. The paper introduces the Bayesian filters used in speaker tracking and discusses the influence of deep learning techniques. The authors also summarize existing trackers and their performance. The paper ends with a discussion on the connection between audio-visual speaker tracking and other areas such as speech separation and distributed speaker tracking.
Publication date: 25 Oct 2023
Project Page: https://arxiv.org/abs/2310.14778v1
Paper: https://arxiv.org/pdf/2310.14778