The article introduces Synchformer, a new model for audio-visual synchronization focused on ‘in-the-wild’ videos, such as those found on YouTube, where synchronization cues can be sparse. The authors propose a training method that separates feature extraction from synchronization modeling through multi-modal segment-level contrastive pre-training. This method achieves state-of-the-art performance in both dense and sparse settings. The model’s training extends to AudioSet, a million-scale ‘in-the-wild’ dataset. The authors also investigate evidence attribution techniques for interpretability and explore a new capability for synchronization models: audio-visual synchronizability.

 

Publication date: 31 Jan 2024
Project Page: robots.ox.ac.uk/~vgg/research/synchformer
Paper: https://arxiv.org/pdf/2401.16423