The article discusses a proposed system for sound design that extracts repetitive actions from a video, which are used in conjunction with audio or textual embeddings to condition a diffusion model trained to generate a new synchronized sound effects audio track. This approach allows for complete creative control for the sound designer while removing the burden of synchronizing audio with video. The process simplifies the sonification process and allows for easier editing of the onset track or changing the conditioning embedding than editing the audio track itself. The authors provide sound examples, source code, and pre-trained models for reproducibility.
Publication date: 25 Oct 2023
Project Page: https://mcomunita.github.io/diffusion-sfx
Paper: https://arxiv.org/pdf/2310.15247