SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis

The article discusses a proposed system for sound design that extracts repetitive actions from a video, which are used in conjunction with audio or textual embeddings to condition a diffusion model trained to generate a new synchronized sound effects audio track. This approach allows for complete creative control for the sound designer while removing the burden of synchronizing audio with video. The process simplifies the sonification process and allows for easier editing of the onset track or changing the conditioning embedding than editing the audio track itself. The authors provide sound examples, source code, and pre-trained models for reproducibility.

Publication date: 25 Oct 2023
Project Page: https://mcomunita.github.io/diffusion-sfx
Paper: https://arxiv.org/pdf/2310.15247

Post Views: 306

SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

How Much Context Does My Attention-Based ASR System Need?

Leave a Reply Cancel reply

Please allow ads on our site