The article introduces a Unified Speech Enhancement and Editing (uSee) model that uses conditional diffusion models to perform various tasks simultaneously. The model can control the generation of unified speech enhancement and editing by providing multiple types of conditions including self-supervised learning embeddings and proper text prompts. The uSee model has shown superior performance in speech denoising and dereverberation compared to other generative speech enhancement models. It can also perform speech editing given desired environmental sound text description, signal-to-noise ratios, and room impulse responses.

 

Publication date: 4 Oct 2023
Project Page: https://muqiaoy.github.io/usee
Paper: https://arxiv.org/pdf/2310.00900