This paper proposes and tests a novel end-to-end trainable neural network called DyDecNet for counting the number of distinct sounds in raw audio, a problem which has been underexplored despite its importance in various fields. The DyDecNet uses dyadic decomposition to progressively decompose the raw waveform along the frequency axis to obtain a time-frequency representation in a multi-stage, coarse-to-fine manner. The research also introduces an energy gain normalization to normalize sound loudness variance and spectrum overlap, and designs three polyphony-aware metrics to better quantify sound counting difficulty level. The paper demonstrates DyDecNet’s superiority on various datasets and its potential to tackle other acoustic tasks.
Publication date: 29 Dec 2023
Project Page: github.com/yuhanghe01/SoundCount
Paper: https://arxiv.org/pdf/2312.16149