The paper presents a synthetic speech detector called Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT). The detector converts a time domain speech signal to a mel-spectrogram and processes it in patches using a transformer neural network. The detector was evaluated on the ASVspoof2019 dataset and performed better than other approaches using spectrograms for synthetic speech detection. It also generalized well on the In-the-Wild dataset and was robust to compression, being able to detect telephone quality synthetic speech better than several existing methods. The need for such a detector arises from the increasing use of synthetic speech for malicious purposes such as spreading misinformation, committing financial fraud, and impersonating humans.

 

Publication date: 23 Feb 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2402.14205