The article discusses a new method for detecting deepfake videos. Deepfakes are synthetic media, including images, videos, and audio, that can be used for unethical purposes. The proposed method uses a multi-modal self-supervised-learning feature extractor to exploit inconsistencies between audio and visual information in multimedia content. The model uses the A V-HuBERT model to extract visual and acoustic features, and a multi-scale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. The model also uses another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. The model outperforms all existing models and achieves a new state-of-the-art performance on the FakeA VCeleb and DeepfakeTIMIT datasets.
Publication date: 8 Nov 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2311.02733