AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection

The article discusses a new method for detecting deepfake videos. Deepfakes are synthetic media, including images, videos, and audio, that can be used for unethical purposes. The proposed method uses a multi-modal self-supervised-learning feature extractor to exploit inconsistencies between audio and visual information in multimedia content. The model uses the A V-HuBERT model to extract visual and acoustic features, and a multi-scale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. The model also uses another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. The model outperforms all existing models and achieves a new state-of-the-art performance on the FakeA VCeleb and DeepfakeTIMIT datasets.

Publication date: 8 Nov 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2311.02733

Post Views: 331

AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Design Of Rubble Analyzer Probe Using ML For Earthquake

OverHear: Headphone based Multi-sensor Keystroke Inference

Leave a Reply Cancel reply

Please allow ads on our site