The article presents a novel method for cross-domain face anti-spoofing (FAS) leveraging language guidance. The authors demonstrate that initializing Vision Transformer (ViT) models with multimodal pre-trained weights improves generalizability for the FAS task. They propose aligning the image representation with an ensemble of class descriptions, improving FAS generalizability in low-data regimes. A multimodal contrastive learning strategy is proposed to further enhance feature generalization and bridge the gap between source and target domains. The method outperforms state-of-the-art methods, achieving better zero-shot transfer performance.

 

Publication date: 29 Sep 2023
Project Page: https://github.com/koushiksrivats/FLIP
Paper: https://arxiv.org/pdf/2309.16649