This article introduces EmoCLIP, a new vision-language model that enhances learning of rich latent representations for zero-shot classification. The model is tested using zero-shot classification on four popular dynamic FER datasets. The results show significant improvements compared to baseline methods, outperforming CLIP by over 10% in terms of Weighted Average Recall and 5% in terms of Unweighted Average Recall. The model also performs well in the downstream task of mental health symptom estimation, achieving a performance comparable or superior to state-of-the-art methods and strong agreement with human experts.
Publication date: 25 Oct 2023
Project Page: https://github.com/NickyFot/EmoCLIP
Paper: https://arxiv.org/pdf/2310.16640