The article introduces ‘FunnyNet-W’, a model that relies on cross- and self-attention for visual, audio, and text data to predict funny moments in videos. Unlike most methods that rely on subtitles, this model utilizes video frames, audio, and text through a speech-to-text model to understand scenes and detect humor. It proposes an unsupervised approach for labeling funny audio moments. It has been tested on five datasets and has shown promising results in identifying funny moments using multimodal cues, setting a new state of the art for humor detection.

 

Publication date: 11 Jan 2024
Project Page: https://doi.org/
Paper: https://arxiv.org/pdf/2401.04210