This research presents an approach to improve human action recognition using knowledge distillation, and the combination of Convolutional Neural Networks (CNN) and Vision Transformer (ViT) models. The aim is to enhance the efficiency of smaller student models by transferring knowledge from larger teacher models. The research introduces the use of a transformer vision network as the student model and a convolutional network as the teacher model. The study shows improved performance results for human action recognition on the Stanford 40 dataset, showing that the proposed approach significantly improves accuracy when compared to regular network training settings. The findings highlight the potential of combining local and global features in action recognition tasks.
Publication date: 3 Nov 2023
Project Page: N/A
Paper: https://arxiv.org/pdf/2311.01283