Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition

This research presents an approach to improve human action recognition using knowledge distillation, and the combination of Convolutional Neural Networks (CNN) and Vision Transformer (ViT) models. The aim is to enhance the efficiency of smaller student models by transferring knowledge from larger teacher models. The research introduces the use of a transformer vision network as the student model and a convolutional network as the teacher model. The study shows improved performance results for human action recognition on the Stanford 40 dataset, showing that the proposed approach significantly improves accuracy when compared to regular network training settings. The findings highlight the potential of combining local and global features in action recognition tasks.

Publication date: 3 Nov 2023
Project Page: N/A
Paper: https://arxiv.org/pdf/2311.01283

Post Views: 367

Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Joint 3D Shape and Motion Estimation from Rolling Shutter Light-Field Images

FacadeNet: Conditional Facade Synthesis via Selective Editing

Leave a Reply Cancel reply

Please allow ads on our site