The study deals with the complex activity detection (CompAD) in videos, a field that extends the analysis of actions in a video to long-term activities. The researchers propose a hybrid graph network that combines attention applied to a graph encoding the local (short-term) dynamic scene with a temporal graph modelling the overall long-duration activity. A novel feature extraction technique is introduced that generates spatiotemporal tubes for the active elements (agents) in the local scene. This technique detects individual objects, tracks them, and then extracts 3D features from all the agent tubes as well as the overall scene. The proposed framework outperforms all previous state-of-the-art methods on all three datasets including ActivityNet-1.3, Thumos-14, and ROAD.
Publication date: 27 Oct 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2310.17493