Multi Sentence Description of Complex Manipulation Action Videos

The study provides an approach for automatic video description, particularly focusing on manipulation actions. Unlike existing methods that generate single-sentence descriptions, the authors propose two frameworks that enable different levels of detail in descriptions, thereby better conveying the hierarchical structure of these actions. The first is a hybrid statistical method which requires less training data as it models uncertainties within the video clips statistically. The second is an end-to-end method which is more data-heavy, connecting the visual encoder to the language decoder without any intermediate processing step. Both frameworks use LSTM stacks to provide different levels of description granularity. The methods are shown to produce more realistic descriptions than other competing approaches.

Publication date: 14 Nov 2023
Project Page: https://arxiv.org/abs/2311.07285
Paper: https://arxiv.org/pdf/2311.07285

Post Views: 308

Multi Sentence Description of Complex Manipulation Action Videos

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Dynamically Weighted Factor-Graph for Feature-based Geo-localization

LT-ViT: A Vision Transformer for multi-label Chest X-ray classification

Leave a Reply Cancel reply

Please allow ads on our site