The study provides an approach for automatic video description, particularly focusing on manipulation actions. Unlike existing methods that generate single-sentence descriptions, the authors propose two frameworks that enable different levels of detail in descriptions, thereby better conveying the hierarchical structure of these actions. The first is a hybrid statistical method which requires less training data as it models uncertainties within the video clips statistically. The second is an end-to-end method which is more data-heavy, connecting the visual encoder to the language decoder without any intermediate processing step. Both frameworks use LSTM stacks to provide different levels of description granularity. The methods are shown to produce more realistic descriptions than other competing approaches.
Publication date: 14 Nov 2023
Project Page: https://arxiv.org/abs/2311.07285
Paper: https://arxiv.org/pdf/2311.07285