The availability of online videos has transformed information access. In the medical field, these instructional videos can offer visual answers to health-related questions. This paper proposes an approach to create two large-scale datasets, HealthVidQA-CRF and HealthVidQA-Prompt, to help address the scarcity of large-scale datasets in the medical domain. The authors also propose monomodal and multimodal approaches to effectively provide visual answers from medical videos to natural language questions. The findings suggest the potential of these datasets to enhance the performance of medical visual answer localization tasks.
Publication date: 21 Sep 2023
Project Page: https://arxiv.org/abs/2309.12224v1
Paper: https://arxiv.org/pdf/2309.12224