RA VEN is a novel unconditional video generative model designed to address long-term spatial and temporal dependencies. It uses a hybrid explicit-implicit tri-plane representation to model an entire video sequence. This approach reduces computational complexity and facilitates the efficient generation of videos. An integrated optical flow-based module enhances the model’s capabilities, enabling it to synthesize high-fidelity video clips at a resolution of 256×256 pixels, with durations extending to more than 5 seconds at a frame rate of 30 fps. The approach is validated across three different datasets comprising both synthetic and real video clips.
Publication date: 11 Jan 2024
Project Page: https://arxiv.org/abs/2401.06035
Paper: https://arxiv.org/pdf/2401.06035