The article presents LaVie, a video generation model that leverages pre-trained text-to-image models for high-quality text-to-video synthesis. The model includes a base text-to-video model, a temporal interpolation model, and a video super-resolution model. It employs simple temporal self-attentions and rotary positional encoding to capture temporal correlations in video data. The process of joint image-video fine-tuning is validated to be crucial in producing high-quality outcomes. To improve LaVie’s performance, the authors introduced a diverse video dataset named Vimeo25M. LaVie demonstrates state-of-the-art performance in various long video generation and personalized video synthesis applications.

 

Publication date: 26 Sep 2023
Project Page: https://vchitect.github.io/LaVie-project/
Paper: https://arxiv.org/pdf/2309.15103