The paper introduces a new category of diffusion models built on state space architecture for image data. These models, called Diffusion State Space Models (DiS), treat all inputs including time, condition, and noisy image patches as tokens. The study involves both unconditional and class-conditional image generation scenarios, showing that DiS models perform comparably or superiorly to CNN-based or Transformer-based U-Net architectures of similar size. Furthermore, DiS models demonstrate commendable scalability characteristics, with higher Gflops models consistently showing lower FID. In latent space, DiS-H/2 models achieve performance levels similar to prior diffusion models on class-conditional ImageNet benchmarks at resolutions of 256×256 and 512×512, while significantly reducing the computational load.

 

Publication date: 8 Feb 2022
Project Page: https://github.com/feizc/DiS
Paper: https://arxiv.org/pdf/2402.05608