YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
PAE: What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
This project presents PAE (Prior-Aligned AutoEncoder), a tokenizer framework that explicitly shapes a diffusion-friendly latent manifold for latent diffusion models. Instead of relying solely on reconstruction fidelity or passively inheriting pretrained representations, PAE identifies and optimizes three key properties of a diffusion-friendly latent space β spatial structure coherence, local manifold continuity, and global manifold semantics β through targeted prior-alignment regularizations. On ImageNet 256Γ256, PAE achieves a new state-of-the-art gFID of 1.03 with up to 13Γ faster convergence than RAE under the same LightningDiT setup.
Prior alignment constructs a diffusion-friendly latent manifold. Left: Compared with reconstruction-oriented counterparts, the prior-aligned latent manifold is more structurally coherent, locally continuous, and semantically organized. Right: PAE yields faster convergence, better generation quality, and robust few-step sampling performance.
Class-conditional samples generated by PAE with LightningDiT-XL/1 on ImageNet 256Γ256.
π₯ Updates
- [2026.05.09] π π π We release PAE. Code and pretrained models are now available!
β¨ Highlights
- π― New Perspective: We study what makes a latent manifold diffusion-friendly, identifying three key properties: spatial structure coherence, local manifold continuity, and global manifold semantics.
- ποΈ Explicit Manifold Shaping: PAE turns these properties into explicit training objectives via three prior-alignment regularizations (SSR, MCR, SCR), rather than leaving them to emerge indirectly.
- β‘ 13Γ Faster Convergence: PAE reaches performance comparable to RAE with up to 13Γ fewer training epochs under the same LightningDiT setup.
- π State-of-the-Art: Achieves gFID 1.03 on ImageNet 256Γ256, the best result among all compared methods.
- π Encoder-Agnostic: Compatible with multiple VFM backbones including DINOv2, SigLIP2, DINOv3, and MAE.
ποΈ Architecture
Overview of the PAE framework. A frozen VFM provides stable representation features. DAM injects pixel detail while preserving the VFM as the dominant semantic source. Three prior-alignment objectives explicitly shape the latent manifold.
β€οΈ Acknowledgement
Our work builds upon the foundations laid by many excellent projects in the field. We would like to thank the authors of LightningDiT, RAE, GAE, ADM. We are grateful for their contributions to the community.