YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

PAE: What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

Github Hugging Face Model ModelScope

This project presents PAE (Prior-Aligned AutoEncoder), a tokenizer framework that explicitly shapes a diffusion-friendly latent manifold for latent diffusion models. Instead of relying solely on reconstruction fidelity or passively inheriting pretrained representations, PAE identifies and optimizes three key properties of a diffusion-friendly latent space β€” spatial structure coherence, local manifold continuity, and global manifold semantics β€” through targeted prior-alignment regularizations. On ImageNet 256Γ—256, PAE achieves a new state-of-the-art gFID of 1.03 with up to 13Γ— faster convergence than RAE under the same LightningDiT setup.


Prior alignment constructs a diffusion-friendly latent manifold. Left: Compared with reconstruction-oriented counterparts, the prior-aligned latent manifold is more structurally coherent, locally continuous, and semantically organized. Right: PAE yields faster convergence, better generation quality, and robust few-step sampling performance.


Class-conditional samples generated by PAE with LightningDiT-XL/1 on ImageNet 256Γ—256.

πŸ”₯ Updates

  • [2026.05.09] πŸš€ πŸš€ πŸš€ We release PAE. Code and pretrained models are now available!

✨ Highlights

  • 🎯 New Perspective: We study what makes a latent manifold diffusion-friendly, identifying three key properties: spatial structure coherence, local manifold continuity, and global manifold semantics.
  • πŸ—οΈ Explicit Manifold Shaping: PAE turns these properties into explicit training objectives via three prior-alignment regularizations (SSR, MCR, SCR), rather than leaving them to emerge indirectly.
  • ⚑ 13Γ— Faster Convergence: PAE reaches performance comparable to RAE with up to 13Γ— fewer training epochs under the same LightningDiT setup.
  • πŸ† State-of-the-Art: Achieves gFID 1.03 on ImageNet 256Γ—256, the best result among all compared methods.
  • πŸ”„ Encoder-Agnostic: Compatible with multiple VFM backbones including DINOv2, SigLIP2, DINOv3, and MAE.

πŸ›οΈ Architecture


Overview of the PAE framework. A frozen VFM provides stable representation features. DAM injects pixel detail while preserving the VFM as the dominant semantic source. Three prior-alignment objectives explicitly shape the latent manifold.

❀️ Acknowledgement

Our work builds upon the foundations laid by many excellent projects in the field. We would like to thank the authors of LightningDiT, RAE, GAE, ADM. We are grateful for their contributions to the community.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support