Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization
Abstract
Quant VideoGen addresses KV cache memory limitations in autoregressive video diffusion models through semantic-aware smoothing and progressive residual quantization, achieving significant memory reduction with minimal latency impact.
Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generation models, the KV cache grows with generation history and quickly dominates GPU memory, often exceeding 30 GB, preventing deployment on widely available hardware. More critically, constrained KV cache budgets restrict the effective working memory, directly degrading long horizon consistency in identity, layout, and motion. To address this challenge, we present Quant VideoGen (QVG), a training free KV cache quantization framework for autoregressive video diffusion models. QVG leverages video spatiotemporal redundancy through Semantic Aware Smoothing, producing low magnitude, quantization friendly residuals. It further introduces Progressive Residual Quantization, a coarse to fine multi stage scheme that reduces quantization error while enabling a smooth quality memory trade off. Across LongCat Video, HY WorldPlay, and Self Forcing benchmarks, QVG establishes a new Pareto frontier between quality and memory efficiency, reducing KV cache memory by up to 7.0 times with less than 4% end to end latency overhead while consistently outperforming existing baselines in generation quality.
Community
Efficient Long Video Generation, designed for world models and autoregressive video gen applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention (2026)
- Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion (2026)
- HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming (2025)
- XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression (2026)
- PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache (2026)
- TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation (2026)
- Efficient Autoregressive Video Diffusion with Dummy Head (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper