ScragVAE β Improved VAE Decoder for ACE-Step 1.5
A fine-tuned AutoencoderOobleck decoder with an intent to improve audio fidelity for the ACE-Step 1.5 music generation pipeline. Drop-in compatible with all existing ACE-Step DiT checkpoints.
What is this?
ACE-Step 1.5 uses a VAE (Variational Autoencoder) to convert between audio waveforms and the latent space that the DiT diffusion model operates in. The original VAE decoder attenuates high-frequency content, resulting in audio with reduced clarity and detail above 6kHz.
ScragVAE retrains the decoder half of the VAE to better reconstruct upper harmonics, transient detail, and spectral "air" β while keeping the encoder frozen so all existing DiT models remain fully compatible.
Benchmarks
Objective spectral analysis comparing ScragVAE vs the original ACE-Step 1.5 VAE decoder on identical latents (same seed, same DiT output):
| Metric | ScragVAE | Original VAE | Improvement |
|---|---|---|---|
| Dynamic range | 85.8 dB | 56.5 dB | +29.3 dB |
| HF energy ratio (>8kHz) | 1.17% | 0.85% | +38% |
| HF energy ratio (>12kHz) | 0.21% | 0.12% | +83% |
| Band: brilliance (6β12kHz) | 43.0 dB | 42.4 dB | +0.6 dB |
| Band: air (12β24kHz) | 30.5 dB | 28.2 dB | +2.3 dB |
| Spectral rolloff (95%) | 3326 Hz | 2901 Hz | +425 Hz |
| Spectral centroid | 3662 Hz | 3447 Hz | +214 Hz (brighter) |
Summary: ScragVAE preserves significantly more high-frequency content (especially 10β20kHz) and has dramatically better dynamic range, resulting in clearer vocals, crisper transients, and more natural-sounding audio.
Files
| File | Format | Size | Use with |
|---|---|---|---|
diffusion_pytorch_model.safetensors |
F32 safetensors | 644 MB | Python / Diffusers / HOT-Step 9000 |
scragvae-BF16.gguf |
BF16 GGUF | 322 MB | acestep.cpp / HOT-Step CPP |
config.json |
JSON | <1 KB | Architecture config (required for both) |
Usage
Python / Diffusers
ScragVAE is a drop-in replacement for the ACE-Step VAE. Replace the VAE checkpoint path in your pipeline:
from diffusers import AutoencoderOobleck
# Load ScragVAE instead of the default VAE
vae = AutoencoderOobleck.from_pretrained("scragnog/Ace-Step-1.5-ScragVAE")
# Use with your existing ACE-Step pipeline
# (replace the vae in your pipeline config or checkpoint directory)
Or manually swap the decoder weights in an existing setup:
import torch
from safetensors.torch import load_file
# Load ScragVAE weights
scrag_weights = load_file("diffusion_pytorch_model.safetensors")
# Only decoder.* keys differ β encoder.* are identical to the original
decoder_keys = {k: v for k, v in scrag_weights.items() if k.startswith("decoder.")}
your_vae.load_state_dict(decoder_keys, strict=False)
acestep.cpp / HOT-Step CPP
Place scragvae-BF16.gguf in your models directory alongside the other GGUF files:
models/
βββ acestep-v15-turbo-BF16.gguf # DiT
βββ acestep-5Hz-lm-BF16.gguf # LM
βββ Qwen3-Embedding-BF16.gguf # Text encoder
βββ vae-BF16.gguf # Original VAE
βββ scragvae-BF16.gguf # β ScragVAE (add this)
The engine auto-discovers all VAE GGUFs at startup. In HOT-Step CPP, select ScragVAE from the VAE Decoder dropdown in the Models & Adapters panel.
For acestep.cpp's built-in web UI or API, pass "vae_model": "scragvae-BF16.gguf" in your synth request JSON.
Converting from safetensors to GGUF yourself
If you need to reconvert (e.g. after further fine-tuning):
python engine/convert.py # scans checkpoints/ and outputs to models/
Or use the converter directly:
from convert import convert_model
convert_model("scragvae", "/path/to/scragvae/", "scragvae-BF16.gguf", "vae")
Architecture
ScragVAE uses the same AutoencoderOobleck architecture as the original ACE-Step VAE β no structural changes. Only the decoder weights differ.
| Parameter | Value |
|---|---|
| Architecture | AutoencoderOobleck |
| Audio channels | 2 (stereo) |
| Sample rate | 48,000 Hz |
| Latent dim | 64 |
| Decoder channels | 128 |
| Channel multiples | [1, 2, 4, 8, 16] |
| Downsampling ratios | [2, 4, 4, 6, 10] |
| Total ratio | 1920Γ |
| Activation | Snake |
| Weight normalization | Yes (fused at load in GGUF) |
| Parameters | 168.7M (encoder + decoder) |
Compatibility
- β All ACE-Step 1.5 DiT checkpoints (turbo, SFT, XL)
- β All LoRA/adapter models
- β Both Python (PyTorch/Diffusers) and C++ (ggml/acestep.cpp) runtimes
- β Encoder weights are identical β no retraining of upstream models needed
Training
Strategy
Freeze encoder β train decoder only. The DiT operates in latent space; by only improving the decoder, all existing DiT checkpoints remain compatible without retraining.
Two-phase training
| Parameter | Phase 1 (Warm-up) | Phase 2 (Quality) |
|---|---|---|
| Steps | ~3,000 | ~98,000 |
| Learning rate | 3e-5 | 3e-5 |
| Adversarial weight | 0.5 | 1.5 |
| Feature matching | 5.0 | 3.0 |
| Perceptual weighting | On | Off |
| L1 time domain | 0.0 | 0.05 |
| Discriminator FFT sizes | 6 | 6 (+4096) |
| Spectral loss FFT sizes | β | 9 (32β8192) |
| Multi-res mel loss | β | 4 scales |
| Precision | bf16-mixed | bf16-mixed |
| Effective batch | 16 (8Γ2 accum) | 16 (8Γ2 accum) |
| Gradient clip | 1.0 | 1.0 |
Key changes vs original training
- Disabled perceptual weighting in the spectral loss β the original's perceptual curve de-emphasizes high frequencies, actively suppressing HF reconstruction
- Increased adversarial weight (0.5 β 1.5) β forces the decoder to produce more realistic spectral detail
- Reduced feature matching (5.0 β 3.0) β less over-smoothing from discriminator feature constraints
- Added L1 time-domain loss (0.05) β preserves transient attacks and waveform fidelity
- Added 4096-point FFT to discriminator β gives the discriminator explicitly better resolution for harmonic content in the 2β8kHz range
- Added multi-resolution mel-spectrogram loss at 4 scales β captures perceptually relevant frequency content
Hardware
- GPU: NVIDIA RTX 5090 (32GB)
- Training time: ~8 hours total (Phase 1 + Phase 2)
- Framework: PyTorch + stable-audio-tools
License
MIT License β same as ACE-Step 1.5.
Citation
If you use ScragVAE in your work:
@misc{scragvae2026,
title={ScragVAE: Improved VAE Decoder for ACE-Step 1.5},
author={Scragnog},
year={2026},
url={https://huggingface.co/scragnog/Ace-Step-1.5-ScragVAE}
}
Acknowledgements
- ACE-Step 1.5 β the base model and VAE architecture
- stable-audio-tools β training framework
- acestep.cpp β C++ inference engine with GGUF support
- Downloads last month
- 1,276
16-bit