ScragVAE — Improved VAE Decoder for ACE-Step 1.5

A fine-tuned AutoencoderOobleck decoder with an intent to improve audio fidelity for the ACE-Step 1.5 music generation pipeline. Drop-in compatible with all existing ACE-Step DiT checkpoints.

What is this?

ACE-Step 1.5 uses a VAE (Variational Autoencoder) to convert between audio waveforms and the latent space that the DiT diffusion model operates in. The original VAE decoder attenuates high-frequency content, resulting in audio with reduced clarity and detail above 6kHz.

ScragVAE retrains the decoder half of the VAE to better reconstruct upper harmonics, transient detail, and spectral "air" — while keeping the encoder frozen so all existing DiT models remain fully compatible.

Benchmarks

Objective spectral analysis comparing ScragVAE vs the original ACE-Step 1.5 VAE decoder on identical latents (same seed, same DiT output):

Metric	ScragVAE	Original VAE	Improvement
Dynamic range	85.8 dB	56.5 dB	+29.3 dB
HF energy ratio (>8kHz)	1.17%	0.85%	+38%
HF energy ratio (>12kHz)	0.21%	0.12%	+83%
Band: brilliance (6–12kHz)	43.0 dB	42.4 dB	+0.6 dB
Band: air (12–24kHz)	30.5 dB	28.2 dB	+2.3 dB
Spectral rolloff (95%)	3326 Hz	2901 Hz	+425 Hz
Spectral centroid	3662 Hz	3447 Hz	+214 Hz (brighter)

Summary: ScragVAE preserves significantly more high-frequency content (especially 10–20kHz) and has dramatically better dynamic range, resulting in clearer vocals, crisper transients, and more natural-sounding audio.

Files

File	Format	Size	Use with
`diffusion_pytorch_model.safetensors`	F32 safetensors	644 MB	Python / Diffusers / HOT-Step 9000
`scragvae-BF16.gguf`	BF16 GGUF	322 MB	acestep.cpp / HOT-Step CPP
`config.json`	JSON	<1 KB	Architecture config (required for both)

Usage

Python / Diffusers

ScragVAE is a drop-in replacement for the ACE-Step VAE. Replace the VAE checkpoint path in your pipeline:

from diffusers import AutoencoderOobleck

# Load ScragVAE instead of the default VAE
vae = AutoencoderOobleck.from_pretrained("scragnog/Ace-Step-1.5-ScragVAE")

# Use with your existing ACE-Step pipeline
# (replace the vae in your pipeline config or checkpoint directory)

Or manually swap the decoder weights in an existing setup:

import torch
from safetensors.torch import load_file

# Load ScragVAE weights
scrag_weights = load_file("diffusion_pytorch_model.safetensors")

# Only decoder.* keys differ — encoder.* are identical to the original
decoder_keys = {k: v for k, v in scrag_weights.items() if k.startswith("decoder.")}
your_vae.load_state_dict(decoder_keys, strict=False)

acestep.cpp / HOT-Step CPP

Place scragvae-BF16.gguf in your models directory alongside the other GGUF files:

models/
├── acestep-v15-turbo-BF16.gguf    # DiT
├── acestep-5Hz-lm-BF16.gguf       # LM
├── Qwen3-Embedding-BF16.gguf      # Text encoder  
├── vae-BF16.gguf                  # Original VAE
└── scragvae-BF16.gguf             # ← ScragVAE (add this)

The engine auto-discovers all VAE GGUFs at startup. In HOT-Step CPP, select ScragVAE from the VAE Decoder dropdown in the Models & Adapters panel.

For acestep.cpp's built-in web UI or API, pass "vae_model": "scragvae-BF16.gguf" in your synth request JSON.

Converting from safetensors to GGUF yourself

If you need to reconvert (e.g. after further fine-tuning):

python engine/convert.py  # scans checkpoints/ and outputs to models/

Or use the converter directly:

from convert import convert_model
convert_model("scragvae", "/path/to/scragvae/", "scragvae-BF16.gguf", "vae")

Architecture

ScragVAE uses the same AutoencoderOobleck architecture as the original ACE-Step VAE — no structural changes. Only the decoder weights differ.

Parameter	Value
Architecture	AutoencoderOobleck
Audio channels	2 (stereo)
Sample rate	48,000 Hz
Latent dim	64
Decoder channels	128
Channel multiples	[1, 2, 4, 8, 16]
Downsampling ratios	[2, 4, 4, 6, 10]
Total ratio	1920×
Activation	Snake
Weight normalization	Yes (fused at load in GGUF)
Parameters	168.7M (encoder + decoder)

Compatibility

✅ All ACE-Step 1.5 DiT checkpoints (turbo, SFT, XL)
✅ All LoRA/adapter models
✅ Both Python (PyTorch/Diffusers) and C++ (ggml/acestep.cpp) runtimes
✅ Encoder weights are identical — no retraining of upstream models needed

Training

Strategy

Freeze encoder → train decoder only. The DiT operates in latent space; by only improving the decoder, all existing DiT checkpoints remain compatible without retraining.

Two-phase training

Parameter	Phase 1 (Warm-up)	Phase 2 (Quality)
Steps	~3,000	~98,000
Learning rate	3e-5	3e-5
Adversarial weight	0.5	1.5
Feature matching	5.0	3.0
Perceptual weighting	On	Off
L1 time domain	0.0	0.05
Discriminator FFT sizes	6	6 (+4096)
Spectral loss FFT sizes	—	9 (32–8192)
Multi-res mel loss	—	4 scales
Precision	bf16-mixed	bf16-mixed
Effective batch	16 (8×2 accum)	16 (8×2 accum)
Gradient clip	1.0	1.0

Key changes vs original training

Disabled perceptual weighting in the spectral loss — the original's perceptual curve de-emphasizes high frequencies, actively suppressing HF reconstruction
Increased adversarial weight (0.5 → 1.5) — forces the decoder to produce more realistic spectral detail
Reduced feature matching (5.0 → 3.0) — less over-smoothing from discriminator feature constraints
Added L1 time-domain loss (0.05) — preserves transient attacks and waveform fidelity
Added 4096-point FFT to discriminator — gives the discriminator explicitly better resolution for harmonic content in the 2–8kHz range
Added multi-resolution mel-spectrogram loss at 4 scales — captures perceptually relevant frequency content

Hardware

GPU: NVIDIA RTX 5090 (32GB)
Training time: ~8 hours total (Phase 1 + Phase 2)
Framework: PyTorch + stable-audio-tools

License

MIT License — same as ACE-Step 1.5.

Citation

If you use ScragVAE in your work:

@misc{scragvae2026,
  title={ScragVAE: Improved VAE Decoder for ACE-Step 1.5},
  author={Scragnog},
  year={2026},
  url={https://huggingface.co/scragnog/Ace-Step-1.5-ScragVAE}
}

Acknowledgements

ACE-Step 1.5 — the base model and VAE architecture
stable-audio-tools — training framework
acestep.cpp — C++ inference engine with GGUF support

Downloads last month: 1,276

GGUF

Model size

0.2B params

Architecture

acestep-vae

Hardware compatibility

16-bit

Model tree for scragnog/Ace-Step-1.5-ScragVAE

Base model

ACE-Step/ace-step-v1.5-1d-vae-stable-audio-format

Quantized

(1)

this model