ScragVAE β€” Improved VAE Decoder for ACE-Step 1.5

A fine-tuned AutoencoderOobleck decoder with an intent to improve audio fidelity for the ACE-Step 1.5 music generation pipeline. Drop-in compatible with all existing ACE-Step DiT checkpoints.

What is this?

ACE-Step 1.5 uses a VAE (Variational Autoencoder) to convert between audio waveforms and the latent space that the DiT diffusion model operates in. The original VAE decoder attenuates high-frequency content, resulting in audio with reduced clarity and detail above 6kHz.

ScragVAE retrains the decoder half of the VAE to better reconstruct upper harmonics, transient detail, and spectral "air" β€” while keeping the encoder frozen so all existing DiT models remain fully compatible.

Benchmarks

Objective spectral analysis comparing ScragVAE vs the original ACE-Step 1.5 VAE decoder on identical latents (same seed, same DiT output):

Metric ScragVAE Original VAE Improvement
Dynamic range 85.8 dB 56.5 dB +29.3 dB
HF energy ratio (>8kHz) 1.17% 0.85% +38%
HF energy ratio (>12kHz) 0.21% 0.12% +83%
Band: brilliance (6–12kHz) 43.0 dB 42.4 dB +0.6 dB
Band: air (12–24kHz) 30.5 dB 28.2 dB +2.3 dB
Spectral rolloff (95%) 3326 Hz 2901 Hz +425 Hz
Spectral centroid 3662 Hz 3447 Hz +214 Hz (brighter)

Summary: ScragVAE preserves significantly more high-frequency content (especially 10–20kHz) and has dramatically better dynamic range, resulting in clearer vocals, crisper transients, and more natural-sounding audio.

Files

File Format Size Use with
diffusion_pytorch_model.safetensors F32 safetensors 644 MB Python / Diffusers / HOT-Step 9000
scragvae-BF16.gguf BF16 GGUF 322 MB acestep.cpp / HOT-Step CPP
config.json JSON <1 KB Architecture config (required for both)

Usage

Python / Diffusers

ScragVAE is a drop-in replacement for the ACE-Step VAE. Replace the VAE checkpoint path in your pipeline:

from diffusers import AutoencoderOobleck

# Load ScragVAE instead of the default VAE
vae = AutoencoderOobleck.from_pretrained("scragnog/Ace-Step-1.5-ScragVAE")

# Use with your existing ACE-Step pipeline
# (replace the vae in your pipeline config or checkpoint directory)

Or manually swap the decoder weights in an existing setup:

import torch
from safetensors.torch import load_file

# Load ScragVAE weights
scrag_weights = load_file("diffusion_pytorch_model.safetensors")

# Only decoder.* keys differ β€” encoder.* are identical to the original
decoder_keys = {k: v for k, v in scrag_weights.items() if k.startswith("decoder.")}
your_vae.load_state_dict(decoder_keys, strict=False)

acestep.cpp / HOT-Step CPP

Place scragvae-BF16.gguf in your models directory alongside the other GGUF files:

models/
β”œβ”€β”€ acestep-v15-turbo-BF16.gguf    # DiT
β”œβ”€β”€ acestep-5Hz-lm-BF16.gguf       # LM
β”œβ”€β”€ Qwen3-Embedding-BF16.gguf      # Text encoder  
β”œβ”€β”€ vae-BF16.gguf                  # Original VAE
└── scragvae-BF16.gguf             # ← ScragVAE (add this)

The engine auto-discovers all VAE GGUFs at startup. In HOT-Step CPP, select ScragVAE from the VAE Decoder dropdown in the Models & Adapters panel.

For acestep.cpp's built-in web UI or API, pass "vae_model": "scragvae-BF16.gguf" in your synth request JSON.

Converting from safetensors to GGUF yourself

If you need to reconvert (e.g. after further fine-tuning):

python engine/convert.py  # scans checkpoints/ and outputs to models/

Or use the converter directly:

from convert import convert_model
convert_model("scragvae", "/path/to/scragvae/", "scragvae-BF16.gguf", "vae")

Architecture

ScragVAE uses the same AutoencoderOobleck architecture as the original ACE-Step VAE β€” no structural changes. Only the decoder weights differ.

Parameter Value
Architecture AutoencoderOobleck
Audio channels 2 (stereo)
Sample rate 48,000 Hz
Latent dim 64
Decoder channels 128
Channel multiples [1, 2, 4, 8, 16]
Downsampling ratios [2, 4, 4, 6, 10]
Total ratio 1920Γ—
Activation Snake
Weight normalization Yes (fused at load in GGUF)
Parameters 168.7M (encoder + decoder)

Compatibility

  • βœ… All ACE-Step 1.5 DiT checkpoints (turbo, SFT, XL)
  • βœ… All LoRA/adapter models
  • βœ… Both Python (PyTorch/Diffusers) and C++ (ggml/acestep.cpp) runtimes
  • βœ… Encoder weights are identical β€” no retraining of upstream models needed

Training

Strategy

Freeze encoder β†’ train decoder only. The DiT operates in latent space; by only improving the decoder, all existing DiT checkpoints remain compatible without retraining.

Two-phase training

Parameter Phase 1 (Warm-up) Phase 2 (Quality)
Steps ~3,000 ~98,000
Learning rate 3e-5 3e-5
Adversarial weight 0.5 1.5
Feature matching 5.0 3.0
Perceptual weighting On Off
L1 time domain 0.0 0.05
Discriminator FFT sizes 6 6 (+4096)
Spectral loss FFT sizes β€” 9 (32–8192)
Multi-res mel loss β€” 4 scales
Precision bf16-mixed bf16-mixed
Effective batch 16 (8Γ—2 accum) 16 (8Γ—2 accum)
Gradient clip 1.0 1.0

Key changes vs original training

  • Disabled perceptual weighting in the spectral loss β€” the original's perceptual curve de-emphasizes high frequencies, actively suppressing HF reconstruction
  • Increased adversarial weight (0.5 β†’ 1.5) β€” forces the decoder to produce more realistic spectral detail
  • Reduced feature matching (5.0 β†’ 3.0) β€” less over-smoothing from discriminator feature constraints
  • Added L1 time-domain loss (0.05) β€” preserves transient attacks and waveform fidelity
  • Added 4096-point FFT to discriminator β€” gives the discriminator explicitly better resolution for harmonic content in the 2–8kHz range
  • Added multi-resolution mel-spectrogram loss at 4 scales β€” captures perceptually relevant frequency content

Hardware

  • GPU: NVIDIA RTX 5090 (32GB)
  • Training time: ~8 hours total (Phase 1 + Phase 2)
  • Framework: PyTorch + stable-audio-tools

License

MIT License β€” same as ACE-Step 1.5.

Citation

If you use ScragVAE in your work:

@misc{scragvae2026,
  title={ScragVAE: Improved VAE Decoder for ACE-Step 1.5},
  author={Scragnog},
  year={2026},
  url={https://huggingface.co/scragnog/Ace-Step-1.5-ScragVAE}
}

Acknowledgements

Downloads last month
1,276
GGUF
Model size
0.2B params
Architecture
acestep-vae
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for scragnog/Ace-Step-1.5-ScragVAE

Quantized
(1)
this model