ACE-Step v1.5 β€” ONNX

ONNX export of ACE-Step/Ace-Step1.5, a text-to-music generation model using flow matching with a Diffusion Transformer (DiT).

Exported for WebGPU inference via ONNX Runtime Web.

Model Components

ACE-Step v1.5 consists of several components that work together:

Component Description FP32 INT4 FP16
DiT decoder Main diffusion transformer (24 layers, 2048 hidden, 8-step turbo) 6.3 GB 2.1 GB β€”
LM (1.7B) Causal language model for lyric-conditioned generation 7.4 GB 5.1 GB β€”
Text encoder (0.6B) Qwen3-Embedding for text conditioning 2.4 GB 1.7 GB β€”
Lyric encoder 8-layer transformer for lyric embeddings 1.6 GB 216 MB β€”
Timbre encoder 4-layer transformer for reference audio timbre 806 MB 108 MB β€”
VAE decoder AutoencoderOobleck (latent β†’ stereo 48kHz waveform) 337 MB β€” 169 MB
Text projector Linear projection (1024 β†’ 2048) 8 MB β€” 4 MB
Embed tokens Embedding table lookup for lyrics 621 MB β€” 311 MB

Directory Structure

onnx/          # FP32 ONNX models (full precision, for validation)
onnx_q4/       # INT4 weight-only quantized (for WebGPU deployment)
onnx_fp16/     # FP16 models (for conv-heavy / small components)

Usage for WebGPU

For text-to-music generation without the LM, the minimum model set is:

  • onnx_q4/dit_decoder_q4.onnx (2.1 GB)
  • onnx_q4/text_encoder_q4.onnx (1.7 GB)
  • onnx_fp16/text_embed_tokens_fp16.onnx (311 MB)
  • onnx_q4/lyric_encoder_q4.onnx (216 MB)
  • onnx_fp16/vae_decoder_fp16.onnx (169 MB)
  • onnx_q4/timbre_encoder_q4.onnx (108 MB)
  • onnx_fp16/text_projector_fp16.onnx (4 MB)

Total: ~4.6 GB β€” fits in 8 GB VRAM on desktop GPUs.

Inference Pipeline

  1. Text encoding: Tokenize caption β†’ text_encoder β†’ text_projector
  2. Lyric encoding: Tokenize lyrics β†’ embed_tokens β†’ lyric_encoder
  3. Timbre encoding: Reference audio latents β†’ timbre_encoder
  4. Condition packing: Concatenate and pack text + lyric + timbre embeddings (JS logic)
  5. Denoising loop (8 steps): DiT decoder with Euler ODE scheduler
  6. VAE decode: Latents β†’ stereo 48kHz waveform

The flow-matching scheduler runs in JavaScript β€” only the DiT forward pass is in ONNX.

Technical Details

  • Latent space: 64 channels, 25 Hz frame rate (1920x upsampling to 48kHz)
  • Denoising: 8-step turbo schedule with flow matching (Euler ODE)
  • Attention: Alternating full + sliding-window (128) bidirectional attention with GQA (16 query / 8 KV heads)
  • Quantization: INT4 weight-only (MatMulNBits, block_size=128, symmetric)

Export Verification

All exports verified against PyTorch reference with max absolute differences:

Component Max Diff
VAE decoder 9.2e-6
Text encoder 2.3e-4
Embed tokens 0.0 (exact)
DiT decoder 2.2e-5
LM 3.2e-3
Lyric encoder 2.4e-5
Timbre encoder 1.7e-5
Text projector 3.6e-6

Attribution

This is an ONNX conversion of ACE-Step v1.5 by the ACE-Step team.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for shreyask/ACE-Step-v1.5-ONNX

Quantized
(3)
this model

Paper for shreyask/ACE-Step-v1.5-ONNX