ACE-Step v1.5 — ONNX

ONNX export of ACE-Step/Ace-Step1.5, a text-to-music generation model using flow matching with a Diffusion Transformer (DiT).

Exported for WebGPU inference via ONNX Runtime Web.

Model Components

ACE-Step v1.5 consists of several components that work together:

Component	Description	FP32	INT4	FP16
DiT decoder	Main diffusion transformer (24 layers, 2048 hidden, 8-step turbo)	6.3 GB	2.1 GB	—
LM (1.7B)	Causal language model for lyric-conditioned generation	7.4 GB	5.1 GB	—
Text encoder (0.6B)	Qwen3-Embedding for text conditioning	2.4 GB	1.7 GB	—
Lyric encoder	8-layer transformer for lyric embeddings	1.6 GB	216 MB	—
Timbre encoder	4-layer transformer for reference audio timbre	806 MB	108 MB	—
VAE decoder	AutoencoderOobleck (latent → stereo 48kHz waveform)	337 MB	—	169 MB
Text projector	Linear projection (1024 → 2048)	8 MB	—	4 MB
Embed tokens	Embedding table lookup for lyrics	621 MB	—	311 MB

Directory Structure

onnx/          # FP32 ONNX models (full precision, for validation)
onnx_q4/       # INT4 weight-only quantized (for WebGPU deployment)
onnx_fp16/     # FP16 models (for conv-heavy / small components)

Usage for WebGPU

For text-to-music generation without the LM, the minimum model set is:

onnx_q4/dit_decoder_q4.onnx (2.1 GB)
onnx_q4/text_encoder_q4.onnx (1.7 GB)
onnx_fp16/text_embed_tokens_fp16.onnx (311 MB)
onnx_q4/lyric_encoder_q4.onnx (216 MB)
onnx_fp16/vae_decoder_fp16.onnx (169 MB)
onnx_q4/timbre_encoder_q4.onnx (108 MB)
onnx_fp16/text_projector_fp16.onnx (4 MB)

Total: ~4.6 GB — fits in 8 GB VRAM on desktop GPUs.

Inference Pipeline

Text encoding: Tokenize caption → text_encoder → text_projector
Lyric encoding: Tokenize lyrics → embed_tokens → lyric_encoder
Timbre encoding: Reference audio latents → timbre_encoder
Condition packing: Concatenate and pack text + lyric + timbre embeddings (JS logic)
Denoising loop (8 steps): DiT decoder with Euler ODE scheduler
VAE decode: Latents → stereo 48kHz waveform

The flow-matching scheduler runs in JavaScript — only the DiT forward pass is in ONNX.

Technical Details

Latent space: 64 channels, 25 Hz frame rate (1920x upsampling to 48kHz)
Denoising: 8-step turbo schedule with flow matching (Euler ODE)
Attention: Alternating full + sliding-window (128) bidirectional attention with GQA (16 query / 8 KV heads)
Quantization: INT4 weight-only (MatMulNBits, block_size=128, symmetric)

Export Verification

All exports verified against PyTorch reference with max absolute differences:

Component	Max Diff
VAE decoder	9.2e-6
Text encoder	2.3e-4
Embed tokens	0.0 (exact)
DiT decoder	2.2e-5
LM	3.2e-3
Lyric encoder	2.4e-5
Timbre encoder	1.7e-5
Text projector	3.6e-6

Attribution

This is an ONNX conversion of ACE-Step v1.5 by the ACE-Step team.

Paper: ACE-Step: A Step Towards Music Generation Foundation Model
Code: github.com/ace-step/ACE-Step-1.5
License: Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for shreyask/ACE-Step-v1.5-ONNX

Base model

ACE-Step/Ace-Step1.5

Quantized

(3)

this model

Paper for shreyask/ACE-Step-v1.5-ONNX

ACE-Step: A Step Towards Music Generation Foundation Model

Paper • 2506.00045 • Published May 28, 2025 • 1