ACE-Step: A Step Towards Music Generation Foundation Model
Paper β’ 2506.00045 β’ Published β’ 1
ONNX export of ACE-Step/Ace-Step1.5, a text-to-music generation model using flow matching with a Diffusion Transformer (DiT).
Exported for WebGPU inference via ONNX Runtime Web.
ACE-Step v1.5 consists of several components that work together:
| Component | Description | FP32 | INT4 | FP16 |
|---|---|---|---|---|
| DiT decoder | Main diffusion transformer (24 layers, 2048 hidden, 8-step turbo) | 6.3 GB | 2.1 GB | β |
| LM (1.7B) | Causal language model for lyric-conditioned generation | 7.4 GB | 5.1 GB | β |
| Text encoder (0.6B) | Qwen3-Embedding for text conditioning | 2.4 GB | 1.7 GB | β |
| Lyric encoder | 8-layer transformer for lyric embeddings | 1.6 GB | 216 MB | β |
| Timbre encoder | 4-layer transformer for reference audio timbre | 806 MB | 108 MB | β |
| VAE decoder | AutoencoderOobleck (latent β stereo 48kHz waveform) | 337 MB | β | 169 MB |
| Text projector | Linear projection (1024 β 2048) | 8 MB | β | 4 MB |
| Embed tokens | Embedding table lookup for lyrics | 621 MB | β | 311 MB |
onnx/ # FP32 ONNX models (full precision, for validation)
onnx_q4/ # INT4 weight-only quantized (for WebGPU deployment)
onnx_fp16/ # FP16 models (for conv-heavy / small components)
For text-to-music generation without the LM, the minimum model set is:
onnx_q4/dit_decoder_q4.onnx (2.1 GB)onnx_q4/text_encoder_q4.onnx (1.7 GB)onnx_fp16/text_embed_tokens_fp16.onnx (311 MB)onnx_q4/lyric_encoder_q4.onnx (216 MB)onnx_fp16/vae_decoder_fp16.onnx (169 MB)onnx_q4/timbre_encoder_q4.onnx (108 MB)onnx_fp16/text_projector_fp16.onnx (4 MB)Total: ~4.6 GB β fits in 8 GB VRAM on desktop GPUs.
The flow-matching scheduler runs in JavaScript β only the DiT forward pass is in ONNX.
All exports verified against PyTorch reference with max absolute differences:
| Component | Max Diff |
|---|---|
| VAE decoder | 9.2e-6 |
| Text encoder | 2.3e-4 |
| Embed tokens | 0.0 (exact) |
| DiT decoder | 2.2e-5 |
| LM | 3.2e-3 |
| Lyric encoder | 2.4e-5 |
| Timbre encoder | 1.7e-5 |
| Text projector | 3.6e-6 |
This is an ONNX conversion of ACE-Step v1.5 by the ACE-Step team.
Base model
ACE-Step/Ace-Step1.5