Vocal Separation Core (Mel-Band RoFormer) β€” ONNX / fp16 / WebGPU

ONNX export of a Mel-Band RoFormer vocal source-separation core, packaged for the musetric packages/ai runtime (onnxruntime-web on WebGPU).

The graph is the neural network core: it takes a precomputed STFT representation and returns per-bin complex masks. The mel-band gather/average tables are baked into the ONNX graph, so the host does not need sidecar /tables/* assets. STFT, iSTFT, chunking and complex packing run host-side (browser WGSL + FFT). This is not a drop-in PyTorch checkpoint.

Intended uses & limitations

Intended:

  • Vocals / instrumental separation as the first stage of an audio pipeline.
  • Client/edge inference via WebGPU through onnxruntime-web.

Out of scope:

  • Standalone use without a host that computes the STFT input, applies the per-bin masks, and runs iSTFT (see musetric packages/ai).
  • Use in other training frameworks β€” this is an inference-only export.

Limitations:

  • Static time window T = 1101 (~11 s) β€” the model's full reference context.
  • The graph uses com.microsoft fused ops (FastGelu, MultiHeadAttention, RMSNormalization) and is tuned for the WebGPU execution provider.
  • Training-data provenance of the upstream weights is undocumented.

How to use

The session runs the core; the host supplies stft_repr and consumes masks.

import * as ort from 'onnxruntime-web/webgpu';

// .onnx and .onnx.data must sit in the same directory; .data loads automatically.
const session = await ort.InferenceSession.create(
  'syhft_core_folded_fp16_webgpu.onnx',
  { executionProviders: ['webgpu'] },
);

// stftRepr: Float32Array of shape [1, 2050, 1101, 2], produced host-side from one
// ~11 s audio chunk (n_fft=2048, hop=441, 44.1 kHz, stereo).
const input = new ort.Tensor('float32', stftRepr, [1, 2050, 1101, 2]);
const { masks } = await session.run({ stft_repr: input });
// masks: float32 [1, 2050, 1101, 2] -> apply to STFT, then iSTFT host-side.

See the musetric packages/ai host code for the full STFT/iSTFT and chunk-recombination pipeline.

Variant & files

  • Precision: fp16 weights, fp32 graph I/O, fp32 RMSNorm islands.
  • Mel-band gather/average tables embedded into the graph output tail.
  • WebGPU hardening: FastGelu, MultiHeadAttention, RMSNormalization, and wide Concat/Split rewritten into <=15-wide trees. Compat/perf only; values preserved apart from fp16 conversion.
File Size SHA256
syhft_core_folded_fp16_webgpu.onnx 5,308,300 B dde2bfe8f85d2c12efa24ce4d45cc13e8709b8a72e277a93f130d496d948e918
syhft_core_folded_fp16_webgpu.onnx.data 741,190,540 B b08cfc80905e3560a4dd5d30f641299a47dd96d309ebbe9524d9d6c9d2a0356f

Signature β€” opset ai.onnx 23 + com.microsoft 1 (IR 10):

Tensor Type Shape Meaning
stft_repr (in) float32 [1, 2050, 1101, 2] batch, freq*2, time, complex
masks (out) float32 [1, 2050, 1101, 2] per-bin complex masks, already gathered/averaged from mel bands

Validation

This fp16/WebGPU export vs the PyTorch first-stage reference at the same T = 1101 (isolates conversion + execution-provider error):

Metric Value
SNR (vocals) ~46–49 dB
correlation ~0.999
NaN / silent gaps 0

T = 1101 is the model's full reference context, so there is no context-window penalty β€” the gap is conversion + EP error only, which is numerically small. Re-run the parity gate on the exact published bytes before relying on it.

Source & lineage

Code license and weight license are separate; ONNX conversion does not change the weight license. Documented only as far as it is verifiable.

The base checkpoint the upstream fine-tuned from is not documented upstream; we do not assert a chain we cannot verify. This export preserves the upstream MIT license; we do not claim authorship of the original weights.

License & citation

MIT, inherited from the upstream weights.

@article{wang2023melbandroformer,
  title={Mel-Band RoFormer for Music Source Separation},
  author={Wang, Ju-Chiang and Lu, Wei-Tsung and Won, Minz},
  journal={arXiv preprint arXiv:2310.01809},
  year={2023}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for musetric/vocal-separation-roformer-onnx