Vocal Separation Core (Mel-Band RoFormer) β ONNX / fp16 / WebGPU
ONNX export of a Mel-Band RoFormer vocal source-separation core, packaged for
the musetric packages/ai runtime
(onnxruntime-web on WebGPU).
The graph is the neural network core: it takes a precomputed STFT
representation and returns per-bin complex masks. The mel-band gather/average
tables are baked into the ONNX graph, so the host does not need sidecar
/tables/* assets. STFT, iSTFT, chunking and complex packing run host-side
(browser WGSL + FFT). This is not a drop-in PyTorch checkpoint.
Intended uses & limitations
Intended:
- Vocals / instrumental separation as the first stage of an audio pipeline.
- Client/edge inference via WebGPU through
onnxruntime-web.
Out of scope:
- Standalone use without a host that computes the STFT input, applies the
per-bin
masks, and runs iSTFT (seemusetricpackages/ai). - Use in other training frameworks β this is an inference-only export.
Limitations:
- Static time window T = 1101 (~11 s) β the model's full reference context.
- The graph uses
com.microsoftfused ops (FastGelu, MultiHeadAttention, RMSNormalization) and is tuned for the WebGPU execution provider. - Training-data provenance of the upstream weights is undocumented.
How to use
The session runs the core; the host supplies stft_repr and consumes masks.
import * as ort from 'onnxruntime-web/webgpu';
// .onnx and .onnx.data must sit in the same directory; .data loads automatically.
const session = await ort.InferenceSession.create(
'syhft_core_folded_fp16_webgpu.onnx',
{ executionProviders: ['webgpu'] },
);
// stftRepr: Float32Array of shape [1, 2050, 1101, 2], produced host-side from one
// ~11 s audio chunk (n_fft=2048, hop=441, 44.1 kHz, stereo).
const input = new ort.Tensor('float32', stftRepr, [1, 2050, 1101, 2]);
const { masks } = await session.run({ stft_repr: input });
// masks: float32 [1, 2050, 1101, 2] -> apply to STFT, then iSTFT host-side.
See the musetric packages/ai host code for the full STFT/iSTFT and
chunk-recombination pipeline.
Variant & files
- Precision: fp16 weights, fp32 graph I/O, fp32 RMSNorm islands.
- Mel-band gather/average tables embedded into the graph output tail.
- WebGPU hardening: FastGelu, MultiHeadAttention, RMSNormalization, and wide
Concat/Splitrewritten into <=15-wide trees. Compat/perf only; values preserved apart from fp16 conversion.
| File | Size | SHA256 |
|---|---|---|
syhft_core_folded_fp16_webgpu.onnx |
5,308,300 B | dde2bfe8f85d2c12efa24ce4d45cc13e8709b8a72e277a93f130d496d948e918 |
syhft_core_folded_fp16_webgpu.onnx.data |
741,190,540 B | b08cfc80905e3560a4dd5d30f641299a47dd96d309ebbe9524d9d6c9d2a0356f |
Signature β opset ai.onnx 23 + com.microsoft 1 (IR 10):
| Tensor | Type | Shape | Meaning |
|---|---|---|---|
stft_repr (in) |
float32 | [1, 2050, 1101, 2] |
batch, freq*2, time, complex |
masks (out) |
float32 | [1, 2050, 1101, 2] |
per-bin complex masks, already gathered/averaged from mel bands |
Validation
This fp16/WebGPU export vs the PyTorch first-stage reference at the same T = 1101 (isolates conversion + execution-provider error):
| Metric | Value |
|---|---|
| SNR (vocals) | ~46β49 dB |
| correlation | ~0.999 |
| NaN / silent gaps | 0 |
T = 1101 is the model's full reference context, so there is no context-window penalty β the gap is conversion + EP error only, which is numerically small. Re-run the parity gate on the exact published bytes before relying on it.
Source & lineage
Code license and weight license are separate; ONNX conversion does not change the weight license. Documented only as far as it is verifiable.
- Architecture: Mel-Band RoFormer (arXiv:2310.01809).
- Reference implementation:
lucidrains/BS-RoFormer. - Training framework / config: ZFTurbo
Music-Source-Separation-Training. - Direct weight source:
SYH99999/MelBandRoformerBigSYHFTV1Fast@96f4ae8e3f690e51ef26b3bef84531c944f5341b, MIT.
The base checkpoint the upstream fine-tuned from is not documented upstream; we do not assert a chain we cannot verify. This export preserves the upstream MIT license; we do not claim authorship of the original weights.
License & citation
MIT, inherited from the upstream weights.
@article{wang2023melbandroformer,
title={Mel-Band RoFormer for Music Source Separation},
author={Wang, Ju-Chiang and Lu, Wei-Tsung and Won, Minz},
journal={arXiv preprint arXiv:2310.01809},
year={2023}
}