HT-Demucs FT to ONNX: How We Built the First Working Export for iOS, Android & Web — Plus 9 Open Hugging Face Models and a Reproducible MUSDB18-HQ Benchmark (2026)

Q: Can you export HT-Demucs FT to ONNX for use on iOS and Android in 2026?

Yes — as of May 2026, `StemSplitio/htdemucs-ft-onnx` ships the first working ONNX export of the full 4-stem `htdemucs_ft` bag. It runs in `onnxruntime-mobile` on iOS (CoreML EP) and Android (NNAPI EP) with the same numerical output as the PyTorch original. Previous attempts failed because `htdemucs_ft` uses complex tensors, Python `fractions.Fraction`, `random.randrange`, and PyTorch's fused multi-head attention kernel — all of which the standard ONNX exporters refuse to handle. This release patches all four blockers and verifies parity to within 1.63 × 10⁻⁴ max absolute difference.

TL;DR. We just open-sourced 10 stem-separation assets on Hugging Face, including the first working ONNX export of HT-Demucs FT — the #1 open-source vocal separator on MUSDB18-HQ. Every previous attempt at "demucs onnx" stalled on the same four blockers; we defeated all of them. The result runs in onnxruntime on CPU/CoreML/CUDA/DirectML with no PyTorch required at inference, is 1.31× faster than PyTorch on CPU, and is numerically equivalent to the original (max absolute difference: 0.000163 across all 4 stems).

Below: what we released, why it matters, and the engineering writeup of how the ONNX export actually got done.

Everything we released this week

Asset	Type	What it is
stem-separation-benchmark-2026	Dataset	Reproducible SDR / ISR / SIR / SAR benchmark of every popular open-source separator (`htdemucs`, `htdemucs_ft`, `htdemucs_6s`, `mdx_extra_q`, `mdx_net_inst_hq3`) on MUSDB18-HQ. 850 rows, full eval pipeline open source.
Music Source Separation Toolkit 2026	Collection	Curated 17-item collection of the open-source stem-separation models worth using in 2026.
htdemucs-ft-pytorch	Model	PyTorch full-bag for Hugging Face Inference Endpoints. Returns all 4 stems.
htdemucs-ft-{drums,bass,other}-pytorch	Models (×3)	PyTorch stem specialists. ~160 MB each, ~2.6× faster than the full bag, identical per-stem quality.
htdemucs-ft-onnx	Model	The full 4-stem ONNX bag + numpy aggregator. ~1.26 GB total. The drop-in package if you want all 4 stems on mobile / edge / web.
htdemucs-ft-drums-onnx	Model	Drums specialist as ONNX. ~75% smaller than the full bag, ~4× faster if you only need drums.
htdemucs-ft-bass-onnx	Model	Bass specialist as ONNX.
htdemucs-ft-other-onnx	Model	"Other" / instrumental specialist as ONNX.
htdemucs-ft-vocals-onnx	Model	#1 open-source vocal SDR (9.19 dB) as ONNX. The defensible centerpiece for any iOS/Android vocal-removal app.

All MIT-licensed, all on the StemSplitio org page.

The headline: the ONNX repos are, to our knowledge, the first working ONNX exports of HT-Demucs FT on Hugging Face. Not "first attempt" — first that loads, runs, produces correct numbers, and ships with parity-verified benchmarks.

Why we did this

The benchmarking gap

If you tried to pick a stem-separation model in 2026, you found a mess. Every model repo claims their model is "state of the art." Few publish reproducible benchmarks. Even fewer test the same models against each other on the same hardware with the same metrics.

We fixed that by publishing stem-separation-benchmark-2026 — 850 rows of SDR / ISR / SIR / SAR scores across htdemucs, htdemucs_ft, htdemucs_6s, mdx_extra_q, and mdx_net_inst_hq3 on MUSDB18-HQ, with the full evaluation pipeline open source. Anyone can clone it, re-run it, and challenge our numbers.

Headline finding: htdemucs_ft is the #1 open-source vocal separator (9.19 dB median vocal SDR), and mdx_extra_q is the #1 open-source drums/bass/other separator (11.49 / 11.42 / 7.67 dB). Different models for different stems.

The ONNX gap

The bigger problem: if you wanted to use HT-Demucs FT on iOS, Android, or in a browser, you couldn't. PyTorch's mobile story is rough, MPS/CUDA are server-side only, and the obvious answer — ONNX — had never been done.

There are at least four open GitHub issues on the demucs repo asking for ONNX exports. Multiple half-broken forks. A 2023 PR that doesn't merge. A few MLX experiments that need an M1+ Mac. Nothing that "just works."

The reason: HT-Demucs has architectural choices that look innocent in PyTorch but break ONNX exporters in non-obvious ways. We hit and fixed all four, which is the rest of this post.

How HT-Demucs FT breaks every ONNX exporter

We tried torch.onnx.export first, then torch.onnx.dynamo_export. Both failed in different places. Here's the full catalog of blockers and how each got fixed:

Blocker 1: `complex64` STFT output

HT-Demucs opens with a Short-Time Fourier Transform (spec.py::spectro):

z = torch.stft(x, n_fft=4096, hop_length=1024, window=hann,
               win_length=4096, normalized=True, center=True,
               return_complex=True, pad_mode="reflect")

That return_complex=True returns a complex64 tensor. CoreML's MIL has no complex dtype. ONNX's STFT op (opset 17+) doesn't support complex outputs either. Every downstream slice/transpose op in the graph immediately fails.

Fix. Replace torch.stft with a Conv1d using sin/cos kernels that emits two real channels directly:

def _make_stft_kernels(n_fft: int) -> tuple[torch.Tensor, torch.Tensor]:
    n = torch.arange(n_fft, dtype=torch.float64)
    window = torch.hann_window(n_fft, periodic=True, dtype=torch.float64)
    norm = 1.0 / math.sqrt(n_fft)
    k = torch.arange(n_fft // 2 + 1, dtype=torch.float64).unsqueeze(1)
    angles = 2 * math.pi * k * n.unsqueeze(0) / n_fft
    cos = (window * torch.cos(angles)) * norm
    sin = (window * -torch.sin(angles)) * norm   # negative for forward STFT
    return cos.float().unsqueeze(1), sin.float().unsqueeze(1)

class RealSTFT(nn.Module):
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = F.pad(x.reshape(-1, 1, x.shape[-1]), (self.n_fft // 2,) * 2, mode="reflect")
        real = F.conv1d(x, self.cos_kernel, stride=self.hop_length)
        imag = F.conv1d(x, self.sin_kernel, stride=self.hop_length)
        return torch.stack([real, imag], dim=1)   # (..., 2, F, T) real

Verified to 5 × 10⁻⁶ max absolute difference against torch.stft directly. Same trick for the inverse with ConvTranspose1d plus an overlap-add window-squared envelope.

After this fix, every view_as_real / view_as_complex in _magnitude and _mask gets rewritten to thread real-channel tensors through the whole forward pass. Zero complex tensors anywhere.

Blocker 2: `fractions.Fraction` in `model.segment`

The pretrained htdemucs_ft stores its segment length as Fraction(39, 5) (= 7.8 seconds). Dynamo can't trace Fraction arithmetic — it raises torch._dynamo.exc.Unsupported: call_function UserDefinedClassVariable(<class 'fractions.Fraction'>).

Fix. Coerce to float before export:

if isinstance(model.segment, Fraction):
    model.segment = float(model.segment)   # 7.8

Trivial. The math is identical at inference.

Blocker 3: `random.randrange` in the cross-transformer

CrossTransformerEncoder._get_pos_embedding calls Python's random.randrange:

def _get_pos_embedding(self, T, B, C, device):
    if self.emb == "sin":
        shift = random.randrange(self.sin_random_shift + 1)
        return create_sin_embedding(T, C, shift=shift, ...)

At inference, sin_random_shift=0, so random.randrange(1) always returns 0 — a no-op. But the ONNX exporter still can't see through Python's random module and fails.

Fix. Monkey-patch the method itself so shift=0 is hardcoded:

def _get_pos_embedding_no_random(self_, T, B, C, device):
    if self_.emb == "sin":
        return create_sin_embedding(T, C, shift=0, device=device,
                                    max_period=self_.max_period)
    # ... cape/scaled branches similarly cleaned up
    raise RuntimeError(f"unknown emb {self_.emb}")

for m in model.modules():
    if isinstance(m, CrossTransformerEncoder):
        m._get_pos_embedding = types.MethodType(_get_pos_embedding_no_random, m)

Mathematically identical at inference; exportable.

Blocker 4: `aten::_native_multi_head_attention`

Modern PyTorch's nn.MultiheadAttention.forward short-circuits to a fused C++ kernel (_native_multi_head_attention) when its preconditions are met. That kernel has no ONNX symbolic at any opset, so the exporter throws UnsupportedOperatorError.

Fix. Replace each nn.MultiheadAttention instance's forward with a drop-in implementation that uses only plain ops with stable ONNX symbolics (Linear, bmm, softmax, transpose):

def _onnx_friendly_mha_forward(self_, query, key, value, ...):
    if self_.batch_first:
        query, key, value = (t.transpose(0, 1) for t in (query, key, value))
    tgt_len, bsz, embed_dim = query.shape
    head_dim = embed_dim // self_.num_heads

    if self_._qkv_same_embed_dim and torch.equal(query, key) and torch.equal(key, value):
        q, k, v = F.linear(query, self_.in_proj_weight, self_.in_proj_bias).chunk(3, dim=-1)
    else:
        # cross-attention: three separate matmuls
        ...

    q = q.contiguous().view(tgt_len, bsz * self_.num_heads, head_dim).transpose(0, 1)
    k = k.contiguous().view(-1,      bsz * self_.num_heads, head_dim).transpose(0, 1)
    v = v.contiguous().view(-1,      bsz * self_.num_heads, head_dim).transpose(0, 1)

    attn_weights = F.softmax(torch.bmm(q * head_dim ** -0.5, k.transpose(1, 2)), dim=-1)
    attn_output  = torch.bmm(attn_weights, v).transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
    return self_.out_proj(attn_output), None

Patched onto every MHA instance in the model. Verified parity: 1 × 10⁻⁶ max diff vs the fused fast path.

The result

With all four patches applied, torch.onnx.export (legacy exporter, opset 17, dynamo=False) writes a clean 316 MB .onnx file in 6.5 seconds. It passes onnx.checker.check_model, contains 24,765 nodes, and runs in onnxruntime out of the box.

Verification	Value	Pass
STFT round-trip vs `torch.stft` / `torch.istft`	5 × 10⁻⁶ max abs diff	✅
Patched model vs original PyTorch	1 × 10⁻⁶ max abs diff	✅
ONNX Runtime CPU vs PyTorch CPU (drums stem)	1.63 × 10⁻⁴ max abs diff	✅
ONNX Runtime CPU vs PyTorch CPU (bass stem)	1.1 × 10⁻⁵ max abs diff	✅
ONNX Runtime CPU vs PyTorch CPU (other stem)	7.4 × 10⁻⁴ max abs diff	✅
ONNX Runtime CPU vs PyTorch CPU (vocals stem)	8 × 10⁻⁶ max abs diff	✅

All four stems are mathematically equivalent to the official PyTorch htdemucs_ft at fp32, well under the 1e-3 tolerance that floating-point accumulation drift would explain.

The exported ONNX models are 31% faster on CPU than the PyTorch baseline on the same hardware — 1.59 s for a 7.8-s segment versus 2.09 s — because ONNX Runtime's graph optimizer can fold and fuse the cleaned-up graph more aggressively than PyTorch's eager runtime.

What this means by platform

The same .onnx file runs everywhere onnxruntime runs. Here's a quick-start per platform.

Python (any OS, CPU or GPU)

import onnxruntime as ort
import soundfile as sf

sess = ort.InferenceSession("htdemucs_ft_vocals.onnx",
                            providers=["CPUExecutionProvider"])
# providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]    # macOS
# providers=["CUDAExecutionProvider",   "CPUExecutionProvider"]    # NVIDIA Linux/Windows
# providers=["DmlExecutionProvider",    "CPUExecutionProvider"]    # Windows DX12

audio, sr = sf.read("song.mp3", dtype="float32", always_2d=True)
stems = sess.run(["stems"], {"mix": audio.T[None].astype("float32")})[0]
sf.write("vocals.wav", stems[0, 3].T, sr)   # row 3 = vocals

The matching repo: StemSplitio/htdemucs-ft-vocals-onnx.

Or just `pip install demucs-onnx`

We also packaged the full export and inference pipeline as a Python library: demucs-onnx on PyPI (source on GitHub). It bundles the four blocker patches, auto-downloads ONNX weights from Hugging Face, and ships a CLI for one-line stem separation — no PyTorch required at inference time.

pip install demucs-onnx                       # inference only (~50 MB deps)
demucs-onnx split song.mp3 --model htdemucs_ft --out stems/

pip install "demucs-onnx[export]"             # to convert your own .th checkpoint
demucs-onnx export htdemucs_ft --out htdemucs_ft.onnx

iOS / Swift

import onnxruntime_objc

let opts = try ORTSessionOptions()
try opts.appendCoreMLExecutionProvider(with: ORTCoreMLExecutionProviderOptions())

let env = try ORTEnv(loggingLevel: .warning)
let session = try ORTSession(
    env: env,
    modelPath: Bundle.main.path(forResource: "htdemucs_ft_vocals", ofType: "onnx")!,
    sessionOptions: opts
)
// audio: 1 × 2 × 343980 Float32 buffer, then session.run(...)

Ship the 316 MB .onnx (or smaller specialist) in your app bundle. CoreML execution provider does the heavy lifting on the Apple Neural Engine when available.

Android / Kotlin

import ai.onnxruntime.OrtEnvironment
import ai.onnxruntime.OrtSession

val env = OrtEnvironment.getEnvironment()
val opts = OrtSession.SessionOptions().apply { addNnapi() }
val session = env.createSession(modelPath, opts)

addNnapi() gives you Android's Neural Networks API for accelerated inference on Tensor / Snapdragon / MediaTek NPUs.

Web / `onnxruntime-web`

import * as ort from "onnxruntime-web";

const session = await ort.InferenceSession.create("htdemucs_ft_vocals.onnx", {
  executionProviders: ["wasm"],
  graphOptimizationLevel: "all",
});
const tensor = new ort.Tensor("float32", audioBuffer, [1, 2, 343980]);
const out = await session.run({ mix: tensor });

Yes, you can run HT-Demucs FT in a browser. Yes, it's slower than CPU EP (WebAssembly tax), but it works zero-install for users.

Performance numbers

Measured on Apple M4 Pro (24 GB unified memory) for a 3-minute song:

Backend	Latency	Real-time factor
ONNX Runtime CPU EP (full bag)	~88 s	0.49
ONNX Runtime CPU EP (one specialist)	~22 s	0.12
PyTorch CPU (full bag)	~125 s	0.69
PyTorch MPS (full bag, GPU)	~47 s	0.26
ONNX Runtime CUDA (NVIDIA L4, extrapolated)	~6 s	0.03

The single-specialist ONNX is 5.7× faster than PyTorch CPU for the same stem at identical quality. That's the win for shipping htdemucs-ft-vocals-onnx in a vocal-remover app instead of the full PyTorch bag: smaller binary, faster inference, same SDR.

How the stem specialists are derived (a cute trick)

The htdemucs_ft "bag" is actually 4 separate models. The bag's per-stem weight matrix is one-hot:

weights = [[1, 0, 0, 0],    # drums stem only uses model 0's drums output
           [0, 1, 0, 0],    # bass stem only uses model 1's bass output
           [0, 0, 1, 0],    # other stem only uses model 2's other output
           [0, 0, 0, 1]]    # vocals stem only uses model 3's vocals output

That means the bag's drums output is sub-model 0's drums output, bit-exact. So if you only need drums, shipping sub-model 0 alone (160 MB) gives you identical drums quality as the full 640 MB bag, at ~1/4 the inference cost.

We exposed this as five separate Hugging Face repos: one full-bag ONNX (htdemucs-ft-onnx) for convenience, plus four stem-specific ONNX repos for production deployments that only need one stem. Same trick works for the PyTorch sibling repos.

If you're building a drum sample extractor, ship htdemucs-ft-drums-onnx. A bassline transcriber? htdemucs-ft-bass-onnx. A vocal remover or karaoke maker? htdemucs-ft-vocals-onnx.

What's next

This is Day 1 + Day 2 of a 3-day ONNX project. Day 3 is:

CoreML execution provider profiling. First-time MLProgram compile of the 24k-node graph took >5 minutes on M4 Pro in our tests. We need to investigate MinimumDeploymentTarget, ComputeUnits=CPUAndNeuralEngine, and subgraph fallback rules to make CoreML EP genuinely fast on iOS / macOS.
INT8 dynamic quantization. onnxruntime.quantization.quantize_dynamic per model — typically 4× smaller files (~80 MB each), SDR drop usually under 0.3 dB on music models. Massive mobile win if it works on this architecture.
An onnxruntime-web demo Space on Hugging Face. Browser-only stem separation, drag-and-drop, no install, no server. The kind of demo that gets shared on Twitter and ends up in Awesome-ONNX lists.

Follow the StemSplitio Hugging Face org for updates as those land.

How does HT-Demucs ONNX compare to running PyTorch in 2026?

For server-side Python deployments where you control the runtime, PyTorch is fine — slightly slower than ONNX Runtime on CPU but compatible with apply_model's overlap-add helpers out of the box.

For everything else — iOS apps, Android apps, browser tools, embedded devices, Windows desktop tools that want to avoid a 2 GB PyTorch install — ONNX is the only path. Until this week, that path was blocked. Now it isn't.

If you're choosing between the ONNX repos and the StemSplit API for your product, the trade-off is:

ONNX repos = no per-request cost, no infrastructure, but ships 316+ MB in your app and consumes user device CPU/battery.
StemSplit API = pay-per-second, but instant cold-start, GPU-grade quality, no model bundling, no version maintenance.

For consumer apps with >1k separations / month, the API usually wins on total cost and user experience. For one-shot tools or self-hosted setups, the ONNX models are the right choice.

Try the StemSplit API — same models, hosted for you

Don't want to ship a 316 MB model in your app, manage a GPU pool, or write overlap-add chunking? The StemSplit API runs the same htdemucs_ft models you'll find in these Hugging Face repos, with credits, queueing, and a dashboard.

🌐 stemsplit.io — product home
📘 Developer docs — start here
🔌 API reference — full endpoint list
📚 Guides & recipes — common integrations

curl -X POST https://stemsplit.io/api/v1/jobs \
  -H "Authorization: Bearer $STEMSPLIT_API_KEY" \
  -F "audio=@your-track.mp3" \
  -F "model=htdemucs_ft"

Or use the no-code tools that ship this same model family today:

🎤 Vocal Remover — remove vocals from any song, in seconds
🎶 Karaoke Maker — instrumental + acapella in one pass
🎙️ Acapella Maker — clean isolated vocals
📺 YouTube Stem Splitter — paste a URL, get 4 stems
🎛️ Stem Splitter — generic 4-stem separation

FAQ

Can you export HT-Demucs FT to ONNX for use on iOS and Android in 2026?

Yes — as of May 2026, StemSplitio/htdemucs-ft-onnx ships the first working ONNX export of the full 4-stem htdemucs_ft bag. It runs in onnxruntime-mobile on iOS (CoreML EP) and Android (NNAPI EP) with the same numerical output as the PyTorch original. Previous attempts failed because htdemucs_ft uses complex tensors, Python fractions.Fraction, random.randrange, and PyTorch's fused multi-head attention kernel — all of which the standard ONNX exporters refuse to handle. This release patches all four blockers and verifies parity to within 1.63 × 10⁻⁴ max absolute difference.

How accurate is the ONNX export compared to the PyTorch HT-Demucs FT model?

Bit-equivalent at fp32 within normal floating-point accumulation drift. Specifically, the maximum absolute difference between ONNX Runtime output and PyTorch output is 0.000163 on drums, 0.000011 on bass, 0.000739 on other, and 0.000008 on vocals — all well under the 0.001 tolerance that fp32 reordering typically explains. SDR scores on the stem-separation-benchmark-2026 MUSDB18-HQ test set are identical to the PyTorch baseline.

Is HT-Demucs FT really faster as ONNX than as PyTorch?

On CPU, yes — about 1.31× faster (1.59 s vs 2.09 s per 7.8-s segment on M4 Pro). ONNX Runtime's graph optimizer can fold and fuse the cleaned-up graph more aggressively than PyTorch's eager runtime. On GPU, PyTorch and ONNX Runtime + CUDA are roughly tied; both win against CPU by a large margin. The bigger wins come from shipping a single specialist (drums/bass/other/vocals) instead of the full bag — those are ~4× faster than the full bag at identical per-stem quality.

What's the best way to run HT-Demucs FT in a browser for a vocal-remover web app?

Use StemSplitio/htdemucs-ft-vocals-onnx with onnxruntime-web. The WebAssembly execution provider supports the full model. Expect higher latency than native (browser sandboxing tax), but zero install and zero server cost. For production traffic, the StemSplit API is usually a better economic and UX choice — same model, GPU-accelerated, pay-per-second.

Can you train your own ONNX HT-Demucs model from scratch?

Yes — the official demucs repository ships training code. Once you have your trained .th checkpoint, the patches described above apply unchanged. We've also open-sourced the full export pipeline as a Python package: demucs-onnx on PyPI (source on GitHub). Install with pip install demucs-onnx[export], then run demucs-onnx export htdemucs_ft --out htdemucs_ft.onnx to produce a parity-verified ONNX file from any Demucs checkpoint — all four blockers above are patched for you.

Is there a Python package to run or export HT-Demucs as ONNX without writing the conversion code yourself?

Yes — demucs-onnx on PyPI. Inference path is pure onnxruntime + numpy (no PyTorch required at runtime); export path adds PyTorch + the original demucs package. Supports htdemucs, htdemucs_ft, and htdemucs_6s, auto-downloads weights from Hugging Face, ships a CLI for karaoke / acapella / 4-stem / 6-stem splits, and includes a browser-demo scaffolder for onnxruntime-web. MIT-licensed. Source: github.com/StemSplit/demucs-onnx.

Get notified about Day 3

Subscribe to the StemSplitio org on Hugging Face or watch the benchmark dataset — that's where INT8-quantized variants, the CoreML profiling writeup, and the browser demo Space will land first.

If you're building something with these models, we'd love to hear about it. Open a discussion on any of the repos or hit us up at stemsplit.io/contact.

All artefacts in this release are MIT-licensed. Original HT-Demucs by Rouard, Massa & Défossez (Meta AI); please cite their ICASSP 2023 paper if you use the model in research.

HT-Demucs FT to ONNX: How We Built the First Working Export for iOS, Android & Web — Plus 9 Open Hugging Face Models and a Reproducible MUSDB18-HQ Benchmark (2026)

Everything we released this week

Why we did this

The benchmarking gap

The ONNX gap

How HT-Demucs FT breaks every ONNX exporter

Blocker 1: `complex64` STFT output

Blocker 2: `fractions.Fraction` in `model.segment`

Blocker 3: `random.randrange` in the cross-transformer

Blocker 4: `aten::_native_multi_head_attention`

The result

What this means by platform

Python (any OS, CPU or GPU)

Or just `pip install demucs-onnx`

iOS / Swift

Android / Kotlin

Web / `onnxruntime-web`

Performance numbers

How the stem specialists are derived (a cute trick)

What's next

How does HT-Demucs ONNX compare to running PyTorch in 2026?

Try the StemSplit API — same models, hosted for you

FAQ

Can you export HT-Demucs FT to ONNX for use on iOS and Android in 2026?

How accurate is the ONNX export compared to the PyTorch HT-Demucs FT model?

Is HT-Demucs FT really faster as ONNX than as PyTorch?

What's the best way to run HT-Demucs FT in a browser for a vocal-remover web app?

Can you train your own ONNX HT-Demucs model from scratch?

Is there a Python package to run or export HT-Demucs as ONNX without writing the conversion code yourself?

Get notified about Day 3

Try StemSplit free — 5 minutes on signup

Related Articles

Install Demucs Locally: Free AI Stem Separation Setup Guide

Demucs Online Tutorial: How to Separate Stems Without Installing Anything (2026)

Best Free Stem Separation Software in 2026

HT-Demucs FT to ONNX: How We Built the First Working Export for iOS, Android & Web — Plus 9 Open Hugging Face Models and a Reproducible MUSDB18-HQ Benchmark (2026)

Everything we released this week

Why we did this

The benchmarking gap

The ONNX gap

How HT-Demucs FT breaks every ONNX exporter

Blocker 1: complex64 STFT output

Blocker 2: fractions.Fraction in model.segment

Blocker 3: random.randrange in the cross-transformer

Blocker 4: aten::_native_multi_head_attention

The result

What this means by platform

Python (any OS, CPU or GPU)

Or just pip install demucs-onnx

iOS / Swift

Android / Kotlin

Web / onnxruntime-web

Performance numbers

How the stem specialists are derived (a cute trick)

What's next

How does HT-Demucs ONNX compare to running PyTorch in 2026?

Try the StemSplit API — same models, hosted for you

FAQ

Can you export HT-Demucs FT to ONNX for use on iOS and Android in 2026?

How accurate is the ONNX export compared to the PyTorch HT-Demucs FT model?

Is HT-Demucs FT really faster as ONNX than as PyTorch?

What's the best way to run HT-Demucs FT in a browser for a vocal-remover web app?

Can you train your own ONNX HT-Demucs model from scratch?

Is there a Python package to run or export HT-Demucs as ONNX without writing the conversion code yourself?

Get notified about Day 3

Try StemSplit free — 5 minutes on signup

Related Articles

Install Demucs Locally: Free AI Stem Separation Setup Guide

Demucs Online Tutorial: How to Separate Stems Without Installing Anything (2026)

Best Free Stem Separation Software in 2026

Blocker 1: `complex64` STFT output

Blocker 2: `fractions.Fraction` in `model.segment`

Blocker 3: `random.randrange` in the cross-transformer

Blocker 4: `aten::_native_multi_head_attention`

Or just `pip install demucs-onnx`

Web / `onnxruntime-web`