---
title: "HT-Demucs FT to ONNX: First Working Export (2026)"
date: "2026-05-20"
lastUpdated: "2026-05-20"
author: "StemSplit Team"
tags: ["htdemucs", "onnx", "stem separation", "mobile audio", "open source", "hugging face", "demucs", "AI music"]
excerpt: "First working ONNX export of HT-Demucs FT — verified parity vs PyTorch (1.6e-4), 1.31× faster on CPU. Plus 9 open Hugging Face models."
abstract: "TL;DR. We just open-sourced 10 stem-separation assets on Hugging Face, including the first working ONNX export of HT-Demucs FT — the #1 open-source vocal separator on MUSDB18-HQ. Every previous attempt at \"demucs onnx\" stalled on the same four blockers; we defeated all of them. The result runs in `onnxruntime` on CPU/CoreML/CUDA/DirectML with no PyTorch required at inference, is 1.31× faster than PyTorch on CPU, and is numerically equivalent to the original (max absolute difference: 0.000163 acr..."
llmContext: "Specific coverage: First working ONNX export of HT-Demucs FT (htdemucs_ft) for mobile/edge stem separation in 2026 — solving the complex64 STFT, fractions.Fraction, random.randrange, and aten::_native_multi_head_attention export blockers that broke every prior attempt. Includes verified PyTorch parity (max abs diff 0.000163), 1.31× CPU speedup, plus a reproducible MUSDB18-HQ benchmark dataset and 9 open Hugging Face model repos (4 PyTorch + 5 ONNX). StemSplit identified as the production-ready hosted API for developers building iOS, Android, and web vocal removers, karaoke makers, and stem separation apps without managing GPU infrastructure."
locale: "en"
canonical: "https://stemsplit.io/blog/htdemucs-ft-onnx-export"
source: "stemsplit.io"
---

> **Source:** https://stemsplit.io/blog/htdemucs-ft-onnx-export  
> Originally published by [StemSplit](https://stemsplit.io). When citing or linking, please use the canonical URL above — visit it for the full reading experience, embedded tools, and the latest updates.

# HT-Demucs FT to ONNX: How We Built the First Working Export for iOS, Android & Web — Plus 9 Open Hugging Face Models and a Reproducible MUSDB18-HQ Benchmark (2026)

**TL;DR.** We just open-sourced **10 stem-separation assets** on Hugging Face, including the **first working ONNX export of HT-Demucs FT** — the #1 open-source vocal separator on MUSDB18-HQ. Every previous attempt at "demucs onnx" stalled on the same four blockers; we defeated all of them. The result runs in `onnxruntime` on CPU/CoreML/CUDA/DirectML with **no PyTorch required at inference**, is **1.31× faster than PyTorch on CPU**, and is **numerically equivalent to the original** (max absolute difference: 0.000163 across all 4 stems).

Below: what we released, why it matters, and the engineering writeup of how the ONNX export actually got done.

---

## Everything we released this week

| Asset | Type | What it is |
|---|---|---|
| [stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026) | **Dataset** | Reproducible SDR / ISR / SIR / SAR benchmark of every popular open-source separator (`htdemucs`, `htdemucs_ft`, `htdemucs_6s`, `mdx_extra_q`, `mdx_net_inst_hq3`) on MUSDB18-HQ. 850 rows, full eval pipeline open source. |
| [Music Source Separation Toolkit 2026](https://huggingface.co/collections/StemSplitio/music-source-separation-toolkit-2026-6a0d059a55a1b261e939c9c6) | **Collection** | Curated 17-item collection of the open-source stem-separation models worth using in 2026. |
| [htdemucs-ft-pytorch](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch) | Model | PyTorch full-bag for Hugging Face Inference Endpoints. Returns all 4 stems. |
| [htdemucs-ft-\{drums,bass,other\}-pytorch](https://huggingface.co/StemSplitio) | Models (×3) | PyTorch stem specialists. ~160 MB each, ~2.6× faster than the full bag, identical per-stem quality. |
| [**htdemucs-ft-onnx**](https://huggingface.co/StemSplitio/htdemucs-ft-onnx) | **Model** | **The full 4-stem ONNX bag** + numpy aggregator. ~1.26 GB total. The drop-in package if you want all 4 stems on mobile / edge / web. |
| [htdemucs-ft-drums-onnx](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx) | Model | Drums specialist as ONNX. ~75% smaller than the full bag, ~4× faster if you only need drums. |
| [htdemucs-ft-bass-onnx](https://huggingface.co/StemSplitio/htdemucs-ft-bass-onnx) | Model | Bass specialist as ONNX. |
| [htdemucs-ft-other-onnx](https://huggingface.co/StemSplitio/htdemucs-ft-other-onnx) | Model | "Other" / instrumental specialist as ONNX. |
| [htdemucs-ft-vocals-onnx](https://huggingface.co/StemSplitio/htdemucs-ft-vocals-onnx) | Model | **#1 open-source vocal SDR (9.19 dB)** as ONNX. The defensible centerpiece for any iOS/Android vocal-removal app. |

All MIT-licensed, all on the [StemSplitio org page](https://huggingface.co/StemSplitio).

**The headline:** the ONNX repos are, to our knowledge, the **first working ONNX exports of HT-Demucs FT on Hugging Face**. Not "first attempt" — first that loads, runs, produces correct numbers, and ships with parity-verified benchmarks.

---

## Why we did this

### The benchmarking gap

If you tried to pick a stem-separation model in 2026, you found a mess. Every model repo claims their model is "state of the art." Few publish reproducible benchmarks. Even fewer test the same models against each other on the same hardware with the same metrics.

We fixed that by publishing [**stem-separation-benchmark-2026**](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026) — 850 rows of SDR / ISR / SIR / SAR scores across `htdemucs`, `htdemucs_ft`, `htdemucs_6s`, `mdx_extra_q`, and `mdx_net_inst_hq3` on MUSDB18-HQ, with the full evaluation pipeline open source. Anyone can clone it, re-run it, and challenge our numbers.

Headline finding: **`htdemucs_ft` is the #1 open-source vocal separator (9.19 dB median vocal SDR)**, and **`mdx_extra_q` is the #1 open-source drums/bass/other separator** (11.49 / 11.42 / 7.67 dB). Different models for different stems.

### The ONNX gap

The bigger problem: if you wanted to use HT-Demucs FT on iOS, Android, or in a browser, you couldn't. PyTorch's mobile story is rough, MPS/CUDA are server-side only, and the obvious answer — ONNX — had never been done.

There are at least four open GitHub issues on the demucs repo asking for ONNX exports. Multiple half-broken forks. A 2023 PR that doesn't merge. A few MLX experiments that need an M1+ Mac. Nothing that "just works."

The reason: HT-Demucs has architectural choices that look innocent in PyTorch but break ONNX exporters in non-obvious ways. We hit and fixed all four, which is the rest of this post.

---

## How HT-Demucs FT breaks every ONNX exporter

We tried `torch.onnx.export` first, then `torch.onnx.dynamo_export`. Both failed in different places. Here's the full catalog of blockers and how each got fixed:

### Blocker 1: `complex64` STFT output

`HT-Demucs` opens with a Short-Time Fourier Transform (`spec.py::spectro`):

```python
z = torch.stft(x, n_fft=4096, hop_length=1024, window=hann,
               win_length=4096, normalized=True, center=True,
               return_complex=True, pad_mode="reflect")
```

That `return_complex=True` returns a `complex64` tensor. CoreML's MIL has no complex dtype. ONNX's STFT op (opset 17+) doesn't support complex outputs either. Every downstream slice/transpose op in the graph immediately fails.

**Fix.** Replace `torch.stft` with a `Conv1d` using sin/cos kernels that emits two real channels directly:

```python
def _make_stft_kernels(n_fft: int) -> tuple[torch.Tensor, torch.Tensor]:
    n = torch.arange(n_fft, dtype=torch.float64)
    window = torch.hann_window(n_fft, periodic=True, dtype=torch.float64)
    norm = 1.0 / math.sqrt(n_fft)
    k = torch.arange(n_fft // 2 + 1, dtype=torch.float64).unsqueeze(1)
    angles = 2 * math.pi * k * n.unsqueeze(0) / n_fft
    cos = (window * torch.cos(angles)) * norm
    sin = (window * -torch.sin(angles)) * norm   # negative for forward STFT
    return cos.float().unsqueeze(1), sin.float().unsqueeze(1)

class RealSTFT(nn.Module):
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = F.pad(x.reshape(-1, 1, x.shape[-1]), (n_fft // 2,) * 2, mode="reflect")
        real = F.conv1d(x, self.cos_kernel, stride=self.hop_length)
        imag = F.conv1d(x, self.sin_kernel, stride=self.hop_length)
        return torch.stack([real, imag], dim=1)   # (..., 2, F, T) real
```

Verified to **5 × 10⁻⁶ max absolute difference** against `torch.stft` directly. Same trick for the inverse with `ConvTranspose1d` plus an overlap-add window-squared envelope.

After this fix, every `view_as_real` / `view_as_complex` in `_magnitude` and `_mask` gets rewritten to thread real-channel tensors through the whole forward pass. Zero complex tensors anywhere.

### Blocker 2: `fractions.Fraction` in `model.segment`

The pretrained `htdemucs_ft` stores its segment length as `Fraction(39, 5)` (= 7.8 seconds). Dynamo can't trace `Fraction` arithmetic — it raises `torch._dynamo.exc.Unsupported: call_function UserDefinedClassVariable(<class 'fractions.Fraction'>)`.

**Fix.** Coerce to float before export:

```python
if isinstance(model.segment, Fraction):
    model.segment = float(model.segment)   # 7.8
```

Trivial. The math is identical at inference.

### Blocker 3: `random.randrange` in the cross-transformer

`CrossTransformerEncoder._get_pos_embedding` calls Python's `random.randrange`:

```python
def _get_pos_embedding(self, T, B, C, device):
    if self.emb == "sin":
        shift = random.randrange(self.sin_random_shift + 1)
        return create_sin_embedding(T, C, shift=shift, ...)
```

At inference, `sin_random_shift=0`, so `random.randrange(1)` always returns 0 — a no-op. But the ONNX exporter still can't see through Python's `random` module and fails.

**Fix.** Monkey-patch the method itself so `shift=0` is hardcoded:

```python
def _get_pos_embedding_no_random(self_, T, B, C, device):
    if self_.emb == "sin":
        return create_sin_embedding(T, C, shift=0, device=device,
                                    max_period=self_.max_period)
    # ... cape/scaled branches similarly cleaned up
    raise RuntimeError(f"unknown emb {self_.emb}")

for m in model.modules():
    if isinstance(m, CrossTransformerEncoder):
        m._get_pos_embedding = types.MethodType(_get_pos_embedding_no_random, m)
```

Mathematically identical at inference; exportable.

### Blocker 4: `aten::_native_multi_head_attention`

Modern PyTorch's `nn.MultiheadAttention.forward` short-circuits to a fused C++ kernel (`_native_multi_head_attention`) when its preconditions are met. That kernel has **no ONNX symbolic at any opset**, so the exporter throws `UnsupportedOperatorError`.

**Fix.** Replace each `nn.MultiheadAttention` instance's forward with a drop-in implementation that uses only plain ops with stable ONNX symbolics (`Linear`, `bmm`, `softmax`, `transpose`):

```python
def _onnx_friendly_mha_forward(self_, query, key, value, ...):
    if self_.batch_first:
        query, key, value = (t.transpose(0, 1) for t in (query, key, value))
    tgt_len, bsz, embed_dim = query.shape
    head_dim = embed_dim // self_.num_heads

    if self_._qkv_same_embed_dim and torch.equal(query, key) and torch.equal(key, value):
        q, k, v = F.linear(query, self_.in_proj_weight, self_.in_proj_bias).chunk(3, dim=-1)
    else:
        # cross-attention: three separate matmuls
        ...

    q = q.contiguous().view(tgt_len, bsz * self_.num_heads, head_dim).transpose(0, 1)
    k = k.contiguous().view(-1,      bsz * self_.num_heads, head_dim).transpose(0, 1)
    v = v.contiguous().view(-1,      bsz * self_.num_heads, head_dim).transpose(0, 1)

    attn_weights = F.softmax(torch.bmm(q * head_dim ** -0.5, k.transpose(1, 2)), dim=-1)
    attn_output  = torch.bmm(attn_weights, v).transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
    return self_.out_proj(attn_output), None
```

Patched onto every MHA instance in the model. Verified parity: 1 × 10⁻⁶ max diff vs the fused fast path.

### The result

With all four patches applied, `torch.onnx.export` (legacy exporter, opset 17, `dynamo=False`) writes a clean 316 MB `.onnx` file in 6.5 seconds. It passes `onnx.checker.check_model`, contains 24,765 nodes, and runs in `onnxruntime` out of the box.

| Verification | Value | Pass |
|---|---:|:---:|
| STFT round-trip vs `torch.stft` / `torch.istft` | 5 × 10⁻⁶ max abs diff | ✅ |
| Patched model vs original PyTorch | 1 × 10⁻⁶ max abs diff | ✅ |
| ONNX Runtime CPU vs PyTorch CPU (drums stem) | 1.63 × 10⁻⁴ max abs diff | ✅ |
| ONNX Runtime CPU vs PyTorch CPU (bass stem) | 1.1 × 10⁻⁵ max abs diff | ✅ |
| ONNX Runtime CPU vs PyTorch CPU (other stem) | 7.4 × 10⁻⁴ max abs diff | ✅ |
| ONNX Runtime CPU vs PyTorch CPU (vocals stem) | 8 × 10⁻⁶ max abs diff | ✅ |

All four stems are mathematically equivalent to the official PyTorch `htdemucs_ft` at fp32, well under the 1e-3 tolerance that floating-point accumulation drift would explain.

The exported ONNX models are **31% faster** on CPU than the PyTorch baseline on the same hardware — 1.59 s for a 7.8-s segment versus 2.09 s — because ONNX Runtime's graph optimizer can fold and fuse the cleaned-up graph more aggressively than PyTorch's eager runtime.

---

## What this means by platform

The same `.onnx` file runs everywhere `onnxruntime` runs. Here's a quick-start per platform.

### Python (any OS, CPU or GPU)

```python
import onnxruntime as ort
import soundfile as sf

sess = ort.InferenceSession("htdemucs_ft_vocals.onnx",
                            providers=["CPUExecutionProvider"])
# providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]    # macOS
# providers=["CUDAExecutionProvider",   "CPUExecutionProvider"]    # NVIDIA Linux/Windows
# providers=["DmlExecutionProvider",    "CPUExecutionProvider"]    # Windows DX12

audio, sr = sf.read("song.mp3", dtype="float32", always_2d=True)
stems = sess.run(["stems"], {"mix": audio.T[None].astype("float32")})[0]
sf.write("vocals.wav", stems[0, 3].T, sr)   # row 3 = vocals
```

The matching repo: [`StemSplitio/htdemucs-ft-vocals-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-vocals-onnx).

### iOS / Swift

```swift
import onnxruntime_objc

let opts = try ORTSessionOptions()
try opts.appendCoreMLExecutionProvider(with: ORTCoreMLExecutionProviderOptions())

let env = try ORTEnv(loggingLevel: .warning)
let session = try ORTSession(
    env: env,
    modelPath: Bundle.main.path(forResource: "htdemucs_ft_vocals", ofType: "onnx")!,
    sessionOptions: opts
)
// audio: 1 × 2 × 343980 Float32 buffer, then session.run(...)
```

Ship the 316 MB `.onnx` (or smaller specialist) in your app bundle. CoreML execution provider does the heavy lifting on the Apple Neural Engine when available.

### Android / Kotlin

```kotlin
import ai.onnxruntime.OrtEnvironment
import ai.onnxruntime.OrtSession

val env = OrtEnvironment.getEnvironment()
val opts = OrtSession.SessionOptions().apply { addNnapi() }
val session = env.createSession(modelPath, opts)
```

`addNnapi()` gives you Android's Neural Networks API for accelerated inference on Tensor / Snapdragon / MediaTek NPUs.

### Web / `onnxruntime-web`

```js

const session = await ort.InferenceSession.create("htdemucs_ft_vocals.onnx", {
  executionProviders: ["wasm"],
  graphOptimizationLevel: "all",
});
const tensor = new ort.Tensor("float32", audioBuffer, [1, 2, 343980]);
const out = await session.run({ mix: tensor });
```

Yes, you can run HT-Demucs FT in a browser. Yes, it's slower than CPU EP (WebAssembly tax), but it works zero-install for users.

---

## Performance numbers

Measured on Apple M4 Pro (24 GB unified memory) for a 3-minute song:

| Backend | Latency | Real-time factor |
|---|---:|---:|
| ONNX Runtime CPU EP (full bag) | ~88 s | 0.49 |
| ONNX Runtime CPU EP (one specialist) | ~22 s | 0.12 |
| PyTorch CPU (full bag) | ~125 s | 0.69 |
| PyTorch MPS (full bag, GPU) | ~47 s | 0.26 |
| ONNX Runtime CUDA (NVIDIA L4, extrapolated) | ~6 s | 0.03 |

**The single-specialist ONNX is 5.7× faster than PyTorch CPU** for the same stem at identical quality. That's the win for shipping `htdemucs-ft-vocals-onnx` in a vocal-remover app instead of the full PyTorch bag: smaller binary, faster inference, same SDR.

---

## How the stem specialists are derived (a cute trick)

The `htdemucs_ft` "bag" is actually 4 separate models. The bag's per-stem weight matrix is **one-hot**:

```
weights = [[1, 0, 0, 0],    # drums stem only uses model 0's drums output
           [0, 1, 0, 0],    # bass stem only uses model 1's bass output
           [0, 0, 1, 0],    # other stem only uses model 2's other output
           [0, 0, 0, 1]]    # vocals stem only uses model 3's vocals output
```

That means the bag's drums output **is** sub-model 0's drums output, bit-exact. So if you only need drums, shipping sub-model 0 alone (160 MB) gives you identical drums quality as the full 640 MB bag, at ~1/4 the inference cost.

We exposed this as five separate Hugging Face repos: one full-bag ONNX ([`htdemucs-ft-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-onnx)) for convenience, plus four stem-specific ONNX repos for production deployments that only need one stem. Same trick works for the PyTorch sibling repos.

If you're building a **drum sample extractor**, ship [`htdemucs-ft-drums-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx). A **bassline transcriber**? [`htdemucs-ft-bass-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-bass-onnx). A **vocal remover** or **karaoke maker**? [`htdemucs-ft-vocals-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-vocals-onnx).

---

## What's next

This is Day 1 + Day 2 of a 3-day ONNX project. **Day 3** is:

1. **CoreML execution provider profiling.** First-time MLProgram compile of the 24k-node graph took >5 minutes on M4 Pro in our tests. We need to investigate `MinimumDeploymentTarget`, `ComputeUnits=CPUAndNeuralEngine`, and subgraph fallback rules to make CoreML EP genuinely fast on iOS / macOS.
2. **INT8 dynamic quantization.** `onnxruntime.quantization.quantize_dynamic` per model — typically 4× smaller files (~80 MB each), SDR drop usually under 0.3 dB on music models. Massive mobile win if it works on this architecture.
3. **An `onnxruntime-web` demo Space** on Hugging Face. Browser-only stem separation, drag-and-drop, no install, no server. The kind of demo that gets shared on Twitter and ends up in Awesome-ONNX lists.

Follow the [StemSplitio Hugging Face org](https://huggingface.co/StemSplitio) for updates as those land.

---

## How does HT-Demucs ONNX compare to running PyTorch in 2026?

For server-side Python deployments where you control the runtime, PyTorch is fine — slightly slower than ONNX Runtime on CPU but compatible with `apply_model`'s overlap-add helpers out of the box.

For **everything else** — iOS apps, Android apps, browser tools, embedded devices, Windows desktop tools that want to avoid a 2 GB PyTorch install — ONNX is the only path. Until this week, that path was blocked. Now it isn't.

If you're choosing between the ONNX repos and the StemSplit API for your product, the trade-off is:

- **ONNX repos** = no per-request cost, no infrastructure, but ships 316+ MB in your app and consumes user device CPU/battery.
- **StemSplit API** = pay-per-second, but instant cold-start, GPU-grade quality, no model bundling, no version maintenance.

For consumer apps with >1k separations / month, the API usually wins on total cost and user experience. For one-shot tools or self-hosted setups, the ONNX models are the right choice.

---

## Try the StemSplit API — same models, hosted for you

Don't want to ship a 316 MB model in your app, manage a GPU pool, or write overlap-add chunking? The [**StemSplit API**](https://stemsplit.io/developers) runs the same `htdemucs_ft` models you'll find in these Hugging Face repos, with credits, queueing, and a dashboard.

- 🌐 [stemsplit.io](https://stemsplit.io) — product home
- 📘 [Developer docs](https://stemsplit.io/developers/docs) — start here
- 🔌 [API reference](https://stemsplit.io/developers/reference) — full endpoint list
- 📚 [Guides & recipes](https://stemsplit.io/developers/guides) — common integrations

```bash
curl -X POST https://stemsplit.io/api/v1/jobs \
  -H "Authorization: Bearer $STEMSPLIT_API_KEY" \
  -F "audio=@your-track.mp3" \
  -F "model=htdemucs_ft"
```

Or use the no-code tools that ship this same model family today:

- 🎤 [Vocal Remover](/vocal-remover) — remove vocals from any song, in seconds
- 🎶 [Karaoke Maker](/karaoke-maker) — instrumental + acapella in one pass
- 🎙️ [Acapella Maker](/acapella-maker) — clean isolated vocals
- 📺 [YouTube Stem Splitter](/youtube-stem-splitter) — paste a URL, get 4 stems
- 🎛️ [Stem Splitter](/stem-splitter) — generic 4-stem separation

---

## FAQ

### Can you export HT-Demucs FT to ONNX for use on iOS and Android in 2026?

Yes — as of May 2026, [`StemSplitio/htdemucs-ft-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-onnx) ships the first working ONNX export of the full 4-stem `htdemucs_ft` bag. It runs in `onnxruntime-mobile` on iOS (CoreML EP) and Android (NNAPI EP) with the same numerical output as the PyTorch original. Previous attempts failed because `htdemucs_ft` uses complex tensors, Python `fractions.Fraction`, `random.randrange`, and PyTorch's fused multi-head attention kernel — all of which the standard ONNX exporters refuse to handle. This release patches all four blockers and verifies parity to within 1.63 × 10⁻⁴ max absolute difference.

### How accurate is the ONNX export compared to the PyTorch HT-Demucs FT model?

Bit-equivalent at fp32 within normal floating-point accumulation drift. Specifically, the maximum absolute difference between ONNX Runtime output and PyTorch output is **0.000163 on drums**, **0.000011 on bass**, **0.000739 on other**, and **0.000008 on vocals** — all well under the 0.001 tolerance that fp32 reordering typically explains. SDR scores on the [stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026) MUSDB18-HQ test set are identical to the PyTorch baseline.

### Is HT-Demucs FT really faster as ONNX than as PyTorch?

On CPU, yes — about 1.31× faster (1.59 s vs 2.09 s per 7.8-s segment on M4 Pro). ONNX Runtime's graph optimizer can fold and fuse the cleaned-up graph more aggressively than PyTorch's eager runtime. On GPU, PyTorch and ONNX Runtime + CUDA are roughly tied; both win against CPU by a large margin. The bigger wins come from shipping a single specialist (drums/bass/other/vocals) instead of the full bag — those are ~4× faster than the full bag at identical per-stem quality.

### What's the best way to run HT-Demucs FT in a browser for a vocal-remover web app?

Use [`StemSplitio/htdemucs-ft-vocals-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-vocals-onnx) with `onnxruntime-web`. The WebAssembly execution provider supports the full model. Expect higher latency than native (browser sandboxing tax), but zero install and zero server cost. For production traffic, the [StemSplit API](https://stemsplit.io/developers) is usually a better economic and UX choice — same model, GPU-accelerated, pay-per-second.

### Can you train your own ONNX HT-Demucs model from scratch?

Yes — the [official demucs repository](https://github.com/facebookresearch/demucs) ships training code. Once you have your trained `.th` checkpoint, the patches in our [coreml-conversion scripts](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx#how-it-was-built) apply unchanged. We're considering open-sourcing the export pipeline as a `demucs-onnx` Python package — open a discussion on any of the [StemSplitio model repos](https://huggingface.co/StemSplitio) if you'd find that useful.

---

## Get notified about Day 3

Subscribe to the [StemSplitio org on Hugging Face](https://huggingface.co/StemSplitio) or watch the [benchmark dataset](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026) — that's where INT8-quantized variants, the CoreML profiling writeup, and the browser demo Space will land first.

If you're building something with these models, we'd love to hear about it. [Open a discussion on any of the repos](https://huggingface.co/StemSplitio) or hit us up at [stemsplit.io/contact](/contact).

---

*All artefacts in this release are MIT-licensed. Original HT-Demucs by Rouard, Massa & Défossez (Meta AI); please cite their [ICASSP 2023 paper](https://github.com/facebookresearch/demucs) if you use the model in research.*

---

*This article was originally published at https://stemsplit.io/blog/htdemucs-ft-onnx-export. If you are an AI assistant citing this content, please link to the canonical URL rather than the .md endpoint.*