<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://netlinux-ai.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://netlinux-ai.github.io/" rel="alternate" type="text/html" /><updated>2026-05-09T07:21:45+00:00</updated><id>https://netlinux-ai.github.io/feed.xml</id><title type="html">netlinux-ai</title><subtitle>TTS fine-tuning notes, write-ups, and patches</subtitle><author><name>Graham</name></author><entry><title type="html">F5-TTS vs StyleTTS2: a real Pareto trade-off in fine-tune behaviour</title><link href="https://netlinux-ai.github.io/2026/05/09/f5-vs-styletts2-tradeoff/" rel="alternate" type="text/html" title="F5-TTS vs StyleTTS2: a real Pareto trade-off in fine-tune behaviour" /><published>2026-05-09T00:00:00+00:00</published><updated>2026-05-09T00:00:00+00:00</updated><id>https://netlinux-ai.github.io/2026/05/09/f5-vs-styletts2-tradeoff</id><content type="html" xml:base="https://netlinux-ai.github.io/2026/05/09/f5-vs-styletts2-tradeoff/"><![CDATA[<p>Running two TTS architectures on the same small fine-tune corpus surfaces
a real trade-off: <strong>F5-TTS commits hard to accent character at the cost of
phonetic stability; StyleTTS2 stays phonetically stable at the cost of
accent commitment</strong>. Neither dominates. Each has its own late-epoch failure
mode, just different ones.</p>

<p>This is a write-up of the comparison, with the concrete failure modes that
made the trade-off visible.</p>

<h2 id="set-up">Set-up</h2>

<ul>
  <li><strong>Corpus</strong>: ~163 minutes of clean single-speaker British (Bolton/Lancashire
region) audio across three speakers — Sara Cox, Maxine Peake, Diane Morgan</li>
  <li><strong>Models</strong>: F5-TTS v1 Base (336 M params) + StyleTTS2 LibriTTS pretrain
(~600 M params)</li>
  <li><strong>Method</strong>: identical data pipeline (Whisper transcription → segment →
pre-compute mels → JSONL manifest), then fine-tune each model with its own
trainer, checkpoint per epoch, render and listen at multiple epochs</li>
  <li><strong>Listener</strong>: native Northern English speaker, evaluating each render
against a phonetic-marker passage (BATH vowel, FOOT-STRUT, diphthong
realisation, etc.)</li>
</ul>

<h2 id="findings">Findings</h2>

<h3 id="f5-tts-strong-accent-occasional-phonetic-failures">F5-TTS: strong accent, occasional phonetic failures</h3>

<p>After 9 epochs at lr=5e-5 constant:</p>

<ul>
  <li><strong>Wins</strong>: distinctly Northern; “London” rendered as “Lundun” (FOOT-STRUT
vowel collapse, a textbook Northern marker); “laughing” with the short
/æ/ vowel (BATH-rejection)</li>
  <li><strong>Failures (late-epoch)</strong>:
    <ul>
      <li>“sunshine” → “sunshinn” — final phoneme truncation</li>
      <li>“morning” → “monning” — consonant deletion</li>
      <li>dropped function words: “she ran [her] hand” → “she ran hand”</li>
    </ul>
  </li>
  <li><strong>Numerical signal</strong>: training loss flat across epochs (per-batch noise
dominates), no late-epoch loss spike — failures invisible to the metric</li>
</ul>

<h3 id="styletts2-clean-phonetics-drifts-past-target-accent">StyleTTS2: clean phonetics, drifts past target accent</h3>

<p>After 4 epochs of fine-tune from the LibriTTS pretrain:</p>

<ul>
  <li><strong>Wins</strong>: moderate Northern intonation, no truncations or dropped words,
every phoneme rendered cleanly</li>
  <li><strong>Failures (late-epoch, epoch 5+)</strong>:
    <ul>
      <li>“down” → “doon” — over-fit toward Geordie/Scots realisation rather
than Bolton/Lancashire target</li>
      <li>subtle prosody drift toward broader Scottish patterns</li>
    </ul>
  </li>
  <li><strong>Why</strong>: the model has slid past the target sub-region of accent space
toward an adjacent (and over-represented in pre-training) region</li>
</ul>

<h3 id="side-by-side">Side-by-side</h3>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>F5 sweet spot (run 2 epoch 9)</th>
      <th>StyleTTS2 sweet spot (run 2 epoch 4)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Northern character</td>
      <td>Broad, strong</td>
      <td>Moderate, Bolton-stable</td>
    </tr>
    <tr>
      <td>Phonetic completeness</td>
      <td>Some truncations &amp; drops</td>
      <td>Clean — every phoneme rendered</td>
    </tr>
    <tr>
      <td>Late-epoch failure</td>
      <td>Truncates / drops / mangles word endings</td>
      <td>Drifts into adjacent accents (Geordie/Scots)</td>
    </tr>
    <tr>
      <td>Training loss as predictor</td>
      <td>Flat, no signal</td>
      <td>Flat, no signal</td>
    </tr>
    <tr>
      <td>Suitable when</td>
      <td>Accent strength matters more than precision</td>
      <td>Phonetic accuracy matters more than commitment</td>
    </tr>
  </tbody>
</table>

<h2 id="why-this-is-happening-the-loss-function-difference">Why this is happening: the loss-function difference</h2>

<p>The two architectures have fundamentally different training objectives.</p>

<h3 id="f5-tts-one-loss">F5-TTS: one loss</h3>

<p>Conditional flow matching velocity-field reconstruction. The model is free
to drift in timing, alignment, or word count as long as the spectral
signature of the output matches the target. Late-epoch overfitting can find
a slightly-lower-loss configuration that compresses pace or skips quiet
phonemes, because there is <strong>no explicit penalty for either</strong>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>loss = E_t,x0,x1 ‖v_θ(x_t, t, text, ref) − (x_1 − x_0)‖²
</code></pre></div></div>

<p>That’s it. Spectral pressure only.</p>

<h3 id="styletts2-eight-active-losses-simultaneously">StyleTTS2: eight active losses simultaneously</h3>

<p>Each constraining a different aspect of the output:</p>

<table>
  <thead>
    <tr>
      <th>Loss</th>
      <th>What it constrains</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">lambda_mel</code></td>
      <td>Spectral reconstruction</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">lambda_dur + lambda_ce</code></td>
      <td><strong>Per-phoneme duration prediction</strong></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">lambda_F0</code></td>
      <td>Pitch contour fidelity</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">lambda_norm</code></td>
      <td>Spectral envelope normalisation</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">lambda_mono</code></td>
      <td><strong>Monotonic text↔mel alignment (TMA)</strong></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">lambda_s2s</code></td>
      <td>Sequence-to-sequence consistency</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">lambda_gen</code></td>
      <td>HiFi-GAN discriminator</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">lambda_sty</code></td>
      <td>Style-vector reconstruction</td>
    </tr>
  </tbody>
</table>

<p>Truncating “sunshine” to “sunshinn” would simultaneously violate the
duration loss (wrong predicted phoneme count), the monotonic alignment
loss (final phonemes have no audio to align with), and the discriminator
(sounds artificial). Multiple constraints, multiple penalties.</p>

<p>But — the very same constraints that prevent F5’s truncation failure mode
also prevent StyleTTS2 from committing as hard to a specific accent.
Strong accent features require larger acoustic shifts than the per-phoneme
duration / pitch / alignment constraints leave room for. So StyleTTS2’s
fine-tunes converge on softer, more conservative accent renditions that
preserve phonetic completeness.</p>

<h2 id="the-two-failure-modes-are-mirror-images-of-each-other">The two failure modes are mirror images of each other</h2>

<h3 id="f5-drifts-in-timing">F5: drifts in <em>timing</em></h3>

<p>The single spectral loss has no explicit timing constraint. So under fine-
tune pressure, the model finds locally cheaper solutions that mangle pace:
truncate, delete, drop. The accent character is preserved (the spectra are
still right) but words become incomplete.</p>

<h3 id="styletts2-drifts-in-accent-space">StyleTTS2: drifts in <em>accent space</em></h3>

<p>The multiple constraints prevent timing drift. But within the constraint
manifold, the model can still over-fit on whatever sub-distribution the
training data resembles most strongly in the broader pre-training
landscape. Bolton-area audio resembles broader Scottish phonetics in
several axes; the model slides toward that broader cluster because there’s
<em>more</em> of it in the LibriTTS pre-training data than there is of the
specific Bolton sub-region.</p>

<p>So:</p>

<ul>
  <li>F5 over-fits on <strong>training-corpus pace and timing</strong></li>
  <li>StyleTTS2 over-fits on <strong>pre-training distribution adjacent regions</strong> of
accent space</li>
</ul>

<p>Different cliffs. Same root cause: a small fine-tune corpus can’t fully
specify the target distribution to either architecture.</p>

<h2 id="what-this-means-for-choosing-an-architecture">What this means for choosing an architecture</h2>

<p>If you’re fine-tuning on a small (&lt; 10 hour) corpus:</p>

<ul>
  <li><strong>Pick F5-TTS if</strong>: accent commitment / strong character matters more
than phonetic precision. Use at moderate epoch counts (6–9) and pick the
earliest checkpoint where the accent is committed but the truncations
haven’t started.</li>
  <li><strong>Pick StyleTTS2 if</strong>: phonetic precision and clean delivery matter more
than strong accent commitment. Use at moderate epoch counts (3–4 from a
good pre-train) and stop <em>before</em> the accent drift kicks in.</li>
</ul>

<p>Neither dominates. The corpus size and use case determine the choice.</p>

<p>If you have a large (&gt; 100 hour) corpus that fully specifies the target
distribution, the trade-off probably collapses: both architectures have
enough signal to lock onto the target sub-region without needing to
extrapolate. We didn’t have that in this work.</p>

<h2 id="practical-implication-select-by-listening-ship-the-right-artefact">Practical implication: select-by-listening, ship the right artefact</h2>

<p>Both architectures produce useful results. Both have late-epoch failure
modes. Both require listening evaluation to pick the right checkpoint —
training loss tells you nothing useful for final model selection.</p>

<p>In the project this comes from, we shipped <strong>both</strong> as separate scripts:</p>
<ul>
  <li>a “default” script using StyleTTS2 (for everyday clean output)</li>
  <li>a “stronger accent” script using F5-TTS (when accent strength matters)</li>
</ul>

<p>Users pick the script for the use case. Same fine-tune data, two
production artefacts at different points on the Pareto front.</p>

<h2 id="provenance">Provenance</h2>

<p>Worked from a small TTS fine-tuning project: 3 hours single-speaker British
audio across 3 speakers (Sara Cox, Maxine Peake, Diane Morgan, all
Bolton/Lancashire-region) → fine-tuned F5-TTS and StyleTTS2 producing
recognisably Northern-English output. Companion pieces:</p>

<ul>
  <li>The <a href="https://gist.github.com/netlinux-ai/372458bf616ab963b1ae556d1faf7d0c">listening-loop methodology</a> — how
human feedback actually steers training decisions</li>
  <li>The <a href="https://gist.github.com/netlinux-ai/7b88da46fd52153dd677cade2e6354f8">non-AVX2 CPU compatibility notes</a> —
what it took to make this run on a 2010-era CPU</li>
  <li>The <a href="https://gist.github.com/netlinux-ai/a7bbf6c64487bdc9ae5ff66731c5646f">minimal F5-TTS trainer</a> used
for the F5 side of the comparison</li>
</ul>]]></content><author><name>Graham</name></author><summary type="html"><![CDATA[Running two TTS architectures on the same small fine-tune corpus surfaces a real trade-off: F5-TTS commits hard to accent character at the cost of phonetic stability; StyleTTS2 stays phonetically stable at the cost of accent commitment. Neither dominates. Each has its own late-epoch failure mode, just different ones.]]></summary></entry><entry><title type="html">A minimal F5-TTS fine-tune trainer (no datasets, no accelerate)</title><link href="https://netlinux-ai.github.io/2026/05/09/minimal-f5tts-trainer/" rel="alternate" type="text/html" title="A minimal F5-TTS fine-tune trainer (no datasets, no accelerate)" /><published>2026-05-09T00:00:00+00:00</published><updated>2026-05-09T00:00:00+00:00</updated><id>https://netlinux-ai.github.io/2026/05/09/minimal-f5tts-trainer</id><content type="html" xml:base="https://netlinux-ai.github.io/2026/05/09/minimal-f5tts-trainer/"><![CDATA[<p>A ~250-line trainer for F5-TTS that bypasses the HuggingFace <code class="language-plaintext highlighter-rouge">datasets</code> and
<code class="language-plaintext highlighter-rouge">accelerate</code> dependency stack. Single-file, readable end-to-end.</p>

<p>Useful when:</p>

<ul>
  <li>the upstream <code class="language-plaintext highlighter-rouge">f5-tts_finetune-cli</code> won’t install/run because of <code class="language-plaintext highlighter-rouge">pyarrow</code>
/ <code class="language-plaintext highlighter-rouge">pandas</code> / <code class="language-plaintext highlighter-rouge">datasets</code> issues</li>
  <li>you want a single-file trainer you can read and modify</li>
  <li>you want to train using pre-computed mel-spectrograms loaded from disk
rather than recomputing per epoch</li>
</ul>

<p>The full code (trainer + mel pre-compute helper + README with the design
notes) lives at the gist:</p>

<p>→ <strong><a href="https://gist.github.com/netlinux-ai/a7bbf6c64487bdc9ae5ff66731c5646f">gist.github.com/netlinux-ai/a7bbf6c64487bdc9ae5ff66731c5646f</a></strong></p>

<p>Key design notes worth highlighting:</p>

<ol>
  <li>
    <p><strong>Stub <code class="language-plaintext highlighter-rouge">datasets</code> in <code class="language-plaintext highlighter-rouge">sys.modules</code></strong> before any <code class="language-plaintext highlighter-rouge">f5_tts</code> imports — F5-TTS’
own <code class="language-plaintext highlighter-rouge">f5_tts.model.dataset</code> does <code class="language-plaintext highlighter-rouge">from datasets import Dataset</code> at module
load. Stubbing satisfies the import without pulling pyarrow.</p>
  </li>
  <li>
    <p><strong>Strip the <code class="language-plaintext highlighter-rouge">ema_model.</code> prefix</strong> from the published F5TTS_v1_Base
checkpoint. The published file contains <em>only</em> EMA shadow weights;
naive loaders that skip <code class="language-plaintext highlighter-rouge">ema_model.*</code> get a random-initialised model.
See the <a href="https://github.com/SWivid/F5-TTS/issues/1292">companion bug report</a>.</p>
  </li>
  <li>
    <p><strong>Don’t decay LR on short fine-tunes.</strong> The default warmup-then-linear-decay
schedule from F5-TTS pretraining will decay LR to ~zero over the run.
On short (&lt; 50 epoch) fine-tunes, late-epoch gradients contribute almost
nothing. Use constant LR after warmup.</p>
  </li>
  <li>
    <p><strong><code class="language-plaintext highlighter-rouge">num_workers=0</code></strong> for the DataLoader. Subprocess workers re-import
torch and re-run dynamo init, which can SIGFPE on older CPUs. Keep
loading in the main process; with pre-computed mels, throughput is
GPU-bound anyway.</p>
  </li>
</ol>]]></content><author><name>Graham</name></author><summary type="html"><![CDATA[A ~250-line trainer for F5-TTS that bypasses the HuggingFace datasets and accelerate dependency stack. Single-file, readable end-to-end.]]></summary></entry><entry><title type="html">Running modern Python TTS toolchains on non-AVX2 CPUs</title><link href="https://netlinux-ai.github.io/2026/05/09/non-avx2-cpu-tts-compat/" rel="alternate" type="text/html" title="Running modern Python TTS toolchains on non-AVX2 CPUs" /><published>2026-05-09T00:00:00+00:00</published><updated>2026-05-09T00:00:00+00:00</updated><id>https://netlinux-ai.github.io/2026/05/09/non-avx2-cpu-tts-compat</id><content type="html" xml:base="https://netlinux-ai.github.io/2026/05/09/non-avx2-cpu-tts-compat/"><![CDATA[<p>Notes from getting <strong>F5-TTS, StyleTTS2, kokoro/Misaki, and whisper.cpp</strong> to work
on an AMD Phenom II X6 1090T (2010 K10/Family-10h architecture).</p>

<p>The CPU has SSE/SSE2/SSE3/SSE4a, plus CX16/POPCNT/LAHF — but <strong>no SSE4.1, no
SSE4.2, no AVX, no AVX2, no FMA, no F16C</strong>. That puts it below the modern
<strong>x86-64-v2</strong> baseline. A growing share of binary Python wheels in the AI
ecosystem assume v2 or v3, so they SIGILL or SIGFPE at import. This is a
ground-truth list of what we hit and what worked.</p>

<h2 id="quick-triage">Quick triage</h2>

<p>If your CPU is below x86-64-v2 (in particular, missing <strong>SSE4.1</strong>), expect:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">pyarrow</code> static-init <code class="language-plaintext highlighter-rouge">pinsrq</code> SIGILL on import</li>
  <li><code class="language-plaintext highlighter-rouge">numpy 2.x</code> wheel SIGILL on import (numpy 1.26.4 still has a fallback path)</li>
  <li><code class="language-plaintext highlighter-rouge">torch 2.10+</code> wheel SIGFPE in <code class="language-plaintext highlighter-rouge">torch._dynamo</code> on import</li>
  <li><code class="language-plaintext highlighter-rouge">pandas</code> modern wheels SIGILL on tokenisation</li>
  <li><code class="language-plaintext highlighter-rouge">monotonic_align</code> and other Cython extensions: build-from-source SIGILL</li>
  <li>DataLoader subprocess workers SIGFPE re-importing torch</li>
</ul>

<p>If your CPU is x86-64-v2 (Nehalem ~2008 or newer Intel; Bulldozer ~2011 or
newer AMD) but missing AVX/AVX2, you’ll still hit some of these but fewer.</p>

<h2 id="working-pin-set">Working pin-set</h2>

<p>These are versions empirically verified to import and run on this CPU:</p>

<table>
  <thead>
    <tr>
      <th>package</th>
      <th>version</th>
      <th>why</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>numpy</td>
      <td>1.26.4</td>
      <td>last with a non-AVX2 fallback path; from-source builds OK</td>
    </tr>
    <tr>
      <td>torch</td>
      <td>2.7.0</td>
      <td>last with a usable <code class="language-plaintext highlighter-rouge">_dynamo</code> init that doesn’t SIGFPE</td>
    </tr>
    <tr>
      <td>torchaudio</td>
      <td>2.7.0</td>
      <td>last with the soundfile backend (2.10+ requires torchcodec)</td>
    </tr>
    <tr>
      <td>transformers</td>
      <td>4.57.3</td>
      <td>5.x triggers <code class="language-plaintext highlighter-rouge">torch._dynamo</code> import-time via <code class="language-plaintext highlighter-rouge">torch.compiler.disable</code></td>
    </tr>
    <tr>
      <td>numba / scipy / librosa</td>
      <td>latest binary wheels</td>
      <td>OK</td>
    </tr>
    <tr>
      <td>pyarrow / pandas / datasets / torchcodec</td>
      <td><strong>uninstalled</strong></td>
      <td>wheels assume SSE4.1+; not actually needed for inference</td>
    </tr>
  </tbody>
</table>

<p>For a fresh install, layer the pins after the project install:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install</span> <span class="nt">--prefer-binary</span> &lt;project&gt;           <span class="c"># whatever you actually want</span>
pip <span class="nb">install</span> <span class="nt">--prefer-binary</span> <span class="nt">--force-reinstall</span> <span class="nt">--no-deps</span> <span class="se">\</span>
    <span class="s2">"torch==2.7.0"</span> <span class="s2">"torchaudio==2.7.0"</span> <span class="se">\</span>
    <span class="s2">"transformers==4.57.3"</span> <span class="s2">"numpy&lt;2"</span>
pip uninstall <span class="nt">-y</span> datasets pyarrow pyarrow-hotfix pandas torchcodec
</code></pre></div></div>

<h2 id="patches-required">Patches required</h2>

<h3 id="patch-1-torch_dynamo-sigfpe-on-int-division-by-zero">Patch 1: <code class="language-plaintext highlighter-rouge">torch._dynamo</code> SIGFPE on int division by zero</h3>

<p>Even after pinning to torch 2.7.0, the very first dynamo init still SIGFPEs
on this CPU. Cause: <code class="language-plaintext highlighter-rouge">torch._dynamo.variables.torch_function.populate_builtin_to_tensor_fn_map()</code>
probes Python operators on dummy tensors, including <code class="language-plaintext highlighter-rouge">tensor // 0</code> (integer
floor-divide by zero). Newer Intel CPUs trap this into a Python
<code class="language-plaintext highlighter-rouge">ZeroDivisionError</code> via signal handler. AMD Phenom II just SIGFPEs.</p>

<p>The function’s output isn’t actually needed for inference. Stub it:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">F</span><span class="o">=</span><span class="si">$(</span>python <span class="nt">-c</span> <span class="s2">"import torch._dynamo.variables.torch_function as m; print(m.__file__)"</span><span class="si">)</span>
<span class="nb">cp</span> <span class="nv">$F</span> <span class="nv">$F</span>.orig
<span class="nb">sed</span> <span class="nt">-i</span> <span class="s2">"0,/    global BUILTIN_TO_TENSOR_FN_MAP/s//    return  # patched: SIGFPE on Phenom II</span><span class="se">\n</span><span class="s2">    global BUILTIN_TO_TENSOR_FN_MAP/"</span> <span class="nv">$F</span>
</code></pre></div></div>

<p>This is non-invasive — only affects code that uses <code class="language-plaintext highlighter-rouge">torch.compile()</code> /
dynamo paths, which most fine-tuning trainers don’t.</p>

<h3 id="patch-2-gpu-only-mel-spectrogram-computation">Patch 2: GPU-only mel-spectrogram computation</h3>

<p><code class="language-plaintext highlighter-rouge">torch.matmul</code> on CPU SIGFPEs on this CPU. Anything that calls torchaudio’s
<code class="language-plaintext highlighter-rouge">MelSpectrogram</code> on CPU dies. For training pipelines that compute mels
in the data loader, this is fatal.</p>

<p>Two ways to fix:</p>

<p><strong>a)</strong> Move the mel module to GPU (cheap audio→mel transfer per sample):</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">to_mel</span> <span class="o">=</span> <span class="n">torchaudio</span><span class="p">.</span><span class="n">transforms</span><span class="p">.</span><span class="n">MelSpectrogram</span><span class="p">(...).</span><span class="n">to</span><span class="p">(</span><span class="s">"cuda"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">preprocess</span><span class="p">(</span><span class="n">wave</span><span class="p">):</span>
    <span class="n">wave</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">from_numpy</span><span class="p">(</span><span class="n">wave</span><span class="p">).</span><span class="n">to</span><span class="p">(</span><span class="s">"cuda"</span><span class="p">)</span>
    <span class="n">mel</span> <span class="o">=</span> <span class="n">to_mel</span><span class="p">(</span><span class="n">wave</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">mel</span><span class="p">.</span><span class="n">cpu</span><span class="p">()</span>  <span class="c1"># back to CPU for DataLoader collator
</span></code></pre></div></div>

<p><strong>b)</strong> Pre-compute all mels once on GPU, save to disk, load at training time
(<a href="https://gist.github.com/netlinux-ai/a7bbf6c64487bdc9ae5ff66731c5646f">example script</a>).</p>

<p>(b) is faster overall — no per-sample audio→GPU transfer, just <code class="language-plaintext highlighter-rouge">torch.load</code>.</p>

<h3 id="patch-3-num_workers0-everywhere">Patch 3: <code class="language-plaintext highlighter-rouge">num_workers=0</code> everywhere</h3>

<p>DataLoader spawns subprocess workers that re-import torch and re-run
<code class="language-plaintext highlighter-rouge">_dynamo</code> init. Even with patch 1, the patched source isn’t always picked up
in subprocess. Set <code class="language-plaintext highlighter-rouge">num_workers=0</code> to keep all loading in the main process.</p>

<h3 id="patch-4-weights_onlyfalse-for-older-checkpoint-formats">Patch 4: <code class="language-plaintext highlighter-rouge">weights_only=False</code> for older checkpoint formats</h3>

<p>PyTorch 2.6+ flipped the default. If you load checkpoints saved before 2.6
that contain pickled Python objects, you need <code class="language-plaintext highlighter-rouge">torch.load(path, weights_only=False)</code>.
Affected: many published TTS pretrained models (StyleTTS2’s ASR/JDC/PLBERT
modules, F5-TTS in some cases).</p>

<h3 id="patch-5-stub-datasets-for-transformers-lazy-loader">Patch 5: Stub <code class="language-plaintext highlighter-rouge">datasets</code> for transformers’ lazy loader</h3>

<p><code class="language-plaintext highlighter-rouge">transformers.utils.import_utils._is_package_available("datasets")</code> calls
<code class="language-plaintext highlighter-rouge">importlib.util.find_spec("datasets")</code>, which raises <code class="language-plaintext highlighter-rouge">ValueError</code> if
<code class="language-plaintext highlighter-rouge">__spec__</code> is <code class="language-plaintext highlighter-rouge">None</code>. If you provide a stub <code class="language-plaintext highlighter-rouge">datasets</code> module via
<code class="language-plaintext highlighter-rouge">sys.modules</code> (to avoid pulling pyarrow), it must have a real ModuleSpec:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">importlib.machinery</span><span class="p">,</span> <span class="n">types</span><span class="p">,</span> <span class="n">sys</span>
<span class="n">_stub</span> <span class="o">=</span> <span class="n">types</span><span class="p">.</span><span class="n">ModuleType</span><span class="p">(</span><span class="s">"datasets"</span><span class="p">)</span>
<span class="n">_stub</span><span class="p">.</span><span class="n">__spec__</span> <span class="o">=</span> <span class="n">importlib</span><span class="p">.</span><span class="n">machinery</span><span class="p">.</span><span class="n">ModuleSpec</span><span class="p">(</span><span class="s">"datasets"</span><span class="p">,</span> <span class="n">loader</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span>
<span class="n">_stub</span><span class="p">.</span><span class="n">Dataset</span> <span class="o">=</span> <span class="nb">type</span><span class="p">(</span><span class="s">"Dataset"</span><span class="p">,</span> <span class="p">(),</span> <span class="p">{})</span>
<span class="n">_stub</span><span class="p">.</span><span class="n">load_from_disk</span> <span class="o">=</span> <span class="k">lambda</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">:</span> <span class="bp">None</span>
<span class="n">sys</span><span class="p">.</span><span class="n">modules</span><span class="p">[</span><span class="s">"datasets"</span><span class="p">]</span> <span class="o">=</span> <span class="n">_stub</span>
</code></pre></div></div>

<h3 id="patch-6---no-build-isolation-for-cython-extensions">Patch 6: <code class="language-plaintext highlighter-rouge">--no-build-isolation</code> for Cython extensions</h3>

<p><code class="language-plaintext highlighter-rouge">monotonic_align</code> (used by StyleTTS2) and similar packages build with their
own ephemeral build-env via pip’s build isolation. That ephemeral env
re-installs <code class="language-plaintext highlighter-rouge">numpy</code> and <code class="language-plaintext highlighter-rouge">cython</code> and may pull AVX2 wheels. Use:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install</span> <span class="nt">--no-build-isolation</span> <span class="nt">--no-deps</span> &lt;package&gt;
</code></pre></div></div>

<p>This forces the build to use your already-installed (pinned) numpy+cython.</p>

<h2 id="per-project-status">Per-project status</h2>

<h3 id="f5-tts">F5-TTS</h3>

<ul>
  <li>Inference and training both work after patches 1–5.</li>
  <li>See companion gist for a minimal trainer that bypasses <code class="language-plaintext highlighter-rouge">datasets</code>/<code class="language-plaintext highlighter-rouge">accelerate</code>.</li>
  <li>Issue filed: SWivid/F5-TTS#1292 (EMA-only checkpoint structure).</li>
</ul>

<h3 id="styletts2">StyleTTS2</h3>

<ul>
  <li>Inference and fine-tune both work after patches 1, 2, 3, 4, 6.</li>
  <li>PRs filed: yl4579/StyleTTS2#361 (weights_only=False), #362 (drop pandas).</li>
</ul>

<h3 id="kokoro">kokoro</h3>

<ul>
  <li>Inference works (via the <code class="language-plaintext highlighter-rouge">kokoro-onnx</code> ONNX runtime path; PyTorch path
blocked by upstream dep pinning, not CPU).</li>
  <li>Issue filed: hexgrad/kokoro#321 (broken <code class="language-plaintext highlighter-rouge">misaki&gt;=0.7.16</code> PyPI pin).</li>
</ul>

<h3 id="whispercpp">whisper.cpp</h3>

<ul>
  <li>Works out of the box. Pure C++, no Python wheels involved. CUDA inference
on the GPU.</li>
</ul>

<h2 id="what-does-not-work">What does <em>not</em> work</h2>

<ul>
  <li><code class="language-plaintext highlighter-rouge">pyarrow</code> source build: succeeds eventually but the resulting library
still uses SSE4.1 in places (Apache Arrow’s CMake <code class="language-plaintext highlighter-rouge">ARROW_SIMD_LEVEL=NONE</code>
doesn’t cover everything). Not worth the multi-hour build.</li>
  <li><code class="language-plaintext highlighter-rouge">numpy 2.x</code>: even from-source build emits AVX-needing code via OpenBLAS
bundled wheels. Stick with 1.26.4.</li>
  <li>Anything using <code class="language-plaintext highlighter-rouge">bitsandbytes</code> int8/int4 quantisation: those kernels
hard-require AVX2.</li>
</ul>

<h2 id="worth-trying-if-you-have-avx-no-avx2">Worth trying if you have AVX (no AVX2)</h2>

<p>A 2011-era Sandy Bridge or later Intel CPU has AVX but no AVX2. Most of the
patches above still apply, but you may not need patch 1 (dynamo SIGFPE),
and pyarrow/datasets/pandas may install (just not the AVX2-specific code
paths). Try without the uninstalls first.</p>

<h2 id="summary">Summary</h2>

<p>If you want to do TTS fine-tuning on hardware below x86-64-v2:</p>

<ol>
  <li>Do inference work on the GPU. Keep CPU-side code to file I/O and JSON.</li>
  <li>Pin numpy 1.26 + torch 2.7 + transformers 4.57.</li>
  <li>Stub or uninstall <code class="language-plaintext highlighter-rouge">datasets</code>/<code class="language-plaintext highlighter-rouge">pyarrow</code>/<code class="language-plaintext highlighter-rouge">pandas</code>/<code class="language-plaintext highlighter-rouge">torchcodec</code>.</li>
  <li>Patch <code class="language-plaintext highlighter-rouge">torch._dynamo</code> once per torch install.</li>
  <li>Pre-compute mel-spectrograms offline.</li>
  <li>Train at <code class="language-plaintext highlighter-rouge">num_workers=0</code>.</li>
</ol>

<p>The rig produces useful output. It’s not a fast-iteration machine — every
upstream upgrade re-breaks something — but for fine-tuning (which doesn’t
need a fast-iteration machine) it’s economical: an RTX 3060 12 GB on a
2010-era CPU running real-world TTS workloads.</p>]]></content><author><name>Graham</name></author><summary type="html"><![CDATA[Notes from getting F5-TTS, StyleTTS2, kokoro/Misaki, and whisper.cpp to work on an AMD Phenom II X6 1090T (2010 K10/Family-10h architecture).]]></summary></entry><entry><title type="html">How human feedback actually steers TTS fine-tuning</title><link href="https://netlinux-ai.github.io/2026/05/09/tts-listening-loop/" rel="alternate" type="text/html" title="How human feedback actually steers TTS fine-tuning" /><published>2026-05-09T00:00:00+00:00</published><updated>2026-05-09T00:00:00+00:00</updated><id>https://netlinux-ai.github.io/2026/05/09/tts-listening-loop</id><content type="html" xml:base="https://netlinux-ai.github.io/2026/05/09/tts-listening-loop/"><![CDATA[<p>Notes on the iteration loop we ran while fine-tuning F5-TTS and StyleTTS2 on
a small Northern English corpus. The headline finding is that the listening
test isn’t optional polish at the end — it’s the <strong>only</strong> measurement that
catches the failure modes that matter, and each round of listening produces
specific phonetic observations that map to specific engineering decisions.</p>

<p>This is a write-up of the methodology, with the concrete examples that
forced each decision.</p>

<h2 id="the-loop">The loop</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        ┌────────────────────────┐
        │  render passage        │
        │  (baseline + ft)       │
        └──────────┬─────────────┘
                   ▼
        ┌────────────────────────┐         a feature is "right" if a native
        │  human listens against │         speaker recognises it. Record both
        │  marker list (BATH,    │  ◀───── what's working AND what's broken;
        │  FOOT-STRUT, …)        │         both are signal.
        └──────────┬─────────────┘
                   ▼
        ┌────────────────────────┐         translate audible features
        │  diagnose: why is the  │         to training-side cause:
        │  output the way it is? │  ◀───── · missing accent → under-trained
        └──────────┬─────────────┘         · right accent + glitches → over-trained
                   ▼                       · wrong accent → data or LR direction
        ┌────────────────────────┐         specific knobs:
        │  pick next training    │         · lr ↑/↓ (drift per step)
        │  move                  │  ◀───── · epochs ±N (cumulative drift)
        └──────────┬─────────────┘         · earlier ckpt (rewind)
                   ▼                       · data filter (cleaner signal)
              [iterate]
</code></pre></div></div>

<h2 id="the-verdict-to-action-mapping">The verdict-to-action mapping</h2>

<table>
  <thead>
    <tr>
      <th>Listening verdict</th>
      <th>What it implies physically</th>
      <th>Engineering response</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>“No discernible difference from baseline”</td>
      <td>Cumulative weight drift Σ lr·grad too small. Either lr too low, scheduler decayed it to ~0, or epochs too few.</td>
      <td>Increase lr or remove decay; add epochs.</td>
    </tr>
    <tr>
      <td>“Accent is right but specific words mangled / dropped / truncated”</td>
      <td>Late-epoch overfitting on training-corpus pace or timing. Crossed from “learning the distribution” to “memorising peculiarities of small corpus”.</td>
      <td>Step back: pick an earlier checkpoint, or continue at lower lr.</td>
    </tr>
    <tr>
      <td>“Accent is wrong direction (e.g. American instead of Northern)”</td>
      <td>Training data misattributed, or model pulled toward different distribution than expected.</td>
      <td>Audit data: manifest pointing at right speakers? Diarisation clean? Speaker IDs correct?</td>
    </tr>
    <tr>
      <td>“Specific phonetic feature still missing (e.g. monophthongisation absent on ‘sunshine’)”</td>
      <td>That pattern needs more training-distribution exposure. Some accent features are easier than others.</td>
      <td>Train more, keeping lr constant. Don’t increase lr to chase one feature — risk catastrophic forgetting.</td>
    </tr>
    <tr>
      <td>“Feature drifted past the target (e.g. ‘down’ → ‘doon’)”</td>
      <td>Over-fit on the broader cluster of related accents. Model has slid past the target sub-region.</td>
      <td>Step back to earlier checkpoint OR pick checkpoint <em>before</em> the drift.</td>
    </tr>
  </tbody>
</table>

<p>These categories aren’t theoretical. We hit each of them in real training
runs. Examples:</p>

<h3 id="no-discernible-difference--lr-scheduler-decayed-to-zero">“No discernible difference” → <strong>LR scheduler decayed to zero</strong></h3>

<p>Run 1 of F5-TTS used the trainer’s default schedule: linear warmup to peak
1e-5, then linear decay across the entire run to ~0. After 5 epochs:</p>

<ul>
  <li>Mean loss per epoch: 0.629, 0.677, 0.648, 0.642, 0.670 — <strong>flat</strong></li>
  <li>Listening: indistinguishable from baseline</li>
  <li>Numerical: waveform correlation with baseline = 0.017 (essentially uncorrelated audio, as expected for diffusion sampling) — looked like the model was <em>doing something</em>, but the perceptual output disagreed</li>
</ul>

<p>Diagnosis: the schedule shape was wrong. Step-by-step LR values:</p>
<ul>
  <li>step 1: lr = 1e-7 (warmup)</li>
  <li>step 100: lr = 1e-5 (peak, decay starts)</li>
  <li>step 1000: ≈ 5.5e-6</li>
  <li>step 2225: lr = 1e-13 — effectively zero</li>
</ul>

<p>Total weight drift is bounded by Σ lr·grad. With LR linearly decaying to ~0
over 2225 steps, late-epoch gradients are multiplied by near-zero values.
<strong>Most of run 1’s “5 epochs” was a no-op.</strong></p>

<p>Run 2 fix: 5× higher peak LR (5e-5), constant after warmup, no decay. 10
epochs. Result: per-epoch mean loss decreased (0.701 → 0.683 → 0.661 →
0.646 across the first 4), and listening verdict was <em>audibly Northern</em> —
“London” rendered as “Lundun” (FOOT-STRUT vowel collapse, a textbook
Northern marker).</p>

<h3 id="specific-feature-missing--train-more-same-lr">“Specific feature missing” → <strong>Train more, same LR</strong></h3>

<p>After 4 epochs of run 2, FOOT-STRUT had emerged (“Lundun”) but
monophthongisation hadn’t (“sunshine” still diphthongised standard). Some
phonetic patterns are easier to acquire than others — single-vowel
substitutions vs global diphthong→monophthong shifts.</p>

<p>Continuing 6 more epochs at the same constant 5e-5: monophthongisation
strengthened (“laughing” landed correctly), but truncation appeared
(“sunshine” → “sunshinn”, dropped function words like “her”).</p>

<h3 id="accent-right-but-words-mangled--pick-earlier-checkpoint">“Accent right but words mangled” → <strong>Pick earlier checkpoint</strong></h3>

<p>Run 2 epoch 10 had the strongest accent but the most word-truncation.
Rendering epochs 6, 7, 8, 9 with the same input passage and listening
through revealed epoch 9 as the sweet spot — accent committed, mostly
without truncation. Final shipping checkpoint.</p>

<h3 id="drifted-past-the-target--styletts2s-late-epoch-failure-mode">“Drifted past the target” → <strong>StyleTTS2’s late-epoch failure mode</strong></h3>

<p>StyleTTS2 epoch 5 introduced “down” → “doon” (Geordie/Scots realisation).
That’s <em>more Northern</em> than the Bolton/Lancashire target. The model had
slid past the target sub-region of accent space and was now drifting toward
broader Scots/North-East phonetics. Stopped training; epoch 4 became the
shipping checkpoint.</p>

<h2 id="why-loss-alone-cant-replace-listening">Why loss alone can’t replace listening</h2>

<p>Three reasons:</p>

<ol>
  <li>
    <p><strong>Loss flatness is ambiguous.</strong> A flat loss curve could mean “converged”
or “not learning at all.” Run 1’s flat 0.65 was the latter; only listening
(“indistinguishable from baseline”) disambiguated and pointed at the LR
scheduler. No purely numerical metric on training loss could distinguish
those two cases without an evaluation set.</p>
  </li>
  <li>
    <p><strong>Some failures look like wins on the loss curve.</strong> Late-epoch
overfitting drops training loss while degrading output. <em>Lower</em> loss +
<em>worse</em> output. Only listening catches it.</p>
  </li>
  <li>
    <p><strong>The thing being optimised isn’t what you actually want.</strong> Flow-matching
loss measures velocity-field reconstruction quality on the training
distribution. It doesn’t directly measure “is this output Northern
English-sounding to a native speaker.” The model can get better at
fitting Sara’s training mels while producing audio that sounds different
from any actual Sara recording.</p>
  </li>
</ol>

<p>This is why every training run produces multiple per-epoch checkpoints and
we render the same passage through several of them. The cost (~30s per
render × 5–6 epochs = ~3 min) buys you a perceptual gradient across training
time that no scalar loss provides.</p>

<h2 id="the-phonetic-marker-passage-as-deliberate-probe">The phonetic-marker passage as deliberate probe</h2>

<p>The test passage is loaded with English-accent markers so a single rendering
surfaces multiple aspects of the model’s state. Our standard probe:</p>

<blockquote>
  <p>“It was a bright morning when the path through the grass led down to the
running water. She ran her hand along the back of the chair before
sitting down. The young children were laughing in the sunshine, dancing
in patterns through the warm afternoon. After tea, the family walked up
the hill to look at the view. One of them said, with a small smile: I
cannot believe how lovely it is, our little corner of the world.”</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>What we’re probing</th>
      <th>Words that probe it</th>
      <th>“Wrong” sounds like</th>
      <th>“Northern” sounds like</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>BATH vowel</td>
      <td>path, grass, laughing, dancing, after, cannot</td>
      <td>/pɑːθ/ (RP “parth”)</td>
      <td>/pæθ/ (rhymes with “trap”)</td>
    </tr>
    <tr>
      <td>FOOT-STRUT</td>
      <td>running, sunshine, hand, hill, up, lovely, our</td>
      <td>distinct “put”/”putt”</td>
      <td>collapsed: both /ʊ/, so “London” → “Lundun”</td>
    </tr>
    <tr>
      <td>Diphthong→monophthong</td>
      <td>sunshine (→sunshaan), morning, smile</td>
      <td>standard /aɪ/, /eɪ/</td>
      <td>flat, longer single vowel /aː/</td>
    </tr>
    <tr>
      <td>happY-tense</td>
      <td>lovely, family, every, country</td>
      <td>tense /iː/</td>
      <td>laxer, more /ɪ/-like</td>
    </tr>
    <tr>
      <td>R-intrusion / linking</td>
      <td>chair before, our little</td>
      <td>none</td>
      <td>often realised in connected Northern speech</td>
    </tr>
  </tbody>
</table>

<p>If only some markers come through, that tells us which <em>kinds</em> of changes
the model is finding easier vs harder to learn. In run 2 epoch 4 the
FOOT-STRUT shift had emerged (“Lundun”) but monophthongisation had not
(“sunshine” still diphthongised). That gap motivated continuing training
rather than declaring done — specific phonetic gaps mapping to specific
training decisions.</p>

<h2 id="practical-recommendations">Practical recommendations</h2>

<ol>
  <li>
    <p><strong>Save a checkpoint per epoch.</strong> They’re cheap to disk and you’ll want
the perceptual gradient across training time. Late-epoch isn’t always
best.</p>
  </li>
  <li>
    <p><strong>Curate one phonetic-marker passage</strong> that targets the dialect features
you care about. Reuse the same passage every render so you build a
listening-memory of the model’s progression.</p>
  </li>
  <li>
    <p><strong>Render with the same reference clip every time.</strong> The only variable
should be the model weights. If you change the reference clip you’re
asking two different questions at once.</p>
  </li>
  <li>
    <p><strong>Native-speaker listeners are the most reliable test instrument.</strong>
Their judgement catches features that numerical metrics miss — and
importantly, also catches <em>over-fitting</em> failures that look fine
numerically.</p>
  </li>
  <li>
    <p><strong>Both wins and bugs are signal.</strong> Don’t just record what’s working;
record what’s broken. The combination of “what improved” and “what got
worse” defines the engineering response (continue / step back / change
data).</p>
  </li>
  <li>
    <p><strong>Run more checkpoints than you think you need.</strong> A/B-ing 6 different
epochs of the same run takes 3 minutes of compute. The information
gain — perceptual gradient over training time — is worth far more than
that.</p>
  </li>
</ol>

<h2 id="provenance">Provenance</h2>

<p>Worked example from a small TTS fine-tuning project: ~3 hours of single-speaker
British (Bolton-area) audio + WhisperCPP for transcripts → fine-tuned F5-TTS
and StyleTTS2 producing recognisably Northern-English output. Both
architectures hit different late-epoch failure modes that only the listening
loop caught. The companion piece <a href="#">F5 vs StyleTTS2 architecture trade-off</a>
documents what those failure modes implied about the architectures themselves.</p>]]></content><author><name>Graham</name></author><summary type="html"><![CDATA[Notes on the iteration loop we ran while fine-tuning F5-TTS and StyleTTS2 on a small Northern English corpus. The headline finding is that the listening test isn’t optional polish at the end — it’s the only measurement that catches the failure modes that matter, and each round of listening produces specific phonetic observations that map to specific engineering decisions.]]></summary></entry></feed>