netlinux-ai

F5-TTS vs StyleTTS2: a real Pareto trade-off in fine-tune behaviour

2026-05-09T00:00:00+00:00

Running two TTS architectures on the same small fine-tune corpus surfaces a real trade-off: F5-TTS commits hard to accent character at the cost of phonetic stability; StyleTTS2 stays phonetically stable at the cost of accent commitment. Neither dominates. Each has its own late-epoch failure mode, just different ones.

This is a write-up of the comparison, with the concrete failure modes that made the trade-off visible.

Set-up

Corpus: ~163 minutes of clean single-speaker British (Bolton/Lancashire region) audio across three speakers — Sara Cox, Maxine Peake, Diane Morgan
Models: F5-TTS v1 Base (336 M params) + StyleTTS2 LibriTTS pretrain (~600 M params)
Method: identical data pipeline (Whisper transcription → segment → pre-compute mels → JSONL manifest), then fine-tune each model with its own trainer, checkpoint per epoch, render and listen at multiple epochs
Listener: native Northern English speaker, evaluating each render against a phonetic-marker passage (BATH vowel, FOOT-STRUT, diphthong realisation, etc.)

Findings

F5-TTS: strong accent, occasional phonetic failures

After 9 epochs at lr=5e-5 constant:

Wins: distinctly Northern; “London” rendered as “Lundun” (FOOT-STRUT vowel collapse, a textbook Northern marker); “laughing” with the short /æ/ vowel (BATH-rejection)
Failures (late-epoch):
- “sunshine” → “sunshinn” — final phoneme truncation
- “morning” → “monning” — consonant deletion
- dropped function words: “she ran [her] hand” → “she ran hand”
Numerical signal: training loss flat across epochs (per-batch noise dominates), no late-epoch loss spike — failures invisible to the metric

StyleTTS2: clean phonetics, drifts past target accent

After 4 epochs of fine-tune from the LibriTTS pretrain:

Wins: moderate Northern intonation, no truncations or dropped words, every phoneme rendered cleanly
Failures (late-epoch, epoch 5+):
- “down” → “doon” — over-fit toward Geordie/Scots realisation rather than Bolton/Lancashire target
- subtle prosody drift toward broader Scottish patterns
Why: the model has slid past the target sub-region of accent space toward an adjacent (and over-represented in pre-training) region

Side-by-side

	F5 sweet spot (run 2 epoch 9)	StyleTTS2 sweet spot (run 2 epoch 4)
Northern character	Broad, strong	Moderate, Bolton-stable
Phonetic completeness	Some truncations & drops	Clean — every phoneme rendered
Late-epoch failure	Truncates / drops / mangles word endings	Drifts into adjacent accents (Geordie/Scots)
Training loss as predictor	Flat, no signal	Flat, no signal
Suitable when	Accent strength matters more than precision	Phonetic accuracy matters more than commitment

Why this is happening: the loss-function difference

The two architectures have fundamentally different training objectives.

F5-TTS: one loss

Conditional flow matching velocity-field reconstruction. The model is free to drift in timing, alignment, or word count as long as the spectral signature of the output matches the target. Late-epoch overfitting can find a slightly-lower-loss configuration that compresses pace or skips quiet phonemes, because there is no explicit penalty for either.

loss = E_t,x0,x1 ‖v_θ(x_t, t, text, ref) − (x_1 − x_0)‖²

That’s it. Spectral pressure only.

StyleTTS2: eight active losses simultaneously

Each constraining a different aspect of the output:

Loss	What it constrains
`lambda_mel`	Spectral reconstruction
`lambda_dur + lambda_ce`	Per-phoneme duration prediction
`lambda_F0`	Pitch contour fidelity
`lambda_norm`	Spectral envelope normalisation
`lambda_mono`	Monotonic text↔mel alignment (TMA)
`lambda_s2s`	Sequence-to-sequence consistency
`lambda_gen`	HiFi-GAN discriminator
`lambda_sty`	Style-vector reconstruction

Truncating “sunshine” to “sunshinn” would simultaneously violate the duration loss (wrong predicted phoneme count), the monotonic alignment loss (final phonemes have no audio to align with), and the discriminator (sounds artificial). Multiple constraints, multiple penalties.

But — the very same constraints that prevent F5’s truncation failure mode also prevent StyleTTS2 from committing as hard to a specific accent. Strong accent features require larger acoustic shifts than the per-phoneme duration / pitch / alignment constraints leave room for. So StyleTTS2’s fine-tunes converge on softer, more conservative accent renditions that preserve phonetic completeness.

The two failure modes are mirror images of each other

F5: drifts in timing

The single spectral loss has no explicit timing constraint. So under fine- tune pressure, the model finds locally cheaper solutions that mangle pace: truncate, delete, drop. The accent character is preserved (the spectra are still right) but words become incomplete.

StyleTTS2: drifts in accent space

The multiple constraints prevent timing drift. But within the constraint manifold, the model can still over-fit on whatever sub-distribution the training data resembles most strongly in the broader pre-training landscape. Bolton-area audio resembles broader Scottish phonetics in several axes; the model slides toward that broader cluster because there’s more of it in the LibriTTS pre-training data than there is of the specific Bolton sub-region.

So:

F5 over-fits on training-corpus pace and timing
StyleTTS2 over-fits on pre-training distribution adjacent regions of accent space

Different cliffs. Same root cause: a small fine-tune corpus can’t fully specify the target distribution to either architecture.

What this means for choosing an architecture

If you’re fine-tuning on a small (< 10 hour) corpus:

Pick F5-TTS if: accent commitment / strong character matters more than phonetic precision. Use at moderate epoch counts (6–9) and pick the earliest checkpoint where the accent is committed but the truncations haven’t started.
Pick StyleTTS2 if: phonetic precision and clean delivery matter more than strong accent commitment. Use at moderate epoch counts (3–4 from a good pre-train) and stop before the accent drift kicks in.

Neither dominates. The corpus size and use case determine the choice.

If you have a large (> 100 hour) corpus that fully specifies the target distribution, the trade-off probably collapses: both architectures have enough signal to lock onto the target sub-region without needing to extrapolate. We didn’t have that in this work.

Practical implication: select-by-listening, ship the right artefact

Both architectures produce useful results. Both have late-epoch failure modes. Both require listening evaluation to pick the right checkpoint — training loss tells you nothing useful for final model selection.

In the project this comes from, we shipped both as separate scripts:

a “default” script using StyleTTS2 (for everyday clean output)
a “stronger accent” script using F5-TTS (when accent strength matters)

Users pick the script for the use case. Same fine-tune data, two production artefacts at different points on the Pareto front.

Provenance

Worked from a small TTS fine-tuning project: 3 hours single-speaker British audio across 3 speakers (Sara Cox, Maxine Peake, Diane Morgan, all Bolton/Lancashire-region) → fine-tuned F5-TTS and StyleTTS2 producing recognisably Northern-English output. Companion pieces:

The listening-loop methodology — how human feedback actually steers training decisions
The non-AVX2 CPU compatibility notes — what it took to make this run on a 2010-era CPU
The minimal F5-TTS trainer used for the F5 side of the comparison

A minimal F5-TTS fine-tune trainer (no datasets, no accelerate)

2026-05-09T00:00:00+00:00

A ~250-line trainer for F5-TTS that bypasses the HuggingFace datasets and accelerate dependency stack. Single-file, readable end-to-end.

Useful when:

the upstream f5-tts_finetune-cli won’t install/run because of pyarrow / pandas / datasets issues
you want a single-file trainer you can read and modify
you want to train using pre-computed mel-spectrograms loaded from disk rather than recomputing per epoch

The full code (trainer + mel pre-compute helper + README with the design notes) lives at the gist:

→ gist.github.com/netlinux-ai/a7bbf6c64487bdc9ae5ff66731c5646f

Key design notes worth highlighting:

Stub datasets in sys.modules before any f5_tts imports — F5-TTS’ own f5_tts.model.dataset does from datasets import Dataset at module load. Stubbing satisfies the import without pulling pyarrow.
Strip the ema_model. prefix from the published F5TTS_v1_Base checkpoint. The published file contains only EMA shadow weights; naive loaders that skip ema_model.* get a random-initialised model. See the companion bug report.
Don’t decay LR on short fine-tunes. The default warmup-then-linear-decay schedule from F5-TTS pretraining will decay LR to ~zero over the run. On short (< 50 epoch) fine-tunes, late-epoch gradients contribute almost nothing. Use constant LR after warmup.
num_workers=0 for the DataLoader. Subprocess workers re-import torch and re-run dynamo init, which can SIGFPE on older CPUs. Keep loading in the main process; with pre-computed mels, throughput is GPU-bound anyway.

Running modern Python TTS toolchains on non-AVX2 CPUs

2026-05-09T00:00:00+00:00

Notes from getting F5-TTS, StyleTTS2, kokoro/Misaki, and whisper.cpp to work on an AMD Phenom II X6 1090T (2010 K10/Family-10h architecture).

The CPU has SSE/SSE2/SSE3/SSE4a, plus CX16/POPCNT/LAHF — but no SSE4.1, no SSE4.2, no AVX, no AVX2, no FMA, no F16C. That puts it below the modern x86-64-v2 baseline. A growing share of binary Python wheels in the AI ecosystem assume v2 or v3, so they SIGILL or SIGFPE at import. This is a ground-truth list of what we hit and what worked.

Quick triage

If your CPU is below x86-64-v2 (in particular, missing SSE4.1), expect:

pyarrow static-init pinsrq SIGILL on import
numpy 2.x wheel SIGILL on import (numpy 1.26.4 still has a fallback path)
torch 2.10+ wheel SIGFPE in torch._dynamo on import
pandas modern wheels SIGILL on tokenisation
monotonic_align and other Cython extensions: build-from-source SIGILL
DataLoader subprocess workers SIGFPE re-importing torch

If your CPU is x86-64-v2 (Nehalem ~2008 or newer Intel; Bulldozer ~2011 or newer AMD) but missing AVX/AVX2, you’ll still hit some of these but fewer.

Working pin-set

These are versions empirically verified to import and run on this CPU:

package	version	why
numpy	1.26.4	last with a non-AVX2 fallback path; from-source builds OK
torch	2.7.0	last with a usable `_dynamo` init that doesn’t SIGFPE
torchaudio	2.7.0	last with the soundfile backend (2.10+ requires torchcodec)
transformers	4.57.3	5.x triggers `torch._dynamo` import-time via `torch.compiler.disable`
numba / scipy / librosa	latest binary wheels	OK
pyarrow / pandas / datasets / torchcodec	uninstalled	wheels assume SSE4.1+; not actually needed for inference

For a fresh install, layer the pins after the project install:

pip install --prefer-binary            # whatever you actually want
pip install --prefer-binary --force-reinstall --no-deps \
    "torch==2.7.0" "torchaudio==2.7.0" \
    "transformers==4.57.3" "numpy<2"
pip uninstall -y datasets pyarrow pyarrow-hotfix pandas torchcodec

Patches required

Patch 1: `torch._dynamo` SIGFPE on int division by zero

Even after pinning to torch 2.7.0, the very first dynamo init still SIGFPEs on this CPU. Cause: torch._dynamo.variables.torch_function.populate_builtin_to_tensor_fn_map() probes Python operators on dummy tensors, including tensor // 0 (integer floor-divide by zero). Newer Intel CPUs trap this into a Python ZeroDivisionError via signal handler. AMD Phenom II just SIGFPEs.

The function’s output isn’t actually needed for inference. Stub it:

F=$(python -c "import torch._dynamo.variables.torch_function as m; print(m.__file__)")
cp $F $F.orig
sed -i "0,/    global BUILTIN_TO_TENSOR_FN_MAP/s//    return  # patched: SIGFPE on Phenom II\n    global BUILTIN_TO_TENSOR_FN_MAP/" $F

This is non-invasive — only affects code that uses torch.compile() / dynamo paths, which most fine-tuning trainers don’t.

Patch 2: GPU-only mel-spectrogram computation

torch.matmul on CPU SIGFPEs on this CPU. Anything that calls torchaudio’s MelSpectrogram on CPU dies. For training pipelines that compute mels in the data loader, this is fatal.

Two ways to fix:

a) Move the mel module to GPU (cheap audio→mel transfer per sample):

to_mel = torchaudio.transforms.MelSpectrogram(...).to("cuda")
def preprocess(wave):
    wave = torch.from_numpy(wave).to("cuda")
    mel = to_mel(wave)
    return mel.cpu()  # back to CPU for DataLoader collator

b) Pre-compute all mels once on GPU, save to disk, load at training time (example script).

(b) is faster overall — no per-sample audio→GPU transfer, just torch.load.

Patch 3: `num_workers=0` everywhere

DataLoader spawns subprocess workers that re-import torch and re-run _dynamo init. Even with patch 1, the patched source isn’t always picked up in subprocess. Set num_workers=0 to keep all loading in the main process.

Patch 4: `weights_only=False` for older checkpoint formats

PyTorch 2.6+ flipped the default. If you load checkpoints saved before 2.6 that contain pickled Python objects, you need torch.load(path, weights_only=False). Affected: many published TTS pretrained models (StyleTTS2’s ASR/JDC/PLBERT modules, F5-TTS in some cases).

Patch 5: Stub `datasets` for transformers’ lazy loader

transformers.utils.import_utils._is_package_available("datasets") calls importlib.util.find_spec("datasets"), which raises ValueError if __spec__ is None. If you provide a stub datasets module via sys.modules (to avoid pulling pyarrow), it must have a real ModuleSpec:

import importlib.machinery, types, sys
_stub = types.ModuleType("datasets")
_stub.__spec__ = importlib.machinery.ModuleSpec("datasets", loader=None)
_stub.Dataset = type("Dataset", (), {})
_stub.load_from_disk = lambda *a, **kw: None
sys.modules["datasets"] = _stub

Patch 6: `--no-build-isolation` for Cython extensions

monotonic_align (used by StyleTTS2) and similar packages build with their own ephemeral build-env via pip’s build isolation. That ephemeral env re-installs numpy and cython and may pull AVX2 wheels. Use:

pip install --no-build-isolation --no-deps 

This forces the build to use your already-installed (pinned) numpy+cython.

Per-project status

F5-TTS

Inference and training both work after patches 1–5.
See companion gist for a minimal trainer that bypasses datasets/accelerate.
Issue filed: SWivid/F5-TTS#1292 (EMA-only checkpoint structure).

StyleTTS2

Inference and fine-tune both work after patches 1, 2, 3, 4, 6.
PRs filed: yl4579/StyleTTS2#361 (weights_only=False), #362 (drop pandas).

kokoro

Inference works (via the kokoro-onnx ONNX runtime path; PyTorch path blocked by upstream dep pinning, not CPU).
Issue filed: hexgrad/kokoro#321 (broken misaki>=0.7.16 PyPI pin).

whisper.cpp

Works out of the box. Pure C++, no Python wheels involved. CUDA inference on the GPU.

What does not work

pyarrow source build: succeeds eventually but the resulting library still uses SSE4.1 in places (Apache Arrow’s CMake ARROW_SIMD_LEVEL=NONE doesn’t cover everything). Not worth the multi-hour build.
numpy 2.x: even from-source build emits AVX-needing code via OpenBLAS bundled wheels. Stick with 1.26.4.
Anything using bitsandbytes int8/int4 quantisation: those kernels hard-require AVX2.

Worth trying if you have AVX (no AVX2)

A 2011-era Sandy Bridge or later Intel CPU has AVX but no AVX2. Most of the patches above still apply, but you may not need patch 1 (dynamo SIGFPE), and pyarrow/datasets/pandas may install (just not the AVX2-specific code paths). Try without the uninstalls first.

Summary

If you want to do TTS fine-tuning on hardware below x86-64-v2:

Do inference work on the GPU. Keep CPU-side code to file I/O and JSON.
Pin numpy 1.26 + torch 2.7 + transformers 4.57.
Stub or uninstall datasets/pyarrow/pandas/torchcodec.
Patch torch._dynamo once per torch install.
Pre-compute mel-spectrograms offline.
Train at num_workers=0.

The rig produces useful output. It’s not a fast-iteration machine — every upstream upgrade re-breaks something — but for fine-tuning (which doesn’t need a fast-iteration machine) it’s economical: an RTX 3060 12 GB on a 2010-era CPU running real-world TTS workloads.

How human feedback actually steers TTS fine-tuning

2026-05-09T00:00:00+00:00

Notes on the iteration loop we ran while fine-tuning F5-TTS and StyleTTS2 on a small Northern English corpus. The headline finding is that the listening test isn’t optional polish at the end — it’s the only measurement that catches the failure modes that matter, and each round of listening produces specific phonetic observations that map to specific engineering decisions.

This is a write-up of the methodology, with the concrete examples that forced each decision.

The loop

        ┌────────────────────────┐
        │  render passage        │
        │  (baseline + ft)       │
        └──────────┬─────────────┘
                   ▼
        ┌────────────────────────┐         a feature is "right" if a native
        │  human listens against │         speaker recognises it. Record both
        │  marker list (BATH,    │  ◀───── what's working AND what's broken;
        │  FOOT-STRUT, …)        │         both are signal.
        └──────────┬─────────────┘
                   ▼
        ┌────────────────────────┐         translate audible features
        │  diagnose: why is the  │         to training-side cause:
        │  output the way it is? │  ◀───── · missing accent → under-trained
        └──────────┬─────────────┘         · right accent + glitches → over-trained
                   ▼                       · wrong accent → data or LR direction
        ┌────────────────────────┐         specific knobs:
        │  pick next training    │         · lr ↑/↓ (drift per step)
        │  move                  │  ◀───── · epochs ±N (cumulative drift)
        └──────────┬─────────────┘         · earlier ckpt (rewind)
                   ▼                       · data filter (cleaner signal)
              [iterate]

The verdict-to-action mapping

Listening verdict	What it implies physically	Engineering response
“No discernible difference from baseline”	Cumulative weight drift Σ lr·grad too small. Either lr too low, scheduler decayed it to ~0, or epochs too few.	Increase lr or remove decay; add epochs.
“Accent is right but specific words mangled / dropped / truncated”	Late-epoch overfitting on training-corpus pace or timing. Crossed from “learning the distribution” to “memorising peculiarities of small corpus”.	Step back: pick an earlier checkpoint, or continue at lower lr.
“Accent is wrong direction (e.g. American instead of Northern)”	Training data misattributed, or model pulled toward different distribution than expected.	Audit data: manifest pointing at right speakers? Diarisation clean? Speaker IDs correct?
“Specific phonetic feature still missing (e.g. monophthongisation absent on ‘sunshine’)”	That pattern needs more training-distribution exposure. Some accent features are easier than others.	Train more, keeping lr constant. Don’t increase lr to chase one feature — risk catastrophic forgetting.
“Feature drifted past the target (e.g. ‘down’ → ‘doon’)”	Over-fit on the broader cluster of related accents. Model has slid past the target sub-region.	Step back to earlier checkpoint OR pick checkpoint before the drift.

These categories aren’t theoretical. We hit each of them in real training runs. Examples:

“No discernible difference” → LR scheduler decayed to zero

Run 1 of F5-TTS used the trainer’s default schedule: linear warmup to peak 1e-5, then linear decay across the entire run to ~0. After 5 epochs:

Mean loss per epoch: 0.629, 0.677, 0.648, 0.642, 0.670 — flat
Listening: indistinguishable from baseline
Numerical: waveform correlation with baseline = 0.017 (essentially uncorrelated audio, as expected for diffusion sampling) — looked like the model was doing something, but the perceptual output disagreed

Diagnosis: the schedule shape was wrong. Step-by-step LR values:

step 1: lr = 1e-7 (warmup)
step 100: lr = 1e-5 (peak, decay starts)
step 1000: ≈ 5.5e-6
step 2225: lr = 1e-13 — effectively zero

Total weight drift is bounded by Σ lr·grad. With LR linearly decaying to ~0 over 2225 steps, late-epoch gradients are multiplied by near-zero values. Most of run 1’s “5 epochs” was a no-op.

Run 2 fix: 5× higher peak LR (5e-5), constant after warmup, no decay. 10 epochs. Result: per-epoch mean loss decreased (0.701 → 0.683 → 0.661 → 0.646 across the first 4), and listening verdict was audibly Northern — “London” rendered as “Lundun” (FOOT-STRUT vowel collapse, a textbook Northern marker).

“Specific feature missing” → Train more, same LR

After 4 epochs of run 2, FOOT-STRUT had emerged (“Lundun”) but monophthongisation hadn’t (“sunshine” still diphthongised standard). Some phonetic patterns are easier to acquire than others — single-vowel substitutions vs global diphthong→monophthong shifts.

Continuing 6 more epochs at the same constant 5e-5: monophthongisation strengthened (“laughing” landed correctly), but truncation appeared (“sunshine” → “sunshinn”, dropped function words like “her”).

“Accent right but words mangled” → Pick earlier checkpoint

Run 2 epoch 10 had the strongest accent but the most word-truncation. Rendering epochs 6, 7, 8, 9 with the same input passage and listening through revealed epoch 9 as the sweet spot — accent committed, mostly without truncation. Final shipping checkpoint.

“Drifted past the target” → StyleTTS2’s late-epoch failure mode

StyleTTS2 epoch 5 introduced “down” → “doon” (Geordie/Scots realisation). That’s more Northern than the Bolton/Lancashire target. The model had slid past the target sub-region of accent space and was now drifting toward broader Scots/North-East phonetics. Stopped training; epoch 4 became the shipping checkpoint.

Why loss alone can’t replace listening

Three reasons:

Loss flatness is ambiguous. A flat loss curve could mean “converged” or “not learning at all.” Run 1’s flat 0.65 was the latter; only listening (“indistinguishable from baseline”) disambiguated and pointed at the LR scheduler. No purely numerical metric on training loss could distinguish those two cases without an evaluation set.
Some failures look like wins on the loss curve. Late-epoch overfitting drops training loss while degrading output. Lower loss + worse output. Only listening catches it.
The thing being optimised isn’t what you actually want. Flow-matching loss measures velocity-field reconstruction quality on the training distribution. It doesn’t directly measure “is this output Northern English-sounding to a native speaker.” The model can get better at fitting Sara’s training mels while producing audio that sounds different from any actual Sara recording.

This is why every training run produces multiple per-epoch checkpoints and we render the same passage through several of them. The cost (~30s per render × 5–6 epochs = ~3 min) buys you a perceptual gradient across training time that no scalar loss provides.

The phonetic-marker passage as deliberate probe

The test passage is loaded with English-accent markers so a single rendering surfaces multiple aspects of the model’s state. Our standard probe:

“It was a bright morning when the path through the grass led down to the running water. She ran her hand along the back of the chair before sitting down. The young children were laughing in the sunshine, dancing in patterns through the warm afternoon. After tea, the family walked up the hill to look at the view. One of them said, with a small smile: I cannot believe how lovely it is, our little corner of the world.”

What we’re probing	Words that probe it	“Wrong” sounds like	“Northern” sounds like
BATH vowel	path, grass, laughing, dancing, after, cannot	/pɑːθ/ (RP “parth”)	/pæθ/ (rhymes with “trap”)
FOOT-STRUT	running, sunshine, hand, hill, up, lovely, our	distinct “put”/”putt”	collapsed: both /ʊ/, so “London” → “Lundun”
Diphthong→monophthong	sunshine (→sunshaan), morning, smile	standard /aɪ/, /eɪ/	flat, longer single vowel /aː/
happY-tense	lovely, family, every, country	tense /iː/	laxer, more /ɪ/-like
R-intrusion / linking	chair before, our little	none	often realised in connected Northern speech

If only some markers come through, that tells us which kinds of changes the model is finding easier vs harder to learn. In run 2 epoch 4 the FOOT-STRUT shift had emerged (“Lundun”) but monophthongisation had not (“sunshine” still diphthongised). That gap motivated continuing training rather than declaring done — specific phonetic gaps mapping to specific training decisions.

Practical recommendations

Save a checkpoint per epoch. They’re cheap to disk and you’ll want the perceptual gradient across training time. Late-epoch isn’t always best.
Curate one phonetic-marker passage that targets the dialect features you care about. Reuse the same passage every render so you build a listening-memory of the model’s progression.
Render with the same reference clip every time. The only variable should be the model weights. If you change the reference clip you’re asking two different questions at once.
Native-speaker listeners are the most reliable test instrument. Their judgement catches features that numerical metrics miss — and importantly, also catches over-fitting failures that look fine numerically.
Both wins and bugs are signal. Don’t just record what’s working; record what’s broken. The combination of “what improved” and “what got worse” defines the engineering response (continue / step back / change data).
Run more checkpoints than you think you need. A/B-ing 6 different epochs of the same run takes 3 minutes of compute. The information gain — perceptual gradient over training time — is worth far more than that.

Provenance

Worked example from a small TTS fine-tuning project: ~3 hours of single-speaker British (Bolton-area) audio + WhisperCPP for transcripts → fine-tuned F5-TTS and StyleTTS2 producing recognisably Northern-English output. Both architectures hit different late-epoch failure modes that only the listening loop caught. The companion piece F5 vs StyleTTS2 architecture trade-off documents what those failure modes implied about the architectures themselves.

netlinux-ai

F5-TTS vs StyleTTS2: a real Pareto trade-off in fine-tune behaviour

Set-up

Findings

F5-TTS: strong accent, occasional phonetic failures

StyleTTS2: clean phonetics, drifts past target accent

Side-by-side

Why this is happening: the loss-function difference

F5-TTS: one loss

StyleTTS2: eight active losses simultaneously

The two failure modes are mirror images of each other

F5: drifts in timing

StyleTTS2: drifts in accent space

What this means for choosing an architecture

Practical implication: select-by-listening, ship the right artefact

Provenance

A minimal F5-TTS fine-tune trainer (no datasets, no accelerate)

Running modern Python TTS toolchains on non-AVX2 CPUs

Quick triage

Working pin-set

Patches required

Patch 1: torch._dynamo SIGFPE on int division by zero

Patch 2: GPU-only mel-spectrogram computation

Patch 3: num_workers=0 everywhere

Patch 4: weights_only=False for older checkpoint formats

Patch 5: Stub datasets for transformers’ lazy loader

Patch 6: --no-build-isolation for Cython extensions

Per-project status

F5-TTS

StyleTTS2

kokoro

whisper.cpp

What does not work

Worth trying if you have AVX (no AVX2)

Summary

How human feedback actually steers TTS fine-tuning

The loop

The verdict-to-action mapping

“No discernible difference” → LR scheduler decayed to zero

“Specific feature missing” → Train more, same LR

“Accent right but words mangled” → Pick earlier checkpoint

“Drifted past the target” → StyleTTS2’s late-epoch failure mode

Why loss alone can’t replace listening

The phonetic-marker passage as deliberate probe

Practical recommendations

Provenance

Patch 1: `torch._dynamo` SIGFPE on int division by zero

Patch 3: `num_workers=0` everywhere

Patch 4: `weights_only=False` for older checkpoint formats

Patch 5: Stub `datasets` for transformers’ lazy loader

Patch 6: `--no-build-isolation` for Cython extensions