F5-TTS vs StyleTTS2: a real Pareto trade-off in fine-tune behaviour
Running two TTS architectures on the same small fine-tune corpus surfaces a real trade-off: F5-TTS commits hard to accent character at the cost of phonetic stability; StyleTTS2 stays phonetically stable at the cost of accent commitment. Neither dominates. Each has its own late-epoch failure mode, just different ones.
This is a write-up of the comparison, with the concrete failure modes that made the trade-off visible.
Set-up
- Corpus: ~163 minutes of clean single-speaker British (Bolton/Lancashire region) audio across three speakers — Sara Cox, Maxine Peake, Diane Morgan
- Models: F5-TTS v1 Base (336 M params) + StyleTTS2 LibriTTS pretrain (~600 M params)
- Method: identical data pipeline (Whisper transcription → segment → pre-compute mels → JSONL manifest), then fine-tune each model with its own trainer, checkpoint per epoch, render and listen at multiple epochs
- Listener: native Northern English speaker, evaluating each render against a phonetic-marker passage (BATH vowel, FOOT-STRUT, diphthong realisation, etc.)
Findings
F5-TTS: strong accent, occasional phonetic failures
After 9 epochs at lr=5e-5 constant:
- Wins: distinctly Northern; “London” rendered as “Lundun” (FOOT-STRUT vowel collapse, a textbook Northern marker); “laughing” with the short /æ/ vowel (BATH-rejection)
- Failures (late-epoch):
- “sunshine” → “sunshinn” — final phoneme truncation
- “morning” → “monning” — consonant deletion
- dropped function words: “she ran [her] hand” → “she ran hand”
- Numerical signal: training loss flat across epochs (per-batch noise dominates), no late-epoch loss spike — failures invisible to the metric
StyleTTS2: clean phonetics, drifts past target accent
After 4 epochs of fine-tune from the LibriTTS pretrain:
- Wins: moderate Northern intonation, no truncations or dropped words, every phoneme rendered cleanly
- Failures (late-epoch, epoch 5+):
- “down” → “doon” — over-fit toward Geordie/Scots realisation rather than Bolton/Lancashire target
- subtle prosody drift toward broader Scottish patterns
- Why: the model has slid past the target sub-region of accent space toward an adjacent (and over-represented in pre-training) region
Side-by-side
| F5 sweet spot (run 2 epoch 9) | StyleTTS2 sweet spot (run 2 epoch 4) | |
|---|---|---|
| Northern character | Broad, strong | Moderate, Bolton-stable |
| Phonetic completeness | Some truncations & drops | Clean — every phoneme rendered |
| Late-epoch failure | Truncates / drops / mangles word endings | Drifts into adjacent accents (Geordie/Scots) |
| Training loss as predictor | Flat, no signal | Flat, no signal |
| Suitable when | Accent strength matters more than precision | Phonetic accuracy matters more than commitment |
Why this is happening: the loss-function difference
The two architectures have fundamentally different training objectives.
F5-TTS: one loss
Conditional flow matching velocity-field reconstruction. The model is free to drift in timing, alignment, or word count as long as the spectral signature of the output matches the target. Late-epoch overfitting can find a slightly-lower-loss configuration that compresses pace or skips quiet phonemes, because there is no explicit penalty for either.
loss = E_t,x0,x1 ‖v_θ(x_t, t, text, ref) − (x_1 − x_0)‖²
That’s it. Spectral pressure only.
StyleTTS2: eight active losses simultaneously
Each constraining a different aspect of the output:
| Loss | What it constrains |
|---|---|
lambda_mel |
Spectral reconstruction |
lambda_dur + lambda_ce |
Per-phoneme duration prediction |
lambda_F0 |
Pitch contour fidelity |
lambda_norm |
Spectral envelope normalisation |
lambda_mono |
Monotonic text↔mel alignment (TMA) |
lambda_s2s |
Sequence-to-sequence consistency |
lambda_gen |
HiFi-GAN discriminator |
lambda_sty |
Style-vector reconstruction |
Truncating “sunshine” to “sunshinn” would simultaneously violate the duration loss (wrong predicted phoneme count), the monotonic alignment loss (final phonemes have no audio to align with), and the discriminator (sounds artificial). Multiple constraints, multiple penalties.
But — the very same constraints that prevent F5’s truncation failure mode also prevent StyleTTS2 from committing as hard to a specific accent. Strong accent features require larger acoustic shifts than the per-phoneme duration / pitch / alignment constraints leave room for. So StyleTTS2’s fine-tunes converge on softer, more conservative accent renditions that preserve phonetic completeness.
The two failure modes are mirror images of each other
F5: drifts in timing
The single spectral loss has no explicit timing constraint. So under fine- tune pressure, the model finds locally cheaper solutions that mangle pace: truncate, delete, drop. The accent character is preserved (the spectra are still right) but words become incomplete.
StyleTTS2: drifts in accent space
The multiple constraints prevent timing drift. But within the constraint manifold, the model can still over-fit on whatever sub-distribution the training data resembles most strongly in the broader pre-training landscape. Bolton-area audio resembles broader Scottish phonetics in several axes; the model slides toward that broader cluster because there’s more of it in the LibriTTS pre-training data than there is of the specific Bolton sub-region.
So:
- F5 over-fits on training-corpus pace and timing
- StyleTTS2 over-fits on pre-training distribution adjacent regions of accent space
Different cliffs. Same root cause: a small fine-tune corpus can’t fully specify the target distribution to either architecture.
What this means for choosing an architecture
If you’re fine-tuning on a small (< 10 hour) corpus:
- Pick F5-TTS if: accent commitment / strong character matters more than phonetic precision. Use at moderate epoch counts (6–9) and pick the earliest checkpoint where the accent is committed but the truncations haven’t started.
- Pick StyleTTS2 if: phonetic precision and clean delivery matter more than strong accent commitment. Use at moderate epoch counts (3–4 from a good pre-train) and stop before the accent drift kicks in.
Neither dominates. The corpus size and use case determine the choice.
If you have a large (> 100 hour) corpus that fully specifies the target distribution, the trade-off probably collapses: both architectures have enough signal to lock onto the target sub-region without needing to extrapolate. We didn’t have that in this work.
Practical implication: select-by-listening, ship the right artefact
Both architectures produce useful results. Both have late-epoch failure modes. Both require listening evaluation to pick the right checkpoint — training loss tells you nothing useful for final model selection.
In the project this comes from, we shipped both as separate scripts:
- a “default” script using StyleTTS2 (for everyday clean output)
- a “stronger accent” script using F5-TTS (when accent strength matters)
Users pick the script for the use case. Same fine-tune data, two production artefacts at different points on the Pareto front.
Provenance
Worked from a small TTS fine-tuning project: 3 hours single-speaker British audio across 3 speakers (Sara Cox, Maxine Peake, Diane Morgan, all Bolton/Lancashire-region) → fine-tuned F5-TTS and StyleTTS2 producing recognisably Northern-English output. Companion pieces:
- The listening-loop methodology — how human feedback actually steers training decisions
- The non-AVX2 CPU compatibility notes — what it took to make this run on a 2010-era CPU
- The minimal F5-TTS trainer used for the F5 side of the comparison