How human feedback actually steers TTS fine-tuning

Notes on the iteration loop we ran while fine-tuning F5-TTS and StyleTTS2 on a small Northern English corpus. The headline finding is that the listening test isn’t optional polish at the end — it’s the only measurement that catches the failure modes that matter, and each round of listening produces specific phonetic observations that map to specific engineering decisions.

This is a write-up of the methodology, with the concrete examples that forced each decision.

The loop

        ┌────────────────────────┐
        │  render passage        │
        │  (baseline + ft)       │
        └──────────┬─────────────┘
                   ▼
        ┌────────────────────────┐         a feature is "right" if a native
        │  human listens against │         speaker recognises it. Record both
        │  marker list (BATH,    │  ◀───── what's working AND what's broken;
        │  FOOT-STRUT, …)        │         both are signal.
        └──────────┬─────────────┘
                   ▼
        ┌────────────────────────┐         translate audible features
        │  diagnose: why is the  │         to training-side cause:
        │  output the way it is? │  ◀───── · missing accent → under-trained
        └──────────┬─────────────┘         · right accent + glitches → over-trained
                   ▼                       · wrong accent → data or LR direction
        ┌────────────────────────┐         specific knobs:
        │  pick next training    │         · lr ↑/↓ (drift per step)
        │  move                  │  ◀───── · epochs ±N (cumulative drift)
        └──────────┬─────────────┘         · earlier ckpt (rewind)
                   ▼                       · data filter (cleaner signal)
              [iterate]

The verdict-to-action mapping

Listening verdict	What it implies physically	Engineering response
“No discernible difference from baseline”	Cumulative weight drift Σ lr·grad too small. Either lr too low, scheduler decayed it to ~0, or epochs too few.	Increase lr or remove decay; add epochs.
“Accent is right but specific words mangled / dropped / truncated”	Late-epoch overfitting on training-corpus pace or timing. Crossed from “learning the distribution” to “memorising peculiarities of small corpus”.	Step back: pick an earlier checkpoint, or continue at lower lr.
“Accent is wrong direction (e.g. American instead of Northern)”	Training data misattributed, or model pulled toward different distribution than expected.	Audit data: manifest pointing at right speakers? Diarisation clean? Speaker IDs correct?
“Specific phonetic feature still missing (e.g. monophthongisation absent on ‘sunshine’)”	That pattern needs more training-distribution exposure. Some accent features are easier than others.	Train more, keeping lr constant. Don’t increase lr to chase one feature — risk catastrophic forgetting.
“Feature drifted past the target (e.g. ‘down’ → ‘doon’)”	Over-fit on the broader cluster of related accents. Model has slid past the target sub-region.	Step back to earlier checkpoint OR pick checkpoint before the drift.

These categories aren’t theoretical. We hit each of them in real training runs. Examples:

“No discernible difference” → LR scheduler decayed to zero

Run 1 of F5-TTS used the trainer’s default schedule: linear warmup to peak 1e-5, then linear decay across the entire run to ~0. After 5 epochs:

Mean loss per epoch: 0.629, 0.677, 0.648, 0.642, 0.670 — flat
Listening: indistinguishable from baseline
Numerical: waveform correlation with baseline = 0.017 (essentially uncorrelated audio, as expected for diffusion sampling) — looked like the model was doing something, but the perceptual output disagreed

Diagnosis: the schedule shape was wrong. Step-by-step LR values:

step 1: lr = 1e-7 (warmup)
step 100: lr = 1e-5 (peak, decay starts)
step 1000: ≈ 5.5e-6
step 2225: lr = 1e-13 — effectively zero

Total weight drift is bounded by Σ lr·grad. With LR linearly decaying to ~0 over 2225 steps, late-epoch gradients are multiplied by near-zero values. Most of run 1’s “5 epochs” was a no-op.

Run 2 fix: 5× higher peak LR (5e-5), constant after warmup, no decay. 10 epochs. Result: per-epoch mean loss decreased (0.701 → 0.683 → 0.661 → 0.646 across the first 4), and listening verdict was audibly Northern — “London” rendered as “Lundun” (FOOT-STRUT vowel collapse, a textbook Northern marker).

“Specific feature missing” → Train more, same LR

After 4 epochs of run 2, FOOT-STRUT had emerged (“Lundun”) but monophthongisation hadn’t (“sunshine” still diphthongised standard). Some phonetic patterns are easier to acquire than others — single-vowel substitutions vs global diphthong→monophthong shifts.

Continuing 6 more epochs at the same constant 5e-5: monophthongisation strengthened (“laughing” landed correctly), but truncation appeared (“sunshine” → “sunshinn”, dropped function words like “her”).

“Accent right but words mangled” → Pick earlier checkpoint

Run 2 epoch 10 had the strongest accent but the most word-truncation. Rendering epochs 6, 7, 8, 9 with the same input passage and listening through revealed epoch 9 as the sweet spot — accent committed, mostly without truncation. Final shipping checkpoint.

“Drifted past the target” → StyleTTS2’s late-epoch failure mode

StyleTTS2 epoch 5 introduced “down” → “doon” (Geordie/Scots realisation). That’s more Northern than the Bolton/Lancashire target. The model had slid past the target sub-region of accent space and was now drifting toward broader Scots/North-East phonetics. Stopped training; epoch 4 became the shipping checkpoint.

Why loss alone can’t replace listening

Three reasons:

Loss flatness is ambiguous. A flat loss curve could mean “converged” or “not learning at all.” Run 1’s flat 0.65 was the latter; only listening (“indistinguishable from baseline”) disambiguated and pointed at the LR scheduler. No purely numerical metric on training loss could distinguish those two cases without an evaluation set.
Some failures look like wins on the loss curve. Late-epoch overfitting drops training loss while degrading output. Lower loss + worse output. Only listening catches it.
The thing being optimised isn’t what you actually want. Flow-matching loss measures velocity-field reconstruction quality on the training distribution. It doesn’t directly measure “is this output Northern English-sounding to a native speaker.” The model can get better at fitting Sara’s training mels while producing audio that sounds different from any actual Sara recording.

This is why every training run produces multiple per-epoch checkpoints and we render the same passage through several of them. The cost (~30s per render × 5–6 epochs = ~3 min) buys you a perceptual gradient across training time that no scalar loss provides.

The phonetic-marker passage as deliberate probe

The test passage is loaded with English-accent markers so a single rendering surfaces multiple aspects of the model’s state. Our standard probe:

“It was a bright morning when the path through the grass led down to the running water. She ran her hand along the back of the chair before sitting down. The young children were laughing in the sunshine, dancing in patterns through the warm afternoon. After tea, the family walked up the hill to look at the view. One of them said, with a small smile: I cannot believe how lovely it is, our little corner of the world.”

What we’re probing	Words that probe it	“Wrong” sounds like	“Northern” sounds like
BATH vowel	path, grass, laughing, dancing, after, cannot	/pɑːθ/ (RP “parth”)	/pæθ/ (rhymes with “trap”)
FOOT-STRUT	running, sunshine, hand, hill, up, lovely, our	distinct “put”/”putt”	collapsed: both /ʊ/, so “London” → “Lundun”
Diphthong→monophthong	sunshine (→sunshaan), morning, smile	standard /aɪ/, /eɪ/	flat, longer single vowel /aː/
happY-tense	lovely, family, every, country	tense /iː/	laxer, more /ɪ/-like
R-intrusion / linking	chair before, our little	none	often realised in connected Northern speech

If only some markers come through, that tells us which kinds of changes the model is finding easier vs harder to learn. In run 2 epoch 4 the FOOT-STRUT shift had emerged (“Lundun”) but monophthongisation had not (“sunshine” still diphthongised). That gap motivated continuing training rather than declaring done — specific phonetic gaps mapping to specific training decisions.

Practical recommendations

Save a checkpoint per epoch. They’re cheap to disk and you’ll want the perceptual gradient across training time. Late-epoch isn’t always best.
Curate one phonetic-marker passage that targets the dialect features you care about. Reuse the same passage every render so you build a listening-memory of the model’s progression.
Render with the same reference clip every time. The only variable should be the model weights. If you change the reference clip you’re asking two different questions at once.
Native-speaker listeners are the most reliable test instrument. Their judgement catches features that numerical metrics miss — and importantly, also catches over-fitting failures that look fine numerically.
Both wins and bugs are signal. Don’t just record what’s working; record what’s broken. The combination of “what improved” and “what got worse” defines the engineering response (continue / step back / change data).
Run more checkpoints than you think you need. A/B-ing 6 different epochs of the same run takes 3 minutes of compute. The information gain — perceptual gradient over training time — is worth far more than that.

Provenance

Worked example from a small TTS fine-tuning project: ~3 hours of single-speaker British (Bolton-area) audio + WhisperCPP for transcripts → fine-tuned F5-TTS and StyleTTS2 producing recognisably Northern-English output. Both architectures hit different late-epoch failure modes that only the listening loop caught. The companion piece F5 vs StyleTTS2 architecture trade-off documents what those failure modes implied about the architectures themselves.