Running modern Python TTS toolchains on non-AVX2 CPUs

Notes from getting F5-TTS, StyleTTS2, kokoro/Misaki, and whisper.cpp to work on an AMD Phenom II X6 1090T (2010 K10/Family-10h architecture).

The CPU has SSE/SSE2/SSE3/SSE4a, plus CX16/POPCNT/LAHF — but no SSE4.1, no SSE4.2, no AVX, no AVX2, no FMA, no F16C. That puts it below the modern x86-64-v2 baseline. A growing share of binary Python wheels in the AI ecosystem assume v2 or v3, so they SIGILL or SIGFPE at import. This is a ground-truth list of what we hit and what worked.

Quick triage

If your CPU is below x86-64-v2 (in particular, missing SSE4.1), expect:

pyarrow static-init pinsrq SIGILL on import
numpy 2.x wheel SIGILL on import (numpy 1.26.4 still has a fallback path)
torch 2.10+ wheel SIGFPE in torch._dynamo on import
pandas modern wheels SIGILL on tokenisation
monotonic_align and other Cython extensions: build-from-source SIGILL
DataLoader subprocess workers SIGFPE re-importing torch

If your CPU is x86-64-v2 (Nehalem ~2008 or newer Intel; Bulldozer ~2011 or newer AMD) but missing AVX/AVX2, you’ll still hit some of these but fewer.

Working pin-set

These are versions empirically verified to import and run on this CPU:

package	version	why
numpy	1.26.4	last with a non-AVX2 fallback path; from-source builds OK
torch	2.7.0	last with a usable `_dynamo` init that doesn’t SIGFPE
torchaudio	2.7.0	last with the soundfile backend (2.10+ requires torchcodec)
transformers	4.57.3	5.x triggers `torch._dynamo` import-time via `torch.compiler.disable`
numba / scipy / librosa	latest binary wheels	OK
pyarrow / pandas / datasets / torchcodec	uninstalled	wheels assume SSE4.1+; not actually needed for inference

For a fresh install, layer the pins after the project install:

pip install --prefer-binary <project>           # whatever you actually want
pip install --prefer-binary --force-reinstall --no-deps \
    "torch==2.7.0" "torchaudio==2.7.0" \
    "transformers==4.57.3" "numpy<2"
pip uninstall -y datasets pyarrow pyarrow-hotfix pandas torchcodec

Patches required

Patch 1: `torch._dynamo` SIGFPE on int division by zero

Even after pinning to torch 2.7.0, the very first dynamo init still SIGFPEs on this CPU. Cause: torch._dynamo.variables.torch_function.populate_builtin_to_tensor_fn_map() probes Python operators on dummy tensors, including tensor // 0 (integer floor-divide by zero). Newer Intel CPUs trap this into a Python ZeroDivisionError via signal handler. AMD Phenom II just SIGFPEs.

The function’s output isn’t actually needed for inference. Stub it:

F=$(python -c "import torch._dynamo.variables.torch_function as m; print(m.__file__)")
cp $F $F.orig
sed -i "0,/    global BUILTIN_TO_TENSOR_FN_MAP/s//    return  # patched: SIGFPE on Phenom II\n    global BUILTIN_TO_TENSOR_FN_MAP/" $F

This is non-invasive — only affects code that uses torch.compile() / dynamo paths, which most fine-tuning trainers don’t.

Patch 2: GPU-only mel-spectrogram computation

torch.matmul on CPU SIGFPEs on this CPU. Anything that calls torchaudio’s MelSpectrogram on CPU dies. For training pipelines that compute mels in the data loader, this is fatal.

Two ways to fix:

a) Move the mel module to GPU (cheap audio→mel transfer per sample):

to_mel = torchaudio.transforms.MelSpectrogram(...).to("cuda")
def preprocess(wave):
    wave = torch.from_numpy(wave).to("cuda")
    mel = to_mel(wave)
    return mel.cpu()  # back to CPU for DataLoader collator

b) Pre-compute all mels once on GPU, save to disk, load at training time (example script).

(b) is faster overall — no per-sample audio→GPU transfer, just torch.load.

Patch 3: `num_workers=0` everywhere

DataLoader spawns subprocess workers that re-import torch and re-run _dynamo init. Even with patch 1, the patched source isn’t always picked up in subprocess. Set num_workers=0 to keep all loading in the main process.

Patch 4: `weights_only=False` for older checkpoint formats

PyTorch 2.6+ flipped the default. If you load checkpoints saved before 2.6 that contain pickled Python objects, you need torch.load(path, weights_only=False). Affected: many published TTS pretrained models (StyleTTS2’s ASR/JDC/PLBERT modules, F5-TTS in some cases).

Patch 5: Stub `datasets` for transformers’ lazy loader

transformers.utils.import_utils._is_package_available("datasets") calls importlib.util.find_spec("datasets"), which raises ValueError if __spec__ is None. If you provide a stub datasets module via sys.modules (to avoid pulling pyarrow), it must have a real ModuleSpec:

import importlib.machinery, types, sys
_stub = types.ModuleType("datasets")
_stub.__spec__ = importlib.machinery.ModuleSpec("datasets", loader=None)
_stub.Dataset = type("Dataset", (), {})
_stub.load_from_disk = lambda *a, **kw: None
sys.modules["datasets"] = _stub

Patch 6: `--no-build-isolation` for Cython extensions

monotonic_align (used by StyleTTS2) and similar packages build with their own ephemeral build-env via pip’s build isolation. That ephemeral env re-installs numpy and cython and may pull AVX2 wheels. Use:

pip install --no-build-isolation --no-deps <package>

This forces the build to use your already-installed (pinned) numpy+cython.

Per-project status

F5-TTS

Inference and training both work after patches 1–5.
See companion gist for a minimal trainer that bypasses datasets/accelerate.
Issue filed: SWivid/F5-TTS#1292 (EMA-only checkpoint structure).

StyleTTS2

Inference and fine-tune both work after patches 1, 2, 3, 4, 6.
PRs filed: yl4579/StyleTTS2#361 (weights_only=False), #362 (drop pandas).

kokoro

Inference works (via the kokoro-onnx ONNX runtime path; PyTorch path blocked by upstream dep pinning, not CPU).
Issue filed: hexgrad/kokoro#321 (broken misaki>=0.7.16 PyPI pin).

whisper.cpp

Works out of the box. Pure C++, no Python wheels involved. CUDA inference on the GPU.

What does not work

pyarrow source build: succeeds eventually but the resulting library still uses SSE4.1 in places (Apache Arrow’s CMake ARROW_SIMD_LEVEL=NONE doesn’t cover everything). Not worth the multi-hour build.
numpy 2.x: even from-source build emits AVX-needing code via OpenBLAS bundled wheels. Stick with 1.26.4.
Anything using bitsandbytes int8/int4 quantisation: those kernels hard-require AVX2.

Worth trying if you have AVX (no AVX2)

A 2011-era Sandy Bridge or later Intel CPU has AVX but no AVX2. Most of the patches above still apply, but you may not need patch 1 (dynamo SIGFPE), and pyarrow/datasets/pandas may install (just not the AVX2-specific code paths). Try without the uninstalls first.

Summary

If you want to do TTS fine-tuning on hardware below x86-64-v2:

Do inference work on the GPU. Keep CPU-side code to file I/O and JSON.
Pin numpy 1.26 + torch 2.7 + transformers 4.57.
Stub or uninstall datasets/pyarrow/pandas/torchcodec.
Patch torch._dynamo once per torch install.
Pre-compute mel-spectrograms offline.
Train at num_workers=0.

The rig produces useful output. It’s not a fast-iteration machine — every upstream upgrade re-breaks something — but for fine-tuning (which doesn’t need a fast-iteration machine) it’s economical: an RTX 3060 12 GB on a 2010-era CPU running real-world TTS workloads.