Whisper and STT on Apple Silicon
Whisper Notes from My Mac (September 2025)
I spent the past few days gluing together a speech‑to‑text workflow on my Apple Silicon MacBook (macOS 26.0, Python 3.12). Most of what follows is a log of what actually happened: which warnings popped up, how I worked around them, and why I eventually leaned on whisper.cpp. I’m keeping this grounded so I (or anyone else) can repeat the steps later without guesswork.
Environment Snapshot
All version numbers below come straight from the venv and host (sw_vers, python -V, importlib.metadata.version).
| Component | Version / detail |
|---|---|
| macOS | 26.0 |
| Python | 3.12.5 |
transformers |
4.46.1 |
torch |
2.8.0 |
soundfile |
0.13.1 |
| HF checkpoint | openai/whisper-large-v3-turbo |
| whisper.cpp | CLI at $HOME/bin/whisper-cli, model ggml-large-v3-turbo.bin |
All test clips were mono WAV (16 kHz).
1. What Hugging Face Whisper Looked Like on This Machine
1.1 Device and dtype quirks
If I let Transformers pick the Metal backend (device="mps"), generate() occasionally crashed with RuntimeError: Input type (float) and bias type (c10::Half) should be the same. The model had moved to float16 on GPU while the feature extractor ran on CPU in float32. To keep things reliable, I now move the model back to CPU before creating the pipeline (and the pipeline itself runs with device=-1). CUDA users don’t usually see this; on Apple Silicon the CPU path was the safe bet.
1.2 Greedy decoding vs. beam search
Short clips were the first surprise: greedy decoding auto-detected Urdu and produced شکریہ. That’s technically correct, but I wanted Devanagari, so I forced the language prompt and enabled beam search:
forced = processor.get_decoder_prompt_ids(
language="hi", task="transcribe", no_timestamps=True
)
result = model.generate(
input_features,
forced_decoder_ids=forced,
num_beams=5,
max_new_tokens=64,
attention_mask=torch.ones(input_features.shape[:-1], dtype=torch.long),
)
The pipeline also honours generate_kwargs, so I now pass language and num_beams explicitly every time.
1.3 Long audio chunking
My first attempt at chunking was a naïve slice (range(0, len, chunk_len)), and later segments went missing because I wasn’t overlapping windows or reusing prompts. Switching to the built-in pipeline fixed that for me:
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v3-turbo",
device=-1,
)
result = asr(
{"array": audio_np, "sampling_rate": sr},
chunk_length_s=30,
stride_length_s=[10, 10],
generate_kwargs={"language": "hi", "task": "transcribe", "num_beams": 5},
)
Letting the pipeline handle overlap and context reuse was much less error-prone than reinventing it. In my session helper I also move the cached model back to CPU before instantiating the pipeline so I don’t trip over Metal-related dtype issues.
1.4 Warnings worth heeding
return_dict_in_generatevsoutput_scores: if you want per-token scores, you must set both. Otherwise the scores are silently dropped.- Language auto-detect: after PR #28687 (Mar 2024), multilingual Whisper defaults to detection + transcription. I now always set
language="hi"(or whatever I need) to avoid unexpected translations. past_key_values: Transformers 4.47 moves toEncoderDecoderCache. If the legacy tuple is still needed,return_legacy_cache=Truekeeps things working for now (see the 4.47 release notes).- Attention mask: pad token equals EOS, so the model can’t infer the mask. Passing an explicit
attention_maskstopped the warning and gave me more predictable behaviour (documented in the Whisper generation guide).
1.5 Throughput snapshot
Transcribing a ~90 s clip on CPU, beam search enabled, took roughly 45 s and produced a readable transcript with a few garbled phrases. Good enough for experiments, not ideal for “fast feedback”.
2. What Changed When I Tried whisper.cpp
2.1 CLI run
I pointed the whisper.cpp CLI at the same ~90 s clip using:
whisper-cli \
-m /Users/jaju/github-others/ggerganov/whisper.cpp/models/ggml-large-v3-turbo.bin \
-l hi \
--output-json -of outputs/sample_cpp \
audata/audio/audio_message.wav
The run finished in about 29 s on the same Mac, the transcript looked cleaner, and I didn’t see any repeated segments. For larger batches, that speed difference is hard to ignore.
2.2 HTTP server
Starting the whisper.cpp server (whisper-server) gave me a simple REST endpoint:
curl 127.0.0.1:8181/inference \
-F file=@audata/audio/audio_message.wav \
-F language="hi" \
-F temperature="0.0" \
-F temperature_inc="0.0" \
-F response_format="verbose_json"
The response includes segment start/end offsets and text. I normalised the whitespace when parsing so the session log just sees plain paragraphs. Latency was around 10–12 s for the same clip, already faster than the HF pipeline.
2.3 What I have not finished yet
whisper.cpp also supports grammar-based decoding (GBNF). I haven’t tried that in this project, so it’s on the “explore later” list rather than a recommendation.
3. Picking a Backend Day to Day
| Scenario | HF Transformers | whisper.cpp |
|---|---|---|
| Need LoRA / fine-tuning | ✅ | ⚠️ (inference only) |
| Python ecosystem (datasets, analytics) | ✅ | Use CLI / bindings |
| Deployment on macOS without Python | ⚠️ | ✅ |
| Real-time or near real-time | ⚠️ (CPU bound) | ✅ (Metal + quantisation) |
| Grammar/structured output | Custom work required | Built-in (needs more exploration) |
So far the balance for me looks like: use whisper.cpp for fast, repeatable inference; keep Transformers around when I need to experiment or compose models in Python.
4. Wiring whisper.cpp into the Existing Session Code
I extended the session API to support both backends. Usage now looks like this:
from whisper_steer import new_session
sess = new_session() # reads WHISPER_STEER_* env vars
# Run via HF (local pipeline)
hf_res = sess.run_asr("audio_message", language="hi", backend="local")
# Run via whisper.cpp server (defaults to http://127.0.0.1:8181)
server_res = sess.run_asr("audio_message", language="hi", backend="server")
The CLI wrapper shares the same options:
PYTHONPATH=src python tools/quick_transcribe.py audio_message \
--language hi \
--backend server \
--no-log
If I don’t set WHISPER_STEER_SERVER_URL, the code falls back to http://127.0.0.1:8181. Logged results include segments and timestamps regardless of backend.
5. Practical To‑Dos and Future Work
- Keep whisper.cpp as the first-line inference engine on Mac. It’s faster and produces fewer artefacts out of the box. I can still parse the JSON output, run transliteration, and save the same metadata I used with HF.
- Stay with HF when I need training or cross-model experiments. LoRA/adapters aren’t an option in whisper.cpp, so Transformers remains the place for that work.
- Write down decoding settings. The biggest time sink was forgetting which clip was decoded with which language prompt or beam width. Every run now logs those parameters explicitly.
- Evaluate grammar constraints separately. whisper.cpp’s grammar tooling looks promising, but I want a dedicated test suite before trusting it in production.
- Support post-edit/transliteration in Python. The server already returns clean text; I’ll keep running transliteration (
indic-transliteration) and domain-specific post-edit passes after the transcript comes back.
That’s as far as I took things this round. If I revisit the project, grammar constraints and transliteration quality will probably be next on my list.