Whisper and STT on Apple Silicon

Whisper Notes from My Mac (September 2025)

I spent the past few days gluing together a speech‑to‑text workflow on my Apple Silicon MacBook (macOS 26.0, Python 3.12). Most of what follows is a log of what actually happened: which warnings popped up, how I worked around them, and why I eventually leaned on whisper.cpp. I’m keeping this grounded so I (or anyone else) can repeat the steps later without guesswork.


Environment Snapshot

All version numbers below come straight from the venv and host (sw_vers, python -V, importlib.metadata.version).

Component Version / detail
macOS 26.0
Python 3.12.5
transformers 4.46.1
torch 2.8.0
soundfile 0.13.1
HF checkpoint openai/whisper-large-v3-turbo
whisper.cpp CLI at $HOME/bin/whisper-cli, model ggml-large-v3-turbo.bin

All test clips were mono WAV (16 kHz).


1. What Hugging Face Whisper Looked Like on This Machine

1.1 Device and dtype quirks

If I let Transformers pick the Metal backend (device="mps"), generate() occasionally crashed with RuntimeError: Input type (float) and bias type (c10::Half) should be the same. The model had moved to float16 on GPU while the feature extractor ran on CPU in float32. To keep things reliable, I now move the model back to CPU before creating the pipeline (and the pipeline itself runs with device=-1). CUDA users don’t usually see this; on Apple Silicon the CPU path was the safe bet.

Short clips were the first surprise: greedy decoding auto-detected Urdu and produced شکریہ. That’s technically correct, but I wanted Devanagari, so I forced the language prompt and enabled beam search:

forced = processor.get_decoder_prompt_ids(
    language="hi", task="transcribe", no_timestamps=True
)
result = model.generate(
    input_features,
    forced_decoder_ids=forced,
    num_beams=5,
    max_new_tokens=64,
    attention_mask=torch.ones(input_features.shape[:-1], dtype=torch.long),
)

The pipeline also honours generate_kwargs, so I now pass language and num_beams explicitly every time.

1.3 Long audio chunking

My first attempt at chunking was a naïve slice (range(0, len, chunk_len)), and later segments went missing because I wasn’t overlapping windows or reusing prompts. Switching to the built-in pipeline fixed that for me:

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3-turbo",
    device=-1,
)

result = asr(
    {"array": audio_np, "sampling_rate": sr},
    chunk_length_s=30,
    stride_length_s=[10, 10],
    generate_kwargs={"language": "hi", "task": "transcribe", "num_beams": 5},
)

Letting the pipeline handle overlap and context reuse was much less error-prone than reinventing it. In my session helper I also move the cached model back to CPU before instantiating the pipeline so I don’t trip over Metal-related dtype issues.

1.4 Warnings worth heeding

  1. return_dict_in_generate vs output_scores: if you want per-token scores, you must set both. Otherwise the scores are silently dropped.
  2. Language auto-detect: after PR #28687 (Mar 2024), multilingual Whisper defaults to detection + transcription. I now always set language="hi" (or whatever I need) to avoid unexpected translations.
  3. past_key_values: Transformers 4.47 moves to EncoderDecoderCache. If the legacy tuple is still needed, return_legacy_cache=True keeps things working for now (see the 4.47 release notes).
  4. Attention mask: pad token equals EOS, so the model can’t infer the mask. Passing an explicit attention_mask stopped the warning and gave me more predictable behaviour (documented in the Whisper generation guide).

1.5 Throughput snapshot

Transcribing a ~90 s clip on CPU, beam search enabled, took roughly 45 s and produced a readable transcript with a few garbled phrases. Good enough for experiments, not ideal for “fast feedback”.


2. What Changed When I Tried whisper.cpp

2.1 CLI run

I pointed the whisper.cpp CLI at the same ~90 s clip using:

whisper-cli \
  -m /Users/jaju/github-others/ggerganov/whisper.cpp/models/ggml-large-v3-turbo.bin \
  -l hi \
  --output-json -of outputs/sample_cpp \
  audata/audio/audio_message.wav

The run finished in about 29 s on the same Mac, the transcript looked cleaner, and I didn’t see any repeated segments. For larger batches, that speed difference is hard to ignore.

2.2 HTTP server

Starting the whisper.cpp server (whisper-server) gave me a simple REST endpoint:

curl 127.0.0.1:8181/inference \
  -F file=@audata/audio/audio_message.wav \
  -F language="hi" \
  -F temperature="0.0" \
  -F temperature_inc="0.0" \
  -F response_format="verbose_json"

The response includes segment start/end offsets and text. I normalised the whitespace when parsing so the session log just sees plain paragraphs. Latency was around 10–12 s for the same clip, already faster than the HF pipeline.

2.3 What I have not finished yet

whisper.cpp also supports grammar-based decoding (GBNF). I haven’t tried that in this project, so it’s on the “explore later” list rather than a recommendation.


3. Picking a Backend Day to Day

Scenario HF Transformers whisper.cpp
Need LoRA / fine-tuning ⚠️ (inference only)
Python ecosystem (datasets, analytics) Use CLI / bindings
Deployment on macOS without Python ⚠️
Real-time or near real-time ⚠️ (CPU bound) ✅ (Metal + quantisation)
Grammar/structured output Custom work required Built-in (needs more exploration)

So far the balance for me looks like: use whisper.cpp for fast, repeatable inference; keep Transformers around when I need to experiment or compose models in Python.


4. Wiring whisper.cpp into the Existing Session Code

I extended the session API to support both backends. Usage now looks like this:

from whisper_steer import new_session

sess = new_session()  # reads WHISPER_STEER_* env vars

# Run via HF (local pipeline)
hf_res = sess.run_asr("audio_message", language="hi", backend="local")

# Run via whisper.cpp server (defaults to http://127.0.0.1:8181)
server_res = sess.run_asr("audio_message", language="hi", backend="server")

The CLI wrapper shares the same options:

PYTHONPATH=src python tools/quick_transcribe.py audio_message \
    --language hi \
    --backend server \
    --no-log

If I don’t set WHISPER_STEER_SERVER_URL, the code falls back to http://127.0.0.1:8181. Logged results include segments and timestamps regardless of backend.


5. Practical To‑Dos and Future Work

  1. Keep whisper.cpp as the first-line inference engine on Mac. It’s faster and produces fewer artefacts out of the box. I can still parse the JSON output, run transliteration, and save the same metadata I used with HF.
  2. Stay with HF when I need training or cross-model experiments. LoRA/adapters aren’t an option in whisper.cpp, so Transformers remains the place for that work.
  3. Write down decoding settings. The biggest time sink was forgetting which clip was decoded with which language prompt or beam width. Every run now logs those parameters explicitly.
  4. Evaluate grammar constraints separately. whisper.cpp’s grammar tooling looks promising, but I want a dedicated test suite before trusting it in production.
  5. Support post-edit/transliteration in Python. The server already returns clean text; I’ll keep running transliteration (indic-transliteration) and domain-specific post-edit passes after the transcript comes back.

That’s as far as I took things this round. If I revisit the project, grammar constraints and transliteration quality will probably be next on my list.