Qwen 3 ASR vs Voxtral Mini Realtime
Two Ways a Machine Can Listen: Qwen3-ASR vs Voxtral Realtime
A detailed, source-checked architectural comparison of two speech-to-text systems — one centered on segment-style decoding, one designed for native realtime decoding.
1. Why This Comparison Matters
This article compares two open ASR systems that target similar use cases but are architecturally very different:
- Qwen3-ASR-0.6B (Qwen)
- Voxtral-Mini-4B-Realtime-2602 (Mistral)
Both convert speech to text. Both target practical deployment. But their core inference loops and runtime assumptions are not the same.
That difference matters for engineering decisions in realtime products:
- how to design backend interfaces,
- how to segment and emit partial/final text,
- where to place latency controls,
- and how much of a stack should be model-agnostic.
The focus is architecture and system behavior, not benchmark ranking. Checkpoint-specific numeric values are anchored to released configuration files (config.json, preprocessor_config.json, params.json), with explicit notes where values come from implementation code, paper-reported totals, or runtime behavior.
2. Background: Common STT Architecture Families
There is no canonical finite list of STT architectures. Most production systems are variants, combinations, or evolutions of a small set of recurring families.
2.1 CTC (Connectionist Temporal Classification)
- Typical shape: encoder-only acoustic model, optional external LM for rescoring.
- Intuition: align frame-level audio evidence with token outputs under a monotonic constraint.
- Practical profile: efficient and robust; widely used in offline and near-realtime systems.
2.2 RNN-T / Transducer
- Typical shape: encoder + prediction network + joiner.
- Intuition: produce tokens incrementally while maintaining monotonic alignment.
- Practical profile: a dominant choice for streaming ASR in many production deployments.
2.3 Attention Encoder-Decoder (Seq2Seq)
- Typical shape: encoder builds acoustic representation; decoder autoregressively generates text.
- Intuition: learn richer sequence mapping with flexible conditioning.
- Practical profile: strong quality in offline settings; streaming variants exist but require additional design constraints.
2.4 Two-Pass Hybrids
- Typical shape: fast first pass (often streaming) + slower second pass rescoring/correction.
- Intuition: trade latency and quality by splitting tasks across two inference stages.
2.5 Speech-LLM Hybrids
- Typical shape: speech/audio encoder feeding a language-model decoder.
- Intuition: leverage LM capabilities (prompting, formatting control, multilingual generalization) in ASR.
Qwen3-ASR and Voxtral Realtime both fall into this last category — they fuse an audio encoder with a language model decoder — but they do so in fundamentally different ways.
Key takeaway: a useful cross-model abstraction should target behavioral contracts (session lifecycle, events, capabilities), not force identical internals.
3. Shared Signal Path
Both models follow the same high-level pipeline from waveform to text:
flowchart LR
A["Speech waveform"] --> B["Microphone / ADC (16 kHz mono)"]
B --> C["Feature extraction (mel frames)"]
C --> D["Neural audio-text model"]
D --> E["Token stream"]
E --> F["Transcript text"]
The major differences are in:
- how audio features are encoded,
- how audio and text streams are fused,
- and how decoding steps are scheduled over time.
4. Building Blocks (a brief primer)
4.1 Mel Spectrograms: What the Model Actually “Sees”
Raw audio at 16 kHz means 16,000 samples per second. Most ASR models operate on short-time spectral features instead of raw waveform samples.
The audio is sliced into overlapping windows (typically 25ms wide, sliding every 10ms), and for each window, the energy at different frequency bands is measured. The “mel” part refers to a perceptual scale — it spaces frequency bands the way human hearing works, with finer resolution at low frequencies.
Time →
┌──────────────────────────┐
High freq │░░▓▓░░░░░░▓▓▓▓░░░░░░░░░░░│
│░▓▓▓░░░░░▓▓▓▓▓▓░░░░░░░░░░│
│▓▓▓▓▓░░░▓▓▓▓▓▓▓▓░░░░░░░░░│
│▓▓▓▓▓▓░▓▓▓▓▓▓▓▓▓▓░░░░░░░░│
Low freq │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░▓▓░░░│
└──────────────────────────┘
"Hello" "world" (silence)
Both checkpoints use 128 mel bins and hop length 160 (10ms at 16 kHz), so both produce ~100 mel frames per second.
If duration is D seconds:
T ≈ 100 × D- mel features:
X ∈ ℝ^(128 × T)
4.2 Embeddings and Dimensional Alignment
Transformers operate on embeddings — fixed-length vectors (typically 1024 to 3072 numbers long) that represent meaning in a geometric space. Things that are similar end up close together: the embedding for “dog” is near “puppy” and far from “algebra.”
For multimodal ASR, audio representations must be mapped into the decoder’s hidden space (or a compatible fusion space). The details differ:
- Qwen3-ASR: audio embeddings replace placeholder tokens in a text prompt sequence.
- Voxtral Realtime: audio and text embeddings are added together at each timestep.
Both approaches require audio and text to share the same vector space and dimensionality.
4.3 Bidirectional vs Causal Attention
A transformer processes a sequence of embeddings. Its core mechanism is attention: each position can “look at” other positions to gather context.
Bidirectional Causal
(Qwen3-ASR encoder) (Voxtral encoder)
Position: 1 2 3 4 5 Position: 1 2 3 4 5
1 ✓ ✓ ✓ ✓ ✓ 1 ✓ · · · ·
2 ✓ ✓ ✓ ✓ ✓ 2 ✓ ✓ · · ·
3 ✓ ✓ ✓ ✓ ✓ 3 ✓ ✓ ✓ · ·
4 ✓ ✓ ✓ ✓ ✓ 4 ✓ ✓ ✓ ✓ ·
5 ✓ ✓ ✓ ✓ ✓ 5 ✓ ✓ ✓ ✓ ✓
✓ = can attend · = cannot attend
- Bidirectional: every position sees every other. Richer representations, but requires the complete input.
- Causal: each position only sees past/present. Required for strict realtime — future frames haven’t arrived yet.
4.4 Autoregressive Decoding and KV Cache
Both models generate text autoregressively: they predict one token at a time, and each prediction depends on all previous tokens.
Step 1: [audio context] → predict "The"
Step 2: [audio context] + "The" → predict " revenue"
Step 3: [audio context] + "The revenue" → predict " increased"
...
The decoder maintains a KV cache — a running memory of previously computed attention keys and values — so each new token doesn’t reprocess the full history.
5. Qwen3-ASR-0.6B: Architecture and Runtime Behavior
5.1 Source-Checked Configuration
All values from config.json and preprocessor_config.json:
If you’re skimming: this checkpoint is relatively compact (18-layer audio encoder + 28-layer decoder at hidden size 1024).
| Component | Value | Source field |
|---|---|---|
| Sample rate | 16,000 Hz | preprocessor_config.json |
| Mel bins | 128 | feature_size: 128 |
n_fft / hop_length |
400 / 160 | preprocessor_config.json |
| Conv downsampling hidden | 480 | downsample_hidden_size |
| Conv strides | [2, 2, 2] (8× total) |
3 Conv2D layers |
| Audio encoder layers | 18 | encoder_layers |
| Audio encoder dim | 896 | d_model |
| Audio encoder heads | 14 | encoder_attention_heads |
| Audio encoder FFN | 3584 | encoder_ffn_dim |
| Audio output projection | 1024 | output_dim |
| Chunk / window | n_window=50, n_window_infer=800 | audio_config |
| Decoder layers | 28 | num_hidden_layers |
| Decoder hidden | 1024 | hidden_size |
| Decoder query heads | 16 | num_attention_heads |
| Decoder KV heads | 8 | num_key_value_heads |
| Decoder head dim | 128 | head_dim |
| Decoder FFN | 3072 | intermediate_size |
| RoPE theta | 1,000,000 | rope_theta |
| Vocabulary | 151,936 | vocab_size |
| Tied embeddings | yes | tie_word_embeddings: true |
| Audio token ID | 151,676 | audio_token_id |
5.2 End-to-End Flow
flowchart LR
A["Audio (16 kHz)"] --> B["Whisper-style mel (128 × T)"]
B --> C["3× Conv2D downsampling (8× in time)"]
C --> D["Audio encoder (18L, d=896, bidirectional/chunked)"]
D --> E["Project 896 → 1024"]
E --> F["Replace audio-pad slots in prompt"]
F --> G["Qwen decoder (28L, d=1024, GQA 16q/8kv)"]
G --> H["Autoregressive token output"]
5.3 Step-by-Step
Feature Extraction
The audio signal is converted into a mel spectrogram using the Whisper feature extractor (n_fft=400, hop_length=160):
mel ∈ ℝ^(128 × T) where T ≈ 100 × duration_seconds
For a 5-second clip: mel has shape [128, 500].
Convolutional Downsampling
Three 2D convolutions with stride 2 shrink both time and frequency:
Input: [128 freq × T time × 1 channel]
Conv1: stride 2, GELU → [64 × T/2 × 480]
Conv2: stride 2, GELU → [32 × T/4 × 480]
Conv3: stride 2, GELU → [16 × T/8 × 480]
Flatten + project: → [T/8, 896]
After 8× time compression: roughly 12.5 frames per second, one embedding per ~80ms.
Transformer Encoder (Bidirectional, Chunked)
18 transformer layers, each with:
- Bidirectional self-attention: every frame can see every other frame
- Feed-forward: GELU, 896 → 3584 → 896
- LayerNorm, residual connections
Symbolically, each block:
h'_l = h_l + MHA(LayerNorm(h_l))
h_{l+1} = h'_l + FFN(LayerNorm(h'_l))
For long audio, attention is restricted to within chunks using a block-diagonal mask:
Audio frames: [ chunk 1 | chunk 2 | chunk 3 ]
Attention mask: ┌───┬───┬───┐
│ ✓ │ │ │ ← chunk 1 attends only to chunk 1
├───┼───┼───┤
│ │ ✓ │ │ ← chunk 2 attends only to chunk 2
├───┼───┼───┤
│ │ │ ✓ │ ← chunk 3 attends only to chunk 3
└───┴───┴───┘
flowchart TB
subgraph C1["Chunk 1"]
A1["f1"] --- A2["f2"] --- A3["f3"]
end
subgraph C2["Chunk 2"]
B1["f4"] --- B2["f5"] --- B3["f6"]
end
subgraph C3["Chunk 3"]
D1["f7"] --- D2["f8"] --- D3["f9"]
end
The final output is projected to decoder dimension:
encoder_out ∈ ℝ^(N × 896) → LN → GELU(Linear) → Linear → ℝ^(N × 1024)
Placeholder Replacement (How Audio Meets Text)
The model uses a chat-style prompt with audio placeholder tokens:
<|im_start|>system
{optional context}<|im_end|>
<|im_start|>user
<|audio_start|><|audio_pad|>×N<|audio_end|><|im_end|>
<|im_start|>assistant
language English<asr_text>
The N pad token embeddings are replaced with the encoder output embeddings:
Before replacement:
[sys_tokens] [user_tokens] [pad₁] [pad₂] ... [padₙ] [asst_tokens]
↓ ↓ ↓ ↓ ↓ ↓
text_emb text_emb pad_emb pad_emb ... pad_emb text_emb
After replacement:
[sys_tokens] [user_tokens] [pad₁] [pad₂] ... [padₙ] [asst_tokens]
↓ ↓ ↓ ↓ ↓ ↓
text_emb text_emb aud₁ aud₂ ... audₙ text_emb
Formally, let E ∈ ℝ^(L × 1024) be prompt embeddings with audio-pad positions p₁...pₙ:
E'ᵢ = Aⱼifi = pⱼ(replace with audio embedding)E'ᵢ = Eᵢotherwise (keep text embedding)
The decoder sees a single mixed embedding sequence. It doesn’t know which positions carry audio vs. text information.
Text Decoding
Standard autoregressive generation:
y_t ~ p(y_t | y_<t, E')
The decoder is a 28-layer Qwen3-style transformer with Grouped-Query Attention (GQA): 16 query heads share 8 KV heads (2:1 ratio), halving KV cache size. The feed-forward uses SwiGLU:
FFN(x) = W_down( SiLU(W_gate(x)) ⊙ W_up(x) )
In common Qwen3-ASR decoding templates, generation stops on EOS token IDs (for example, 151645 and 151643); exact stop sets are tokenizer/runtime dependent.
5.4 Pipeline Summary (1 second of audio)
16,000 samples
→ mel: [128, 100] (128 bins × 100 frames)
→ after convs: [~13, 896] (12.5 frames/sec)
→ after enc: [~13, 1024] (projected to decoder dim)
→ replaces ~13 pad tokens in prompt
→ decoder generates ~5-15 text tokens until EOS
5.5 Runtime Implications
Qwen’s official model card reports unified online/offline inference support in their serving stack. In many custom deployments (including ours), integration is segment/batch-style: periodic windows or VAD-bounded turns.
This is important operationally: “supports streaming” at model/tooling level does not force a particular app-level control loop. The bidirectional encoder fundamentally benefits from seeing full audio, so segment-style integration is a natural fit. In this article, “segment-style” describes an integration pattern, not a hard model impossibility.
5.6 Generation Loop Shape
# Simplified pseudocode for segment-style decoding
mel = featurize(audio_segment)
audio_embeds = encode_and_project(mel) # [N, 1024]
prompt_embeds = replace_pads(audio_embeds) # audio injected into prompt
cache = init_kv_cache()
for step in range(max_tokens): # unconstrained loop
logits, cache = decoder_step(prompt_embeds, cache)
token = sample(logits)
if token in eos_ids:
break
yield token
The loop runs until the model decides to stop. There is no fixed relationship between the number of audio frames and the number of generated tokens.
6. Voxtral-Mini-4B-Realtime-2602: Architecture and Runtime Behavior
6.1 Source-Checked Configuration
All values from params.json:
If you’re skimming: this checkpoint is larger and explicitly realtime-oriented (causal 32-layer audio encoder + 26-layer 3072-d decoder with delay conditioning).
| Component | Value | Source field |
|---|---|---|
| Sample rate | 16,000 Hz | audio_encoding_args.sampling_rate |
| Frame rate (output) | 12.5 Hz | audio_encoding_args.frame_rate |
| Mel bins | 128 | audio_encoding_args.num_mel_bins |
| Window / hop | window=400, hop=160 | audio_encoding_args |
| Global log mel max | 1.5 | audio_encoding_args.global_log_mel_max |
| Conv stem 1 | causal Conv1D, 128→1280, k=3, s=1 | encoder implementation¹ |
| Conv stem 2 | causal Conv1D, 1280→1280, k=3, s=2 | encoder implementation¹ |
| Audio encoder layers | 32 | encoder_args.n_layers |
| Audio encoder dim | 1280 | encoder_args.dim |
| Audio encoder heads | 32 | encoder_args.n_heads |
| Audio encoder KV heads | 32 (full MHA) | encoder_args.n_kv_heads |
| Audio encoder head dim | 64 | encoder_args.head_dim |
| Audio encoder FFN | 5120 | encoder_args.hidden_dim |
| Audio encoder FFN type | SwiGLU | encoder_args.ffn_type |
| Audio encoder norm | RMSNorm | encoder_args.norm_type |
| Audio encoder position | RoPE | encoder_args.pos_embed |
| Audio encoder attention | causal | encoder_args.causal: true |
| Audio encoder window | 750 frames | encoder_args.sliding_window |
| Encoder biases | yes | encoder_args.use_biases: true |
| Downsample factor | 4 | downsample_args.downsample_factor |
| Adapter MLP | 5120→3072→3072 | (1280×4 concat → project)¹ |
| Decoder layers | 26 | n_layers |
| Decoder dim | 3072 | dim |
| Decoder heads | 32 | n_heads |
| Decoder KV heads | 8 | n_kv_heads |
| Decoder head dim | 128 | head_dim |
| Decoder FFN | 9216 | hidden_dim |
| Decoder biases | no | use_biases: false |
| Decoder sliding window | 8192 | sliding_window |
| Max sequence length | 131,072 | model_max_length |
| Vocabulary | 131,072 (Tekken) | vocab_size |
| Tied embeddings | yes | tied_embeddings: true |
| RoPE theta | 1,000,000 | rope_theta |
| Ada RMS-Norm enabled | yes | ada_rms_norm_t_cond: true |
| Ada RMS-Norm inner dim | 32 | ada_rms_norm_t_cond_dim |
| Total parameters | ~4.4B | encoder 970M + adapter 25M + decoder 3.4B |
¹ Conv stem and adapter MLP architecture details come from implementation code (mlx-audio, vLLM), not from params.json directly. The params file specifies the encoder dimensions and downsample factor; the conv topology is an implementation detail.
frame_rate=12.5 here refers to the stream-synchronous decoding rate after internal downsampling. Frontend mel extraction is still ~100 Hz, with an intermediate ~50 Hz stage after the conv stem.
Note on n_fft: params.json does not specify n_fft. The value 512 used in some implementations (e.g., mlx-audio) is likely an implementation default. Only window_size=400 and hop_length=160 are specified in the primary config.
6.2 End-to-End Flow
flowchart LR
A["Audio (16 kHz)"] --> B["Mel features (128 × T, 100 Hz)"]
B --> C["Causal Conv stem → 50 Hz, d=1280"]
C --> D["Causal encoder (32L, d=1280, sliding window=750)"]
D --> E["Downsample ×4 + adapter MLP → d=3072, 12.5 Hz"]
E --> F["Per-step fusion: audio_t + text_embed(prev token)"]
F --> G["LM decoder (26L, d=3072, GQA 32q/8kv, AdaNorm)"]
G --> H["Token stream (lexical + padding/boundary control tokens)"]
6.3 Step-by-Step
Feature Extraction
Same 128-bin mel spectrogram as Qwen3-ASR, but with Slaney-normalized filterbanks and a different log normalization:
log_spec = log10(max(mel, 1e-10))
log_spec = clamp(log_spec, min=global_log_mel_max - 8.0)
log_spec = (log_spec + 4.0) / 4.0
where global_log_mel_max = 1.5 (from params.json). Not interchangeable with Whisper’s feature extractor.
Causal Convolutional Stem
Two 1D convolutions (not 2D like Qwen3) with causal padding — each output frame depends only on current and past input frames:
Input: [128, T] (128 mel bins as input channels)
Conv1: k=3, s=1, causal → [1280, T] (expand channels, keep time)
Conv2: k=3, s=2, causal → [1280, T/2] (halve time)
Causal vs. standard convolution (kernel=3):
Standard: output[t] depends on input[t-1], input[t], input[t+1]
↑ needs future!
Causal: output[t] depends on input[t-2], input[t-1], input[t]
✓ only past/present
Output rate: 50 Hz (one embedding every 20ms). Dimension: 1280.
Causal Transformer Encoder
32 transformer layers process the convolution output. Every frame can only attend to past and present frames — never future frames.
The sliding window of 750 frames provides ~15 seconds of audio context at 50 Hz. This bounds memory for arbitrarily long audio:
Sliding window attention (window=5, for illustration):
Frame: 1 2 3 4 5 6 7 8 9 10
1 [✓ · · · · · · · · · ]
2 [✓ ✓ · · · · · · · · ]
3 [✓ ✓ ✓ · · · · · · · ]
4 [✓ ✓ ✓ ✓ · · · · · · ]
5 [✓ ✓ ✓ ✓ ✓ · · · · · ]
6 [· ✓ ✓ ✓ ✓ ✓ · · · · ] ← frame 1 falls out of window
7 [· · ✓ ✓ ✓ ✓ ✓ · · · ]
...
Each layer uses:
- Causal self-attention with sliding window and RoPE (θ = 10⁶)
- SwiGLU FFN: 1280 → 5120 → 1280
- RMSNorm (pre-norm)
- Full MHA (32 heads, 32 KV heads — no GQA in the encoder)
- Biases on attention projections (unlike the decoder)
4× Downsampling + Adapter MLP
The encoder output (50 Hz, 1280-dim) is too frequent for the decoder. Four consecutive frames are concatenated and projected:
encoder output: e₁ e₂ e₃ e₄ e₅ e₆ e₇ e₈ ... (50 Hz, 1280-dim)
└──────┬──────┘ └──────┬──────┘
concat(4) concat(4)
↓ ↓
[e₁;e₂;e₃;e₄] [e₅;e₆;e₇;e₈] (12.5 Hz, 5120-dim)
↓ ↓
GELU(Linear(5120→3072)) (12.5 Hz, 3072-dim)
↓ ↓
Linear(3072→3072) (12.5 Hz, 3072-dim)
↓ ↓
a₁ a₂ audio embeddings
One audio embedding per 80ms, dimension 3072 — matching the decoder hidden size.
Additive Fusion (How Audio Meets Text)
Instead of replacing placeholders, Voxtral adds audio and text embeddings at every timestep:
z_t = A_t + e(y_{t-1})
where A_t ∈ ℝ^3072 is the adapted audio embedding and e(y_{t-1}) ∈ ℝ^3072 is the embedding of the previous token.
A concrete trace of “The revenue increased” with delay = 6 frames (480ms):
Time Audio embed Previous token Fused input Output
───── ──────────────── ──────────────── ───────────────── ────────────
0ms a₀ (silence) + embed(BOS) = z₀ → [P] (padding)
80ms a₁ (silence) + embed([P]) = z₁ → [P]
160ms a₂ ("Th-") + embed([P]) = z₂ → [P]
240ms a₃ ("-e re-") + embed([P]) = z₃ → [P]
320ms a₄ ("-venue") + embed([P]) = z₄ → [P]
400ms a₅ (" incr-") + embed([P]) = z₅ → [P]
480ms a₆ ("-eased") + embed([P]) = z₆ → [W] (boundary)
560ms a₇ (...) + embed([W]) = z₇ → "The"
640ms a₈ (...) + embed("The") = z₈ → [W]
720ms a₉ (...) + embed([W]) = z₉ → " revenue"
800ms a₁₀ (...) + embed("revenue")= z₁₀ → [W]
880ms a₁₁ (...) + embed([W]) = z₁₁ → " increased"
...
[P] = “nothing to say yet.” [W] = “word boundary.” Both are filtered from final output.
Why addition, not concatenation? With concatenation, the dimension doubles (6144) and every layer must distinguish which half is audio vs. text. With addition, dimension stays at 3072 and the network learns to disentangle both signals from the combined representation. More parameter-efficient, no architectural changes needed in the decoder.
Language Decoder with Ada RMS-Norm
The decoder is initialized from Ministral 3B, a general-purpose language model — not a narrow transcription specialist.
The decoder uses GQA with 32 query heads and 8 KV heads (4:1 ratio), SwiGLU FFN (3072 → 9216 → 3072), and a sliding window of 8192 tokens for bounded-memory decoding on arbitrarily long sequences.
Ada RMS-Norm allows a single model to operate at any delay from 80ms to 2400ms. Given target delay τ:
1. Sinusoidal time embedding:
t_embed = sinusoidal_encoding(τ) ∈ ℝ^3072
2. Per-layer MLP (inner dim = 32, tiny):
g_l(τ) = Linear₂(GELU(Linear₁(t_embed))) ∈ ℝ^3072
3. Applied to the FFN branch only:
r_attn = Attention(RMSNorm(h)) ← NOT conditioned on delay
h' = h + r_attn
r_ffn = FFN(RMSNorm(h') ⊙ (1 + g_l(τ))) ← conditioned on delay
y = h' + r_ffn
The ⊙ is element-wise multiplication. When g_l(τ) = 0, this is standard RMSNorm. Non-zero values selectively amplify or suppress dimensions based on delay. Smaller delay → more aggressive early hypotheses; larger delay → wait for more acoustic evidence.
The paper reports this adds about ~5M parameters total. g_l(τ) is computed once per inference session and reused.
6.4 Pipeline Summary (1 second of audio)
16,000 samples
→ mel: [128, 100] (128 bins × 100 frames, 100 Hz)
→ after convs: [1280, 50] (50 Hz)
→ after enc: [50, 1280] (32 transformer layers)
→ after adapt: [12, 3072] (12.5 Hz, 3072-dim)
→ 12 generation steps, one per frame
→ ~3-8 text tokens + [P]/[W] control tokens
6.5 Runtime Implications
Voxtral’s architecture is natively streaming. A practical integration:
- feeds incremental audio frames,
- advances one decode step per 80ms audio frame,
- emits partial text continuously,
- finalizes according to endpointing/segmentation policy.
6.6 Generation Loop Shape
# Simplified pseudocode for realtime streaming
session = start_stream(delay_ms=480)
for chunk in incoming_audio_chunks:
session.feed_audio(chunk)
while session.has_next_frame_step():
token = session.step() # one decode step per 80ms frame
if token not in {PAD_TOKEN, WORD_BOUNDARY_TOKEN}:
yield token
for token in session.end_stream():
if token not in {PAD_TOKEN, WORD_BOUNDARY_TOKEN}:
yield token
The loop is timeline-locked: exactly one generation step per 80ms audio frame. This is structurally different from unconstrained “generate until EOS” loops.
7. Side-by-Side Comparison
7.1 Runtime Flow
sequenceDiagram
participant Mic
participant Q as Qwen-style Pipeline
participant V as Voxtral-style Pipeline
Mic->>Q: audio segment (buffered)
Q->>Q: encode full segment
Q->>Q: decode tokens to EOS
Q-->>Mic: transcript chunk
Mic->>V: audio chunk 1
V->>V: frame-aligned decode steps
V-->>Mic: partial tokens
Mic->>V: audio chunk 2
V->>V: continue frame-aligned steps
V-->>Mic: more partial/final tokens
7.2 Tracing “The revenue increased” Through Both
Qwen3-ASR (segment-style):
1. Capture: accumulate audio in ring buffer until silence/trigger
2. Frontend: mel [128 × 200] from 2 seconds of audio
3. Encode: conv → 18 transformer layers → project → [~25, 1024]
4. Fuse: build prompt, replace 25 pad slots with audio embeddings
5. Decode: generate autoregressively until EOS
Step 1 → "The"
Step 2 → " revenue"
Step 3 → " increased"
Step 4 → "."
Step 5 → EOS
Total latency = (time to accumulate segment) + encode + decode
Voxtral Realtime (streaming, 480ms delay):
1. Session open with τ = 6 frames (480ms)
2. Feed audio continuously
3. Per-frame decode:
t=0ms a₀ + embed(BOS) → [P]
t=80ms a₁ + embed([P]) → [P]
t=160ms a₂ + embed([P]) → [P]
t=240ms a₃ + embed([P]) → [P]
t=320ms a₄ + embed([P]) → [P]
t=400ms a₅ + embed([P]) → [P]
t=480ms a₆ + embed([P]) → [W] (first word ready)
t=560ms a₇ + embed([W]) → "The" ← text appears
t=640ms a₈ + embed("The") → [W]
t=720ms a₉ + embed([W]) → " revenue"
...
Text arrives incrementally, ~480ms behind the speaker.
No need to wait for silence.
7.3 Consolidated Comparison Table
| Aspect | Qwen3-ASR-0.6B | Voxtral-Mini-4B-Realtime-2602 |
|---|---|---|
| Parameters | 0.6B | 4.4B |
| Mel features | 128 bins, Whisper frontend | 128 bins, Slaney-normalized |
| STFT | n_fft=400, hop=160 | window=400, hop=160² |
| Conv downsampling | 3× Conv2D stride 2 (8× total) | 2× CausalConv1D (2×) + adapter 4× (8× total) |
| Audio encoder | 18L, d=896, 14 heads, FFN 3584 | 32L, d=1280, 32 heads, FFN 5120 |
| Encoder attention | Bidirectional, block-chunked | Causal, sliding window 750 |
| Encoder norm | LayerNorm | RMSNorm |
| Encoder position | Sinusoidal (fixed) | RoPE (rotary) |
| Encoder GQA | Full MHA (14/14) | Full MHA (32/32) |
| Audio → decoder dim | 1024 | 3072 |
| Fusion method | Placeholder replacement | Additive per timestep |
| Decoder | 28L, d=1024 | 26L, d=3072 |
| Decoder heads / KV | 16 / 8 (GQA 2:1) | 32 / 8 (GQA 4:1) |
| Decoder FFN | 3072 | 9216 |
| Decoder context | Full sequence | Sliding window 8192 |
| Delay conditioning | None | Ada RMS-Norm (dim 32, paper-reported ~5M params) |
| Vocabulary | 151,936 (Qwen3) | 131,072 (Tekken) |
| Streaming | Official stack supports online/offline; many app integrations remain segment-style | Native realtime |
| Generation loop | Until EOS (unconstrained) | One token per 80ms frame (timeline-locked) |
| Decoder origin | ASR-specific | Ministral 3B (general LLM) |
| Quantized size | ~0.6 GB (8-bit) | ~3.1 GB (4-bit) |
² n_fft not specified in params.json. Implementations may vary.
8. The Decoder as Language Model
8.1 What This Means
Qwen3-ASR’s decoder is a small (0.6B) ASR-focused decoder. In practice, it is typically used for transcription rather than reliable instruction-style rewriting/summarization, so post-transcription cleanup is usually delegated to a separate LLM.
Voxtral’s decoder is Ministral 3B: a general-purpose language model that was taught to understand audio. This is a fundamentally different capability profile.
8.2 Can Voxtral’s Decoder Rewrite Text After Transcription?
Once the audio stream ends, the model could theoretically continue generating in text-only mode. With no audio signal, additive fusion becomes 0 + text_embed = text_embed — effectively a standard LM forward pass.
This is untested. The model was trained with audio always present. Whether fine-tuning preserved or destroyed Ministral 3B’s text-only instruction-following is an empirical question that should be validated before building architecture around it.
A practical policy:
- Attempt backend-native rewrite only if capability is validated.
- Otherwise fall back to a dedicated text LLM post-processor.
8.3 Can You Swap the Decoder for a Different Language Model?
The paper’s training recipe is designed to support this:
-
Phase 1 (5% of training): Freeze decoder. Train only encoder + adapter. LR: 4×10⁻⁴. This lets the randomly initialized encoder produce useful embeddings without destabilizing the pre-trained decoder.
-
Phase 2 (95% of training): Unfreeze everything. Train end-to-end. LR: 6×10⁻⁵, batch size ~370 hours, AdamW with z-loss regularization.
To swap the decoder:
- Pick an LM with compatible architecture. If hidden dim ≠ 3072, retrain the adapter.
- Add Ada RMS-Norm conditioning (~5M extra params).
- Follow the same two-phase recipe with speech + word-level timestamps in target language(s).
The bottleneck is data: the DSM framework requires knowing exactly when each word was spoken to construct training targets. Large-scale word-level timestamp annotations are expensive.
9. Abstraction Design Implications
9.1 What Can Be Unified
- Input audio format contract (16 kHz mono float32).
- Output text event contract (partial/final tokens, segment boundaries).
- Session lifecycle and error handling.
- Metrics and tracing hooks.
9.2 What Should Remain Backend-Specific
- Feature extractor internals (Whisper vs. Slaney mel).
- Tokenizer internals and special token semantics.
- Generation loop mechanics (unconstrained vs. timeline-locked).
- Delay / endpointing control policy.
9.3 Failure Modes to Avoid
Over-abstraction: if the abstraction hides critical timing semantics, you get incorrect latency assumptions, brittle endpointing, and misleading capability signals.
Under-abstraction: if every backend leaks all internals upward, you get duplicated pipeline logic, inconsistent event semantics, and harder testing.
9.4 Practical Interface Shape
A capability-driven session interface:
start_session(...)feed_audio(...)poll_events(...)end_session(...)
with explicit capability flags:
native_streaming: booldelay_control: boolword_timestamps: boolrewrite_experimental: bool
This preserves model differences without duplicating pipeline/UI logic.
10. Runtime Considerations
10.1 Throughput vs Decode Schedule
Two distinct constraints:
- Segment-style: total decode time must be less than segment wall time.
- Timeline-locked: each decode step must complete within its 80ms frame budget.
Similar in spirit, different in failure mode.
10.2 Endpointing
Even with native realtime models, turn segmentation remains an application-level choice:
- VAD-driven endpointing,
- model-token-driven endpointing (sequences of [P] tokens as silence signal),
- or hybrid endpointing.
10.3 Memory
Realtime note-taking products may co-host:
- ASR backend (~0.6 GB or ~3.1 GB),
- optional intent model,
- optional rewrite LLM (~3-4 GB).
If Voxtral’s decoder can handle rewriting, total memory drops significantly. If not, Voxtral (3.1 GB) + external LLM (3-4 GB) may strain 16 GB machines.
10.4 Partial Transcript Stability
Aggressive low-delay settings increase early revisions; conservative delay reduces revisions but increases perceived lag. This tradeoff is orthogonal to raw WER and should be evaluated separately.
11. Source Validation Notes
11.1 Common Misreports for Qwen3-ASR-0.6B
Values frequently wrong in derivative writeups:
| Field | Often stated | Actual (from config.json) |
|---|---|---|
| Decoder hidden size | 2048 | 1024 |
| Audio encoder dim | 1024 | 896 |
| Audio encoder layers | 24 | 18 |
| Audio encoder FFN | 4096 | 3584 |
| Audio output projection | 2048 | 1024 |
These mismatches arise because many discussions blend paper-level reference architectures, larger Qwen3 variants, and checkpoint-specific configs. For deployment, the checkpoint config is the source of truth.
11.2 Voxtral Variant Confusion
The earlier Voxtral-Mini-3B-2507 (non-realtime) has a different architecture:
- Bidirectional encoder (not causal)
- Cross-attention fusion (not additive)
- Different decoder dimensions
Values from the non-realtime variant should not be mixed with the Realtime-2602 checkpoint. Always verify against the specific checkpoint’s params.json.
11.3 Validation Checklist
When documenting ASR checkpoints, prefer:
config.json/params.jsonover third-party summaries.- Explicitly verify: layer counts, hidden/FFN dims, head counts, sliding windows, vocab size, special token IDs.
- Cite both paper and checkpoint config; treat checkpoint config as definitive.
- Pin commit hashes for strict reproducibility.
12. References
Primary sources used for numbers and claims:
- Qwen3-ASR-0.6B model card: huggingface.co/Qwen/Qwen3-ASR-0.6B
- Qwen3-ASR-0.6B config.json: huggingface.co/Qwen/Qwen3-ASR-0.6B/blob/main/config.json
- Qwen3-ASR-0.6B preprocessor_config.json: huggingface.co/Qwen/Qwen3-ASR-0.6B/blob/main/preprocessor_config.json
- Qwen3-ASR technical report: arxiv.org/abs/2601.21337
- Voxtral-Mini-4B-Realtime-2602 model card: huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602
- Voxtral-Mini-4B-Realtime-2602 params.json: huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602/blob/main/params.json
- Voxtral Realtime paper (DSM architecture): arxiv.org/abs/2602.11298
- MLX-community Voxtral 4-bit weights: huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit
Written while building voissistant, a real-time speech-to-text system for Apple Silicon using MLX.