Two Ways a Machine Can Listen: Qwen3-ASR vs Voxtral Realtime

A detailed, source-checked architectural comparison of two speech-to-text systems — one centered on segment-style decoding, one designed for native realtime decoding.

1. Why This Comparison Matters

This article compares two open ASR systems that target similar use cases but are architecturally very different:

Qwen3-ASR-0.6B (Qwen)
Voxtral-Mini-4B-Realtime-2602 (Mistral)

Both convert speech to text. Both target practical deployment. But their core inference loops and runtime assumptions are not the same.

That difference matters for engineering decisions in realtime products:

how to design backend interfaces,
how to segment and emit partial/final text,
where to place latency controls,
and how much of a stack should be model-agnostic.

The focus is architecture and system behavior, not benchmark ranking. Checkpoint-specific numeric values are anchored to released configuration files (config.json, preprocessor_config.json, params.json), with explicit notes where values come from implementation code, paper-reported totals, or runtime behavior.

2. Background: Common STT Architecture Families

There is no canonical finite list of STT architectures. Most production systems are variants, combinations, or evolutions of a small set of recurring families.

2.1 CTC (Connectionist Temporal Classification)

Typical shape: encoder-only acoustic model, optional external LM for rescoring.
Intuition: align frame-level audio evidence with token outputs under a monotonic constraint.
Practical profile: efficient and robust; widely used in offline and near-realtime systems.

2.2 RNN-T / Transducer

Typical shape: encoder + prediction network + joiner.
Intuition: produce tokens incrementally while maintaining monotonic alignment.
Practical profile: a dominant choice for streaming ASR in many production deployments.

2.3 Attention Encoder-Decoder (Seq2Seq)

Typical shape: encoder builds acoustic representation; decoder autoregressively generates text.
Intuition: learn richer sequence mapping with flexible conditioning.
Practical profile: strong quality in offline settings; streaming variants exist but require additional design constraints.

2.4 Two-Pass Hybrids

Typical shape: fast first pass (often streaming) + slower second pass rescoring/correction.
Intuition: trade latency and quality by splitting tasks across two inference stages.

2.5 Speech-LLM Hybrids

Typical shape: speech/audio encoder feeding a language-model decoder.
Intuition: leverage LM capabilities (prompting, formatting control, multilingual generalization) in ASR.

Qwen3-ASR and Voxtral Realtime both fall into this last category — they fuse an audio encoder with a language model decoder — but they do so in fundamentally different ways.

Key takeaway: a useful cross-model abstraction should target behavioral contracts (session lifecycle, events, capabilities), not force identical internals.

3. Shared Signal Path

Both models follow the same high-level pipeline from waveform to text:

flowchart LR
    A["Speech waveform"] --> B["Microphone / ADC (16 kHz mono)"]
    B --> C["Feature extraction (mel frames)"]
    C --> D["Neural audio-text model"]
    D --> E["Token stream"]
    E --> F["Transcript text"]

The major differences are in:

how audio features are encoded,
how audio and text streams are fused,
and how decoding steps are scheduled over time.

4. Building Blocks (a brief primer)

4.1 Mel Spectrograms: What the Model Actually “Sees”

Raw audio at 16 kHz means 16,000 samples per second. Most ASR models operate on short-time spectral features instead of raw waveform samples.

The audio is sliced into overlapping windows (typically 25ms wide, sliding every 10ms), and for each window, the energy at different frequency bands is measured. The “mel” part refers to a perceptual scale — it spaces frequency bands the way human hearing works, with finer resolution at low frequencies.

                    Time →
            ┌──────────────────────────┐
  High freq │░░▓▓░░░░░░▓▓▓▓░░░░░░░░░░░│
            │░▓▓▓░░░░░▓▓▓▓▓▓░░░░░░░░░░│
            │▓▓▓▓▓░░░▓▓▓▓▓▓▓▓░░░░░░░░░│
            │▓▓▓▓▓▓░▓▓▓▓▓▓▓▓▓▓░░░░░░░░│
  Low freq  │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░▓▓░░░│
            └──────────────────────────┘
              "Hello"       "world"  (silence)

Both checkpoints use 128 mel bins and hop length 160 (10ms at 16 kHz), so both produce ~100 mel frames per second.

If duration is D seconds:

T ≈ 100 × D
mel features: X ∈ ℝ^(128 × T)

4.2 Embeddings and Dimensional Alignment

Transformers operate on embeddings — fixed-length vectors (typically 1024 to 3072 numbers long) that represent meaning in a geometric space. Things that are similar end up close together: the embedding for “dog” is near “puppy” and far from “algebra.”

For multimodal ASR, audio representations must be mapped into the decoder’s hidden space (or a compatible fusion space). The details differ:

Qwen3-ASR: audio embeddings replace placeholder tokens in a text prompt sequence.
Voxtral Realtime: audio and text embeddings are added together at each timestep.

Both approaches require audio and text to share the same vector space and dimensionality.

4.3 Bidirectional vs Causal Attention

A transformer processes a sequence of embeddings. Its core mechanism is attention: each position can “look at” other positions to gather context.

  Bidirectional                 Causal
  (Qwen3-ASR encoder)          (Voxtral encoder)

  Position: 1 2 3 4 5          Position: 1 2 3 4 5
       1    ✓ ✓ ✓ ✓ ✓               1    ✓ · · · ·
       2    ✓ ✓ ✓ ✓ ✓               2    ✓ ✓ · · ·
       3    ✓ ✓ ✓ ✓ ✓               3    ✓ ✓ ✓ · ·
       4    ✓ ✓ ✓ ✓ ✓               4    ✓ ✓ ✓ ✓ ·
       5    ✓ ✓ ✓ ✓ ✓               5    ✓ ✓ ✓ ✓ ✓

  ✓ = can attend    · = cannot attend

Bidirectional: every position sees every other. Richer representations, but requires the complete input.
Causal: each position only sees past/present. Required for strict realtime — future frames haven’t arrived yet.

4.4 Autoregressive Decoding and KV Cache

Both models generate text autoregressively: they predict one token at a time, and each prediction depends on all previous tokens.

Step 1: [audio context]                   → predict "The"
Step 2: [audio context] + "The"           → predict " revenue"
Step 3: [audio context] + "The revenue"   → predict " increased"
...

The decoder maintains a KV cache — a running memory of previously computed attention keys and values — so each new token doesn’t reprocess the full history.

5. Qwen3-ASR-0.6B: Architecture and Runtime Behavior

5.1 Source-Checked Configuration

All values from config.json and preprocessor_config.json:

If you’re skimming: this checkpoint is relatively compact (18-layer audio encoder + 28-layer decoder at hidden size 1024).

Component	Value	Source field
Sample rate	16,000 Hz	`preprocessor_config.json`
Mel bins	128	`feature_size: 128`
`n_fft` / `hop_length`	400 / 160	`preprocessor_config.json`
Conv downsampling hidden	480	`downsample_hidden_size`
Conv strides	`[2, 2, 2]` (8× total)	3 Conv2D layers
Audio encoder layers	18	`encoder_layers`
Audio encoder dim	896	`d_model`
Audio encoder heads	14	`encoder_attention_heads`
Audio encoder FFN	3584	`encoder_ffn_dim`
Audio output projection	1024	`output_dim`
Chunk / window	n_window=50, n_window_infer=800	`audio_config`
Decoder layers	28	`num_hidden_layers`
Decoder hidden	1024	`hidden_size`
Decoder query heads	16	`num_attention_heads`
Decoder KV heads	8	`num_key_value_heads`
Decoder head dim	128	`head_dim`
Decoder FFN	3072	`intermediate_size`
RoPE theta	1,000,000	`rope_theta`
Vocabulary	151,936	`vocab_size`
Tied embeddings	yes	`tie_word_embeddings: true`
Audio token ID	151,676	`audio_token_id`

5.2 End-to-End Flow

flowchart LR
    A["Audio (16 kHz)"] --> B["Whisper-style mel (128 × T)"]
    B --> C["3× Conv2D downsampling (8× in time)"]
    C --> D["Audio encoder (18L, d=896, bidirectional/chunked)"]
    D --> E["Project 896 → 1024"]
    E --> F["Replace audio-pad slots in prompt"]
    F --> G["Qwen decoder (28L, d=1024, GQA 16q/8kv)"]
    G --> H["Autoregressive token output"]

5.3 Step-by-Step

Feature Extraction

The audio signal is converted into a mel spectrogram using the Whisper feature extractor (n_fft=400, hop_length=160):

mel ∈ ℝ^(128 × T)     where T ≈ 100 × duration_seconds

For a 5-second clip: mel has shape [128, 500].

Convolutional Downsampling

Three 2D convolutions with stride 2 shrink both time and frequency:

Input:  [128 freq × T time × 1 channel]
Conv1:  stride 2, GELU → [64 × T/2 × 480]
Conv2:  stride 2, GELU → [32 × T/4 × 480]
Conv3:  stride 2, GELU → [16 × T/8 × 480]
Flatten + project:      → [T/8, 896]

After 8× time compression: roughly 12.5 frames per second, one embedding per ~80ms.

Transformer Encoder (Bidirectional, Chunked)

18 transformer layers, each with:

Bidirectional self-attention: every frame can see every other frame
Feed-forward: GELU, 896 → 3584 → 896
LayerNorm, residual connections

Symbolically, each block:

h'_l = h_l + MHA(LayerNorm(h_l))
h_{l+1} = h'_l + FFN(LayerNorm(h'_l))

For long audio, attention is restricted to within chunks using a block-diagonal mask:

Audio frames:     [  chunk 1   |  chunk 2   |  chunk 3   ]

Attention mask:   ┌───┬───┬───┐
                  │ ✓ │   │   │  ← chunk 1 attends only to chunk 1
                  ├───┼───┼───┤
                  │   │ ✓ │   │  ← chunk 2 attends only to chunk 2
                  ├───┼───┼───┤
                  │   │   │ ✓ │  ← chunk 3 attends only to chunk 3
                  └───┴───┴───┘

flowchart TB
    subgraph C1["Chunk 1"]
    A1["f1"] --- A2["f2"] --- A3["f3"]
    end
    subgraph C2["Chunk 2"]
    B1["f4"] --- B2["f5"] --- B3["f6"]
    end
    subgraph C3["Chunk 3"]
    D1["f7"] --- D2["f8"] --- D3["f9"]
    end

The final output is projected to decoder dimension:

encoder_out ∈ ℝ^(N × 896) → LN → GELU(Linear) → Linear → ℝ^(N × 1024)

Placeholder Replacement (How Audio Meets Text)

The model uses a chat-style prompt with audio placeholder tokens:

<|im_start|>system
{optional context}<|im_end|>
<|im_start|>user
<|audio_start|><|audio_pad|>×N<|audio_end|><|im_end|>
<|im_start|>assistant
language English<asr_text>

The N pad token embeddings are replaced with the encoder output embeddings:

Before replacement:
  [sys_tokens] [user_tokens] [pad₁] [pad₂] ... [padₙ] [asst_tokens]
       ↓            ↓          ↓      ↓           ↓         ↓
    text_emb     text_emb   pad_emb pad_emb ... pad_emb  text_emb

After replacement:
  [sys_tokens] [user_tokens] [pad₁] [pad₂] ... [padₙ] [asst_tokens]
       ↓            ↓          ↓      ↓           ↓         ↓
    text_emb     text_emb   aud₁    aud₂   ... audₙ     text_emb

Formally, let E ∈ ℝ^(L × 1024) be prompt embeddings with audio-pad positions p₁...pₙ:

E'ᵢ = Aⱼ if i = pⱼ (replace with audio embedding)
E'ᵢ = Eᵢ otherwise (keep text embedding)

The decoder sees a single mixed embedding sequence. It doesn’t know which positions carry audio vs. text information.

Text Decoding

Standard autoregressive generation:

y_t ~ p(y_t | y_<t, E')

The decoder is a 28-layer Qwen3-style transformer with Grouped-Query Attention (GQA): 16 query heads share 8 KV heads (2:1 ratio), halving KV cache size. The feed-forward uses SwiGLU:

FFN(x) = W_down( SiLU(W_gate(x)) ⊙ W_up(x) )

In common Qwen3-ASR decoding templates, generation stops on EOS token IDs (for example, 151645 and 151643); exact stop sets are tokenizer/runtime dependent.

5.4 Pipeline Summary (1 second of audio)

16,000 samples
  → mel:         [128, 100]     (128 bins × 100 frames)
  → after convs: [~13, 896]    (12.5 frames/sec)
  → after enc:   [~13, 1024]   (projected to decoder dim)
  → replaces ~13 pad tokens in prompt
  → decoder generates ~5-15 text tokens until EOS

5.5 Runtime Implications

Qwen’s official model card reports unified online/offline inference support in their serving stack. In many custom deployments (including ours), integration is segment/batch-style: periodic windows or VAD-bounded turns.

This is important operationally: “supports streaming” at model/tooling level does not force a particular app-level control loop. The bidirectional encoder fundamentally benefits from seeing full audio, so segment-style integration is a natural fit. In this article, “segment-style” describes an integration pattern, not a hard model impossibility.

5.6 Generation Loop Shape

# Simplified pseudocode for segment-style decoding
mel = featurize(audio_segment)
audio_embeds = encode_and_project(mel)      # [N, 1024]
prompt_embeds = replace_pads(audio_embeds)   # audio injected into prompt

cache = init_kv_cache()
for step in range(max_tokens):               # unconstrained loop
    logits, cache = decoder_step(prompt_embeds, cache)
    token = sample(logits)
    if token in eos_ids:
        break
    yield token

The loop runs until the model decides to stop. There is no fixed relationship between the number of audio frames and the number of generated tokens.

6. Voxtral-Mini-4B-Realtime-2602: Architecture and Runtime Behavior

6.1 Source-Checked Configuration

All values from params.json:

If you’re skimming: this checkpoint is larger and explicitly realtime-oriented (causal 32-layer audio encoder + 26-layer 3072-d decoder with delay conditioning).

Component	Value	Source field
Sample rate	16,000 Hz	`audio_encoding_args.sampling_rate`
Frame rate (output)	12.5 Hz	`audio_encoding_args.frame_rate`
Mel bins	128	`audio_encoding_args.num_mel_bins`
Window / hop	window=400, hop=160	`audio_encoding_args`
Global log mel max	1.5	`audio_encoding_args.global_log_mel_max`
Conv stem 1	causal Conv1D, 128→1280, k=3, s=1	encoder implementation¹
Conv stem 2	causal Conv1D, 1280→1280, k=3, s=2	encoder implementation¹
Audio encoder layers	32	`encoder_args.n_layers`
Audio encoder dim	1280	`encoder_args.dim`
Audio encoder heads	32	`encoder_args.n_heads`
Audio encoder KV heads	32 (full MHA)	`encoder_args.n_kv_heads`
Audio encoder head dim	64	`encoder_args.head_dim`
Audio encoder FFN	5120	`encoder_args.hidden_dim`
Audio encoder FFN type	SwiGLU	`encoder_args.ffn_type`
Audio encoder norm	RMSNorm	`encoder_args.norm_type`
Audio encoder position	RoPE	`encoder_args.pos_embed`
Audio encoder attention	causal	`encoder_args.causal: true`
Audio encoder window	750 frames	`encoder_args.sliding_window`
Encoder biases	yes	`encoder_args.use_biases: true`
Downsample factor	4	`downsample_args.downsample_factor`
Adapter MLP	5120→3072→3072	(1280×4 concat → project)¹
Decoder layers	26	`n_layers`
Decoder dim	3072	`dim`
Decoder heads	32	`n_heads`
Decoder KV heads	8	`n_kv_heads`
Decoder head dim	128	`head_dim`
Decoder FFN	9216	`hidden_dim`
Decoder biases	no	`use_biases: false`
Decoder sliding window	8192	`sliding_window`
Max sequence length	131,072	`model_max_length`
Vocabulary	131,072 (Tekken)	`vocab_size`
Tied embeddings	yes	`tied_embeddings: true`
RoPE theta	1,000,000	`rope_theta`
Ada RMS-Norm enabled	yes	`ada_rms_norm_t_cond: true`
Ada RMS-Norm inner dim	32	`ada_rms_norm_t_cond_dim`
Total parameters	~4.4B	encoder 970M + adapter 25M + decoder 3.4B

¹ Conv stem and adapter MLP architecture details come from implementation code (mlx-audio, vLLM), not from params.json directly. The params file specifies the encoder dimensions and downsample factor; the conv topology is an implementation detail.

frame_rate=12.5 here refers to the stream-synchronous decoding rate after internal downsampling. Frontend mel extraction is still ~100 Hz, with an intermediate ~50 Hz stage after the conv stem.

Note on n_fft: params.json does not specify n_fft. The value 512 used in some implementations (e.g., mlx-audio) is likely an implementation default. Only window_size=400 and hop_length=160 are specified in the primary config.

6.2 End-to-End Flow

flowchart LR
    A["Audio (16 kHz)"] --> B["Mel features (128 × T, 100 Hz)"]
    B --> C["Causal Conv stem → 50 Hz, d=1280"]
    C --> D["Causal encoder (32L, d=1280, sliding window=750)"]
    D --> E["Downsample ×4 + adapter MLP → d=3072, 12.5 Hz"]
    E --> F["Per-step fusion: audio_t + text_embed(prev token)"]
    F --> G["LM decoder (26L, d=3072, GQA 32q/8kv, AdaNorm)"]
    G --> H["Token stream (lexical + padding/boundary control tokens)"]

6.3 Step-by-Step

Feature Extraction

Same 128-bin mel spectrogram as Qwen3-ASR, but with Slaney-normalized filterbanks and a different log normalization:

log_spec = log10(max(mel, 1e-10))
log_spec = clamp(log_spec, min=global_log_mel_max - 8.0)
log_spec = (log_spec + 4.0) / 4.0

where global_log_mel_max = 1.5 (from params.json). Not interchangeable with Whisper’s feature extractor.

Causal Convolutional Stem

Two 1D convolutions (not 2D like Qwen3) with causal padding — each output frame depends only on current and past input frames:

Input:   [128, T]  (128 mel bins as input channels)
Conv1:   k=3, s=1, causal → [1280, T]     (expand channels, keep time)
Conv2:   k=3, s=2, causal → [1280, T/2]   (halve time)

Causal vs. standard convolution (kernel=3):

Standard:  output[t] depends on input[t-1], input[t], input[t+1]
                                                       ↑ needs future!
Causal:    output[t] depends on input[t-2], input[t-1], input[t]
                                                        ✓ only past/present

Output rate: 50 Hz (one embedding every 20ms). Dimension: 1280.

Causal Transformer Encoder

32 transformer layers process the convolution output. Every frame can only attend to past and present frames — never future frames.

The sliding window of 750 frames provides ~15 seconds of audio context at 50 Hz. This bounds memory for arbitrarily long audio:

Sliding window attention (window=5, for illustration):

Frame:     1  2  3  4  5  6  7  8  9  10
   1      [✓  ·  ·  ·  ·  ·  ·  ·  ·  · ]
   2      [✓  ✓  ·  ·  ·  ·  ·  ·  ·  · ]
   3      [✓  ✓  ✓  ·  ·  ·  ·  ·  ·  · ]
   4      [✓  ✓  ✓  ✓  ·  ·  ·  ·  ·  · ]
   5      [✓  ✓  ✓  ✓  ✓  ·  ·  ·  ·  · ]
   6      [·  ✓  ✓  ✓  ✓  ✓  ·  ·  ·  · ]  ← frame 1 falls out of window
   7      [·  ·  ✓  ✓  ✓  ✓  ✓  ·  ·  · ]
   ...

Each layer uses:

Causal self-attention with sliding window and RoPE (θ = 10⁶)
SwiGLU FFN: 1280 → 5120 → 1280
RMSNorm (pre-norm)
Full MHA (32 heads, 32 KV heads — no GQA in the encoder)
Biases on attention projections (unlike the decoder)

4× Downsampling + Adapter MLP

The encoder output (50 Hz, 1280-dim) is too frequent for the decoder. Four consecutive frames are concatenated and projected:

encoder output:  e₁  e₂  e₃  e₄  e₅  e₆  e₇  e₈  ...   (50 Hz, 1280-dim)
                 └──────┬──────┘  └──────┬──────┘
                   concat(4)         concat(4)
                      ↓                  ↓
                [e₁;e₂;e₃;e₄]    [e₅;e₆;e₇;e₈]          (12.5 Hz, 5120-dim)
                      ↓                  ↓
              GELU(Linear(5120→3072))                       (12.5 Hz, 3072-dim)
                      ↓                  ↓
              Linear(3072→3072)                             (12.5 Hz, 3072-dim)
                      ↓                  ↓
                     a₁                 a₂                  audio embeddings

One audio embedding per 80ms, dimension 3072 — matching the decoder hidden size.

Additive Fusion (How Audio Meets Text)

Instead of replacing placeholders, Voxtral adds audio and text embeddings at every timestep:

z_t = A_t + e(y_{t-1})

where A_t ∈ ℝ^3072 is the adapted audio embedding and e(y_{t-1}) ∈ ℝ^3072 is the embedding of the previous token.

A concrete trace of “The revenue increased” with delay = 6 frames (480ms):

Time   Audio embed       Previous token    Fused input         Output
─────  ────────────────  ────────────────  ─────────────────   ────────────
 0ms   a₀ (silence)    + embed(BOS)      = z₀                → [P] (padding)
80ms   a₁ (silence)    + embed([P])      = z₁                → [P]
160ms  a₂ ("Th-")      + embed([P])      = z₂                → [P]
240ms  a₃ ("-e re-")   + embed([P])      = z₃                → [P]
320ms  a₄ ("-venue")   + embed([P])      = z₄                → [P]
400ms  a₅ (" incr-")   + embed([P])      = z₅                → [P]
480ms  a₆ ("-eased")   + embed([P])      = z₆                → [W] (boundary)
560ms  a₇ (...)        + embed([W])      = z₇                → "The"
640ms  a₈ (...)        + embed("The")    = z₈                → [W]
720ms  a₉ (...)        + embed([W])      = z₉                → " revenue"
800ms  a₁₀ (...)       + embed("revenue")= z₁₀               → [W]
880ms  a₁₁ (...)       + embed([W])      = z₁₁               → " increased"
...

[P] = “nothing to say yet.” [W] = “word boundary.” Both are filtered from final output.

Why addition, not concatenation? With concatenation, the dimension doubles (6144) and every layer must distinguish which half is audio vs. text. With addition, dimension stays at 3072 and the network learns to disentangle both signals from the combined representation. More parameter-efficient, no architectural changes needed in the decoder.

Language Decoder with Ada RMS-Norm

The decoder is initialized from Ministral 3B, a general-purpose language model — not a narrow transcription specialist.

The decoder uses GQA with 32 query heads and 8 KV heads (4:1 ratio), SwiGLU FFN (3072 → 9216 → 3072), and a sliding window of 8192 tokens for bounded-memory decoding on arbitrarily long sequences.

Ada RMS-Norm allows a single model to operate at any delay from 80ms to 2400ms. Given target delay τ:

1. Sinusoidal time embedding:
   t_embed = sinusoidal_encoding(τ)          ∈ ℝ^3072

2. Per-layer MLP (inner dim = 32, tiny):
   g_l(τ) = Linear₂(GELU(Linear₁(t_embed))) ∈ ℝ^3072

3. Applied to the FFN branch only:
   r_attn = Attention(RMSNorm(h))            ← NOT conditioned on delay
   h' = h + r_attn
   r_ffn = FFN(RMSNorm(h') ⊙ (1 + g_l(τ)))  ← conditioned on delay
   y = h' + r_ffn

The ⊙ is element-wise multiplication. When g_l(τ) = 0, this is standard RMSNorm. Non-zero values selectively amplify or suppress dimensions based on delay. Smaller delay → more aggressive early hypotheses; larger delay → wait for more acoustic evidence.

The paper reports this adds about ~5M parameters total. g_l(τ) is computed once per inference session and reused.

6.4 Pipeline Summary (1 second of audio)

16,000 samples
  → mel:         [128, 100]     (128 bins × 100 frames, 100 Hz)
  → after convs: [1280, 50]    (50 Hz)
  → after enc:   [50, 1280]    (32 transformer layers)
  → after adapt: [12, 3072]    (12.5 Hz, 3072-dim)
  → 12 generation steps, one per frame
  → ~3-8 text tokens + [P]/[W] control tokens

6.5 Runtime Implications

Voxtral’s architecture is natively streaming. A practical integration:

feeds incremental audio frames,
advances one decode step per 80ms audio frame,
emits partial text continuously,
finalizes according to endpointing/segmentation policy.

6.6 Generation Loop Shape

# Simplified pseudocode for realtime streaming
session = start_stream(delay_ms=480)

for chunk in incoming_audio_chunks:
    session.feed_audio(chunk)
    while session.has_next_frame_step():
        token = session.step()  # one decode step per 80ms frame
        if token not in {PAD_TOKEN, WORD_BOUNDARY_TOKEN}:
            yield token

for token in session.end_stream():
    if token not in {PAD_TOKEN, WORD_BOUNDARY_TOKEN}:
        yield token

The loop is timeline-locked: exactly one generation step per 80ms audio frame. This is structurally different from unconstrained “generate until EOS” loops.

7. Side-by-Side Comparison

7.1 Runtime Flow

sequenceDiagram
    participant Mic
    participant Q as Qwen-style Pipeline
    participant V as Voxtral-style Pipeline

    Mic->>Q: audio segment (buffered)
    Q->>Q: encode full segment
    Q->>Q: decode tokens to EOS
    Q-->>Mic: transcript chunk

    Mic->>V: audio chunk 1
    V->>V: frame-aligned decode steps
    V-->>Mic: partial tokens
    Mic->>V: audio chunk 2
    V->>V: continue frame-aligned steps
    V-->>Mic: more partial/final tokens

7.2 Tracing “The revenue increased” Through Both

Qwen3-ASR (segment-style):

1. Capture: accumulate audio in ring buffer until silence/trigger
2. Frontend: mel [128 × 200] from 2 seconds of audio
3. Encode: conv → 18 transformer layers → project → [~25, 1024]
4. Fuse: build prompt, replace 25 pad slots with audio embeddings
5. Decode: generate autoregressively until EOS

   Step 1 → "The"
   Step 2 → " revenue"
   Step 3 → " increased"
   Step 4 → "."
   Step 5 → EOS

   Total latency = (time to accumulate segment) + encode + decode

Voxtral Realtime (streaming, 480ms delay):

1. Session open with τ = 6 frames (480ms)
2. Feed audio continuously
3. Per-frame decode:

   t=0ms    a₀ + embed(BOS)       → [P]
   t=80ms   a₁ + embed([P])       → [P]
   t=160ms  a₂ + embed([P])       → [P]
   t=240ms  a₃ + embed([P])       → [P]
   t=320ms  a₄ + embed([P])       → [P]
   t=400ms  a₅ + embed([P])       → [P]
   t=480ms  a₆ + embed([P])       → [W]      (first word ready)
   t=560ms  a₇ + embed([W])       → "The"    ← text appears
   t=640ms  a₈ + embed("The")     → [W]
   t=720ms  a₉ + embed([W])       → " revenue"
   ...

   Text arrives incrementally, ~480ms behind the speaker.
   No need to wait for silence.

7.3 Consolidated Comparison Table

Aspect	Qwen3-ASR-0.6B	Voxtral-Mini-4B-Realtime-2602
Parameters	0.6B	4.4B
Mel features	128 bins, Whisper frontend	128 bins, Slaney-normalized
STFT	n_fft=400, hop=160	window=400, hop=160²
Conv downsampling	3× Conv2D stride 2 (8× total)	2× CausalConv1D (2×) + adapter 4× (8× total)
Audio encoder	18L, d=896, 14 heads, FFN 3584	32L, d=1280, 32 heads, FFN 5120
Encoder attention	Bidirectional, block-chunked	Causal, sliding window 750
Encoder norm	LayerNorm	RMSNorm
Encoder position	Sinusoidal (fixed)	RoPE (rotary)
Encoder GQA	Full MHA (14/14)	Full MHA (32/32)
Audio → decoder dim	1024	3072
Fusion method	Placeholder replacement	Additive per timestep
Decoder	28L, d=1024	26L, d=3072
Decoder heads / KV	16 / 8 (GQA 2:1)	32 / 8 (GQA 4:1)
Decoder FFN	3072	9216
Decoder context	Full sequence	Sliding window 8192
Delay conditioning	None	Ada RMS-Norm (dim 32, paper-reported ~5M params)
Vocabulary	151,936 (Qwen3)	131,072 (Tekken)
Streaming	Official stack supports online/offline; many app integrations remain segment-style	Native realtime
Generation loop	Until EOS (unconstrained)	One token per 80ms frame (timeline-locked)
Decoder origin	ASR-specific	Ministral 3B (general LLM)
Quantized size	~0.6 GB (8-bit)	~3.1 GB (4-bit)

² n_fft not specified in params.json. Implementations may vary.

8. The Decoder as Language Model

8.1 What This Means

Qwen3-ASR’s decoder is a small (0.6B) ASR-focused decoder. In practice, it is typically used for transcription rather than reliable instruction-style rewriting/summarization, so post-transcription cleanup is usually delegated to a separate LLM.

Voxtral’s decoder is Ministral 3B: a general-purpose language model that was taught to understand audio. This is a fundamentally different capability profile.

8.2 Can Voxtral’s Decoder Rewrite Text After Transcription?

Once the audio stream ends, the model could theoretically continue generating in text-only mode. With no audio signal, additive fusion becomes 0 + text_embed = text_embed — effectively a standard LM forward pass.

This is untested. The model was trained with audio always present. Whether fine-tuning preserved or destroyed Ministral 3B’s text-only instruction-following is an empirical question that should be validated before building architecture around it.

A practical policy:

Attempt backend-native rewrite only if capability is validated.
Otherwise fall back to a dedicated text LLM post-processor.

8.3 Can You Swap the Decoder for a Different Language Model?

The paper’s training recipe is designed to support this:

Phase 1 (5% of training): Freeze decoder. Train only encoder + adapter. LR: 4×10⁻⁴. This lets the randomly initialized encoder produce useful embeddings without destabilizing the pre-trained decoder.
Phase 2 (95% of training): Unfreeze everything. Train end-to-end. LR: 6×10⁻⁵, batch size ~370 hours, AdamW with z-loss regularization.

To swap the decoder:

Pick an LM with compatible architecture. If hidden dim ≠ 3072, retrain the adapter.
Add Ada RMS-Norm conditioning (~5M extra params).
Follow the same two-phase recipe with speech + word-level timestamps in target language(s).

The bottleneck is data: the DSM framework requires knowing exactly when each word was spoken to construct training targets. Large-scale word-level timestamp annotations are expensive.

9. Abstraction Design Implications

9.1 What Can Be Unified

Input audio format contract (16 kHz mono float32).
Output text event contract (partial/final tokens, segment boundaries).
Session lifecycle and error handling.
Metrics and tracing hooks.

9.2 What Should Remain Backend-Specific

Feature extractor internals (Whisper vs. Slaney mel).
Tokenizer internals and special token semantics.
Generation loop mechanics (unconstrained vs. timeline-locked).
Delay / endpointing control policy.

9.3 Failure Modes to Avoid

Over-abstraction: if the abstraction hides critical timing semantics, you get incorrect latency assumptions, brittle endpointing, and misleading capability signals.

Under-abstraction: if every backend leaks all internals upward, you get duplicated pipeline logic, inconsistent event semantics, and harder testing.

9.4 Practical Interface Shape

A capability-driven session interface:

start_session(...)
feed_audio(...)
poll_events(...)
end_session(...)

with explicit capability flags:

native_streaming: bool
delay_control: bool
word_timestamps: bool
rewrite_experimental: bool

This preserves model differences without duplicating pipeline/UI logic.

10. Runtime Considerations

10.1 Throughput vs Decode Schedule

Two distinct constraints:

Segment-style: total decode time must be less than segment wall time.
Timeline-locked: each decode step must complete within its 80ms frame budget.

Similar in spirit, different in failure mode.

10.2 Endpointing

Even with native realtime models, turn segmentation remains an application-level choice:

VAD-driven endpointing,
model-token-driven endpointing (sequences of [P] tokens as silence signal),
or hybrid endpointing.

10.3 Memory

Realtime note-taking products may co-host:

ASR backend (~0.6 GB or ~3.1 GB),
optional intent model,
optional rewrite LLM (~3-4 GB).

If Voxtral’s decoder can handle rewriting, total memory drops significantly. If not, Voxtral (3.1 GB) + external LLM (3-4 GB) may strain 16 GB machines.

10.4 Partial Transcript Stability

Aggressive low-delay settings increase early revisions; conservative delay reduces revisions but increases perceived lag. This tradeoff is orthogonal to raw WER and should be evaluated separately.

11. Source Validation Notes

11.1 Common Misreports for Qwen3-ASR-0.6B

Values frequently wrong in derivative writeups:

Field	Often stated	Actual (from config.json)
Decoder hidden size	2048	1024
Audio encoder dim	1024	896
Audio encoder layers	24	18
Audio encoder FFN	4096	3584
Audio output projection	2048	1024

These mismatches arise because many discussions blend paper-level reference architectures, larger Qwen3 variants, and checkpoint-specific configs. For deployment, the checkpoint config is the source of truth.

11.2 Voxtral Variant Confusion

The earlier Voxtral-Mini-3B-2507 (non-realtime) has a different architecture:

Bidirectional encoder (not causal)
Cross-attention fusion (not additive)
Different decoder dimensions

Values from the non-realtime variant should not be mixed with the Realtime-2602 checkpoint. Always verify against the specific checkpoint’s params.json.

11.3 Validation Checklist

When documenting ASR checkpoints, prefer:

config.json / params.json over third-party summaries.
Explicitly verify: layer counts, hidden/FFN dims, head counts, sliding windows, vocab size, special token IDs.
Cite both paper and checkpoint config; treat checkpoint config as definitive.
Pin commit hashes for strict reproducibility.

12. References

Primary sources used for numbers and claims:

Qwen3-ASR-0.6B model card: huggingface.co/Qwen/Qwen3-ASR-0.6B
Qwen3-ASR-0.6B config.json: huggingface.co/Qwen/Qwen3-ASR-0.6B/blob/main/config.json
Qwen3-ASR-0.6B preprocessor_config.json: huggingface.co/Qwen/Qwen3-ASR-0.6B/blob/main/preprocessor_config.json
Qwen3-ASR technical report: arxiv.org/abs/2601.21337
Voxtral-Mini-4B-Realtime-2602 model card: huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602
Voxtral-Mini-4B-Realtime-2602 params.json: huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602/blob/main/params.json
Voxtral Realtime paper (DSM architecture): arxiv.org/abs/2602.11298
MLX-community Voxtral 4-bit weights: huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit

Written while building voissistant, a real-time speech-to-text system for Apple Silicon using MLX.