Qwen 3 ASR vs Voxtral Mini Realtime

Two Ways a Machine Can Listen: Qwen3-ASR vs Voxtral Realtime

A detailed, source-checked architectural comparison of two speech-to-text systems — one centered on segment-style decoding, one designed for native realtime decoding.


1. Why This Comparison Matters

This article compares two open ASR systems that target similar use cases but are architecturally very different:

  • Qwen3-ASR-0.6B (Qwen)
  • Voxtral-Mini-4B-Realtime-2602 (Mistral)

Both convert speech to text. Both target practical deployment. But their core inference loops and runtime assumptions are not the same.

That difference matters for engineering decisions in realtime products:

  • how to design backend interfaces,
  • how to segment and emit partial/final text,
  • where to place latency controls,
  • and how much of a stack should be model-agnostic.

The focus is architecture and system behavior, not benchmark ranking. Checkpoint-specific numeric values are anchored to released configuration files (config.json, preprocessor_config.json, params.json), with explicit notes where values come from implementation code, paper-reported totals, or runtime behavior.


2. Background: Common STT Architecture Families

There is no canonical finite list of STT architectures. Most production systems are variants, combinations, or evolutions of a small set of recurring families.

2.1 CTC (Connectionist Temporal Classification)

  • Typical shape: encoder-only acoustic model, optional external LM for rescoring.
  • Intuition: align frame-level audio evidence with token outputs under a monotonic constraint.
  • Practical profile: efficient and robust; widely used in offline and near-realtime systems.

2.2 RNN-T / Transducer

  • Typical shape: encoder + prediction network + joiner.
  • Intuition: produce tokens incrementally while maintaining monotonic alignment.
  • Practical profile: a dominant choice for streaming ASR in many production deployments.

2.3 Attention Encoder-Decoder (Seq2Seq)

  • Typical shape: encoder builds acoustic representation; decoder autoregressively generates text.
  • Intuition: learn richer sequence mapping with flexible conditioning.
  • Practical profile: strong quality in offline settings; streaming variants exist but require additional design constraints.

2.4 Two-Pass Hybrids

  • Typical shape: fast first pass (often streaming) + slower second pass rescoring/correction.
  • Intuition: trade latency and quality by splitting tasks across two inference stages.

2.5 Speech-LLM Hybrids

  • Typical shape: speech/audio encoder feeding a language-model decoder.
  • Intuition: leverage LM capabilities (prompting, formatting control, multilingual generalization) in ASR.

Qwen3-ASR and Voxtral Realtime both fall into this last category — they fuse an audio encoder with a language model decoder — but they do so in fundamentally different ways.

Key takeaway: a useful cross-model abstraction should target behavioral contracts (session lifecycle, events, capabilities), not force identical internals.


3. Shared Signal Path

Both models follow the same high-level pipeline from waveform to text:

flowchart LR
    A["Speech waveform"] --> B["Microphone / ADC (16 kHz mono)"]
    B --> C["Feature extraction (mel frames)"]
    C --> D["Neural audio-text model"]
    D --> E["Token stream"]
    E --> F["Transcript text"]

The major differences are in:

  • how audio features are encoded,
  • how audio and text streams are fused,
  • and how decoding steps are scheduled over time.

4. Building Blocks (a brief primer)

4.1 Mel Spectrograms: What the Model Actually “Sees”

Raw audio at 16 kHz means 16,000 samples per second. Most ASR models operate on short-time spectral features instead of raw waveform samples.

The audio is sliced into overlapping windows (typically 25ms wide, sliding every 10ms), and for each window, the energy at different frequency bands is measured. The “mel” part refers to a perceptual scale — it spaces frequency bands the way human hearing works, with finer resolution at low frequencies.

                    Time →
            ┌──────────────────────────┐
  High freq │░░▓▓░░░░░░▓▓▓▓░░░░░░░░░░░│
            │░▓▓▓░░░░░▓▓▓▓▓▓░░░░░░░░░░│
            │▓▓▓▓▓░░░▓▓▓▓▓▓▓▓░░░░░░░░░│
            │▓▓▓▓▓▓░▓▓▓▓▓▓▓▓▓▓░░░░░░░░│
  Low freq  │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░▓▓░░░│
            └──────────────────────────┘
              "Hello"       "world"  (silence)

Both checkpoints use 128 mel bins and hop length 160 (10ms at 16 kHz), so both produce ~100 mel frames per second.

If duration is D seconds:

  • T ≈ 100 × D
  • mel features: X ∈ ℝ^(128 × T)

4.2 Embeddings and Dimensional Alignment

Transformers operate on embeddings — fixed-length vectors (typically 1024 to 3072 numbers long) that represent meaning in a geometric space. Things that are similar end up close together: the embedding for “dog” is near “puppy” and far from “algebra.”

For multimodal ASR, audio representations must be mapped into the decoder’s hidden space (or a compatible fusion space). The details differ:

  • Qwen3-ASR: audio embeddings replace placeholder tokens in a text prompt sequence.
  • Voxtral Realtime: audio and text embeddings are added together at each timestep.

Both approaches require audio and text to share the same vector space and dimensionality.

4.3 Bidirectional vs Causal Attention

A transformer processes a sequence of embeddings. Its core mechanism is attention: each position can “look at” other positions to gather context.

  Bidirectional                 Causal
  (Qwen3-ASR encoder)          (Voxtral encoder)

  Position: 1 2 3 4 5          Position: 1 2 3 4 5
       1    ✓ ✓ ✓ ✓ ✓               1    ✓ · · · ·
       2    ✓ ✓ ✓ ✓ ✓               2    ✓ ✓ · · ·
       3    ✓ ✓ ✓ ✓ ✓               3    ✓ ✓ ✓ · ·
       4    ✓ ✓ ✓ ✓ ✓               4    ✓ ✓ ✓ ✓ ·
       5    ✓ ✓ ✓ ✓ ✓               5    ✓ ✓ ✓ ✓ ✓

  ✓ = can attend    · = cannot attend
  • Bidirectional: every position sees every other. Richer representations, but requires the complete input.
  • Causal: each position only sees past/present. Required for strict realtime — future frames haven’t arrived yet.

4.4 Autoregressive Decoding and KV Cache

Both models generate text autoregressively: they predict one token at a time, and each prediction depends on all previous tokens.

Step 1: [audio context]                   → predict "The"
Step 2: [audio context] + "The"           → predict " revenue"
Step 3: [audio context] + "The revenue"   → predict " increased"
...

The decoder maintains a KV cache — a running memory of previously computed attention keys and values — so each new token doesn’t reprocess the full history.


5. Qwen3-ASR-0.6B: Architecture and Runtime Behavior

5.1 Source-Checked Configuration

All values from config.json and preprocessor_config.json:

If you’re skimming: this checkpoint is relatively compact (18-layer audio encoder + 28-layer decoder at hidden size 1024).

Component Value Source field
Sample rate 16,000 Hz preprocessor_config.json
Mel bins 128 feature_size: 128
n_fft / hop_length 400 / 160 preprocessor_config.json
Conv downsampling hidden 480 downsample_hidden_size
Conv strides [2, 2, 2] (8× total) 3 Conv2D layers
Audio encoder layers 18 encoder_layers
Audio encoder dim 896 d_model
Audio encoder heads 14 encoder_attention_heads
Audio encoder FFN 3584 encoder_ffn_dim
Audio output projection 1024 output_dim
Chunk / window n_window=50, n_window_infer=800 audio_config
Decoder layers 28 num_hidden_layers
Decoder hidden 1024 hidden_size
Decoder query heads 16 num_attention_heads
Decoder KV heads 8 num_key_value_heads
Decoder head dim 128 head_dim
Decoder FFN 3072 intermediate_size
RoPE theta 1,000,000 rope_theta
Vocabulary 151,936 vocab_size
Tied embeddings yes tie_word_embeddings: true
Audio token ID 151,676 audio_token_id

5.2 End-to-End Flow

flowchart LR
    A["Audio (16 kHz)"] --> B["Whisper-style mel (128 × T)"]
    B --> C["3× Conv2D downsampling (8× in time)"]
    C --> D["Audio encoder (18L, d=896, bidirectional/chunked)"]
    D --> E["Project 896 → 1024"]
    E --> F["Replace audio-pad slots in prompt"]
    F --> G["Qwen decoder (28L, d=1024, GQA 16q/8kv)"]
    G --> H["Autoregressive token output"]

5.3 Step-by-Step

Feature Extraction

The audio signal is converted into a mel spectrogram using the Whisper feature extractor (n_fft=400, hop_length=160):

mel ∈ ℝ^(128 × T)     where T ≈ 100 × duration_seconds

For a 5-second clip: mel has shape [128, 500].

Convolutional Downsampling

Three 2D convolutions with stride 2 shrink both time and frequency:

Input:  [128 freq × T time × 1 channel]
Conv1:  stride 2, GELU → [64 × T/2 × 480]
Conv2:  stride 2, GELU → [32 × T/4 × 480]
Conv3:  stride 2, GELU → [16 × T/8 × 480]
Flatten + project:      → [T/8, 896]

After 8× time compression: roughly 12.5 frames per second, one embedding per ~80ms.

Transformer Encoder (Bidirectional, Chunked)

18 transformer layers, each with:

  • Bidirectional self-attention: every frame can see every other frame
  • Feed-forward: GELU, 896 → 3584 → 896
  • LayerNorm, residual connections

Symbolically, each block:

h'_l = h_l + MHA(LayerNorm(h_l))
h_{l+1} = h'_l + FFN(LayerNorm(h'_l))

For long audio, attention is restricted to within chunks using a block-diagonal mask:

Audio frames:     [  chunk 1   |  chunk 2   |  chunk 3   ]

Attention mask:   ┌───┬───┬───┐
                  │ ✓ │   │   │  ← chunk 1 attends only to chunk 1
                  ├───┼───┼───┤
                  │   │ ✓ │   │  ← chunk 2 attends only to chunk 2
                  ├───┼───┼───┤
                  │   │   │ ✓ │  ← chunk 3 attends only to chunk 3
                  └───┴───┴───┘
flowchart TB
    subgraph C1["Chunk 1"]
    A1["f1"] --- A2["f2"] --- A3["f3"]
    end
    subgraph C2["Chunk 2"]
    B1["f4"] --- B2["f5"] --- B3["f6"]
    end
    subgraph C3["Chunk 3"]
    D1["f7"] --- D2["f8"] --- D3["f9"]
    end

The final output is projected to decoder dimension:

encoder_out ∈ ℝ^(N × 896) → LN → GELU(Linear) → Linear → ℝ^(N × 1024)

Placeholder Replacement (How Audio Meets Text)

The model uses a chat-style prompt with audio placeholder tokens:

<|im_start|>system
{optional context}<|im_end|>
<|im_start|>user
<|audio_start|><|audio_pad|>×N<|audio_end|><|im_end|>
<|im_start|>assistant
language English<asr_text>

The N pad token embeddings are replaced with the encoder output embeddings:

Before replacement:
  [sys_tokens] [user_tokens] [pad₁] [pad₂] ... [padₙ] [asst_tokens]
       ↓            ↓          ↓      ↓           ↓         ↓
    text_emb     text_emb   pad_emb pad_emb ... pad_emb  text_emb

After replacement:
  [sys_tokens] [user_tokens] [pad₁] [pad₂] ... [padₙ] [asst_tokens]
       ↓            ↓          ↓      ↓           ↓         ↓
    text_emb     text_emb   aud₁    aud₂   ... audₙ     text_emb

Formally, let E ∈ ℝ^(L × 1024) be prompt embeddings with audio-pad positions p₁...pₙ:

  • E'ᵢ = Aⱼ if i = pⱼ (replace with audio embedding)
  • E'ᵢ = Eᵢ otherwise (keep text embedding)

The decoder sees a single mixed embedding sequence. It doesn’t know which positions carry audio vs. text information.

Text Decoding

Standard autoregressive generation:

y_t ~ p(y_t | y_<t, E')

The decoder is a 28-layer Qwen3-style transformer with Grouped-Query Attention (GQA): 16 query heads share 8 KV heads (2:1 ratio), halving KV cache size. The feed-forward uses SwiGLU:

FFN(x) = W_down( SiLU(W_gate(x)) ⊙ W_up(x) )

In common Qwen3-ASR decoding templates, generation stops on EOS token IDs (for example, 151645 and 151643); exact stop sets are tokenizer/runtime dependent.

5.4 Pipeline Summary (1 second of audio)

16,000 samples
  → mel:         [128, 100]     (128 bins × 100 frames)
  → after convs: [~13, 896]    (12.5 frames/sec)
  → after enc:   [~13, 1024]   (projected to decoder dim)
  → replaces ~13 pad tokens in prompt
  → decoder generates ~5-15 text tokens until EOS

5.5 Runtime Implications

Qwen’s official model card reports unified online/offline inference support in their serving stack. In many custom deployments (including ours), integration is segment/batch-style: periodic windows or VAD-bounded turns.

This is important operationally: “supports streaming” at model/tooling level does not force a particular app-level control loop. The bidirectional encoder fundamentally benefits from seeing full audio, so segment-style integration is a natural fit. In this article, “segment-style” describes an integration pattern, not a hard model impossibility.

5.6 Generation Loop Shape

# Simplified pseudocode for segment-style decoding
mel = featurize(audio_segment)
audio_embeds = encode_and_project(mel)      # [N, 1024]
prompt_embeds = replace_pads(audio_embeds)   # audio injected into prompt

cache = init_kv_cache()
for step in range(max_tokens):               # unconstrained loop
    logits, cache = decoder_step(prompt_embeds, cache)
    token = sample(logits)
    if token in eos_ids:
        break
    yield token

The loop runs until the model decides to stop. There is no fixed relationship between the number of audio frames and the number of generated tokens.


6. Voxtral-Mini-4B-Realtime-2602: Architecture and Runtime Behavior

6.1 Source-Checked Configuration

All values from params.json:

If you’re skimming: this checkpoint is larger and explicitly realtime-oriented (causal 32-layer audio encoder + 26-layer 3072-d decoder with delay conditioning).

Component Value Source field
Sample rate 16,000 Hz audio_encoding_args.sampling_rate
Frame rate (output) 12.5 Hz audio_encoding_args.frame_rate
Mel bins 128 audio_encoding_args.num_mel_bins
Window / hop window=400, hop=160 audio_encoding_args
Global log mel max 1.5 audio_encoding_args.global_log_mel_max
Conv stem 1 causal Conv1D, 128→1280, k=3, s=1 encoder implementation¹
Conv stem 2 causal Conv1D, 1280→1280, k=3, s=2 encoder implementation¹
Audio encoder layers 32 encoder_args.n_layers
Audio encoder dim 1280 encoder_args.dim
Audio encoder heads 32 encoder_args.n_heads
Audio encoder KV heads 32 (full MHA) encoder_args.n_kv_heads
Audio encoder head dim 64 encoder_args.head_dim
Audio encoder FFN 5120 encoder_args.hidden_dim
Audio encoder FFN type SwiGLU encoder_args.ffn_type
Audio encoder norm RMSNorm encoder_args.norm_type
Audio encoder position RoPE encoder_args.pos_embed
Audio encoder attention causal encoder_args.causal: true
Audio encoder window 750 frames encoder_args.sliding_window
Encoder biases yes encoder_args.use_biases: true
Downsample factor 4 downsample_args.downsample_factor
Adapter MLP 5120→3072→3072 (1280×4 concat → project)¹
Decoder layers 26 n_layers
Decoder dim 3072 dim
Decoder heads 32 n_heads
Decoder KV heads 8 n_kv_heads
Decoder head dim 128 head_dim
Decoder FFN 9216 hidden_dim
Decoder biases no use_biases: false
Decoder sliding window 8192 sliding_window
Max sequence length 131,072 model_max_length
Vocabulary 131,072 (Tekken) vocab_size
Tied embeddings yes tied_embeddings: true
RoPE theta 1,000,000 rope_theta
Ada RMS-Norm enabled yes ada_rms_norm_t_cond: true
Ada RMS-Norm inner dim 32 ada_rms_norm_t_cond_dim
Total parameters ~4.4B encoder 970M + adapter 25M + decoder 3.4B

¹ Conv stem and adapter MLP architecture details come from implementation code (mlx-audio, vLLM), not from params.json directly. The params file specifies the encoder dimensions and downsample factor; the conv topology is an implementation detail.

frame_rate=12.5 here refers to the stream-synchronous decoding rate after internal downsampling. Frontend mel extraction is still ~100 Hz, with an intermediate ~50 Hz stage after the conv stem.

Note on n_fft: params.json does not specify n_fft. The value 512 used in some implementations (e.g., mlx-audio) is likely an implementation default. Only window_size=400 and hop_length=160 are specified in the primary config.

6.2 End-to-End Flow

flowchart LR
    A["Audio (16 kHz)"] --> B["Mel features (128 × T, 100 Hz)"]
    B --> C["Causal Conv stem → 50 Hz, d=1280"]
    C --> D["Causal encoder (32L, d=1280, sliding window=750)"]
    D --> E["Downsample ×4 + adapter MLP → d=3072, 12.5 Hz"]
    E --> F["Per-step fusion: audio_t + text_embed(prev token)"]
    F --> G["LM decoder (26L, d=3072, GQA 32q/8kv, AdaNorm)"]
    G --> H["Token stream (lexical + padding/boundary control tokens)"]

6.3 Step-by-Step

Feature Extraction

Same 128-bin mel spectrogram as Qwen3-ASR, but with Slaney-normalized filterbanks and a different log normalization:

log_spec = log10(max(mel, 1e-10))
log_spec = clamp(log_spec, min=global_log_mel_max - 8.0)
log_spec = (log_spec + 4.0) / 4.0

where global_log_mel_max = 1.5 (from params.json). Not interchangeable with Whisper’s feature extractor.

Causal Convolutional Stem

Two 1D convolutions (not 2D like Qwen3) with causal padding — each output frame depends only on current and past input frames:

Input:   [128, T]  (128 mel bins as input channels)
Conv1:   k=3, s=1, causal → [1280, T]     (expand channels, keep time)
Conv2:   k=3, s=2, causal → [1280, T/2]   (halve time)

Causal vs. standard convolution (kernel=3):

Standard:  output[t] depends on input[t-1], input[t], input[t+1]
                                                       ↑ needs future!
Causal:    output[t] depends on input[t-2], input[t-1], input[t]
                                                        ✓ only past/present

Output rate: 50 Hz (one embedding every 20ms). Dimension: 1280.

Causal Transformer Encoder

32 transformer layers process the convolution output. Every frame can only attend to past and present frames — never future frames.

The sliding window of 750 frames provides ~15 seconds of audio context at 50 Hz. This bounds memory for arbitrarily long audio:

Sliding window attention (window=5, for illustration):

Frame:     1  2  3  4  5  6  7  8  9  10
   1      [✓  ·  ·  ·  ·  ·  ·  ·  ·  · ]
   2      [✓  ✓  ·  ·  ·  ·  ·  ·  ·  · ]
   3      [✓  ✓  ✓  ·  ·  ·  ·  ·  ·  · ]
   4      [✓  ✓  ✓  ✓  ·  ·  ·  ·  ·  · ]
   5      [✓  ✓  ✓  ✓  ✓  ·  ·  ·  ·  · ]
   6      [·  ✓  ✓  ✓  ✓  ✓  ·  ·  ·  · ]  ← frame 1 falls out of window
   7      [·  ·  ✓  ✓  ✓  ✓  ✓  ·  ·  · ]
   ...

Each layer uses:

  • Causal self-attention with sliding window and RoPE (θ = 10⁶)
  • SwiGLU FFN: 1280 → 5120 → 1280
  • RMSNorm (pre-norm)
  • Full MHA (32 heads, 32 KV heads — no GQA in the encoder)
  • Biases on attention projections (unlike the decoder)

4× Downsampling + Adapter MLP

The encoder output (50 Hz, 1280-dim) is too frequent for the decoder. Four consecutive frames are concatenated and projected:

encoder output:  e₁  e₂  e₃  e₄  e₅  e₆  e₇  e₈  ...   (50 Hz, 1280-dim)
                 └──────┬──────┘  └──────┬──────┘
                   concat(4)         concat(4)
                      ↓                  ↓
                [e₁;e₂;e₃;e₄]    [e₅;e₆;e₇;e₈]          (12.5 Hz, 5120-dim)
                      ↓                  ↓
              GELU(Linear(5120→3072))                       (12.5 Hz, 3072-dim)
                      ↓                  ↓
              Linear(3072→3072)                             (12.5 Hz, 3072-dim)
                      ↓                  ↓
                     a₁                 a₂                  audio embeddings

One audio embedding per 80ms, dimension 3072 — matching the decoder hidden size.

Additive Fusion (How Audio Meets Text)

Instead of replacing placeholders, Voxtral adds audio and text embeddings at every timestep:

z_t = A_t + e(y_{t-1})

where A_t ∈ ℝ^3072 is the adapted audio embedding and e(y_{t-1}) ∈ ℝ^3072 is the embedding of the previous token.

A concrete trace of “The revenue increased” with delay = 6 frames (480ms):

Time   Audio embed       Previous token    Fused input         Output
─────  ────────────────  ────────────────  ─────────────────   ────────────
 0ms   a₀ (silence)    + embed(BOS)      = z₀                → [P] (padding)
80ms   a₁ (silence)    + embed([P])      = z₁                → [P]
160ms  a₂ ("Th-")      + embed([P])      = z₂                → [P]
240ms  a₃ ("-e re-")   + embed([P])      = z₃                → [P]
320ms  a₄ ("-venue")   + embed([P])      = z₄                → [P]
400ms  a₅ (" incr-")   + embed([P])      = z₅                → [P]
480ms  a₆ ("-eased")   + embed([P])      = z₆                → [W] (boundary)
560ms  a₇ (...)        + embed([W])      = z₇                → "The"
640ms  a₈ (...)        + embed("The")    = z₈                → [W]
720ms  a₉ (...)        + embed([W])      = z₉                → " revenue"
800ms  a₁₀ (...)       + embed("revenue")= z₁₀               → [W]
880ms  a₁₁ (...)       + embed([W])      = z₁₁               → " increased"
...

[P] = “nothing to say yet.” [W] = “word boundary.” Both are filtered from final output.

Why addition, not concatenation? With concatenation, the dimension doubles (6144) and every layer must distinguish which half is audio vs. text. With addition, dimension stays at 3072 and the network learns to disentangle both signals from the combined representation. More parameter-efficient, no architectural changes needed in the decoder.

Language Decoder with Ada RMS-Norm

The decoder is initialized from Ministral 3B, a general-purpose language model — not a narrow transcription specialist.

The decoder uses GQA with 32 query heads and 8 KV heads (4:1 ratio), SwiGLU FFN (3072 → 9216 → 3072), and a sliding window of 8192 tokens for bounded-memory decoding on arbitrarily long sequences.

Ada RMS-Norm allows a single model to operate at any delay from 80ms to 2400ms. Given target delay τ:

1. Sinusoidal time embedding:
   t_embed = sinusoidal_encoding(τ)          ∈ ℝ^3072

2. Per-layer MLP (inner dim = 32, tiny):
   g_l(τ) = Linear₂(GELU(Linear₁(t_embed))) ∈ ℝ^3072

3. Applied to the FFN branch only:
   r_attn = Attention(RMSNorm(h))            ← NOT conditioned on delay
   h' = h + r_attn
   r_ffn = FFN(RMSNorm(h') ⊙ (1 + g_l(τ)))  ← conditioned on delay
   y = h' + r_ffn

The is element-wise multiplication. When g_l(τ) = 0, this is standard RMSNorm. Non-zero values selectively amplify or suppress dimensions based on delay. Smaller delay → more aggressive early hypotheses; larger delay → wait for more acoustic evidence.

The paper reports this adds about ~5M parameters total. g_l(τ) is computed once per inference session and reused.

6.4 Pipeline Summary (1 second of audio)

16,000 samples
  → mel:         [128, 100]     (128 bins × 100 frames, 100 Hz)
  → after convs: [1280, 50]    (50 Hz)
  → after enc:   [50, 1280]    (32 transformer layers)
  → after adapt: [12, 3072]    (12.5 Hz, 3072-dim)
  → 12 generation steps, one per frame
  → ~3-8 text tokens + [P]/[W] control tokens

6.5 Runtime Implications

Voxtral’s architecture is natively streaming. A practical integration:

  • feeds incremental audio frames,
  • advances one decode step per 80ms audio frame,
  • emits partial text continuously,
  • finalizes according to endpointing/segmentation policy.

6.6 Generation Loop Shape

# Simplified pseudocode for realtime streaming
session = start_stream(delay_ms=480)

for chunk in incoming_audio_chunks:
    session.feed_audio(chunk)
    while session.has_next_frame_step():
        token = session.step()  # one decode step per 80ms frame
        if token not in {PAD_TOKEN, WORD_BOUNDARY_TOKEN}:
            yield token

for token in session.end_stream():
    if token not in {PAD_TOKEN, WORD_BOUNDARY_TOKEN}:
        yield token

The loop is timeline-locked: exactly one generation step per 80ms audio frame. This is structurally different from unconstrained “generate until EOS” loops.


7. Side-by-Side Comparison

7.1 Runtime Flow

sequenceDiagram
    participant Mic
    participant Q as Qwen-style Pipeline
    participant V as Voxtral-style Pipeline

    Mic->>Q: audio segment (buffered)
    Q->>Q: encode full segment
    Q->>Q: decode tokens to EOS
    Q-->>Mic: transcript chunk

    Mic->>V: audio chunk 1
    V->>V: frame-aligned decode steps
    V-->>Mic: partial tokens
    Mic->>V: audio chunk 2
    V->>V: continue frame-aligned steps
    V-->>Mic: more partial/final tokens

7.2 Tracing “The revenue increased” Through Both

Qwen3-ASR (segment-style):

1. Capture: accumulate audio in ring buffer until silence/trigger
2. Frontend: mel [128 × 200] from 2 seconds of audio
3. Encode: conv → 18 transformer layers → project → [~25, 1024]
4. Fuse: build prompt, replace 25 pad slots with audio embeddings
5. Decode: generate autoregressively until EOS

   Step 1 → "The"
   Step 2 → " revenue"
   Step 3 → " increased"
   Step 4 → "."
   Step 5 → EOS

   Total latency = (time to accumulate segment) + encode + decode

Voxtral Realtime (streaming, 480ms delay):

1. Session open with τ = 6 frames (480ms)
2. Feed audio continuously
3. Per-frame decode:

   t=0ms    a₀ + embed(BOS)       → [P]
   t=80ms   a₁ + embed([P])       → [P]
   t=160ms  a₂ + embed([P])       → [P]
   t=240ms  a₃ + embed([P])       → [P]
   t=320ms  a₄ + embed([P])       → [P]
   t=400ms  a₅ + embed([P])       → [P]
   t=480ms  a₆ + embed([P])       → [W]      (first word ready)
   t=560ms  a₇ + embed([W])       → "The"    ← text appears
   t=640ms  a₈ + embed("The")     → [W]
   t=720ms  a₉ + embed([W])       → " revenue"
   ...

   Text arrives incrementally, ~480ms behind the speaker.
   No need to wait for silence.

7.3 Consolidated Comparison Table

Aspect Qwen3-ASR-0.6B Voxtral-Mini-4B-Realtime-2602
Parameters 0.6B 4.4B
Mel features 128 bins, Whisper frontend 128 bins, Slaney-normalized
STFT n_fft=400, hop=160 window=400, hop=160²
Conv downsampling 3× Conv2D stride 2 (8× total) 2× CausalConv1D (2×) + adapter 4× (8× total)
Audio encoder 18L, d=896, 14 heads, FFN 3584 32L, d=1280, 32 heads, FFN 5120
Encoder attention Bidirectional, block-chunked Causal, sliding window 750
Encoder norm LayerNorm RMSNorm
Encoder position Sinusoidal (fixed) RoPE (rotary)
Encoder GQA Full MHA (14/14) Full MHA (32/32)
Audio → decoder dim 1024 3072
Fusion method Placeholder replacement Additive per timestep
Decoder 28L, d=1024 26L, d=3072
Decoder heads / KV 16 / 8 (GQA 2:1) 32 / 8 (GQA 4:1)
Decoder FFN 3072 9216
Decoder context Full sequence Sliding window 8192
Delay conditioning None Ada RMS-Norm (dim 32, paper-reported ~5M params)
Vocabulary 151,936 (Qwen3) 131,072 (Tekken)
Streaming Official stack supports online/offline; many app integrations remain segment-style Native realtime
Generation loop Until EOS (unconstrained) One token per 80ms frame (timeline-locked)
Decoder origin ASR-specific Ministral 3B (general LLM)
Quantized size ~0.6 GB (8-bit) ~3.1 GB (4-bit)

² n_fft not specified in params.json. Implementations may vary.


8. The Decoder as Language Model

8.1 What This Means

Qwen3-ASR’s decoder is a small (0.6B) ASR-focused decoder. In practice, it is typically used for transcription rather than reliable instruction-style rewriting/summarization, so post-transcription cleanup is usually delegated to a separate LLM.

Voxtral’s decoder is Ministral 3B: a general-purpose language model that was taught to understand audio. This is a fundamentally different capability profile.

8.2 Can Voxtral’s Decoder Rewrite Text After Transcription?

Once the audio stream ends, the model could theoretically continue generating in text-only mode. With no audio signal, additive fusion becomes 0 + text_embed = text_embed — effectively a standard LM forward pass.

This is untested. The model was trained with audio always present. Whether fine-tuning preserved or destroyed Ministral 3B’s text-only instruction-following is an empirical question that should be validated before building architecture around it.

A practical policy:

  1. Attempt backend-native rewrite only if capability is validated.
  2. Otherwise fall back to a dedicated text LLM post-processor.

8.3 Can You Swap the Decoder for a Different Language Model?

The paper’s training recipe is designed to support this:

  1. Phase 1 (5% of training): Freeze decoder. Train only encoder + adapter. LR: 4×10⁻⁴. This lets the randomly initialized encoder produce useful embeddings without destabilizing the pre-trained decoder.

  2. Phase 2 (95% of training): Unfreeze everything. Train end-to-end. LR: 6×10⁻⁵, batch size ~370 hours, AdamW with z-loss regularization.

To swap the decoder:

  1. Pick an LM with compatible architecture. If hidden dim ≠ 3072, retrain the adapter.
  2. Add Ada RMS-Norm conditioning (~5M extra params).
  3. Follow the same two-phase recipe with speech + word-level timestamps in target language(s).

The bottleneck is data: the DSM framework requires knowing exactly when each word was spoken to construct training targets. Large-scale word-level timestamp annotations are expensive.


9. Abstraction Design Implications

9.1 What Can Be Unified

  • Input audio format contract (16 kHz mono float32).
  • Output text event contract (partial/final tokens, segment boundaries).
  • Session lifecycle and error handling.
  • Metrics and tracing hooks.

9.2 What Should Remain Backend-Specific

  • Feature extractor internals (Whisper vs. Slaney mel).
  • Tokenizer internals and special token semantics.
  • Generation loop mechanics (unconstrained vs. timeline-locked).
  • Delay / endpointing control policy.

9.3 Failure Modes to Avoid

Over-abstraction: if the abstraction hides critical timing semantics, you get incorrect latency assumptions, brittle endpointing, and misleading capability signals.

Under-abstraction: if every backend leaks all internals upward, you get duplicated pipeline logic, inconsistent event semantics, and harder testing.

9.4 Practical Interface Shape

A capability-driven session interface:

  • start_session(...)
  • feed_audio(...)
  • poll_events(...)
  • end_session(...)

with explicit capability flags:

  • native_streaming: bool
  • delay_control: bool
  • word_timestamps: bool
  • rewrite_experimental: bool

This preserves model differences without duplicating pipeline/UI logic.


10. Runtime Considerations

10.1 Throughput vs Decode Schedule

Two distinct constraints:

  1. Segment-style: total decode time must be less than segment wall time.
  2. Timeline-locked: each decode step must complete within its 80ms frame budget.

Similar in spirit, different in failure mode.

10.2 Endpointing

Even with native realtime models, turn segmentation remains an application-level choice:

  • VAD-driven endpointing,
  • model-token-driven endpointing (sequences of [P] tokens as silence signal),
  • or hybrid endpointing.

10.3 Memory

Realtime note-taking products may co-host:

  • ASR backend (~0.6 GB or ~3.1 GB),
  • optional intent model,
  • optional rewrite LLM (~3-4 GB).

If Voxtral’s decoder can handle rewriting, total memory drops significantly. If not, Voxtral (3.1 GB) + external LLM (3-4 GB) may strain 16 GB machines.

10.4 Partial Transcript Stability

Aggressive low-delay settings increase early revisions; conservative delay reduces revisions but increases perceived lag. This tradeoff is orthogonal to raw WER and should be evaluated separately.


11. Source Validation Notes

11.1 Common Misreports for Qwen3-ASR-0.6B

Values frequently wrong in derivative writeups:

Field Often stated Actual (from config.json)
Decoder hidden size 2048 1024
Audio encoder dim 1024 896
Audio encoder layers 24 18
Audio encoder FFN 4096 3584
Audio output projection 2048 1024

These mismatches arise because many discussions blend paper-level reference architectures, larger Qwen3 variants, and checkpoint-specific configs. For deployment, the checkpoint config is the source of truth.

11.2 Voxtral Variant Confusion

The earlier Voxtral-Mini-3B-2507 (non-realtime) has a different architecture:

  • Bidirectional encoder (not causal)
  • Cross-attention fusion (not additive)
  • Different decoder dimensions

Values from the non-realtime variant should not be mixed with the Realtime-2602 checkpoint. Always verify against the specific checkpoint’s params.json.

11.3 Validation Checklist

When documenting ASR checkpoints, prefer:

  1. config.json / params.json over third-party summaries.
  2. Explicitly verify: layer counts, hidden/FFN dims, head counts, sliding windows, vocab size, special token IDs.
  3. Cite both paper and checkpoint config; treat checkpoint config as definitive.
  4. Pin commit hashes for strict reproducibility.

12. References

Primary sources used for numbers and claims:

  1. Qwen3-ASR-0.6B model card: huggingface.co/Qwen/Qwen3-ASR-0.6B
  2. Qwen3-ASR-0.6B config.json: huggingface.co/Qwen/Qwen3-ASR-0.6B/blob/main/config.json
  3. Qwen3-ASR-0.6B preprocessor_config.json: huggingface.co/Qwen/Qwen3-ASR-0.6B/blob/main/preprocessor_config.json
  4. Qwen3-ASR technical report: arxiv.org/abs/2601.21337
  5. Voxtral-Mini-4B-Realtime-2602 model card: huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602
  6. Voxtral-Mini-4B-Realtime-2602 params.json: huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602/blob/main/params.json
  7. Voxtral Realtime paper (DSM architecture): arxiv.org/abs/2602.11298
  8. MLX-community Voxtral 4-bit weights: huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit

Written while building voissistant, a real-time speech-to-text system for Apple Silicon using MLX.