AI Sparks

IBM Releases Two Expressions for Granite 4.1 2B Models: Automated ASR with Translation and Non-Automatic Programming for Quick View

IBM has released two new types of open speech recognition— Granite Speech 4.1 2B again Granite Speech 4.1 2B-NAR – and they make a compelling case for what a ~2B-parameter speech model can do. Both are available from Hugging Face under the Apache 2.0 license.

The pair addressed a particular problem that enterprise AI teams are well aware of: most automatic speech recognition (ASR) systems can require large computing or dedicated precision to stay within budget. IBM’s bet is that careful architecture decisions can let you have it both ways.

What These Models Actually Do

Granite Speech 4.1 2B is an integrated and functional language model designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST) including English, French, German, Spanish, Portuguese, and Japanese. Its non-autoregressive counterpart, Granite Speech 4.1 2B-NARit specializes in ASR — specifically targeting latency-sensitive transmissions — and supports English, French, German, Spanish, and Portuguese, but not Japanese. That’s a noticeable difference: teams that need Japanese transcription or any speech translation capabilities should reach for a standard automation model.

IBM also quietly released a third companion to these two. Granite Speech 4.1 2B-Plus adds speaker-defined ASR and word-level timestamps for applications where knowing who said what – and exactly when – is a must.

Word Error Rate (WER) is a key metric for measuring transcription quality. The lower the better. A WER of 5% means that 5 out of 100 words are wrong. On the Open ASR leaderboard (as of April 2026), Granite Speech 4.1 2B gets a WER rating of 5.33. Drawing benchmark data – in LibriSpeech clean, the model achieves a WER of 1.33, and 2.5 in LibriSpeech other.

Architecture, Explained

Both models share the same three-part design at a high level – the speech encoder, the mode adapter, and the language model – although the decoding method differs significantly.

I the first part speech encoder. The structure uses 16 conformer blocks trained with Connectionist Temporal Classification (CTC) with two classification heads – one for graphemic output (character level) and one for BPE units – using frame importance sampling to focus on informative parts of the sound. Conformer is a neural network layer that combines convolutional layers (ideal for capturing spatial acoustic patterns) and attentional modes (ideal for capturing long-range dependencies). CTC is a training technique that allows a model to learn from audio-text pairs without requiring precise frame-level alignment.

I the second part a text-to-speech adapter. A 2-layer window query converter (Q-Former) operates on 15 1024-dimensional acoustic embedding blocks from the last conformer block, scaled down by a factor of 5 using 3 trainable queries per block and per layer – for a total of 1010 temporal sampling – resulting in the LMH concentration. This adapter bridges the gap between continuous acoustic features and discrete text tokens, compressing the sound representation so that the language model can process it properly. In the NAR model, Q-Former has 160M parameters and downsamples the connected hidden representations from four layers of the encoder (layers 4, 8, 12, and 16).

I one third it is a language model. Granite Speech 4.1 2B uses the central granite-4.0-1b-base test environment with a context length of 128k, which is fine-tuned for all training frameworks. In the NAR variant, this becomes a 1B-parameter LLM editor – granite-4.0-1b-base whose attention mask is removed to enable bidirectional context – transformed with LoRA at level 128 used in both attention and MLP layers.

Autoregressive vs. Non-Autoregressive Tradeoff

This is where the two models diverge significantly, and there are direct implications for production deployment.

In standard Granite 4.1 2B speech, text is generated automatically — one token at a time, each depending on every token before it. This produces accurate, stable transcription with full AST support, keyword bias recognition, and punctuation, but is inherently linear and slow to scale.

Granite Speech 4.1 2B-NAR takes a very different approach. Rather than recording the tokens at once, it sorts the CTC hypothesis in one forward pass using bidirectional LLM, achieving competitive accuracy with faster understanding than other automated methods. This is the creation of NLE (Non-autoregressive LLM-based Editing) architecture. More precisely: the CTC encoder generates the first complex text, that view crosses the input spaces, and the bidirectional LLM predicts the editing – copy, insert, delete, or change – at all positions simultaneously in one place.

The NAR model measured an RTFx of about 1820 on a single H100 GPU using cluster inference at a cluster size of 128. RTFx (real-time factor multiplier) measures how many times faster than real time the model can process audio – an RTFx of 1820 means that a one-hour audio file can be written in less than two seconds on that hardware. One practical developer should note: the NAR model requires flash_attention_2 to be accessed, as this endpoint supports sequence packing and respects the is_causal=False flag.

Training Data and Infrastructure

Both models are trained on different datasets. The standard model was trained on 174,000 hours of audio from public companies ASR and AST, as well as synthetic datasets designed to support Japanese ASR, keyword-biased ASR, and speech translation. The NAR model was trained on approximately 130,000 hours of speech across five languages ​​using publicly available datasets including CommonVoice 15, MLS, LibriSpeech, LibriHeavy, AMI, Granary VoxPopuli, Granary YODAS, Earnings-22, Fisher, CallHome, and SwitchBoard.

The infrastructure gap between the two is equally telling. A typical model training was completed in 30 days — 26 days for the encoding engine and 4 days for the playback engine — on 8 H100 GPUs. NAR model trained in just 3 days on 16 H100 GPUs (2 nodes) for 5 epochs – very simple training, showing the simplicity of programming architecture over fully automatic generation.

Key Takeaways

Here are 5 short takeaways:

  • IBM has released two open source ASR models — Granite Speech 4.1 2B (autoregressive) and Granite Speech 4.1 2B-NAR (non-autoregressive) — both ~2B parameters, and Apache 2.0 licensed.
  • The standard model achieves an average WER of 5.33 on the Open ASR Leaderboard, it supports 6 ASR languages ​​(including Japanese), bidirectional speech translation, keyword biasing, and punctuation/validation — competing with models many times its size.
  • The NAR model trades skills for speed – reduces Japanese, AST, and keyword bias, but delivers ~1820 RTFx on a single H100 GPU by programming the CTC hypothesis in one forward pass rather than generating tokens one at a time.
  • The architecture has three main components — a 16-layer CTC-trained encoder with two heads, a 2-layer Q-Former window projector that downsamples audio to a 10Hz embedding rate, and a well-tuned granite-4.0-1b base language model.
  • A third variant, the Granite Speech 4.1 2B-Plus, is also available — extends the standard model with speaker-specified ASR and word-level timestamps for applications where speaker identity and precise timing are required.

Check it out Model-Granite Speech 4.1 2B again Model-Granite Speech 4.1 2B (NAR). Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button