Interfaze Ships diffusion-gemma-asr-small, an Open-Source Diffusion ASR Model Writes Six Languages with DiffusionGemma’s Parallel Denoising Decoder

Interfaze, a young YC startup, has opened the source for a new speech recognition model. It is called diffusion-gemma-asr-small. The model encodes noise with a diffusion decoder, not an autoregressive one. It is described as the first ASR model for multilingual audio streaming. One adapter handles six languages. The research team trained only 42M parameters on the 26B frozen core. That’s about 0.16% of the model’s weight.
Here two important words come forward. Autoregressive models generate text one token at a time. Distribution models refine all tokens in parallel. This model uses a speech-to-text distribution method.
The TL;DR
- Claimed by the Interfaze team, to be the first open source multilingual ASR distribution: six languages from a single ~42M parameter adapter.
- It writes with DiffusionGemma’s diffusion decoder using a uniform, random token distribution, not absorption.
system. - Estimates of transcription costs with denoising steps, not transcription length.
- Leads peers in LibriSpeech streaming (6.6% WER vs 8.3%) for Whisfusion but trails Autoregressive Whisper.
- The adapter ships under Apache-2.0; DiffusionGemma (Gemma criteria) and whisper-small (MIT) load separately.
What is diffusion-gemma-asr-small?
diffusion-gemma-asr-small is a native ASR model for audio. Converts speech to text using a separate video converter. That decoder is DiffusionGemma, Google’s professional hybrid model 26B. DiffusionGemma activates 4B parameters, using 128 experts with a maximum of 8 channels. Generates text with transparent distribution instead of indentation.
The distribution details are clear. Most distribution LLMs use absorption system. DiffusionGemma uses a uniform, random token distribution instead. Fills a fixed length canvas with random vocabulary tokens. Each step maintains a confident prediction and rearranges the rest. After a few steps the sound is converted into text.
Interfaze added sound to this text-only model. Out of the box, DiffusionGemma takes text, images, and video. It does not take sound. The repo ships only the trained adapter, about 42M parameters. Frozen backbones download separately from their habitats.
How does this work
The model does not feed raw waveforms to the LLM. The first attempt tried that and failed. Frozen LLM has never seen a spectrogram. The embedding space has no concept of forms or phones. The model learned to ignore noise and fake fluent nonsense.
Functional design uses ice whisper-small encoder. It only works as a feature extractor, not a decoder. Whisper converts 30 seconds of audio into 1500 frames. Each frame carries 768-dimensional acoustic properties. A small programmable projector then compresses these frames. It uses 8× sampled conv layers and a line map. The output is 188 “audio tokens” of 2816 size. These tokens distribute the stored information <|audio|> spaces. LoRA adapters allow the backbone to take care of this new method. The decoder then adds a 192 token encoding canvas. It doubles over about 16 steps.
The pipeline, from the model card, consists of:
raw audio ─► whisper-small encoder (frozen) ─► projector (trained, ~19M)
─► scatter into
Open training
The first training has stopped. The loss is slightly lower around 8. The failure was circular. The projector started randomly, so its output was noisy. Attention then learned to ignore it. Almost no gradients reached the projector. The model did not learn.
The adjustment aimed the projector directly. The research team used 188 audio tokens using DiffusionGemma snow lm_head. They used the loss of CTC against the transcript. CTC stands for Connectionist Temporal Classification. Matches audio and text features without requiring attention.
This removes the standoff. Audio embedding was the correct word prediction. The CTC loss then drops from 24 to 8.6 per 300 steps. In LibriSpeech test-clean, English WER decreased by 90% → 52% → 14.6% → 6.6% over ten periods.
Working with benchmarks
WER stands for Word Error Rate, where lower is better. CER stands for Character Error Rate. A model trained on FLEURS, LibriSpeech, and VoxPopuli. All points below use the Whisper text normalizer for 16 steps of distribution.
| benchmark | metric | The result |
|---|---|---|
| LibriSpeech test-clean (en) | WER | 6.6% |
| FLEURS English | WER | 15.7% |
| VoxPopuli English | WER | 18.5% |
| FLEURS Hindi | CER | 15.8% |
| FLEURS Mandarin | CER | 29.6% |
Against other distributions or non-autoregressive ASR, it leads.
| model | come closer | LibriSpeech test-clean |
|---|---|---|
| TransFusion (2022) | multinomial distribution | ~6–7% (proof of concept) |
| Whisfusion (Aug 2025) | Whisper-large-v3 + hidden distribution | 8.3% |
| diffusion-gemma-asr-small (2026) | Whisper-small + DiffusionGemma | 6.6% |
Against the Autoregressive Whisper, it follows. The team enters this space as data, not structure.
| benchmark | ours | Whisper softly | gossip-big-v3 |
|---|---|---|---|
| LibriSpeech is clean | 6.6% | ~3.4% | ~2.0% |
| FLEURS-en | 15.7% | ~9–10% | ~4–5% |
| VoxPopuli-zu | 18.5% | ~9–11% | ~7–10% |
The sweep of the denoising step shows an almost flat curve.
| steps | FLEURS-zu WER | speed |
|---|---|---|
| 8 | 15.7% | 14.9× real time |
| 16 | 15.6% | 10.3× |
| 32 | 15.2% | 6.5× |
| 48 | 15.6% | 4.7× |
Going from 8 steps to 48 buys about 0.1 WER point. It costs about 3 × the delay. The model converges in approximately 8 parallel passes. That’s about 0.7–1.5s of the model time for a 10-second clip.
Use cases with examples
- Batch coding pipelines benefit from parallel coding. Costs are set by split steps, not clip length. A 10-second clip requires about the same number of passes as the shorter ones.
- Multilingual transcription works on a single adapter. It includes English, German, French, Spanish, Hindi and Mandarin. Teams avoid loading a different model for each language.
- A non-autoregressive ASR study achieves a reproducible baseline. The recipe supports frozen LLM with a small adapter. Researchers can extend it with more noise or a larger encoder.
How to start
The model resides in the Hub. Moves the adapter, model.py, audio.pyand running inference.py. DiffusionGemma support requirements transformers from main.
pip install torch peft soundfile librosa huggingface_hub
"transformers @ git+
Then write in Python:
import sys, soundfile as sf
from huggingface_hub import snapshot_download
repo = snapshot_download("interfaze-ai/diffusion-gemma-asr-small") # adapter, ~170 MB
sys.path.insert(0, repo)
from inference import load, transcribe
# Loads frozen DiffusionGemma-26B + whisper-small + this adapter.
model, tok, fe = load(f"{repo}/diffusion_asr_small.pt", device="cuda")
wav, sr = sf.read("audio.wav") # 16 kHz mono float32
print(transcribe(wav, model, tok, fe, max_steps=16))
The command line route also works within the downloaded repo:
python inference.py audio.wav
I max_steps the argument sells accuracy speed. Group 8 notes are near-better and faster. The default is 16. Basic models are loaded under their own licenses: DiffusionGemma under Gemma terms, gossip under MIT.
Interactive Descriptor
Check it out Model weights again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us
Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.



