Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models That Beat gpt-realtime-translate in Accuracy and Latency

0 1 5 minutes read

Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models That Beat gpt-realtime-translate in Accuracy and Latency

Gradium today released two real-time speech translation models: stt-translate again s2s-translate. Both work in five languages and stream results live in the browser.

Gradium wants better accuracy for latency tradeoff than gpt-realtime-translate again gemini-3.5-live-translate. It also adds voice output control, including cloning, that gpt-realtime-translate lack.

The TL;DR

Gradium has introduced two models for real-time speech translation: stt-translate (speech → text) and s2s-translate (speech → speech).
They include five languages (EN, FR, DE, ES, PT) and 20 pairs, breaking the usual 3-model cascade into 2.
Accuracy leads gemini-3.5-live-translate in BLEU and MetricX, and bits gpt-realtime-translate in BLEU (compared to MetricX).
The delay is between 3.0s – before that gpt-realtime-translate (3.6s), just behind gemini-3.5-live-translate (2.9s).
In contrast gpt-realtime-translateyou choose outgoing voice or compose your own, over a single duplex WebSocket.

st-translate

stt-translate take the speech in one language and translate the text in the other. It supports English (EN), French (FR), German (DE), Spanish (ES) and Portuguese (PT).

Any source maps to any target in that entire set. That’s 20 language pairs in total, in all directions.

The main design option is to fold two steps into one. Transcription and translation happen in one pass, within the speech model. There is no intermediate transcription to wait and no handoff between systems.

According to Gradium: this method draws on the Hibiki-Zero framework. The model achieves low latency and high accuracy in conjunction with Reinforcement Learning. This means fewer moving parts in the pipeline.

s2s-translate

s2s-translate he turns a spoken sound in one language into a spoken sound in another, end to end. Build on stt-translate and pair it with the Gradium TTS model in one service.

Streams audio via WebSocket. You get both the combined extracted audio and the translated text as it is produced.

That removes the integration task. You don’t connect STT and TTS together yourself or manage two connections. The server runs the pipeline and broadcasts the results back.

Audio input is PCM at 24 kHz, 16-bit signed mono. The audio output is PCM at 48 kHz, 16-bit signed mono. WAV, Opus, mu-law, and A-law are also supported.

How Gradium Measures Quality: BLEU and MetricX

Translation quality is not a single number, so Gradium reports two related metrics:

BLEU (Bilingual Evaluation Understudy) is a long-standing standard for machine translation (Papineni et al.). It measures the n-gram overlap between the model output and the reference population interpretation. It ranges from 0 to 100, the higher the better.

BLEU is fast, reproducible, and comparable across systems. Its limitation is that it rewards matching more words. Correct translation using different words can be penalized.

MetricX is a learned quality metric, developed by Google (Juraska et al.). It predicts how one can rate the translation. It’s a point of error, so lower is better, and it tracks human judgment more closely than BLEU.

The two suffered separate failures. BLEU assesses lexical reliability; MetricX evaluates semantic validity.

Benchmark

Gradium benchmarks on a proprietary conversational speech dataset. The data shows everyday topics such as work, travel, and weather, rather than text.

Against gemini-3.5-live-translateGradium leads in both BLEU and MetricX. Against gpt-realtime-translateGradium leads in BLEU and is comparable in MetricX.

Power	Gradium	`gpt-realtime-translate`	`gemini-3.5-live-translate`
Average latency (all pairs)	3.0s	3.6s	2.9s
BLEU (higher is better)	It leads to both	It is lower than Gradium	It is lower than Gradium
MetricX (lower error is better)	Compared to GPT; lead by Gemini	Comparable to Gradium	Higher error than Gradium
Select the output voice	Yes (catalogue)	No	It is not mentioned
Make your own voice	Yes	No	It is not mentioned
Languages	5 languages, 20 pairs	It is not mentioned	It is not mentioned

Accuracy (BLEU and MetricX) is measured by stt-translate‘s translation; latency is for congestion s2s-translate pipe. Read it as a tradeoff, not a clean sweep. Gemini is a little faster; Gradium is more accurate and adds voice control.

Why Two Models Beat Three

A typical speech-to-speech stack uses three models: Speech-to-Text, then text-to-text translation, and then text-to-speech. Each stage is a different call to say. Each adds processing time and handoff.

Gradium uses two. stt-translate do transcription and translation in one pass. The dedicated text-to-text section disappears completely.

That removes one full model from the critical path, along with its latency and handoff. The end-to-end path is shorter than a cascade of three models with the same quality.

The numbers return the design. s2s-translate average 3.0 for all language pairs. That beats it gpt-realtime-translate in 3.6s and lives nearby gemini-3.5-live-translate in 2.9s.

Use Cases with examples

Live copying and localization: Follow the presenter’s voice once. Translate a French key note into Spanish that sounds like a native speaker.
Multilingual voice agents: Transfer a support call s2s-translate. The English agent hears the German caller in English, and answers the broadcast back in German.
Real time meetings: Pipe microphone audio over WebSocket. Each participant receives a translated and transcribed speech in his or her own language.
Accessibility and subtitles: Use stt-translate you are on your own if you only need text. Provide live transcribed captions without audio production.

Translate in Few Lines of Code

The Python SDK streams audio through a Speech-To-Speech endpoint and returns translated and transcribed audio.

import asyncio
import numpy as np
from gradium import client as gradium_client

grc = gradium_client.GradiumClient()  # reads GRADIUM_API_KEY from the environment

setup = {
    "model_name": "s2s-translate",
    "input_format": "pcm_24000",        # 24 kHz, 16-bit signed mono input
    "output_format": "pcm_48000",       # 48 kHz, 16-bit signed mono output
    "voice_id": "cLONiZ4hQ8VpQ4Sz",     # must be a voice in the target language
    "stt_model_name": "stt-translate",
    "tts_model_name": "default",
    "target_language": "en",
}

# Raw 24 kHz, 16-bit mono PCM bytes (from a file, buffer, or microphone).
with open("input_24k_mono.pcm", "rb") as f:
    pcm = f.read()

async def main() -> np.ndarray:
    audio_out: list[bytes] = []
    async with grc.s2s_realtime(wait_for_ready_on_start=True, **setup) as s2s:
        async def send_loop():
            for i in range(0, len(pcm), 1920):       # 1920 bytes = 40 ms at 24 kHz
                await s2s.send_audio(pcm[i : i + 1920])
            await s2s.send_eos()                     # signal end of input

        async def recv_loop():
            async for msg in s2s:
                if msg["type"] == "audio":
                    audio_out.append(msg["audio"])           # translated speech (bytes)
                elif msg["type"] == "text":
                    print(msg["text"], end=" ", flush=True)  # translated transcript
                elif msg["type"] == "end_of_stream":
                    break

        async with asyncio.TaskGroup() as tg:
            tg.create_task(send_loop())
            tg.create_task(recv_loop())

    return np.frombuffer(b"".join(audio_out), dtype=np.int16)  # 48 kHz mono PCM

translated_pcm = asyncio.run(main())

The SDK presents three ways to drive S2S. Use it s2s_realtime with live sources, s2s_stream repeated extremes, and s2s for buffered files. All three spoke wss://api.gradium.ai/api/speech/s2s.

Strengths and Weaknesses

Power

One pass stt-translate removes one model from the delay path
Leading gemini-3.5-live-translate for both BLEU and MetricX
The output is voice selection and cloning, which gpt-realtime-translate lack
A single duplex WebSocket replaces the hand-wired STT-plus-TTS pipeline

Weakness

Five languages at launch, with only 20 pairs in that set
gemini-3.5-live-translate partial latency is low at 2.9s
MetricX is only comparable to, not prior to, gpt-realtime-translate
Benchmarks use a proprietary data set, so external replication is limited

Interactive Descriptor

out.length){ clearInterval(timer); box.textContent=out; setBars(true); speak(out, bcp); $(‘#gtx-runnote’).textContent=”Average end-to-end latency for all language pairs (lower is better).”; run=false; $(‘#gtx-run’).disabled=false; } },26); }; function speak(text,bcp){ if(!window.speechSynthesis){return;} var u=new SpeechSynthesisUtterance(text); u.lang=bcp; u.rate=.96; var want=vSel.value, vs=speechSynthesis.getVoices(); var v=vs.filter(function(x){return x.name===want;})[0] ||vs.filtha(function(x){ return x.lang&&x.lang.toLowerCase().indexOf(bcp.split(‘-‘)[0])===0;})[0]; if(v) u.voice=v; speechSynthesis.speak(u); } $(‘#gtx-clear’).onclick=function(){ $(‘#gtx-outtext’).innerHTML=”; setBars(false); if(window.speechSynthesis) speechSynthesis.cancel(); }; /* —- tabs —- */ root.querySelectorAll(‘.gtx-tab’).forEach(function(tb){ tb.onclick=function(){ root.querySelectorAll(‘.gtx-tab’).forEach(function(x){x.setAttribute(‘;aria);’ selected} tb.setAttribute(‘aria-selected’,’root.querySelectorAll(‘.gtx-view’).forEach(function(v){v.classList.remove(‘gtx-on’); });[data-view=”‘+tb.dataset.v+'”]’).classList.add(‘gtx-on’); report(); }; }); /* —- properties —- */ var FLOWS={ grad:[[‘🎙’,’Input speech’,”],[‘stt-translate’,’transcribe + translate’,’acc’],[‘TTS’,’synthesize voice’,”]], cascade:[[‘🎙’,’Input speech’,”],[‘STT’,’transcribe’,”],[‘T2T’,’translate’,’drop’],[‘TTS’,’synthesize’,”]]}; function drawArch(k){ var flow=$(‘#gtx-flow’); flow.innerHTML=”; IT’S GORGEOUS[k].forEach(function(st,idx){ if(idx>0){var a=document.createElement(‘span’);a.className=”gtx-arrow”;a.textContent=”→”;flow.appendChild(a);} var d=document.createElement(‘div-Name)”gtx[2]?’ ‘+st[2]:”); d.innerHTML=’‘+st[0]+’‘+st[1]+’‘; flow.appendChild(d); }); $(‘#gtx-archnote’).textContent = k===’grad’ ? ‘Two models. stt-translate combines transcription and translation, removing the separate Text-to-Text section and its handwriting.’ : ‘Three models. Each stage is a separate logic call with its own delay and handover to the next pending stage.’; report(); } root.querySelectorAll(‘.gtx-archtoggle button’).forEach(function(b){ b.onclick=function(){ root.querySelectorAll(‘.gtx-archtoggle button’).forEach(function(x){ x.classList.remove(‘on’) drawArch(b.dataset.arch); drawArch(‘grad’); /* —- reporting height of WordPress iframe (offset scroll + 40, neverHeight) —- */ function report(){ var h=root.offsetHeight({type:’gtx-height},’*’); setTimeout(report,120){ new ResizeObserver(root); }}