Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models That Beat gpt-realtime-translate in Accuracy and Latency

Gradium today released two real-time speech translation models: stt-translate again s2s-translate. Both work in five languages and stream results live in the browser.
Gradium wants better accuracy for latency tradeoff than gpt-realtime-translate again gemini-3.5-live-translate. It also adds voice output control, including cloning, that gpt-realtime-translate lack.
The TL;DR
- Gradium has introduced two models for real-time speech translation:
stt-translate(speech → text) ands2s-translate(speech → speech). - They include five languages (EN, FR, DE, ES, PT) and 20 pairs, breaking the usual 3-model cascade into 2.
- Accuracy leads
gemini-3.5-live-translatein BLEU and MetricX, and bitsgpt-realtime-translatein BLEU (compared to MetricX). - The delay is between 3.0s – before that
gpt-realtime-translate(3.6s), just behindgemini-3.5-live-translate(2.9s). - In contrast
gpt-realtime-translateyou choose outgoing voice or compose your own, over a single duplex WebSocket.
st-translate
stt-translate take the speech in one language and translate the text in the other. It supports English (EN), French (FR), German (DE), Spanish (ES) and Portuguese (PT).
Any source maps to any target in that entire set. That’s 20 language pairs in total, in all directions.
The main design option is to fold two steps into one. Transcription and translation happen in one pass, within the speech model. There is no intermediate transcription to wait and no handoff between systems.
According to Gradium: this method draws on the Hibiki-Zero framework. The model achieves low latency and high accuracy in conjunction with Reinforcement Learning. This means fewer moving parts in the pipeline.
s2s-translate
s2s-translate he turns a spoken sound in one language into a spoken sound in another, end to end. Build on stt-translate and pair it with the Gradium TTS model in one service.
Streams audio via WebSocket. You get both the combined extracted audio and the translated text as it is produced.
That removes the integration task. You don’t connect STT and TTS together yourself or manage two connections. The server runs the pipeline and broadcasts the results back.
Audio input is PCM at 24 kHz, 16-bit signed mono. The audio output is PCM at 48 kHz, 16-bit signed mono. WAV, Opus, mu-law, and A-law are also supported.
How Gradium Measures Quality: BLEU and MetricX
Translation quality is not a single number, so Gradium reports two related metrics:
BLEU (Bilingual Evaluation Understudy) is a long-standing standard for machine translation (Papineni et al.). It measures the n-gram overlap between the model output and the reference population interpretation. It ranges from 0 to 100, the higher the better.
BLEU is fast, reproducible, and comparable across systems. Its limitation is that it rewards matching more words. Correct translation using different words can be penalized.
MetricX is a learned quality metric, developed by Google (Juraska et al.). It predicts how one can rate the translation. It’s a point of error, so lower is better, and it tracks human judgment more closely than BLEU.
The two suffered separate failures. BLEU assesses lexical reliability; MetricX evaluates semantic validity.
Benchmark
Gradium benchmarks on a proprietary conversational speech dataset. The data shows everyday topics such as work, travel, and weather, rather than text.
Against gemini-3.5-live-translateGradium leads in both BLEU and MetricX. Against gpt-realtime-translateGradium leads in BLEU and is comparable in MetricX.
| Power | Gradium | gpt-realtime-translate |
gemini-3.5-live-translate |
|---|---|---|---|
| Average latency (all pairs) | 3.0s | 3.6s | 2.9s |
| BLEU (higher is better) | It leads to both | It is lower than Gradium | It is lower than Gradium |
| MetricX (lower error is better) | Compared to GPT; lead by Gemini | Comparable to Gradium | Higher error than Gradium |
| Select the output voice | Yes (catalogue) | No | It is not mentioned |
| Make your own voice | Yes | No | It is not mentioned |
| Languages | 5 languages, 20 pairs | It is not mentioned | It is not mentioned |
Accuracy (BLEU and MetricX) is measured by stt-translate‘s translation; latency is for congestion s2s-translate pipe. Read it as a tradeoff, not a clean sweep. Gemini is a little faster; Gradium is more accurate and adds voice control.
Why Two Models Beat Three
A typical speech-to-speech stack uses three models: Speech-to-Text, then text-to-text translation, and then text-to-speech. Each stage is a different call to say. Each adds processing time and handoff.
Gradium uses two. stt-translate do transcription and translation in one pass. The dedicated text-to-text section disappears completely.
That removes one full model from the critical path, along with its latency and handoff. The end-to-end path is shorter than a cascade of three models with the same quality.
The numbers return the design. s2s-translate average 3.0 for all language pairs. That beats it gpt-realtime-translate in 3.6s and lives nearby gemini-3.5-live-translate in 2.9s.
Use Cases with examples
- Live copying and localization: Follow the presenter’s voice once. Translate a French key note into Spanish that sounds like a native speaker.
- Multilingual voice agents: Transfer a support call
s2s-translate. The English agent hears the German caller in English, and answers the broadcast back in German. - Real time meetings: Pipe microphone audio over WebSocket. Each participant receives a translated and transcribed speech in his or her own language.
- Accessibility and subtitles: Use
stt-translateyou are on your own if you only need text. Provide live transcribed captions without audio production.
Translate in Few Lines of Code
The Python SDK streams audio through a Speech-To-Speech endpoint and returns translated and transcribed audio.
import asyncio
import numpy as np
from gradium import client as gradium_client
grc = gradium_client.GradiumClient() # reads GRADIUM_API_KEY from the environment
setup = {
"model_name": "s2s-translate",
"input_format": "pcm_24000", # 24 kHz, 16-bit signed mono input
"output_format": "pcm_48000", # 48 kHz, 16-bit signed mono output
"voice_id": "cLONiZ4hQ8VpQ4Sz", # must be a voice in the target language
"stt_model_name": "stt-translate",
"tts_model_name": "default",
"target_language": "en",
}
# Raw 24 kHz, 16-bit mono PCM bytes (from a file, buffer, or microphone).
with open("input_24k_mono.pcm", "rb") as f:
pcm = f.read()
async def main() -> np.ndarray:
audio_out: list[bytes] = []
async with grc.s2s_realtime(wait_for_ready_on_start=True, **setup) as s2s:
async def send_loop():
for i in range(0, len(pcm), 1920): # 1920 bytes = 40 ms at 24 kHz
await s2s.send_audio(pcm[i : i + 1920])
await s2s.send_eos() # signal end of input
async def recv_loop():
async for msg in s2s:
if msg["type"] == "audio":
audio_out.append(msg["audio"]) # translated speech (bytes)
elif msg["type"] == "text":
print(msg["text"], end=" ", flush=True) # translated transcript
elif msg["type"] == "end_of_stream":
break
async with asyncio.TaskGroup() as tg:
tg.create_task(send_loop())
tg.create_task(recv_loop())
return np.frombuffer(b"".join(audio_out), dtype=np.int16) # 48 kHz mono PCM
translated_pcm = asyncio.run(main())
The SDK presents three ways to drive S2S. Use it s2s_realtime with live sources, s2s_stream repeated extremes, and s2s for buffered files. All three spoke wss://api.gradium.ai/api/speech/s2s.
Strengths and Weaknesses
Power
- One pass
stt-translateremoves one model from the delay path - Leading
gemini-3.5-live-translatefor both BLEU and MetricX - The output is voice selection and cloning, which
gpt-realtime-translatelack - A single duplex WebSocket replaces the hand-wired STT-plus-TTS pipeline
Weakness
- Five languages at launch, with only 20 pairs in that set
gemini-3.5-live-translatepartial latency is low at 2.9s- MetricX is only comparable to, not prior to,
gpt-realtime-translate - Benchmarks use a proprietary data set, so external replication is limited
Interactive Descriptor



