AI Sparks

Inworld AI Introduces Realtime TTS-2: A Closed Voice Model That Adapts to How You Really Speak

Voice AI has a dirty secret: most of it was never designed for conversation. The leading paradigm – text feed, output sound – tracks their list in audiobook narration and voice production, where the model is never heard by the person on the other end. It’s fine if you’re producing a podcast intro. It’s not fair if a frustrated user is trying to get support from an AI agent at 11pm.

Inworld AI is calling for just that by introducing Realtime TTS-2, a new voice model released as a research preview of the Inworld API and the Inworld Realtime API. The model hears the full audio of the exchange, picks up on the user’s tone, movement and mood, and takes voice guidance in plain English the way engineers appreciate LLM.

What’s Really Different Here

A logical architectural difference with TTS-2 is that it works as a closed system. The model takes the actual audio of the previous exchange as input, not just a transcription – it hears how the user sounded. That’s a trivial difference. The “okay, okay” transcript gives you the words. The sound of “okay, okay” tells you that someone is relaxed, resigned, or sarcastic. The TTS-2 is designed to use that signal.

The same line sits differently after a joke than after bad news, and the model knows the difference because she has heard the turn before. Tone, pacing, and mood move forward automatically. Simply put, audio content flows seamlessly within a Realtime session without developers needing to go through the overlays. prior_audio fields or building more pipelines.

Four Skills, One Model

The Inworld team ships the TTS-2 with four important factors, to put the combination and not each piece, like dividing.

  1. Voice Guidance: Allows developers to direct delivery using simple language commands inline at runtime. Instead of selecting from a static enum like [sad] or [excited]developers pass the parenthesis tag as [speak sadly, as if something bad just happened] directly to the text. Long, descriptive information beats short labels — the model responds much better to full context than one-word labels. Inline markers who don’t speak like it [laugh], [sigh], [breathe], [clear_throat]again [cough] they can be dropped anywhere in the text where the moment should occur, and the model classifies them as sound events, not spoken words.
  2. Conversational Awareness: The closed structure described above – a modification of the TTS-2 classification structures in previous generation models that treat each sentence as a stateless generation call.
  3. Different languages Support: The identity of a single word is preserved in more than 100 languages, including language changes within a single generation. No language flag is required — the model handles changes automatically, keeping timbre, pitch, and character constant across switches. High-level languages ​​go by the quality of the native speaker, while the long tail is defined as a launch window test, in line with the output model as a research preview.
  4. Advanced Voice Design: Generates saved voice from written notification and no reference audio is required. Developers can describe a character in prose, save the result as a reusable voice, and call it like any other voice in the app. Voice Design ships with three ways to stabilize: Clear (for live chat between buyers and friends), Balanced (automation of many agent tasks), and Stable (for IVR and professional deployments where pitch drift is unacceptable).

Discussion Layer Underneath

Beyond the four key elements, he calls for a set of behaviors that push the discourse forward into what he describes as the “mindful person” realm. What is most interesting about technology is the contrast: the model reproduces nature well again umself-correction, noun pauses, and sequential thoughts that reflect warmth and recall instead of dysfunction. Importantly, different speaker profiles include cluster fillers differently, and the model follows the rhythm – filler-as-energy sounds different than filler-as-hesitation. Voicing is also supported with a Two-step API: load a reference sample (5–15 seconds, clean, single speaker) into it /voices/v1/voices:cloneget the voice ID, and use it like any other voice.

When It Goes On The Stack

TTS-2 is one layer in Inworld’s extensive Realtime API pipeline. The full stack includes Realtime STT, which records and profiles a speaker in a single pass – capturing age, pitch, tone, voice style, emotional tone, and gait as structured signals in the same connection. A Realtime Router that is Routes across 200+ beauties to choose from The right model and tools are based on the user’s situation and the situation of the conversation. And TTS-2 is in the output layer. The pipeline runs over a single continuous WebSocket connection, with a sub-200ms median time-to-first audio in the TTS layer.

(data as of May 5, 2026)

Broad Context

Realtime TTS 1.5 is already #1 in the Artificial Analysis Speech Arena (as of May 5, 2026), ahead of Google (#2) and ElevenLabs (#3). The introduction of TTS-2 signals that Inworld considers raw audio quality a solved problem — and now competes on a behavioral level: context awareness, directionality, and identity consistency across languages.


Check it out Documents again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button