AI Sparks

StepFun Releases StepAudio 2.5 Real-Time: End-to-End Voice Modeling with Roleplay-Specific RLHF and Linguistic Understanding

StepFun, an AI lab based in Shanghai, has released StepAudio 2.5 Realtime. It is a real-time speech modeling language with fully customizable capabilities.

StepAudio 2.5 Realtime is a voice model that works in real time. Unlike pipeline-based systems that separate speech recognition, reasoning, and synthesis into sequential steps, this is an end-to-end model. Sound in and sound out in one integrated system. The model supports Chinese and English.

Connects via WebSocket API. The conclusion is wss://api.stepfun.com/v1/realtime using a string step-2.5-realtime.

The Three Pillars of Technology

The StepFun research team describes three basic architectural elements behind the model:

1. Million-Scale Persona Data Augmentation

From 10,000+ high-quality natively recorded people, StepFun used algorithmic augmentation to create a human feature matrix. This is combined with millions of real-world conversation samples for training. The goal is generalization — specifically, stable performance on difficult, long-tail discussion topics.

Instead of manually labeling millions of human samples, the StepFun team used algorithmic expansion from a selected seed set.

2. Special RLHF Alignment

A well-known failure mode in conversational AI is “out-of-character” (OOC) behavior — when the model deviates from a defined personal conversation. The StepFun team has developed a dedicated RLHF (Reinforcement Learning from Human Feedback) specifically to identify human consistency in role-playing situations. RLHF is a training method where preference signals are used by a human to train a reward model, which in turn guides the behavior of the language model. Applying it directly to role stability is an intentional design choice.

3. Understanding Synthesis and Generation

StepAudio 2.5 Real Time inherits the capabilities of StepAudio 2.5 TTS and deeply integrates speech understanding and productivity with reinforcement learning. This allows for what StepFun calls “global scene-level tone adjustment” and “sentence detail sculpting.” The model can set the overall emotional register for the response while adjusting the finer acoustic details within individual sentences.

Understanding in Different Languages

A technically distinct area of ​​this model is paralinguistic theory. Paralinguistics refers to non-verbal acoustic information in speech – things like pitch, speaking rate, pauses, sighs, and laughter. By analyzing these elements, the model can identify the user’s situation and underlying intentions. For example, it can detect fatigue from a low voice or frustration from a fast speaking rate. Capturing these signals requires the model to work on audio features rather than text alone.

StepAudio 2.5 Realtime scored 82.18 on the paralinguistic comprehension benchmark, which shows perception of voice speed, emotion, age, and other acoustic characteristics.

Benchmark results

StepFun’s research team conducted an extensive list of subjective and objective tests, benchmarking StepAudio 2.5 Realtime against leading realtime voice models across five dimensions.

Human testing is done through real mobile app conversations that people have experienced. Points:

  • Human assessment (subject): 80.41
  • General discussion (objective): 86.36
  • Vehicle status (objective): 84.80
  • Spoken QA, which includes 11 audio comprehension tasks (objective): 79.80
  • Comprehension of one language (objective): 82.18

Key Takeaways

  • StepAudio 2.5 Realtime is an end-to-end realtime speaker, released by Shanghai-based StepFun.
  • It uses personalized RLHF and million-scale data augmentation to maintain stable character matching.
  • The model ranked first in all five benchmarks, tested in April 2026.
  • The understanding of one language — visual tone, quality, emotion in sound — is a core technical differentiator.
  • API access via WebSocket on wss://api.stepfun.com/v1/realtime with the model string step-2.5-realtime.

Check it out Model Card again Demo. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us


Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button