StepFun Releases StepAudio 2.5 Real-Time: End-to-End Voice Modeling with Roleplay-Specific RLHF and Linguistic Understanding

pleasuremandarya@gmail.com 24/05/2026

0 5 3 minutes read

StepFun Releases StepAudio 2.5 Real-Time: End-to-End Voice Modeling with Roleplay-Specific RLHF and Linguistic Understanding

StepFun, an AI lab based in Shanghai, has released StepAudio 2.5 Realtime. It is a real-time speech modeling language with fully customizable capabilities.

StepAudio 2.5 Realtime is a voice model that works in real time. Unlike pipeline-based systems that separate speech recognition, reasoning, and synthesis into sequential steps, this is an end-to-end model. Sound in and sound out in one integrated system. The model supports Chinese and English.

Connects via WebSocket API. The conclusion is wss://api.stepfun.com/v1/realtime using a string step-2.5-realtime.

The Three Pillars of Technology

The StepFun research team describes three basic architectural elements behind the model:

1. Million-Scale Persona Data Augmentation

From 10,000+ high-quality natively recorded people, StepFun used algorithmic augmentation to create a human feature matrix. This is combined with millions of real-world conversation samples for training. The goal is generalization — specifically, stable performance on difficult, long-tail discussion topics.

Instead of manually labeling millions of human samples, the StepFun team used algorithmic expansion from a selected seed set.

2. Special RLHF Alignment

A well-known failure mode in conversational AI is “out-of-character” (OOC) behavior — when the model deviates from a defined personal conversation. The StepFun team has developed a dedicated RLHF (Reinforcement Learning from Human Feedback) specifically to identify human consistency in role-playing situations. RLHF is a training method where preference signals are used by a human to train a reward model, which in turn guides the behavior of the language model. Applying it directly to role stability is an intentional design choice.

3. Understanding Synthesis and Generation

StepAudio 2.5 Real Time inherits the capabilities of StepAudio 2.5 TTS and deeply integrates speech understanding and productivity with reinforcement learning. This allows for what StepFun calls “global scene-level tone adjustment” and “sentence detail sculpting.” The model can set the overall emotional register for the response while adjusting the finer acoustic details within individual sentences.

Understanding in Different Languages

A technically distinct area of this model is paralinguistic theory. Paralinguistics refers to non-verbal acoustic information in speech – things like pitch, speaking rate, pauses, sighs, and laughter. By analyzing these elements, the model can identify the user’s situation and underlying intentions. For example, it can detect fatigue from a low voice or frustration from a fast speaking rate. Capturing these signals requires the model to work on audio features rather than text alone.

StepAudio 2.5 Realtime scored 82.18 on the paralinguistic comprehension benchmark, which shows perception of voice speed, emotion, age, and other acoustic characteristics.

Benchmark results

StepFun’s research team conducted an extensive list of subjective and objective tests, benchmarking StepAudio 2.5 Realtime against leading realtime voice models across five dimensions.

Human testing is done through real mobile app conversations that people have experienced. Points:

Human assessment (subject): 80.41
General discussion (objective): 86.36
Vehicle status (objective): 84.80
Spoken QA, which includes 11 audio comprehension tasks (objective): 79.80
Comprehension of one language (objective): 82.18

Key Takeaways

StepAudio 2.5 Realtime is an end-to-end realtime speaker, released by Shanghai-based StepFun.
It uses personalized RLHF and million-scale data augmentation to maintain stable character matching.
The model ranked first in all five benchmarks, tested in April 2026.
The understanding of one language — visual tone, quality, emotion in sound — is a core technical differentiator.
API access via WebSocket on wss://api.stepfun.com/v1/realtime with the model string step-2.5-realtime.

Check it out Model Card again Demo. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.

pleasuremandarya@gmail.com 24/05/2026

0 5 3 minutes read

StepFun Releases StepAudio 2.5 Real-Time: End-to-End Voice Modeling with Roleplay-Specific RLHF and Linguistic Understanding

The Three Pillars of Technology

1. Million-Scale Persona Data Augmentation

2. Special RLHF Alignment

3. Understanding Synthesis and Generation

Understanding in Different Languages

Benchmark results

Key Takeaways

pleasuremandarya@gmail.com

Leave a Reply Cancel reply

Cybersecurity in 2026: AI Attacks, Identity‑First Defense, and the New Playbook for Resilience

Top 10 Best Earning Apps in 2026 (Legit, Paying Fast & With Official Download Links)

How MIT students help prevent cyber attacks | MIT News

“Master Google AI Overviews in 2026 with the GEO Playbook”

How Agentic AI Will Change Workflows in 2026 (Complete Guide)

“12 Best Fully Funded Scholarships for International Students in 2026 (Apply Now with Official Links)”

The Three Pillars of Technology

1. Million-Scale Persona Data Augmentation

2. Special RLHF Alignment

3. Understanding Synthesis and Generation

Understanding in Different Languages

Benchmark results

Key Takeaways

pleasuremandarya@gmail.com

10 Xbox Open-World Games Worth Exploring

TrapDoor Supply Chain Attack Distributes Authentication Theft Malware via npm, PyPI, and CratesIO

Related Articles

How MIT students help prevent cyber attacks | MIT News

Building a VideoAgent-Style Multi-Agent System: Objective Analysis, Scheduling Graphs, and Toolpaths for Video Editing Tasks

Stanford Researchers Introduce TRACE: A Skill-Oriented Agentic Training Program That Turns Repeated Agent Failures into Artificial RL Environments

The new approach aims to keep children safe from illegal content generated by AI | MIT News

Leave a Reply Cancel reply

Cybersecurity in 2026: AI Attacks, Identity‑First Defense, and the New Playbook for Resilience

Top 10 Best Earning Apps in 2026 (Legit, Paying Fast & With Official Download Links)

How MIT students help prevent cyber attacks | MIT News

“Master Google AI Overviews in 2026 with the GEO Playbook”

How Agentic AI Will Change Workflows in 2026 (Complete Guide)

“12 Best Fully Funded Scholarships for International Students in 2026 (Apply Now with Official Links)”