Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads

0 0 4 minutes read

Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads

Prime Intellect has released prime-rl version 0.6.0. This framework aims to strengthen learning in multi-parameter Mixture-of-Experts (MoE) models. It focuses on agent-heavy tasks, such as long-term software engineering tasks.

The research team trained GLM-5 on SWE functions up to 131k sequence length. Step times last less than five minutes. The cluster size was 256 releases. The run used only 28 H200 nodes.

The TL;DR

prime-rl 0.6.0 trains trillion-parameter MoE models on RL agent workloads.
GLM-5 trained on SWE with 131k sequence length, less than 5 min steps, 28 H200 nodes.
Asynchronous RL separates the trainer and the endpoint for independent optimization.
Inference uses FP8, Wide EP, P/D disaggregation, KV loading, and router replay.
Training uses 3-D parallelism (FSDP, EP, CP) and block-scaled FP8.

What is prime-rl 0.6.0?

prime-rl is an open framework for asynchronous reinforcement learning. It deploys after training large open source models on agent functions. Version 0.6.0 extends this to the billion-parameter MoE scale.

The example in the announcement is zai-org/GLM-5.1. The configuration also applies to other major MoE models. Examples include moonshotai/Kimi-K2.7-Code again nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16.

A full run of GLM-5.1 starts with a single command in the Slurm cluster.

uv run rl @ examples/glm5_llmd/rl.toml --output-dir /shared/outputs/glm5-llmd

Role of asynchronous RL

Agent functions have long-tailed outliers. Some coding runs for hours. Waiting for them before each policy update will not disable the GPUs.

Asynchronous RL avoids this. Trainer and description systems are separated. They run and measure independently. The inference policy is updated as soon as the maintenance step completes.

There is one point of synchronization: policy update. prime-rl pushes new weights as soon as they are available. Releases that have already been deployed maintain a working initialization cache. So a single release may mix tokens from several policy versions.

New releases behave differently. They also fill their own KV cache, even if the initials are the same. The KV-cache salt enforces this. The oldest policy requests are abandoned. I max_off_policy_steps value controls that limit.

Understanding optimization

A consideration is usually the throughput bottleneck in the RL system. prime-rl prepares for output, while keeping latency constrained.

FP8 data: Low precision speeds up prefilling and trimming. prime-rl uses FP8 with DeepEP and DeepGEMM kernels.

Wide Expert Parallelism: Wide EP distributes experts across ≥32 GPUs. It pairs with a large data-parallel rank, for example 32. Each GPU has different specialists and acts as a storage area. Synchronization occurs at each layer, by dispatching and merging operations.

Prefill and Split Split: Some model↔env pairs achieve a 4:1 prefill:decode token ratio. Distributed workers will increase end-to-end latency. That reduces the benefits of PipelineRL. The P/D classification separates prefilling and issuing employee codes. The output of long tools then stops the decoding crew.

KV repository management: High consistency requires large KV cache space. prime-rl supports integrated loading on CPU and disk. The native loading of vLLM creates one pool per task. The Mooncake Store instead pools RAM and disk across all nodes in the center.

Request a route: prime-rl sends vllm-router fork by default. It also supports NVIDIA Dynamo router as a gateway. Routers that score workers use KV cache redundancy, queue depth, and live load.

Route replay (R3): Trainer↔inference mismatch silently kills training. Route replay captures route decisions. It also plays directly to the coach. This reduces the KL disparity by almost an order of magnitude. Accomplished professionals are in shape [num_layers, top_k, seq_len]. This payload can grow to hundreds of GB. At scale, the data rate reaches tens of Gbps. So prime-rl takes it as an opaque payload. Configured PyTorch functions handle the processing.

Training preparation

The trainer builds on torchtitan, PyTorch’s native training framework. It relies on 3-D parallelism: FSDP, CP, and EP. The GLM-5 case study uses all three.

Strategy	What separates	Main use	Important details
FSDP (FSDP2)	Parameters, gradients, regions of the optimizer	The basis for reducing memory	Collects weights on demand in each layer with `fully_shard`
Expert Parallelism (EP)	Experts within the framework	Reduce the memory of the working layer	`all2all` send/combine; torch-native or DeepEP
Context Parallelism (CP)	The size of the sequence	Long-term context memory for activation	Ulysses (default) or Attention of the Ring

EP exists because layers are always larger after FSDP. With 78 layers and 800B parameters in float32, the entire compilation of one layer requires about 40GB. Overlapping one layer pushes that closer to 80GB. Setting EP=8 sends tokens instead of collecting full experts. The torch-native all2all is slightly faster within a single node. DeepEP wins if EP covers more nodes.

CP is significant for 131k+ sequence length. There, activation dominates memory, not parameters. GLM-5 uses DSA, which is not directly compatible with Ulysses or Ring Attention. So prime-rl sends its own custom implementation.

FP8 training. prime-rl uses DeepGEMM block-scaled FP8, as proposed by DeepSeek V3. This rarely raises the volume, due to quantization overhead. Its real value matches the trainer with the accuracy of the description. That reduces the KL mismatch and stabilizes the training.

Interactive Descriptor

Use cases with examples

Long-term SWE agents: Train the model on real cache issues. Extraction can take 100 times and tool calls. The P/D split keeps recording delays predictable here.
1T-scale post-training on few nodes: GLM-5 run fits at 28 H200 nodes. Wide EP and KV loading increases compatibility and performance.
RL for a scale-stable agent: Route replay and FP8 training both reduce trainer↔inference KL mismatch. Low contrast means strong training.

Check it out Technical details. Also, feel free to follow us Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

pleasuremandarya@gmail.com 5 hours ago

0 0 4 minutes read

Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads

The TL;DR

What is prime-rl 0.6.0?

Role of asynchronous RL

Understanding optimization

Training preparation

Interactive Descriptor

Use cases with examples

pleasuremandarya@gmail.com

Leave a Reply Cancel reply

Cybersecurity in 2026: AI Attacks, Identity‑First Defense, and the New Playbook for Resilience

Top 10 Best Earning Apps in 2026 (Legit, Paying Fast & With Official Download Links)

“Master Google AI Overviews in 2026 with the GEO Playbook”

How Agentic AI Will Change Workflows in 2026 (Complete Guide)

“12 Best Fully Funded Scholarships for International Students in 2026 (Apply Now with Official Links)”

Top 10 Most Anticipated Games of 2026 – Release Dates, Platforms & Official Details”

The TL;DR

What is prime-rl 0.6.0?

Role of asynchronous RL

Understanding optimization

Training preparation

Interactive Descriptor

Use cases with examples

pleasuremandarya@gmail.com

China is targeting virtual money laundering in an expanded anti-money laundering push

Aces of Thunder - Aerial Catastrophe Or A Masterpiece Born Of Chaos PS VR2 Review

Related Articles

A new chip can help small robots cross complex terrain | MIT News

xAI Introduces /goal to Grok Build, Adds Long-Term Automation with Built-in Validation for Multi-Step Tasks

MoonMath AI Open-Sources HIP Attention Kernel for AMD MI300X Beating AITER v3 in All Orientation and Rotation Mode

How to Design Python-First Interactive Dashboards with Active UI Components and Static HTML Export

Leave a Reply Cancel reply

Cybersecurity in 2026: AI Attacks, Identity‑First Defense, and the New Playbook for Resilience

Top 10 Best Earning Apps in 2026 (Legit, Paying Fast & With Official Download Links)

“Master Google AI Overviews in 2026 with the GEO Playbook”

How Agentic AI Will Change Workflows in 2026 (Complete Guide)

“12 Best Fully Funded Scholarships for International Students in 2026 (Apply Now with Official Links)”

Top 10 Most Anticipated Games of 2026 – Release Dates, Platforms & Official Details”