Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads

Prime Intellect has released prime-rl version 0.6.0. This framework aims to strengthen learning in multi-parameter Mixture-of-Experts (MoE) models. It focuses on agent-heavy tasks, such as long-term software engineering tasks.
The research team trained GLM-5 on SWE functions up to 131k sequence length. Step times last less than five minutes. The cluster size was 256 releases. The run used only 28 H200 nodes.
The TL;DR
- prime-rl 0.6.0 trains trillion-parameter MoE models on RL agent workloads.
- GLM-5 trained on SWE with 131k sequence length, less than 5 min steps, 28 H200 nodes.
- Asynchronous RL separates the trainer and the endpoint for independent optimization.
- Inference uses FP8, Wide EP, P/D disaggregation, KV loading, and router replay.
- Training uses 3-D parallelism (FSDP, EP, CP) and block-scaled FP8.
What is prime-rl 0.6.0?
prime-rl is an open framework for asynchronous reinforcement learning. It deploys after training large open source models on agent functions. Version 0.6.0 extends this to the billion-parameter MoE scale.
The example in the announcement is zai-org/GLM-5.1. The configuration also applies to other major MoE models. Examples include moonshotai/Kimi-K2.7-Code again nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16.
A full run of GLM-5.1 starts with a single command in the Slurm cluster.
uv run rl @ examples/glm5_llmd/rl.toml --output-dir /shared/outputs/glm5-llmd
Role of asynchronous RL
Agent functions have long-tailed outliers. Some coding runs for hours. Waiting for them before each policy update will not disable the GPUs.
Asynchronous RL avoids this. Trainer and description systems are separated. They run and measure independently. The inference policy is updated as soon as the maintenance step completes.
There is one point of synchronization: policy update. prime-rl pushes new weights as soon as they are available. Releases that have already been deployed maintain a working initialization cache. So a single release may mix tokens from several policy versions.
New releases behave differently. They also fill their own KV cache, even if the initials are the same. The KV-cache salt enforces this. The oldest policy requests are abandoned. I max_off_policy_steps value controls that limit.
Understanding optimization
A consideration is usually the throughput bottleneck in the RL system. prime-rl prepares for output, while keeping latency constrained.
FP8 data: Low precision speeds up prefilling and trimming. prime-rl uses FP8 with DeepEP and DeepGEMM kernels.
Wide Expert Parallelism: Wide EP distributes experts across ≥32 GPUs. It pairs with a large data-parallel rank, for example 32. Each GPU has different specialists and acts as a storage area. Synchronization occurs at each layer, by dispatching and merging operations.
Prefill and Split Split: Some model↔env pairs achieve a 4:1 prefill:decode token ratio. Distributed workers will increase end-to-end latency. That reduces the benefits of PipelineRL. The P/D classification separates prefilling and issuing employee codes. The output of long tools then stops the decoding crew.
KV repository management: High consistency requires large KV cache space. prime-rl supports integrated loading on CPU and disk. The native loading of vLLM creates one pool per task. The Mooncake Store instead pools RAM and disk across all nodes in the center.
Request a route: prime-rl sends vllm-router fork by default. It also supports NVIDIA Dynamo router as a gateway. Routers that score workers use KV cache redundancy, queue depth, and live load.
Route replay (R3): Trainer↔inference mismatch silently kills training. Route replay captures route decisions. It also plays directly to the coach. This reduces the KL disparity by almost an order of magnitude. Accomplished professionals are in shape [num_layers, top_k, seq_len]. This payload can grow to hundreds of GB. At scale, the data rate reaches tens of Gbps. So prime-rl takes it as an opaque payload. Configured PyTorch functions handle the processing.
Training preparation
The trainer builds on torchtitan, PyTorch’s native training framework. It relies on 3-D parallelism: FSDP, CP, and EP. The GLM-5 case study uses all three.
| Strategy | What separates | Main use | Important details |
|---|---|---|---|
| FSDP (FSDP2) | Parameters, gradients, regions of the optimizer | The basis for reducing memory | Collects weights on demand in each layer with fully_shard |
| Expert Parallelism (EP) | Experts within the framework | Reduce the memory of the working layer | all2all send/combine; torch-native or DeepEP |
| Context Parallelism (CP) | The size of the sequence | Long-term context memory for activation | Ulysses (default) or Attention of the Ring |
EP exists because layers are always larger after FSDP. With 78 layers and 800B parameters in float32, the entire compilation of one layer requires about 40GB. Overlapping one layer pushes that closer to 80GB. Setting EP=8 sends tokens instead of collecting full experts. The torch-native all2all is slightly faster within a single node. DeepEP wins if EP covers more nodes.
CP is significant for 131k+ sequence length. There, activation dominates memory, not parameters. GLM-5 uses DSA, which is not directly compatible with Ulysses or Ring Attention. So prime-rl sends its own custom implementation.
FP8 training. prime-rl uses DeepGEMM block-scaled FP8, as proposed by DeepSeek V3. This rarely raises the volume, due to quantization overhead. Its real value matches the trainer with the accuracy of the description. That reduces the KL mismatch and stabilizes the training.
Interactive Descriptor
Use cases with examples
- Long-term SWE agents: Train the model on real cache issues. Extraction can take 100 times and tool calls. The P/D split keeps recording delays predictable here.
- 1T-scale post-training on few nodes: GLM-5 run fits at 28 H200 nodes. Wide EP and KV loading increases compatibility and performance.
- RL for a scale-stable agent: Route replay and FP8 training both reduce trainer↔inference KL mismatch. Low contrast means strong training.
Check it out Technical details. Also, feel free to follow us Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us



