Meet the ‘North Mini Code’: Cohere’s 30B Open-Weight Mixed-Experts Model with 3B Active Parameters for Agent Coding

pleasuremandarya@gmail.com 11/06/2026

0 2 6 minutes read

Meet the ‘North Mini Code’: Cohere’s 30B Open-Weight Mixed-Experts Model with 3B Active Parameters for Agent Coding

This week, the Cohere AI team sent their first developer-facing code model called ‘Little North Code’. ‘North Mini Code’ is open-minded and focused on application developers. It is a mixture of experts (MoE) model with a total of 30B parameters. Only 3B of those parameters work per token.

The release is positioned around “autonomous” AI. The idea is simple: use talented models on your terms. Small, efficient code models allow teams to run independently without large GPU clusters. The North Mini Code directly addresses that gap.

Little North Code

The North Mini Code is a 30B-A3B parameter model. A3B represents three billion active parameters per forward pass. Cohere has prepared it three functions: code generation, agent software engineering, and storage functions. The model is text-in, text-out. No photo or video input.

The window contains 256K tokens. The maximum output length is 64K tokens. Cohere lists a single H100 hardware bar in the FP8. Weights go under Apache 2.0 on Hugging Face. You can also access it through the Cohere API, Model Vault, and OpenRouter.

The field	North-Mini-Code-1.0
License	Apache 2.0
Model size	30B price; 3B is active
The length of the thread	256K total; 64K is the highest generation
Prepared	Code generation, agent software engineering, end-to-end operations
Availability	Hugging Face, Cohere API, Cohere Model Vault, OpenRouter
Hardware (minimum)	1× H100 @ FP8

Architecture

The North Mini Code is a decoder-only Transformer with overlapping MoE layers. Its concentration combines the two types in a ratio of 3:1. Sliding window attention uses RoPE in positions. Global attention does not use hierarchical embedding at all. The feed-forward block accommodates 128 technicians. Eight experts activate each token. Each expert is an FFN using SwiGLU.

The router uses a sigmoid prior to the maximum-k selection. One dense layer sits before the smaller layers. That mix keeps the computing power small while expanding the overall capacity. Cohere took the weights out of BF16.

After the training was conducted in two stages. Two stages of cascaded supervised fine-tuning (SFT) first emerged. Then came reinforced learning with guaranteed rewards (RLVR). Post-training focuses on agent writing. The model also supports centralized reasoning and the use of native tools.

Measurements

Cohere reports a 33.4 on the Artificial Analysis Coding Index. It describes this as a competitive position between models of the same size. The company tested on SWE-Bench Verified, SWE-Bench Pro, and Terminal-Bench v2. Also used Terminal-Bench Hard, SciCode, and LiveCodeBench v6.

The method is straightforward. SWE-Bench used the SWE-agent v1.1.0 harness. Terminal-Bench v2 used a simple ReAct harness with one terminal tool. Terminal-Bench Hard used the Terminus-2 harness. Each benchmark went with three seeds, and then averaged. The sample used is temperature 1.0 and top_p 0.95.

Speed

In Cohere’s internal tests, Little North Code achieved 2.8x higher output. That is held in the same concurrency as the hardware. It also showed a 30% edge in inter-token latency. The first time-token was close between the two. Devstral Small 2 retained the small TTFT lead.

Metric	North Mini Code vs Devstral Small 2
Output	Up to 2.8x more (same hardware compatibility)
Inter-token delay	30% off North Mini Code
Initial time-token	It is slightly behind the Devstral Small 2

Use Cases with examples

Cohere built North Mini Code agent workflow.

Three patterns stand out in their composition:

Sub-agent orchestration: The master agent sends subordinate tasks to helpers. For example: one agent writes unit tests while another fixes code that fails.
System architecture map: The model reads the cache and draws its structure. For example: tracking how services call each other before a major refactor.
Code review: The model evaluates the diff of problems. Example: marking an unattended null dereference before compilation.

Terminal functions fit the model as well. For example: listing files, running a layout, and passing the output to find errors.

Getting started

The fastest way is Hugging Face Transformers. Install Transformers in the source of this model. Recommended sampling is temperature 1.0 and top_p 0.95.

# Install Transformers from source (required for this model):
# pip install "git+
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereLabs/North-Mini-Code-1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "Write a python program to check if a string is a palindrome or not."
messages = [{"role": "user", "content": prompt}]

# return_dict=True yields a dict (input_ids + attention_mask) so **inputs unpacks cleanly
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

gen_tokens = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=1.0,
    top_p=0.95,
)

# Decode only the newly generated tokens, not the prompt
output = tokenizer.decode(gen_tokens[0][inputs["input_ids"].shape[-1]:])
print(output)

Functionally, vLLM is active. You need the vLLM main library plus Cohere’s melody. Analyzing the correct answer depends on it.

uv pip install "git+
uv pip install "cohere_melody>=0.9.0"

vllm serve CohereLabs/North-Mini-Code-1.0 
  -tp 2 
  --max-model-len 320000 
  --tool-call-parser cohere_command4 
  --reasoning-parser cohere_command4 
  --enable-auto-tool-choice

There are limited builds of Ollama, LM Studio, and llama.cpp. You can also try the model before downloading. Cohere offers free access through OpenCode and Hugging Face Space hosted.

Key Takeaways

Cohere’s first code model, Little North Code, is a 30B professional mixer that activates 3B parameters per token.
It runs on a single H100 in FP8, with 256K cores and 64K max output.
Weights are shipped under Apache 2.0, although the Hugging Face card adds a non-commercial note.
The official combined output reports 33.4 in the Artificial Analysis Coding Index, and up to 2.8x output with Devstral Small 2.
Designed for agent coding—sub-agent orchestration, architecture mapping, code review using a native tool

Marktechpost’s Interactive Explainer

Compactness · Open Weight Code Model

Little North Code

Cohere’s first developer code model: a 30B hybrid that activates only 3B parameters per token, designed for agent software engineering and end-to-end operations.

30B absolute parameters
3B active / symbol
256K context
64K maximum output
1× H100 @ FP8

A model at a glance

Open weights, released on June 9, 2026. Text, write out.

The size

30B is the price / 3B active

Buildings

Small MoE (decoder only)

Small hardware

1× H100 @FP8

License

Apache 2.0 see note

Content window · drag to explore

128K tokens

medium-sized codebase

8K64K output cap256K maximum

Relative sizes are approximate. The exact limits are 256K core and 64K max generation.

Prepared

Code execution
Agent software engineering
Terminal functions

Agent usage cases

Sub-agent orchestration
System architecture map
Code review

License note: Cohere’s blog says Apache 2.0. The Hugging Face card adds an acceptable use supplement and a non-commercial note. Check both before you use.

Forward pass

Tap any category to see what we’re up to. The MoE block is where the minimum occurs.

→

→

→

→

Input tokens

The text is tokenized and fed to the encoder only. The model is text inside, text outside.

Try the router

Each MoE block has 128 experts. The router chooses 8 tokens each. Route tokens and clock input are increasing.

Coral = 8 shooting experts now. Peach = experts used at the beginning of time. Move square to check.

8 / 128 experts

6.25% of experts use each token, so the computer is always small.

Different experts are used0 / 128

Tokens have been moved0

Reported performance

Statistics from Cohere. Self-employment is still important.

Artificial Analysis Coding Index

Output vs Devstral Small 2

Better inter-token latency

The higher the better

Little North Codeup to 2.8×

Devstral Small 21.0× (base)

Time-to-first-token was closely matched, with Devstral Small 2 holding a slight edge.

Benchmarks: SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench v2, Terminal-Bench Hard, SciCode, LiveCodeBench v6. Harnesses: SWE-agent v1.1.0 (SWE-Bench), ReAct harness with one terminal tool (Terminal-Bench v2), Terminus-2 (Terminal-Bench Hard). Each run used 3 seeds, average, at temperatures 1.0 and above_p 0.95.

Quick start

Hugging Face Transformers, installed from source. Recommended samples: temperature 1.0, top_p 0.95.

# Install Transformers from source, then:
from transformers import AutoTokenizer, AutoModelForCausalLM

mid = "CohereLabs/North-Mini-Code-1.0"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(mid, device_map="auto")

msgs = [{"role": "user", "content": "Write a Python palindrome checker."}]
inputs = tok.apply_chat_template(
    msgs, add_generation_prompt=True,
    return_dict=True, return_tensors="pt",
).to(model.device)

out = model.generate(**inputs, max_new_tokens=1024,
                     do_sample=True, temperature=1.0, top_p=0.95)
print(tok.decode(out[0][inputs["input_ids"].shape[-1]:]))

Serve with vLLM (+ cohere_melody)
You are trained OpenCode
The natives tool use + collective thinking

Quantized: OllamaLM Studio, llama.cpp
Also in Cohere API, Model Vault, OpenRouter

Check it out Model weights again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us