AI Sparks

Liquid AI Releases LFM2.5-8B-A1B: In-Device MoE Model With 8.3B Value and 1.5B Functional Parameters

Liquid AI just shipped LFM2.5-8B-A1B. It’s a Mixture-of-Experts (MoE) device model built for toolkits. The model holds a total of 8.3B parameters but activates only 1.5B per token. That sparsity is what allows it to run on consumer hardware.

The release follows the LFM2-8B-A1B, which the Liquid AI team previously published. LFM2.5 is a new family of hybrid models for device implementation. This version adds a 128K context window, reasoning, and increased training.

What is LFM2.5-8B-A1B

The model uses a small MoE design. It unlocks 1.5B of a total of 8.3B parameters per forward pass. That keeps each token generated cheap to count.

The building has 24 floors. Eighteen are LIV blocks with two gates; six are GQA layers. It includes MoE, GQA, and short gated blocks. The length of the context is 131,072 tokens. The model includes nine languages, including Arabic, Chinese, and Japanese.

The Liquid AI team recommends a temperature of 0.2, a top_k of 80, and a penalty_repetition of 1.05.

Unlike its predecessor, the LFM2.5-8B-A1B is a concept-only model. It presents a clear train of thought before its final answer. The Liquid AI team chose this because MoE models work in computationally constrained settings. The small number of active parameters makes each logic token cost-effective.

What Has Changed Since the LFM2-8B-A1B

Liquid increased the context window from 32,768 tokens to 128,000. Pre-training is rated from 12T to 38T tokens. The vocabulary has doubled from 65,536 tokens to 128,000.

A larger vocabulary encodes non-Latin texts very well. The Liquid AI team reports the strongest compression gains in Hindi, Thai, Vietnamese, Indonesian, and Arabic. The rest of the architecture remains the same as the LFM2-8B-A1B.

How Liquid AI Trained

The Liquid AI team stretched the token in place instead of retraining it from scratch. Continued BPE compilation training since the original compilation on the multilingual corpus. New embedding lines begin as a description of the decomposition of their smaller tokens. A short two-stage practice then restores quality.

The content expansion came in two phases. The 2T token training section reached 32K, focusing on logic, math, and tool use. Raising the base of RoPE θ, and the stage of 400B tokens, reached 128K.

Two classes of reinforcement studies are directed at known failure modes. A preference optimization section reduces ‘doom loops’ in long-term tracking. It also distributes the maximum number of possibilities to the possible paths. A separate award for RL formatting discourages looping restart words like ‘Wait…’. Another RL class uses a reward based on avg@k to cut out missing views. The goal is to avoid questions beyond reliable information.

The Benchmark Case

The LFM2.5-8B-A1B improves on its predecessor across the board. The AA-Omniscience Non-Hallucination Rate jumped from 7.46 to 63.47. FEval increased from 79.44 to 91.84. MATH500 increased from 74.80 to 88.76. Tau² Telecom rose from 13.60 to 88.07.

The Liquid AI team compared the model with other dense and MoE methods. In the next command, match Gemma-4-26B-A4B-IT to IFEval. Do so in the active parameter calculation section. In Tau² Telecom, it gets a score of 88.07, ahead of the biggest models.

The avg@k award conveys the lowest rate of hallucinations. Accuracy remains reasonable with model size. In agent benchmarks, it remains competitive with larger models.

Benchmark LFM2-8B-A1B LFM2.5-8B-A1B Δ
AA-Omniscience Non-Hallucination Rate 7.46 63.47 +56.01
FEval 79.44 91.84 +12.40
MATH500 74.80 88.76 +13.96
Tau² Telecom 13.60 88.07 +74,47

The model runs with one-day support across the entire inference ecosystem. Frameworks include llama.cpp, MLX, vLLM, and SGlang. ONNX and Liquid’s LEAP edge platform are also supported.

On the CPU, it issues 253 tokens/s on the M5 Max. It reaches 146 tokens/s on the Ryzen AI Max+ 395. It stays under 6 GB of memory all around. On the phone, it handles about 30 tokens/s.

On one NVIDIA H100 SXM5, the throughput hits 18.5K tokens per second. That’s over 1.6B tokens per day at high compatibility.

Using the tool, LFM2.5 writes Pythonic calls automatically. They appear between <|tool_call_start|> again <|tool_call_end|> special tokens. You can output this to JSON in the system notification.

Power and What to Watch

Power:

  • Enables only 1.5B parameters, keeping guesswork cheap on edge hardware
  • Following the competitive instructions and agency scores for its size class
  • 128K content window and nine language input
  • Open weight under the LFM1.0 license, with basic and post-training checkpoints

A Must Watch:

  • The amount of information is limited from the calculation of the smallest effective parameter
  • Not suitable for heavy programming or QA that requires information without retrieval
  • Thought output only adds thought tokens to every turn
  • Text only; this variant has no vision or audio input

Marktechpost Visual Explainer

Device Model Guide

LFM2.5-8B-A1B

Mixture-of-Experts on-device Liquid AI model, built to beat complex tools and instructions following consumer hardware.

8.3B absolute parameters
1.5B active
128K context
only thinking
open weight

What It Is

A small MoE that activates 1.5B of 8.3B parameters per token

  • 24 layers – LIV blocks have 18 dual gates and 6 GQA layers.
  • It includes The MoE, GQAagain short convolution gate blocks.
  • The length of the thread 131,072 tokens; covers 9 languages.
  • Reasons only: presents a clear train of thought before responding.
  • Recommended parameters: temperature 0.2, up_k80, repetition_penalty 1.05.

What Has Changed Since the LFM2-8B-A1B

Greater context, more training, wider vocabulary

Content window

32,768 → 128,000

Processes long documents and long-term reasons.

Training tokens

12T → 38T

Combined pre-training and great RL.

Vocabulary size

65,536 → 128,000

Tokenize non-Latin texts very well.

The best benefits of compression

5 languages

Hindi, Thai, Vietnamese, Indonesian, Arabic.

How He Was Trained

Token extension, staged content growth, targeted RL

  • Tokenizer: expanded locally, with continued BPE training including a multilingual corpus.
  • Context: 2T-token midtraining phase to 32K, then base RoPE θ and 400B tokens to 128K.
  • Loops of Doom: better doing preferences redistributes the abundance of opportunities in meaningful ways.
  • A separate award for RL formatting discourages looping restart words such as “Wait…”.
  • Hallucinations: RL reward based on avg@k encourages self-discipline more than reliable information.

Benchmarks vs LFM2-8B-A1B

The biggest benefits are not hallucinating and using tools

Benchmark LFM2 LFM2.5 Δ
AA-Omniscience Non-Hallucination Rate 7.46 63.47 +56.01
FEval 79.44 91.84 +12.40
MATH500 74.80 88.76 +13.96
Tau² Telecom 13.60 88.07 +74,47

In IFEval it is similar to Gemma-4-26B-A4B-IT in the active parameter calculation part.

Inference Performance

Faster with CPU and GPU, with one day frame support

CPU code

253 tokens/s

M5 Max, memory less than 6 GB. 146 tok/s on Ryzen AI Max+ 395.

On the phone

~30 tok/s

It works locally and privately on the device.

GPU transfer

18.5K tok/s

High consensus, >1.6B tokens/day on one H100.

One day support

llama.cpp, MLX, vLLM, SGlang.

Also ONNX and Liquid’s LEAP.

Use of Tools and Agents

Pythonic function calls, ideal for on-device agents

  • By default, you write Pythonic function calls in the middle <|tool_call_start|> again <|tool_call_end|> tokens.
  • You can extract this from JSON work calls to system notification.
  • I LocalCowork the demo is running 67 tools on the other side 13 MCP servers.
  • It runs on a single laptop — no clouds, no API keys, no data leaving the machine.

Run it

Serve in two rows, or load straight

# Serve with vLLM (OpenAI-compatible API)
pip install vllm
vllm serve "LiquidAI/LFM2.5-8B-A1B"

# Or load directly with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "LiquidAI/LFM2.5-8B-A1B"
model = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained(model_id)

Recommended

Agent workflow
Use of the tool
Scheduled output
Multilingual assistants
Assistants on the device

It’s not worth it

It’s a tough program
QA that requires information without retrieval

Key Takeaways

  • Liquid AI’s LFM2.5-8B-A1B holds a total of 8.3B parameters but activates only 1.5B per token.
  • It’s just a concept, with a 128K core window and nine language inputs.
  • The Non-Hallucination Rate jumped from 7.46 to 63.47 over LFM2-8B-A1B; FEval reached 91.84.
  • It determines 253 tok/s on the M5 Max under 6 GB, and ~30 tok/s on the phone.
  • Day-one support includes llama.cpp, MLX, vLLM, and SGlang, with open source and post-trained weights.

Check it out Model weights again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button