VOOZH about

URL: https://huggingface.co/LiquidAI/LFM2.5-8B-A1B

⇱ LiquidAI/LFM2.5-8B-A1B Β· Hugging Face


πŸ‘ Liquid AI
Try LFM β€’ Docs β€’ LEAP β€’ Discord

LFM2.5-8B-A1B

⚠️Important: The tokenizer was updated after the original release to fix tool-calling issues in llama.cpp. If you downloaded LFM2.5-8B-A1B before commit feb5e04, please re-download the tokenizer files. The GGUF files have also been re-converted with the updated tokenizer.

LFM2.5 is a new family of hybrid models designed for on-device deployment. It builds on the LFM2 architecture with extended pre-training and reinforcement learning.

  • On-device personal assistant: Designed to power real-life applications, chaining tool calls, and following complex instructions on all devices.
  • Compressed performance: Competitive with much larger dense and MoE models on instruction following and agentic tasks.
  • Unmatched throughput: Fastest in its size class on both CPU and GPU inference, with day-one support for llama.cpp, MLX, vLLM, and SGLang.

Find more information about LFM2.5-8B-A1B in our blog post.

πŸ‘ image

*AA-Omniscience Index (higher is better) rewards correct answers and penalizes hallucinations. Scores range from -100 to 100. See more results on Artificial Analysis.

πŸ—’οΈ Model Details

Model Parameters Description
LFM2.5-8B-A1B-Base 8.3B total / 1.5B active Pre-trained base model for fine-tuning
LFM2.5-8B-A1B 8.3B total / 1.5B active Reasoning-tuned general-purpose model

LFM2.5-8B-A1B is a general-purpose text-only model with the following features:

  • Total parameters: 8.3B
  • Active parameters: 1.5B
  • Number of layers: 24 (18 double-gated LIV conv + 6 GQA)
  • Training budget: 38 trillion tokens
  • Context length: 128,000
  • Vocabulary size: 128,000
  • Languages: English, Arabic, Chinese, French, German, Italian, Japanese, Korean, Portuguese, Spanish
  • Generation parameters: We recommend the following parameters:
    • temperature: 0.2
    • top_k: 80
    • repetition_penalty: 1.05
Model Description
LFM2.5-8B-A1B Original model checkpoint in native format. Best for fine-tuning or inference with Transformers, vLLM, and SGLang.
LFM2.5-8B-A1B-GGUF Quantized format for llama.cpp and compatible tools. Optimized for edge inference and local deployment.
LFM2.5-8B-A1B-ONNX ONNX Runtime format for cross-platform deployment.
LFM2.5-8B-A1B-MLX MLX format for Apple Silicon. Optimized for fast inference on Mac devices.

We recommend using LFM2.5-8B-A1B for agentic workflows, tool use, structured outputs, multilingual assistants, and on-device personal-assistant applications. It is not the best fit for heavy programming or knowledge-intensive question answering without retrieval.

Chat Template

LFM2.5 uses a ChatML-like format. See the Chat Template documentation for details. Example:

<|startoftext|><|im_start|>system
You are a helpful assistant trained by Liquid AI.<|im_end|>
<|im_start|>user
What is C. elegans?<|im_end|>
<|im_start|>assistant

Because LFM2.5-8B-A1B is a reasoning model, assistant turns contain an explicit chain of thought before the final answer. You can use tokenizer.apply_chat_template() to format your messages automatically.

Tool Use

LFM2.5 supports function calling in four steps:

  1. Function definition: Provide the list of tools as a JSON object in the system prompt, or use tokenizer.apply_chat_template() with tools=....
  2. Function call: By default, LFM2.5 writes Pythonic function calls (a Python list between <|tool_call_start|> and <|tool_call_end|> special tokens), as the assistant answer. You can override this behavior by asking the model to output JSON function calls in the system prompt.
  3. Function execution: Execute the call and return the result with the tool role.
  4. Final answer: LFM2.5 interprets the tool output and returns a plain-text answer addressing the original prompt.

See the Tool Use documentation for the full guide. Example:

<|startoftext|><|im_start|>system
List of tools: [{"name": "get_candidate_status", "description": "Retrieves the current status of a candidate in the recruitment process", "parameters": {"type": "object", "properties": {"candidate_id": {"type": "string", "description": "Unique identifier for the candidate"}}, "required": ["candidate_id"]}}]<|im_end|>
<|im_start|>user
What is the current status of candidate ID 12345?<|im_end|>
<|im_start|>assistant
<|tool_call_start|>[get_candidate_status(candidate_id="12345")]<|tool_call_end|>Checking the current status of candidate ID 12345.<|im_end|>
<|im_start|>tool
[{"candidate_id": "12345", "status": "Interview Scheduled", "position": "Clinical Research Associate", "date": "2023-11-20"}]<|im_end|>
<|im_start|>assistant
The candidate with ID 12345 is currently in the "Interview Scheduled" stage for the position of Clinical Research Associate, with an interview date set for 2023-11-20.<|im_end|>

πŸƒ Inference

LFM2.5-8B-A1B is supported by many inference frameworks. See the Inference documentation for the full list.

Name Description Docs Notebook
Transformers Simple inference with direct access to model internals. Link πŸ‘ Colab link
vLLM High-throughput production deployments with GPU. Link πŸ‘ Colab link
llama.cpp Cross-platform inference with CPU offloading. Link πŸ‘ Colab link
MLX Apple's machine learning framework optimized for Apple Silicon. Link β€”
LM Studio Desktop application for running LLMs locally. Link β€”

Quick start with Transformers (compatible with transformers>=5.0.0):

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_id = "LiquidAI/LFM2.5-8B-A1B"
model = AutoModelForCausalLM.from_pretrained(
 model_id,
 device_map="auto",
 dtype="bfloat16",
# attn_implementation="flash_attention_2" <- uncomment on compatible GPU
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "What is C. elegans?"

input_ids = tokenizer.apply_chat_template(
 [{"role": "user", "content": prompt}],
 add_generation_prompt=True,
 return_tensors="pt",
 tokenize=True,
)["input_ids"].to(model.device)

output = model.generate(
 input_ids,
 do_sample=True,
 temperature=0.2,
 top_k=80,
 repetition_penalty=1.05,
 max_new_tokens=8192,
 streamer=streamer,
)

πŸ”§ Fine-Tuning

We recommend fine-tuning LFM2.5 for your specific use case to achieve the best results.

Name Description Docs Notebook
CPT (Unsloth) Continued Pre-Training using Unsloth for text completion. Link πŸ‘ Colab link
CPT (Unsloth) Continued Pre-Training using Unsloth for translation. Link πŸ‘ Colab link
SFT (Unsloth) Supervised Fine-Tuning with LoRA using Unsloth. Link πŸ‘ Colab link
SFT (TRL) Supervised Fine-Tuning with LoRA using TRL. Link πŸ‘ Colab link
DPO (TRL) Direct Preference Optimization with LoRA using TRL. Link πŸ‘ Colab link
GRPO (Unsloth) GRPO with LoRA using Unsloth. Link πŸ‘ Colab link
GRPO (TRL) GRPO with LoRA using TRL. Link πŸ‘ Colab link

πŸ“Š Performance

Improvements over LFM2-8B-A1B

Thanks to reasoning, scaled-up pre-training, and large-scale RL, LFM2.5-8B-A1B improves over its predecessor across the board:

Benchmark LFM2-8B-A1B LFM2.5-8B-A1B Ξ”
AA-Omniscience Index -78.42 -24.70 +53.62
AA-Omniscience Accuracy 7.33 8.67 +1.34
AA-Omniscience Non-Hallucination Rate 7.46 63.47 +56.01
IFEval 79.44 91.84 +12.40
IFBench 26.00 56.47 +30.47
Multi-IF 58.54 79.93 +21.39
MATH500 74.80 88.76 +13.96
AIME25 20.00 42.53 +22.53
BFCLv3 45.07 64.36 +19.29
BFCLv4 25.52 48.50 +22.98
TauΒ² Telecom 13.60 88.07 +74.47
TauΒ² Retail 7.02 39.82 +32.80

Knowledge and instruction following

Model Parameters AA-Omni. Index AA-Omni. Accuracy AA-Omni. Non-Halluc. IFEval IFBench Multi-IF
LFM2.5-8B-A1B 8B/A1B -24.70 8.67 63.47 91.84 56.47 79.93
Granite-4.0-H-Tiny 7B/A1B -75.50 9.37 6.38 82.23 21.28 59.00
Qwen3.5-4B 4B -51.53 17.20 16.99 87.80 50.38 67.43
Qwen3-30B-A3B-Thinking-2507 30.5B/3.3B -51.31 18.80 13.87 90.82 51.11 79.04
Gemma-4-E2B-IT 5.1B -72 7.00 15.05 82.93 33.53 69.70
Gemma-4-E4B-IT 8B -50.67 8.10 36.06 87.74 39.48 77.58
Gemma-4-26B-A4B-IT 26B/4B -62.07 14.37 10.75 91.40 47.25 82.06
gpt-oss-20b 21B/3.6B -49.17 14.57 24.50 86.73 58.65 76.64

Math and agentic workflows

Model Parameters MATH500 AIME25 AIME26 BFCLv3 BFCLv4 TauΒ² Telecom TauΒ² Retail
LFM2.5-8B-A1B 8B/A1B 88.76 42.53 50.00 64.79 49.73 88.07 39.82
Granite-4.0-H-Tiny 7B/A1B 59.20 4.93 3.33 56.89 28.52 16.67 18.42
Qwen3.5-4B 4B 80.76 54.28 58.33 71.06 54.01 87.72 71.93
Qwen3-30B-A3B-Thinking-2507 30.5B/3.3B 86.48 71.67 66.67 73.39 50.53 21.93 56.14
Gemma-4-E2B-IT 5.1B 64.00 26 30 56.44 31.91 22.37 18.95
Gemma-4-E4B-IT 8B 65.00 34.33 40.67 57.31 33.92 26.75 42.11

CPU Inference

πŸ‘ image

GPU Inference

LFM2.5-8B-A1B is the fastest model in its size class, reaching 18.5K output tokens per second at high concurrency, over 1.6B tokens per day on a single H100.

πŸ‘ image

πŸ“¬ Contact

Citation

@article{liquidAI20268BA1B,
 author = {Liquid AI},
 title = {LFM2.5-8B-A1B: Personal Assistant On Your Laptop},
 journal = {Liquid AI Blog},
 year = {2026},
 note = {www.liquid.ai/blog/lfm2-5-8b-a1b},
}
@article{liquidai2025lfm2,
 title = {LFM2 Technical Report},
 author = {Liquid AI},
 journal = {arXiv preprint arXiv:2511.23404},
 year = {2025}
}
Downloads last month
94,062
Safetensors
Model size
8B params
Tensor type
F32
Β·
BF16
Β·

Model tree for LiquidAI/LFM2.5-8B-A1B

Finetuned
(17)
this model
Adapters
6 models
Finetunes
24 models
Quantizations
50 models

Spaces using LiquidAI/LFM2.5-8B-A1B 7

Collection including LiquidAI/LFM2.5-8B-A1B

Paper for LiquidAI/LFM2.5-8B-A1B

Evaluation results