Try LFM • Docs • LEAP • Discord

LFM2-24B-A2B

LFM2 is a family of hybrid models designed for on-device deployment. LFM2-24B-A2B is the largest model in the family, scaling the architecture to 24 billion parameters while keeping inference efficient.

Best-in-class efficiency: A 24B MoE model with only 2B active parameters per token, fitting in 32 GB of RAM for deployment on consumer laptops and desktops.
Fast edge inference: 112 tok/s decode on AMD CPU, 293 tok/s on H100. Fits in 32B GB of RAM with day-one support llama.cpp, vLLM, and SGLang.
Predictable scaling: Quality improves log-linearly from 350M to 24B total parameters, confirming the LFM2 hybrid architecture scales reliably across nearly two orders of magnitude.

👁 image

Find more information about LFM2-24B-A2B in our blog post.

🗒️ Model Details

LFM2-24B-A2B is a general-purpose instruct model (without reasoning traces) with the following features:

Property	LFM2-8B-A1B	LFM2-24B-A2B
Total parameters	8.3B	24B
Active parameters	1.5B	2.3B
Layers	24 (18 conv + 6 attn)	40 (30 conv + 10 attn)
Context length	32,768 tokens	32,768 tokens
Vocabulary size	65,536	65,536
Training precision	Mixed BF16/FP8	Mixed BF16/FP8
Training budget	12 trillion tokens	17 trillion tokens
License	LFM Open License v1.0	LFM Open License v1.0

Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, Spanish, Portuguese

Generation parameters:

temperature: 0.1
top_k: 50
repetition_penalty: 1.05

We recommend the following use cases:

Agentic tool use: Native function calling, web search, structured outputs. Ideal as the fast inner-loop model in multi-step agent pipelines.
Offline document summarization and Q&A: Run entirely on consumer hardware for privacy-sensitive workflows (legal, medical, corporate).
Privacy-preserving customer support agent: Deployed on-premise at a company, handles multi-turn support conversations with tool access (database lookups, ticket creation) without data leaving the network.
Local RAG pipelines: Serve as the generation backbone in retrieval-augmented setups on a single machine without GPU servers.

We don't recommend using it for coding, as it wasn't optimized for this purpose.

Chat Template

LFM2-24B-A2B uses a ChatML-like format. See the Chat Template documentation for details. Example:

<|startoftext|><|im_start|>system
You are a helpful assistant trained by Liquid AI.<|im_end|>
<|im_start|>user
What is C. elegans?<|im_end|>
<|im_start|>assistant

You can use tokenizer.apply_chat_template() to format your messages automatically.

Tool Use

LFM2-24B-A2B supports function calling as follows:

Function definition: We recommend providing the list of tools as a JSON object in the system prompt. You can also use the tokenizer.apply_chat_template() function with tools.
Function call: By default, LFM2-24B-A2B writes Pythonic function calls (a Python list between <|tool_call_start|> and <|tool_call_end|> special tokens), as the assistant answer. You can override this behavior by asking the model to output JSON function calls in the system prompt.
Function execution: The function call is executed, and the result is returned as a "tool" role.
Final answer: LFM2-24B-A2B interprets the outcome of the function call to address the original user prompt in plain text.

See the Tool Use documentation for the full guide. Example:

<|startoftext|><|im_start|>system
List of tools: [{"name": "get_candidate_status", "description": "Retrieves the current status of a candidate in the recruitment process", "parameters": {"type": "object", "properties": {"candidate_id": {"type": "string", "description": "Unique identifier for the candidate"}}, "required": ["candidate_id"]}}]<|im_end|>
<|im_start|>user
What is the current status of candidate ID 12345?<|im_end|>
<|im_start|>assistant
<|tool_call_start|>[get_candidate_status(candidate_id="12345")]<|tool_call_end|>Checking the current status of candidate ID 12345.<|im_end|>
<|im_start|>tool
[{"candidate_id": "12345", "status": "Interview Scheduled", "position": "Clinical Research Associate", "date": "2023-11-20"}]<|im_end|>
<|im_start|>assistant
The candidate with ID 12345 is currently in the "Interview Scheduled" stage for the position of Clinical Research Associate, with an interview date set for 2023-11-20.<|im_end|>

🏃 Inference

LFM2-24B-A2B is supported by many inference frameworks. See the Inference documentation for the full list.

Name	Description	Docs	Notebook
Transformers	Simple inference with direct access to model internals.	Link	👁 Colab link
vLLM	High-throughput production deployments with GPU.	Link	👁 Colab link
llama.cpp	Cross-platform inference with CPU offloading.	Link	👁 Colab link
MLX	Apple's machine learning framework optimized for Apple Silicon.	Link	—
LM Studio	Desktop application for running LLMs locally.	Link	—

Here's a quick start example with Transformers (compatible with transformers>=5.0.0):

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_id = "LiquidAI/LFM2-24B-A2B"
model = AutoModelForCausalLM.from_pretrained(
 model_id,
 device_map="auto",
 dtype="bfloat16",
# attn_implementation="flash_attention_2" <- uncomment on compatible GPU
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "What is C. elegans?"

input_ids = tokenizer.apply_chat_template(
 [{"role": "user", "content": prompt}],
 add_generation_prompt=True,
 return_tensors="pt",
 tokenize=True,
).to(model.device)

output = model.generate(
 input_ids,
 do_sample=True,
 temperature=0.1,
 top_k=50,
 repetition_penalty=1.05,
 max_new_tokens=512,
 streamer=streamer,
)

🔧 Fine-Tuning

Name	Description	Docs	Notebook
CPT (Unsloth)	Continued Pre-Training using Unsloth for text completion.	Link	👁 Colab link
CPT (Unsloth)	Continued Pre-Training using Unsloth for translation.	Link	👁 Colab link
SFT (Unsloth)	Supervised Fine-Tuning with LoRA using Unsloth.	Link	👁 Colab link
SFT (TRL)	Supervised Fine-Tuning with LoRA using TRL.	Link	👁 Colab link
DPO (TRL)	Direct Preference Optimization with LoRA using TRL.	Link	👁 Colab link
GRPO (Unsloth)	GRPO with LoRA using Unsloth.	Link	👁 Colab link
GRPO (TRL)	GRPO with LoRA using TRL.	Link	👁 Colab link

📊 Performance

CPU Inference

We compared LFM2-24B-A2B against two popular MoE models of similar size: Qwen3-30B-A3B-Instruct-2507 (30.5B total, 3.3B active parameters) and gpt-oss-20b (21B total, 3.6B active parameters). We measured both prefill and decode throughputs with Q4_K_M versions of these models using llama.cpp on AMD Ryzen AI Max+ 395.

👁 image

GPU Inference

We also report throughput (total tokens / wall time) achieved with vLLM on a single H100 SXM5 GPU.

👁 image

📬 Contact

Got questions or want to connect? Join our Discord community
If you are interested in custom solutions with edge deployment, please contact our sales team.

Citation

@article{liquidAI202624B,
 author = {Liquid AI},
 title = {LFM2.5-24B-A2B: Scaling Up the LFM2 Architecture},
 journal = {Liquid AI Blog},
 year = {2026},
 note = {www.liquid.ai/blog/},
}

@article{liquidai2025lfm2,
 title={LFM2 Technical Report},
 author={Liquid AI},
 journal={arXiv preprint arXiv:2511.23404},
 year={2025}
}

Downloads last month: 21,587

Safetensors

Model size

24B params

Tensor type

F32

BF16

Model tree for LiquidAI/LFM2-24B-A2B

Adapters

3 models

Finetunes

12 models

Quantizations

23 models

Spaces using LiquidAI/LFM2-24B-A2B 4

Collection including LiquidAI/LFM2-24B-A2B

LFM2 is a new generation of hybrid models, designed for on-device deployment. • 28 items • Updated 11 days ago • 154

Paper for LiquidAI/LFM2-24B-A2B

Paper • 2511.23404 • Published Nov 28, 2025 • 61

URL: https://huggingface.co/LiquidAI/LFM2-24B-A2B

⇱ LiquidAI/LFM2-24B-A2B · Hugging Face

LFM2-24B-A2B

🗒️ Model Details

Chat Template

Tool Use

🏃 Inference

🔧 Fine-Tuning

📊 Performance

CPU Inference

GPU Inference

📬 Contact

Citation

Model tree for LiquidAI/LFM2-24B-A2B

Spaces using LiquidAI/LFM2-24B-A2B 4

Collection including LiquidAI/LFM2-24B-A2B

Paper for LiquidAI/LFM2-24B-A2B