Try LFM • Docs • LEAP • Discord

LFM2.5-8B-A1B-ONNX

ONNX export of LFM2.5-8B-A1B for cross-platform inference.

LFM2.5-8B-A1B is a Mixture of Experts model with 8B total parameters and about 1B active parameters per token. It uses 32 experts with 4 experts activated per token, combining the efficiency of sparse models with the quality of larger dense models.

Recommended Variants

Precision	Size	Use Case
Q4F16	~4.7GB	Recommended (Q4 MoE + FP16 dense)
FP16	~15.8GB	Higher quality
Q4	~5.2GB	Smallest size
Q8	~30.4GB	Highest-fidelity quantized variant

Note: This model is too large for WebGPU browser inference.

Validation

This export was validated against the local PyTorch reference for LiquidAI/LFM2.5-8B-A1B.

FP32 padded-batch parity passed for both left and right padding, with cosine similarity 1.0000 and top-5 overlap 5/5 at the last valid token for each row.
Q4 decoder and coherence checks passed the repository thresholds. Average coherence similarity: 0.7144.
Q4F16 was runtime-validated on CPUExecutionProvider and matched the same decoder/coherence thresholds as Q4. Average coherence similarity: 0.7145.
Q8 decoder and coherence checks passed, and stayed very close to the PyTorch reference. Average coherence similarity: 0.9975.

Model Files

onnx/
├── model.onnx # FP32 model graph
├── model.onnx_data* # FP32 weights
├── model_fp16.onnx # FP16 model graph
├── model_fp16.onnx_data* # FP16 weights
├── model_q4.onnx # Q4 model graph
├── model_q4.onnx_data* # Q4 weights
├── model_q4f16.onnx # Q4 MoE experts + FP16 dense (recommended)
├── model_q4f16.onnx_data* # Q4F16 weights
├── model_q8.onnx # Q8 model graph
└── model_q8.onnx_data* # Q8 weights

* Large models split weights across multiple files:
 model.onnx_data, model.onnx_data_1, model.onnx_data_2, etc.
 All data files must be in the same directory as the .onnx file.

Python

Installation

pip install onnxruntime transformers numpy huggingface_hub
# or with GPU support:
pip install onnxruntime-gpu transformers numpy huggingface_hub

Inference

from huggingface_hub import snapshot_download
from transformers import AutoConfig, AutoTokenizer
import numpy as np
import onnxruntime

# 1. Load config, tokenizer, and model
model_id = "LiquidAI/LFM2.5-8B-A1B-ONNX"
config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
eos_token_id = config.eos_token_id

filename = "model_q4f16.onnx" # Options: model.onnx, model_fp16.onnx, model_q4.onnx, model_q4f16.onnx, model_q8.onnx
model_path = snapshot_download(repo_id=model_id, allow_patterns=f"onnx/{filename}*")
session = onnxruntime.InferenceSession(f"{model_path}/onnx/{filename}")
input_names = {inp.name for inp in session.get_inputs()}

# 2. Prepare inputs
prompt = "What is C. elegans?"
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(
 messages,
 add_generation_prompt=True,
 tokenize=True,
 return_dict=True,
 return_tensors="np",
)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
batch_size = input_ids.shape[0]

past_cache_values = {}
for inp in session.get_inputs():
 name = inp.name
 shape = inp.shape
 dtype = np.float32 if inp.type == "tensor(float)" else np.float16
 if name.startswith("past_key_values"):
 past_cache_values[name] = np.zeros([batch_size, shape[1], 0, shape[3]], dtype=dtype)
 elif name.startswith("past_conv"):
 past_cache_values[name] = np.zeros([batch_size, shape[1], shape[2]], dtype=dtype)

position_ids = np.arange(input_ids.shape[1], dtype=np.int64).reshape(1, -1)

# 3. Generation loop
max_new_tokens = 256
generated_tokens = np.array([[]], dtype=np.int64)
cur_len = input_ids.shape[1]
for i in range(max_new_tokens):
 if i == 0:
 ids = input_ids
 pos = position_ids
 else:
 ids = generated_tokens[:, -1:]
 pos = np.array([[cur_len - 1]], dtype=np.int64)

 feed = {
 "input_ids": ids,
 "attention_mask": attention_mask,
 **past_cache_values,
 }
 if "position_ids" in input_names:
 feed["position_ids"] = pos

 outputs = session.run(None, feed)
 logits = outputs[0]
 next_token = logits[:, -1].argmax(-1, keepdims=True)

 generated_tokens = (
 next_token if generated_tokens.shape[1] == 0
 else np.concatenate([generated_tokens, next_token], axis=-1)
 )
 attention_mask = np.concatenate(
 [attention_mask, np.ones_like(next_token, dtype=np.int64)],
 axis=-1,
 )

 output_names = [out.name for out in session.get_outputs()]
 cache_outputs = {
 name: value
 for name, value in zip(output_names[1:], outputs[1:])
 }
 for key in past_cache_values:
 present_key = key.replace("past_key_values", "present").replace("past_conv", "present_conv")
 past_cache_values[key] = cache_outputs[present_key]

 cur_len += 1
 if np.isin(next_token, eos_token_id).any():
 break

 print(tokenizer.decode(next_token[0]), end="", flush=True)
print()

# 4. Output result
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0])

License

This model is released under the LFM 1.0 License.

Downloads last month: 138

Model tree for LiquidAI/LFM2.5-8B-A1B-ONNX

Base model

LiquidAI/LFM2.5-8B-A1B-Base

Finetuned

LiquidAI/LFM2.5-8B-A1B

Quantized

(51)

this model

URL: https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-ONNX

⇱ LiquidAI/LFM2.5-8B-A1B-ONNX · Hugging Face