Recommended Model and Feature Matrices¶

Although vLLM TPU’s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components. For this reason, until we land more capabilities, we recommend starting from this list of stress tested models and features below.

We are still landing components in tpu-inference that will improve performance for larger scale, higher complexity models (XL MoE, +vision encoders, MLA, etc.).

If you’d like us to prioritize something specific, please submit a GitHub feature request here.

Recommended Models¶

These tables show the models currently tested for accuracy and performance.

Models¶

Model	Type	Unit Test	Correctness Test	Performance Test
Qwen/Qwen2.5-VL-7B-Instruct	Multimodal	✅	✅	✅
google/gemma-3-27b-it	Text	✅	✅	✅
meta-llama/Llama-3.1-8B-Instruct	Text	✅	✅	✅
meta-llama/Llama-3.3-70B-Instruct	Text	✅	✅	✅
Qwen/Qwen3-30B-A3B	Text	✅	✅	✅
Qwen/Qwen3-32B	Text	✅	✅	✅
Qwen/Qwen3-4B	Text	✅	✅	✅
Qwen/Qwen3-Coder-480B-A35B-Instruct	Text	✅	✅	✅
Qwen/Qwen3.5-397B-A17B	Text	✅	✅	✅
google/gemma-4-26B-A4B-it	Multimodal	✅	✅	❌
google/gemma-4-31B-it	Multimodal	✅	✅	❌
openai/gpt-oss-120b	Text	✅	✅	❓
deepseek-ai/DeepSeek-R1	Text	✅	❓	❓
moonshotai/Kimi-K2.6	Text	✅	❓	❓
deepseek-ai/DeepSeek-OCR	Multimodal	❓	❓	❓
Qwen/Qwen3-Omni-30B-A3B-Instruct	Multimodal	❓	❓	❓
Qwen/Qwen3-VL-8B-Instruct	Multimodal	❓	❓	❓
Qwen/Qwen3.5-9B	Multimodal	❓	❓	❓
deepseek-ai/DeepSeek-Math-V2	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.1	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.2	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.2-Speciale	Text	❓	❓	❓
MiniMaxAI/MiniMax-M2.5	Text	❓	❓	❓
moonshotai/Kimi-K2-Thinking	Text	❓	❓	❓
openai/gpt-oss-20b	Text	❓	❓	❓
zai-org/GLM-5	Text	❓	❓	❓

Model	Type	Unit Test	Correctness Test	Performance Test
google/gemma-4-31B-it	Multimodal	✅	✅	✅
google/gemma-3-27b-it	Text	✅	✅	✅
meta-llama/Llama-3.1-8B-Instruct	Text	✅	✅	✅
meta-llama/Llama-3.3-70B-Instruct	Text	✅	✅	✅
Qwen/Qwen3-30B-A3B	Text	✅	✅	✅
Qwen/Qwen3-32B	Text	✅	✅	✅
Qwen/Qwen3-4B	Text	✅	✅	✅
Qwen/Qwen3-Coder-480B-A35B-Instruct	Text	✅	✅	✅
Qwen/Qwen3.5-397B-A17B	Text	✅	✅	✅
Qwen/Qwen2.5-VL-7B-Instruct	Multimodal	✅	✅	❌
Qwen/Qwen3-Embedding-8B	Embedding	✅	✅	❓
deepseek-ai/DeepSeek-R1	Text	✅	✅	❓
openai/gpt-oss-120b	Text	✅	✅	❓
google/gemma-4-E2B-it	Multimodal	✅	❌	❓
google/gemma-4-E4B-it	Multimodal	✅	❌	❓
moonshotai/Kimi-K2.6	Text	✅	❓	❓
google/gemma-4-26B-A4B-it	Multimodal	❌	❓	❓
deepseek-ai/DeepSeek-OCR	Multimodal	❓	❓	❓
Qwen/Qwen3-Omni-30B-A3B-Instruct	Multimodal	❓	❓	❓
Qwen/Qwen3-VL-8B-Instruct	Multimodal	❓	❓	❓
Qwen/Qwen3.5-9B	Multimodal	❓	❓	❓
deepseek-ai/DeepSeek-Math-V2	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.1	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.2	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.2-Speciale	Text	❓	❓	❓
MiniMaxAI/MiniMax-M2.5	Text	❓	❓	❓
moonshotai/Kimi-K2-Thinking	Text	❓	❓	❓
openai/gpt-oss-20b	Text	❓	❓	❓
zai-org/GLM-5	Text	❓	❓	❓

Embedding Models¶

Model	Type	UnitTest	Accuracy/Correctness	Benchmark
Qwen/Qwen2.5-VL-7B-Instruct	Multimodal	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-Omni-30B-A3B-Instruct	Multimodal	❓ Untested	❓ Untested	❓ Untested
Qwen/Qwen3-VL-8B-Instruct	Multimodal	❓ Untested	❓ Untested	❓ Untested
Qwen/Qwen3.5-9B	Multimodal	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-OCR	Multimodal	❓ Untested	❓ Untested	❓ Untested
google/gemma-4-26B-A4B-it	Multimodal	✅ Passing	✅ Passing	✅ Passing
google/gemma-4-31B-it	Multimodal	✅ Passing	✅ Passing	✅ Passing
MiniMaxAI/MiniMax-M2.5	Text	❓ Untested	❓ Untested	❓ Untested
Qwen/Qwen3-30B-A3B	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-32B	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-4B	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-Coder-480B-A35B-Instruct	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3.5-397B-A17B	Text	✅ Passing	✅ Passing	✅ Passing
deepseek-ai/DeepSeek-Math-V2	Text	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-R1	Text	✅ Passing	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-V3.1	Text	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-V3.2	Text	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-V3.2-Speciale	Text	❓ Untested	❓ Untested	❓ Untested
google/gemma-3-27b-it	Text	✅ Passing	✅ Passing	✅ Passing
meta-llama/Llama-3.1-8B-Instruct	Text	✅ Passing	✅ Passing	✅ Passing
meta-llama/Llama-3.3-70B-Instruct	Text	✅ Passing	✅ Passing	✅ Passing
moonshotai/Kimi-K2-Thinking	Text	❓ Untested	❓ Untested	❓ Untested
moonshotai/Kimi-K2.6	Text	✅ Passing	❓ Untested	❓ Untested
openai/gpt-oss-120b	Text	✅ Passing	✅ Passing	❓ Untested
openai/gpt-oss-20b	Text	❓ Untested	❓ Untested	❓ Untested
zai-org/GLM-5	Text	❓ Untested	❓ Untested	❓ Untested

Model	Type	UnitTest	Accuracy/Correctness	Benchmark
Qwen/Qwen2.5-VL-7B-Instruct	Multimodal	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-Omni-30B-A3B-Instruct	Multimodal	❓ Untested	❓ Untested	❓ Untested
Qwen/Qwen3-VL-8B-Instruct	Multimodal	❓ Untested	❓ Untested	❓ Untested
Qwen/Qwen3.5-9B	Multimodal	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-OCR	Multimodal	❓ Untested	❓ Untested	❓ Untested
google/gemma-4-26B-A4B-it	Multimodal	not enough HBM	not enough HBM	not enough HBM
google/gemma-4-31B-it	Multimodal	✅ Passing	✅ Passing	not enough HBM
MiniMaxAI/MiniMax-M2.5	Text	❓ Untested	❓ Untested	❓ Untested
Qwen/Qwen3-30B-A3B	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-32B	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-4B	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-Coder-480B-A35B-Instruct	Text	not enough HBM	not enough HBM	not enough HBM
Qwen/Qwen3.5-397B-A17B	Text	not enough HBM	not enough HBM	not enough HBM
deepseek-ai/DeepSeek-Math-V2	Text	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-R1	Text	not enough HBM	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-V3.1	Text	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-V3.2	Text	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-V3.2-Speciale	Text	❓ Untested	❓ Untested	❓ Untested
google/gemma-3-27b-it	Text	✅ Passing	✅ Passing	✅ Passing
meta-llama/Llama-3.1-8B-Instruct	Text	✅ Passing	✅ Passing	✅ Passing
meta-llama/Llama-3.3-70B-Instruct	Text	✅ Passing	✅ Passing	✅ Passing
moonshotai/Kimi-K2-Thinking	Text	❓ Untested	❓ Untested	❓ Untested
moonshotai/Kimi-K2.6	Text	not enough HBM	❓ Untested	❓ Untested
openai/gpt-oss-120b	Text	not enough HBM	not enough HBM	❓ Untested
openai/gpt-oss-20b	Text	❓ Untested	❓ Untested	❓ Untested
zai-org/GLM-5	Text	❓ Untested	❓ Untested	❓ Untested

Recommended Features¶

This table shows the features currently tested for accuracy and performance.

Feature	Flax	Torchax	Default
async scheduler	✅	✅	✅
Chunked Prefill	✅	✅	✅
DCN-based P/D disaggregation	✅	✅	✅
LoRA_Torch	✅	✅	✅
Out-of-tree model support	✅	✅	✅
Prefix Caching	✅	✅	✅
Single Program Multi Data	✅	✅	✅
Speculative Decoding: Ngram	✅	✅	✅
KV Cache Offload	✅	❌	✅
Multimodal Inputs	✅	❌	✅
Speculative Decoding: Eagle3	✅	❌	✅
hybrid kv cache	❓	❓	❓
multi-host	❓	❓	❓
runai_model_streamer_loader	❓	❓	❓
sampling_params	❓	❓	❓
Single-Host-P-D-disaggregation	❓	❓	❓
structured_decoding	❓	❓	❓

Feature	Flax	Torchax	Default
async scheduler	✅	✅	✅
Chunked Prefill	✅	✅	✅
DCN-based P/D disaggregation	✅	✅	✅
KV Cache Offload	✅	✅	✅
LoRA_Torch	✅	✅	✅
Prefix Caching	✅	✅	✅
Single Program Multi Data	✅	✅	✅
Speculative Decoding: Eagle3	✅	✅	✅
Speculative Decoding: Ngram	✅	✅	✅
Multimodal Inputs	✅	❌	✅
Out-of-tree model support	❌	❌	❌
multi-host	❓	❌	❓
hybrid kv cache	❓	❓	❓
runai_model_streamer_loader	❓	❓	❓
sampling_params	❓	❓	❓
Step Pooling (Embedding)	❓	❓	❓
structured_decoding	❓	❓	❓

Kernel Support¶

This table tracks high-level correctness and performance validation for distributed compute kernels.

Feature	CorrectnessTest	PerformanceTest
Collective Communication Matmul	✅	❓
MLA	❓	❓
MoE	❓	❓
Quantized Attention	❓	❓
Quantized KV Cache	❓	❓
Quantized Matmul	❓	❓
Ragged Paged Attention V3	✅	✅

Microbenchmark Kernel Support¶

This section outlines the detailed hardware and precision validation for our core microbenchmark kernels.

Category	Test	W16A16	W8A8	W8A16	W4A4	W4A8	W4A16
Moe	Fused MoE	❓	❓	❓	❓	❓	❓
Moe	gmm	❓	❓	❓	❓	❓	❓
Dense	All‑gather matmul	❓	❓	❓	❓	❓	❓
Attention	Generic Ragged Paged Attention V3	❓	❓	❓	❓	❓	❓
	MLA	❓	❓	❓	❓	❓	❓
	Ragged Paged Attention V3 Head_Dim 64	❓	❓	❓	❓	❓	❓

Note: - For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

Category	Test	W16A16	W8A8	W8A16	W4A4	W4A8	W4A16
Moe	Fused MoE	❓	❓	❓	❓	❓	❓
Moe	gmm	❓	❓	❓	❓	❓	❓
Dense	All‑gather matmul	❓	❓	❓	❓	❓	❓
Attention	Generic Ragged Paged Attention V3	❓	❓	❓	❓	❓	❓
	MLA	❓	❓	❓	❓	❓	❓
	Ragged Paged Attention V3 Head_Dim 64	❓	❓	❓	❓	❓	❓

Note: - For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

Parallelism Support¶

This table shows the current parallelism support status.

Feature	Flax		Torchax
Feature	Single-host	Multi-host	Single-host	Multi-host
PP	✅	✅	✅	✅
DP	✅	❓	✅	❓
EP	✅	❓	✅	❓
TP	✅	❓	✅	❓
CP	❓	❓	❓	❓
SP (vote to prioritize)	❓	❓	❓	❓

Feature	Flax		Torchax
Feature	Single-host	Multi-host	Single-host	Multi-host
PP	✅	✅	✅	✅
DP	✅	❓	✅	❓
TP	✅	❓	✅	❓
EP	❌	❓	✅	❓
SP (vote to prioritize)	❌	❓	❓	❓
CP	❓	❓	❓	❓

Quantization Support¶

This table shows the current quantization support status.

Checkpoint dtype	Method	Supported Hardware Acceleration	Flax	Torchax
FP4 W4A16	mxfp4	v7	❓	❓
FP8 W8A16	compressed-tensor	v7	❓	❓
FP8 W8A8	compressed-tensor	v7	❓	❓
INT4 W4A16	awq	v5, v6	❓	❓
INT8 W8A8	compressed-tensor	v5, v6	❓	❓

Note: - This table only tests checkpoint loading compatibility.

Checkpoint dtype	Method	Supported Hardware Acceleration	Flax	Torchax
FP4 W4A16	mxfp4	v7	❓	❓
FP8 W8A16	compressed-tensor	v7	❓	❓
FP8 W8A8	compressed-tensor	v7	❓	❓
INT4 W4A16	awq	v5, v6	❓	❓
INT8 W8A8	compressed-tensor	v5, v6	❓	❓
NVFP4 W4A16	modelopt_fp4	v7	❓	❓