VOOZH about

URL: https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/

⇱ Recommended Models and Features - vLLM TPU


Skip to content

Recommended Model and Feature Matrices

Although vLLM TPU’s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components. For this reason, until we land more capabilities, we recommend starting from this list of stress tested models and features below.

We are still landing components in tpu-inference that will improve performance for larger scale, higher complexity models (XL MoE, +vision encoders, MLA, etc.).

If you’d like us to prioritize something specific, please submit a GitHub feature request here.

Recommended Models

These tables show the models currently tested for accuracy and performance.

Models

Model Type Unit Test Correctness Test Performance Test
Qwen/Qwen2.5-VL-7B-Instruct Multimodal
google/gemma-3-27b-it Text
meta-llama/Llama-3.1-8B-Instruct Text
meta-llama/Llama-3.3-70B-Instruct Text
Qwen/Qwen3-30B-A3B Text
Qwen/Qwen3-32B Text
Qwen/Qwen3-4B Text
Qwen/Qwen3-Coder-480B-A35B-Instruct Text
Qwen/Qwen3.5-397B-A17B Text
google/gemma-4-26B-A4B-it Multimodal
google/gemma-4-31B-it Multimodal
openai/gpt-oss-120b Text
deepseek-ai/DeepSeek-R1 Text
moonshotai/Kimi-K2.6 Text
deepseek-ai/DeepSeek-OCR Multimodal
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal
Qwen/Qwen3-VL-8B-Instruct Multimodal
Qwen/Qwen3.5-9B Multimodal
deepseek-ai/DeepSeek-Math-V2 Text
deepseek-ai/DeepSeek-V3.1 Text
deepseek-ai/DeepSeek-V3.2 Text
deepseek-ai/DeepSeek-V3.2-Speciale Text
MiniMaxAI/MiniMax-M2.5 Text
moonshotai/Kimi-K2-Thinking Text
openai/gpt-oss-20b Text
zai-org/GLM-5 Text
Model Type Unit Test Correctness Test Performance Test
google/gemma-4-31B-it Multimodal
google/gemma-3-27b-it Text
meta-llama/Llama-3.1-8B-Instruct Text
meta-llama/Llama-3.3-70B-Instruct Text
Qwen/Qwen3-30B-A3B Text
Qwen/Qwen3-32B Text
Qwen/Qwen3-4B Text
Qwen/Qwen3-Coder-480B-A35B-Instruct Text
Qwen/Qwen3.5-397B-A17B Text
Qwen/Qwen2.5-VL-7B-Instruct Multimodal
Qwen/Qwen3-Embedding-8B Embedding
deepseek-ai/DeepSeek-R1 Text
openai/gpt-oss-120b Text
google/gemma-4-E2B-it Multimodal
google/gemma-4-E4B-it Multimodal
moonshotai/Kimi-K2.6 Text
google/gemma-4-26B-A4B-it Multimodal
deepseek-ai/DeepSeek-OCR Multimodal
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal
Qwen/Qwen3-VL-8B-Instruct Multimodal
Qwen/Qwen3.5-9B Multimodal
deepseek-ai/DeepSeek-Math-V2 Text
deepseek-ai/DeepSeek-V3.1 Text
deepseek-ai/DeepSeek-V3.2 Text
deepseek-ai/DeepSeek-V3.2-Speciale Text
MiniMaxAI/MiniMax-M2.5 Text
moonshotai/Kimi-K2-Thinking Text
openai/gpt-oss-20b Text
zai-org/GLM-5 Text

Embedding Models

Model Type UnitTest Accuracy/Correctness Benchmark
Qwen/Qwen2.5-VL-7B-Instruct Multimodal ✅ Passing ✅ Passing ✅ Passing
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal ❓ Untested ❓ Untested ❓ Untested
Qwen/Qwen3-VL-8B-Instruct Multimodal ❓ Untested ❓ Untested ❓ Untested
Qwen/Qwen3.5-9B Multimodal ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-OCR Multimodal ❓ Untested ❓ Untested ❓ Untested
google/gemma-4-26B-A4B-it Multimodal ✅ Passing ✅ Passing ✅ Passing
google/gemma-4-31B-it Multimodal ✅ Passing ✅ Passing ✅ Passing
MiniMaxAI/MiniMax-M2.5 Text ❓ Untested ❓ Untested ❓ Untested
Qwen/Qwen3-30B-A3B Text ✅ Passing ✅ Passing ✅ Passing
Qwen/Qwen3-32B Text ✅ Passing ✅ Passing ✅ Passing
Qwen/Qwen3-4B Text ✅ Passing ✅ Passing ✅ Passing
Qwen/Qwen3-Coder-480B-A35B-Instruct Text ✅ Passing ✅ Passing ✅ Passing
Qwen/Qwen3.5-397B-A17B Text ✅ Passing ✅ Passing ✅ Passing
deepseek-ai/DeepSeek-Math-V2 Text ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-R1 Text ✅ Passing ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-V3.1 Text ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-V3.2 Text ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-V3.2-Speciale Text ❓ Untested ❓ Untested ❓ Untested
google/gemma-3-27b-it Text ✅ Passing ✅ Passing ✅ Passing
meta-llama/Llama-3.1-8B-Instruct Text ✅ Passing ✅ Passing ✅ Passing
meta-llama/Llama-3.3-70B-Instruct Text ✅ Passing ✅ Passing ✅ Passing
moonshotai/Kimi-K2-Thinking Text ❓ Untested ❓ Untested ❓ Untested
moonshotai/Kimi-K2.6 Text ✅ Passing ❓ Untested ❓ Untested
openai/gpt-oss-120b Text ✅ Passing ✅ Passing ❓ Untested
openai/gpt-oss-20b Text ❓ Untested ❓ Untested ❓ Untested
zai-org/GLM-5 Text ❓ Untested ❓ Untested ❓ Untested
Model Type UnitTest Accuracy/Correctness Benchmark
Qwen/Qwen2.5-VL-7B-Instruct Multimodal ✅ Passing ✅ Passing ✅ Passing
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal ❓ Untested ❓ Untested ❓ Untested
Qwen/Qwen3-VL-8B-Instruct Multimodal ❓ Untested ❓ Untested ❓ Untested
Qwen/Qwen3.5-9B Multimodal ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-OCR Multimodal ❓ Untested ❓ Untested ❓ Untested
google/gemma-4-26B-A4B-it Multimodal not enough HBM not enough HBM not enough HBM
google/gemma-4-31B-it Multimodal ✅ Passing ✅ Passing not enough HBM
MiniMaxAI/MiniMax-M2.5 Text ❓ Untested ❓ Untested ❓ Untested
Qwen/Qwen3-30B-A3B Text ✅ Passing ✅ Passing ✅ Passing
Qwen/Qwen3-32B Text ✅ Passing ✅ Passing ✅ Passing
Qwen/Qwen3-4B Text ✅ Passing ✅ Passing ✅ Passing
Qwen/Qwen3-Coder-480B-A35B-Instruct Text not enough HBM not enough HBM not enough HBM
Qwen/Qwen3.5-397B-A17B Text not enough HBM not enough HBM not enough HBM
deepseek-ai/DeepSeek-Math-V2 Text ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-R1 Text not enough HBM ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-V3.1 Text ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-V3.2 Text ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-V3.2-Speciale Text ❓ Untested ❓ Untested ❓ Untested
google/gemma-3-27b-it Text ✅ Passing ✅ Passing ✅ Passing
meta-llama/Llama-3.1-8B-Instruct Text ✅ Passing ✅ Passing ✅ Passing
meta-llama/Llama-3.3-70B-Instruct Text ✅ Passing ✅ Passing ✅ Passing
moonshotai/Kimi-K2-Thinking Text ❓ Untested ❓ Untested ❓ Untested
moonshotai/Kimi-K2.6 Text not enough HBM ❓ Untested ❓ Untested
openai/gpt-oss-120b Text not enough HBM not enough HBM ❓ Untested
openai/gpt-oss-20b Text ❓ Untested ❓ Untested ❓ Untested
zai-org/GLM-5 Text ❓ Untested ❓ Untested ❓ Untested

Recommended Features

This table shows the features currently tested for accuracy and performance.

Feature Flax Torchax Default
async scheduler
Chunked Prefill
DCN-based P/D disaggregation
LoRA_Torch
Out-of-tree model support
Prefix Caching
Single Program Multi Data
Speculative Decoding: Ngram
KV Cache Offload
Multimodal Inputs
Speculative Decoding: Eagle3
hybrid kv cache
multi-host
runai_model_streamer_loader
sampling_params
Single-Host-P-D-disaggregation
structured_decoding
Feature Flax Torchax Default
async scheduler
Chunked Prefill
DCN-based P/D disaggregation
KV Cache Offload
LoRA_Torch
Prefix Caching
Single Program Multi Data
Speculative Decoding: Eagle3
Speculative Decoding: Ngram
Multimodal Inputs
Out-of-tree model support
multi-host
hybrid kv cache
runai_model_streamer_loader
sampling_params
Step Pooling (Embedding)
structured_decoding

Kernel Support

This table tracks high-level correctness and performance validation for distributed compute kernels.

Feature CorrectnessTest PerformanceTest
Collective Communication Matmul
MLA
MoE
Quantized Attention
Quantized KV Cache
Quantized Matmul
Ragged Paged Attention V3

Microbenchmark Kernel Support

This section outlines the detailed hardware and precision validation for our core microbenchmark kernels.

Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16
Moe Fused MoE
gmm
Dense All‑gather matmul
Attention Generic Ragged Paged
Attention V3
MLA
Ragged Paged
Attention V3 Head_Dim
64

Note: - For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16
Moe Fused MoE
gmm
Dense All‑gather matmul
Attention Generic Ragged Paged
Attention V3
MLA
Ragged Paged
Attention V3 Head_Dim
64

Note: - For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

Parallelism Support

This table shows the current parallelism support status.

Feature Flax Torchax
Single-host Multi-host Single-host Multi-host
PP
DP
EP
TP
CP
SP (vote to prioritize)
Feature Flax Torchax
Single-host Multi-host Single-host Multi-host
PP
DP
TP
EP
SP (vote to prioritize)
CP

Quantization Support

This table shows the current quantization support status.

Checkpoint dtype Method Supported
Hardware Acceleration
Flax Torchax
FP4 W4A16 mxfp4 v7
FP8 W8A16 compressed-tensor v7
FP8 W8A8 compressed-tensor v7
INT4 W4A16 awq v5, v6
INT8 W8A8 compressed-tensor v5, v6

Note: - This table only tests checkpoint loading compatibility.

Checkpoint dtype Method Supported
Hardware Acceleration
Flax Torchax
FP4 W4A16 mxfp4 v7
FP8 W8A16 compressed-tensor v7
FP8 W8A8 compressed-tensor v7
INT4 W4A16 awq v5, v6
INT8 W8A8 compressed-tensor v5, v6
NVFP4 W4A16 modelopt_fp4 v7

Note: - This table only tests checkpoint loading compatibility.