Recommended Model and Feature Matrices¶
Although vLLM TPU’s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components. For this reason, until we land more capabilities, we recommend starting from this list of stress tested models and features below.
We are still landing components in tpu-inference that will improve performance for larger scale, higher complexity models (XL MoE, +vision encoders, MLA, etc.).
If you’d like us to prioritize something specific, please submit a GitHub feature request here.
Recommended Models¶
These tables show the models currently tested for accuracy and performance.
Models¶
Embedding Models¶
| Model | Type | UnitTest | Accuracy/Correctness | Benchmark |
|---|---|---|---|---|
| Qwen/Qwen2.5-VL-7B-Instruct | Multimodal | ✅ Passing | ✅ Passing | ✅ Passing |
| Qwen/Qwen3-Omni-30B-A3B-Instruct | Multimodal | ❓ Untested | ❓ Untested | ❓ Untested |
| Qwen/Qwen3-VL-8B-Instruct | Multimodal | ❓ Untested | ❓ Untested | ❓ Untested |
| Qwen/Qwen3.5-9B | Multimodal | ❓ Untested | ❓ Untested | ❓ Untested |
| deepseek-ai/DeepSeek-OCR | Multimodal | ❓ Untested | ❓ Untested | ❓ Untested |
| google/gemma-4-26B-A4B-it | Multimodal | ✅ Passing | ✅ Passing | ✅ Passing |
| google/gemma-4-31B-it | Multimodal | ✅ Passing | ✅ Passing | ✅ Passing |
| MiniMaxAI/MiniMax-M2.5 | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| Qwen/Qwen3-30B-A3B | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| Qwen/Qwen3-32B | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| Qwen/Qwen3-4B | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| Qwen/Qwen3-Coder-480B-A35B-Instruct | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| Qwen/Qwen3.5-397B-A17B | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| deepseek-ai/DeepSeek-Math-V2 | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| deepseek-ai/DeepSeek-R1 | Text | ✅ Passing | ❓ Untested | ❓ Untested |
| deepseek-ai/DeepSeek-V3.1 | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| deepseek-ai/DeepSeek-V3.2 | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| deepseek-ai/DeepSeek-V3.2-Speciale | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| google/gemma-3-27b-it | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| meta-llama/Llama-3.1-8B-Instruct | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| meta-llama/Llama-3.3-70B-Instruct | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| moonshotai/Kimi-K2-Thinking | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| moonshotai/Kimi-K2.6 | Text | ✅ Passing | ❓ Untested | ❓ Untested |
| openai/gpt-oss-120b | Text | ✅ Passing | ✅ Passing | ❓ Untested |
| openai/gpt-oss-20b | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| zai-org/GLM-5 | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| Model | Type | UnitTest | Accuracy/Correctness | Benchmark |
|---|---|---|---|---|
| Qwen/Qwen2.5-VL-7B-Instruct | Multimodal | ✅ Passing | ✅ Passing | ✅ Passing |
| Qwen/Qwen3-Omni-30B-A3B-Instruct | Multimodal | ❓ Untested | ❓ Untested | ❓ Untested |
| Qwen/Qwen3-VL-8B-Instruct | Multimodal | ❓ Untested | ❓ Untested | ❓ Untested |
| Qwen/Qwen3.5-9B | Multimodal | ❓ Untested | ❓ Untested | ❓ Untested |
| deepseek-ai/DeepSeek-OCR | Multimodal | ❓ Untested | ❓ Untested | ❓ Untested |
| google/gemma-4-26B-A4B-it | Multimodal | not enough HBM | not enough HBM | not enough HBM |
| google/gemma-4-31B-it | Multimodal | ✅ Passing | ✅ Passing | not enough HBM |
| MiniMaxAI/MiniMax-M2.5 | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| Qwen/Qwen3-30B-A3B | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| Qwen/Qwen3-32B | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| Qwen/Qwen3-4B | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| Qwen/Qwen3-Coder-480B-A35B-Instruct | Text | not enough HBM | not enough HBM | not enough HBM |
| Qwen/Qwen3.5-397B-A17B | Text | not enough HBM | not enough HBM | not enough HBM |
| deepseek-ai/DeepSeek-Math-V2 | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| deepseek-ai/DeepSeek-R1 | Text | not enough HBM | ❓ Untested | ❓ Untested |
| deepseek-ai/DeepSeek-V3.1 | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| deepseek-ai/DeepSeek-V3.2 | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| deepseek-ai/DeepSeek-V3.2-Speciale | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| google/gemma-3-27b-it | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| meta-llama/Llama-3.1-8B-Instruct | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| meta-llama/Llama-3.3-70B-Instruct | Text | ✅ Passing | ✅ Passing | ✅ Passing |
| moonshotai/Kimi-K2-Thinking | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| moonshotai/Kimi-K2.6 | Text | not enough HBM | ❓ Untested | ❓ Untested |
| openai/gpt-oss-120b | Text | not enough HBM | not enough HBM | ❓ Untested |
| openai/gpt-oss-20b | Text | ❓ Untested | ❓ Untested | ❓ Untested |
| zai-org/GLM-5 | Text | ❓ Untested | ❓ Untested | ❓ Untested |
Recommended Features¶
This table shows the features currently tested for accuracy and performance.
| Feature | Flax | Torchax | Default |
|---|---|---|---|
| async scheduler | ✅ | ✅ | ✅ |
| Chunked Prefill | ✅ | ✅ | ✅ |
| DCN-based P/D disaggregation | ✅ | ✅ | ✅ |
| LoRA_Torch | ✅ | ✅ | ✅ |
| Out-of-tree model support | ✅ | ✅ | ✅ |
| Prefix Caching | ✅ | ✅ | ✅ |
| Single Program Multi Data | ✅ | ✅ | ✅ |
| Speculative Decoding: Ngram | ✅ | ✅ | ✅ |
| KV Cache Offload | ✅ | ❌ | ✅ |
| Multimodal Inputs | ✅ | ❌ | ✅ |
| Speculative Decoding: Eagle3 | ✅ | ❌ | ✅ |
| hybrid kv cache | ❓ | ❓ | ❓ |
| multi-host | ❓ | ❓ | ❓ |
| runai_model_streamer_loader | ❓ | ❓ | ❓ |
| sampling_params | ❓ | ❓ | ❓ |
| Single-Host-P-D-disaggregation | ❓ | ❓ | ❓ |
| structured_decoding | ❓ | ❓ | ❓ |
| Feature | Flax | Torchax | Default |
|---|---|---|---|
| async scheduler | ✅ | ✅ | ✅ |
| Chunked Prefill | ✅ | ✅ | ✅ |
| DCN-based P/D disaggregation | ✅ | ✅ | ✅ |
| KV Cache Offload | ✅ | ✅ | ✅ |
| LoRA_Torch | ✅ | ✅ | ✅ |
| Prefix Caching | ✅ | ✅ | ✅ |
| Single Program Multi Data | ✅ | ✅ | ✅ |
| Speculative Decoding: Eagle3 | ✅ | ✅ | ✅ |
| Speculative Decoding: Ngram | ✅ | ✅ | ✅ |
| Multimodal Inputs | ✅ | ❌ | ✅ |
| Out-of-tree model support | ❌ | ❌ | ❌ |
| multi-host | ❓ | ❌ | ❓ |
| hybrid kv cache | ❓ | ❓ | ❓ |
| runai_model_streamer_loader | ❓ | ❓ | ❓ |
| sampling_params | ❓ | ❓ | ❓ |
| Step Pooling (Embedding) | ❓ | ❓ | ❓ |
| structured_decoding | ❓ | ❓ | ❓ |
Kernel Support¶
This table tracks high-level correctness and performance validation for distributed compute kernels.
| Feature | CorrectnessTest | PerformanceTest |
|---|---|---|
| Collective Communication Matmul | ✅ | ❓ |
| MLA | ❓ | ❓ |
| MoE | ❓ | ❓ |
| Quantized Attention | ❓ | ❓ |
| Quantized KV Cache | ❓ | ❓ |
| Quantized Matmul | ❓ | ❓ |
| Ragged Paged Attention V3 | ✅ | ✅ |
Microbenchmark Kernel Support¶
This section outlines the detailed hardware and precision validation for our core microbenchmark kernels.
| Category | Test | W16A16 | W8A8 | W8A16 | W4A4 | W4A8 | W4A16 |
|---|---|---|---|---|---|---|---|
| Moe | Fused MoE | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
| gmm | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | |
| Dense | All‑gather matmul | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
| Attention | Generic Ragged Paged Attention V3 |
❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
| MLA | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | |
| Ragged Paged Attention V3 Head_Dim 64 |
❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
Note: - For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.
| Category | Test | W16A16 | W8A8 | W8A16 | W4A4 | W4A8 | W4A16 |
|---|---|---|---|---|---|---|---|
| Moe | Fused MoE | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
| gmm | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | |
| Dense | All‑gather matmul | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
| Attention | Generic Ragged Paged Attention V3 |
❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
| MLA | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | |
| Ragged Paged Attention V3 Head_Dim 64 |
❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
Note: - For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.
Parallelism Support¶
This table shows the current parallelism support status.
| Feature | Flax | Torchax | ||
|---|---|---|---|---|
| Single-host | Multi-host | Single-host | Multi-host | |
| PP | ✅ | ✅ | ✅ | ✅ |
| DP | ✅ | ❓ | ✅ | ❓ |
| EP | ✅ | ❓ | ✅ | ❓ |
| TP | ✅ | ❓ | ✅ | ❓ |
| CP | ❓ | ❓ | ❓ | ❓ |
| SP (vote to prioritize) | ❓ | ❓ | ❓ | ❓ |
| Feature | Flax | Torchax | ||
|---|---|---|---|---|
| Single-host | Multi-host | Single-host | Multi-host | |
| PP | ✅ | ✅ | ✅ | ✅ |
| DP | ✅ | ❓ | ✅ | ❓ |
| TP | ✅ | ❓ | ✅ | ❓ |
| EP | ❌ | ❓ | ✅ | ❓ |
| SP (vote to prioritize) | ❌ | ❓ | ❓ | ❓ |
| CP | ❓ | ❓ | ❓ | ❓ |
Quantization Support¶
This table shows the current quantization support status.
| Checkpoint dtype | Method | Supported Hardware Acceleration |
Flax | Torchax |
|---|---|---|---|---|
| FP4 W4A16 | mxfp4 | v7 | ❓ | ❓ |
| FP8 W8A16 | compressed-tensor | v7 | ❓ | ❓ |
| FP8 W8A8 | compressed-tensor | v7 | ❓ | ❓ |
| INT4 W4A16 | awq | v5, v6 | ❓ | ❓ |
| INT8 W8A8 | compressed-tensor | v5, v6 | ❓ | ❓ |
Note: - This table only tests checkpoint loading compatibility.
| Checkpoint dtype | Method | Supported Hardware Acceleration |
Flax | Torchax |
|---|---|---|---|---|
| FP4 W4A16 | mxfp4 | v7 | ❓ | ❓ |
| FP8 W8A16 | compressed-tensor | v7 | ❓ | ❓ |
| FP8 W8A8 | compressed-tensor | v7 | ❓ | ❓ |
| INT4 W4A16 | awq | v5, v6 | ❓ | ❓ |
| INT8 W8A8 | compressed-tensor | v5, v6 | ❓ | ❓ |
| NVFP4 W4A16 | modelopt_fp4 | v7 | ❓ | ❓ |
Note: - This table only tests checkpoint loading compatibility.
