Step-3.7-Flash Guide¶
Step-3.7-Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and delivers a throughput of up to 400 tokens per second. Step 3.7 Flash supports a 256k context window and offers three selectable reasoning levels (low, medium, and high) so developers can easily balance speed, cost, and cognitive depth. Key highlights:
- Multimodal Understanding: Native vision encoder for image understanding, supporting single and multi-image inputs alongside text.
- Hybrid Attention Architecture: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 3:1 ratio and an aggressive 512-token window, ensuring consistent performance across massive datasets while significantly reducing computational overhead.
- Sparse Mixture-of-Experts: Only 11B active parameters out of 198B total parameters.
- Multi-Layer Multi-Token Prediction (MTP): Equipped with 3-way Multi-Token Prediction (MTP-3) for complex, multi-step reasoning chains with immediate responsiveness.
Installing vLLM¶
uvvenv
source.venv/bin/activate
uvpipinstallvllm--torch-backendauto
Serving with vLLM¶
Official Provided Formats¶
Step-3.7-Flash provides three precision options, You can choose the appropriate model based on your needs.
Deployment¶
For FP8 model¶
vllmservestepfun-ai/Step-3.7-Flash-FP8\
--served-model-namestep3p7-flash\
--tensor-parallel-size8\
--enable-expert-parallel\
--disable-cascade-attn\
--reasoning-parserstep3p5\
--enable-auto-tool-choice\
--tool-call-parserstep3p5\
--speculative-config'{"method": "mtp", "num_speculative_tokens": 3}'\
--trust-remote-code
For BF16 model¶
vllmservestepfun-ai/Step-3.7-Flash\
--served-model-namestep3p7-flash-bf16\
--tensor-parallel-size8\
--enable-expert-parallel\
--disable-cascade-attn\
--reasoning-parserstep3p5\
--enable-auto-tool-choice\
--tool-call-parserstep3p5\
--speculative-config'{"method": "mtp", "num_speculative_tokens": 3}'\
--trust-remote-code
For NVFP4 model¶
Compared to standard precisions, running the FP4 quantized version requires modelopt activation and FP8 KV Cache alignment.
vllmservestepfun-ai/Step-3.7-Flash-NVFP4\
--served-model-namestep3p7\
--tensor-parallel-size4\
--gpu-memory-utilization0.9\
--enable-expert-parallel\
--trust-remote-code\
--quantizationmodelopt\
--kv-cache-dtypefp8\
--reasoning-parserstep3p5\
--enable-auto-tool-choice\
--tool-call-parserstep3p5\
--speculative-config'{"method": "mtp", "num_speculative_tokens": 3}'\
--async-scheduling
