Voozh

RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

by dhlee - opened 14 days ago

Discussion

👁 Image

dhlee

14 days ago

•

edited 14 days ago

MTP drafter weight shape mismatch when using speculative decoding with `--load-format safetensors`

Environment

Hardware: 2× DGX Spark (clustered, TP=2)
Docker image: eugr/spark-vllm-docker (main branch, as of 2026-06-05)
Model: stepfun-ai/Step-3.7-Flash-NVFP4
vLLM 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 (cu132)
Launch flags: --load-format safetensors --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Error

The worker crashes during MTP drafter weight loading with the following error:

RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

Traceback summary:

File ".../vllm/model_executor/models/step3p5_mtp.py", line 273, in load_weights
 weight_loader(param, loaded_weight)
File ".../vllm/model_executor/layers/vocab_parallel_embedding.py", line 474, in weight_loader
 param[: loaded_weight.shape[0]].data.copy_(loaded_weight)
RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

The main model (14 safetensors shards) loads successfully. The crash happens only when the MTP drafter subsequently attempts to load its own weights.

Observation

vLLM's step3p5_mtp.py appears to assume a vocab embedding shape with dimension 1 = 2048, but the actual weight in this NVFP4 checkpoint has dimension 1 = 4096. This suggests the drafter architecture assumed by vLLM may not match the weight layout of this specific quantized checkpoint.

👁 Image

huangyu-nv

StepFun org 11 days ago

A few things to check — the mismatch most likely comes from the runtime loader / vLLM build / a stale checkpoint revision / the --load-format safetensors override, not the checkpoint itself:

Refresh the local checkpoint to the latest HF revision, especially config.json, model.safetensors.index.json, and model-mtp-bf16.safetensors.
Verify the text_config per-layer lists are length 48:

python3 -c "import json; c=json.load(open('config.json'))['text_config']; print({k:len(c[k]) for k in ['layer_types','partial_rotary_factors','swiglu_limits','swiglu_limits_shared','rope_theta']})"

Try without forcing --load-format safetensors, so vLLM uses its default loader path.

👁 Image

dhlee

11 days ago

•

edited 11 days ago

A few things to check — the mismatch most likely comes from the runtime loader / vLLM build / a stale checkpoint revision / the --load-format safetensors override, not the checkpoint itself:

Refresh the local checkpoint to the latest HF revision, especially config.json, model.safetensors.index.json, and model-mtp-bf16.safetensors.
Verify the text_config per-layer lists are length 48:

python3 -c "import json; c=json.load(open('config.json'))['text_config']; print({k:len(c[k]) for k in ['layer_types','partial_rotary_factors','swiglu_limits','swiglu_limits_shared','rope_theta']})"

Try without forcing --load-format safetensors, so vLLM uses its default loader path.

The result is the same.
This issue is being similarly discovered by many people in the DGX Spark community besides myself.

https://forums.developer.nvidia.com/t/step-3-7-flash-is-supported-in-community-docker-on-dgx-spark/371652/49

(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:36 [gpu_model_runner.py:5116] Loading drafter model...
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:36 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(Worker_TP0_EP0 pid=174) WARNING 06-08 03:30:36 [modelopt.py:1022] Detected ModelOpt NVFP4 checkpoint (quant_algo=NVFP4). Please note that the format is experimental and could change in future.
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:36 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:36 [__init__.py:962] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(Worker_TP0_EP0 pid=174) WARNING 06-08 03:30:37 [vllm.py:2203] `torch.compile` is turned on, but the model 0xSero/Step-3.7-Flash-173B does not support it. Please open an issue on GitHub if you want it to be supported.
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:37 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 107.20 GiB. Available RAM: 49.01 GiB.
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:37 [weight_utils.py:952] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre) and the checkpoint size (107.20 GiB) exceeds 90% of available RAM (49.01 GiB).
Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:04<00:52, 4.06s/it]
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] WorkerProc failed to start.
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] Traceback (most recent call last):
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 855, in worker_main
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] worker = WorkerProc(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] return func(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 634, in __init__
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] self.worker.load_model()
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 349, in load_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] return func(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5118, in load_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] self.drafter.load_model(self.model)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/llm_base_proposer.py", line 1199, in load_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] self.model = self._get_model()
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/llm_base_proposer.py", line 1184, in _get_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] model = get_model(
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 143, in get_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] return loader.load_model(
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] return func(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] self.load_weights(model, model_config)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] return func(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 394, in load_weights
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/step3p5_mtp.py", line 273, in load_weights
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] weight_loader(param, loaded_weight)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 474, in weight_loader
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] param[: loaded_weight.shape[0]].data.copy_(loaded_weight)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

· Sign up or log in to comment

URL: https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4/discussions/4

⇱ stepfun-ai/Step-3.7-Flash-NVFP4 · RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

MTP drafter weight shape mismatch when using speculative decoding with `--load-format safetensors`

URL: https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4/discussions/4

⇱ stepfun-ai/Step-3.7-Flash-NVFP4 · RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

MTP drafter weight shape mismatch when using speculative decoding with --load-format safetensors

MTP drafter weight shape mismatch when using speculative decoding with `--load-format safetensors`