The Dataset Viewer has been disabled on this dataset.

Transformers Continuous Batching Scripts

GPU inference scripts using transformers' native continuous batching (CB). No vLLM dependency required.

Why transformers CB?

Instant new model support - works with any model supported by transformers, including newly released architectures. No waiting for vLLM to add support.
No dependency headaches - no vLLM, flashinfer, or custom wheel indexes. Just transformers + accelerate.
Simple HF Jobs setup - no Docker image needed. Just hf jobs uv run.
~95% of vLLM throughput - uses PagedAttention and continuous scheduling for near-vLLM performance.

Available Scripts

generate-responses.py

Generate responses for prompts in a dataset. Supports chat messages and plain text prompts.

Quick Start

# Local (requires GPU)
uv run generate-responses.py \
 username/input-dataset \
 username/output-dataset \
 --prompt-column question

# HF Jobs (single GPU)
hf jobs uv run --flavor l4x1 -s HF_TOKEN \
 https://huggingface.co/datasets/uv-scripts/transformers-inference/raw/main/generate-responses.py \
 username/input-dataset \
 username/output-dataset \
 --prompt-column question \
 --max-tokens 1024

# HF Jobs (multi-GPU for larger models)
hf jobs uv run --flavor l4x4 -s HF_TOKEN \
 https://huggingface.co/datasets/uv-scripts/transformers-inference/raw/main/generate-responses.py \
 username/input-dataset \
 username/output-dataset \
 --model-id Qwen/Qwen3-30B-A3B-Instruct-2507 \
 --messages-column messages \
 --max-batch-tokens 2048 \
 --max-tokens 4096

Example with SmolTalk2

# Generate responses for SmolTalk2 chat data
hf jobs uv run --flavor l4x1 -s HF_TOKEN \
 https://huggingface.co/datasets/uv-scripts/transformers-inference/raw/main/generate-responses.py \
 HuggingFaceTB/smoltalk2 username/smoltalk2-responses \
 --subset SFT \
 --split OpenHermes_2.5_no_think \
 --messages-column messages \
 --max-tokens 256

Parameters

Parameter	Default	Description
`--model-id`	`Qwen/Qwen3-4B-Instruct-2507`	Any HF causal LM model
`--messages-column`	`messages`	Column with chat messages
`--prompt-column`	-	Column with plain text prompts (alternative to messages)
`--output-column`	`response`	Name for the generated response column
`--temperature`	`0.7`	Sampling temperature
`--top-p`	`0.8`	Top-p (nucleus) sampling
`--top-k`	`20`	Top-k sampling
`--max-tokens`	`4096`	Maximum tokens to generate per response
`--repetition-penalty`	`1.0`	Repetition penalty
`--max-batch-tokens`	`512`	Token budget per scheduling step (see below)
`--dtype`	`bfloat16`	Model precision (`bfloat16`, `float16`, `float32`)
`--attn-implementation`	`paged	sdpa`
`--max-samples`	all	Limit to N samples (useful for testing)
`--hf-token`	-	HF token (or use `HF_TOKEN` env var)
`--skip-long-prompts`	`True`	Skip prompts exceeding context length

Tuning `--max-batch-tokens`

This is the key performance parameter. It controls how many tokens the continuous batching scheduler processes per step:

Too low (e.g., 128): GPU underutilized, slow throughput
Too high (e.g., 8192): May cause out-of-memory errors
Default 512: Conservative, works on most GPUs
Recommended for A100/H100: 2048-4096
Recommended for L4: 512-1024

If you hit OOM errors, reduce this value or switch to --dtype float16.

Current Limitations

Single GPU only - device_map="auto" (pipeline parallelism) doesn't work with CB's PagedAttention cache. Transformers does have tensor parallelism (tp_plan="auto") for supported models, but it requires torchrun and is undocumented with CB. For now, use a model that fits on one GPU (e.g., 8B in bf16 on A10G/L4 with 24GB).
Text-only - no vision-language model support yet.

When to use this vs vLLM

	Transformers CB	vLLM
Best for	New/niche models, simple setup, avoiding dependency issues	Maximum throughput, production serving
Model support	Any transformers model, immediately	Popular models, may lag on new architectures
Dependencies	`transformers` + `accelerate`	`vllm` + `flashinfer` + custom indexes
Docker image	Not needed	`vllm/vllm-openai` recommended
Multi-GPU	Single GPU only (for now)	Tensor parallelism
Performance	~95% of vLLM for text generation	Fastest for supported models
VLM support	Not yet	Yes

Rule of thumb: Use transformers CB when you want simplicity and broad model support. Use vLLM when you need maximum throughput with well-supported models.

Downloads last month: 27

URL: https://huggingface.co/datasets/uv-scripts/transformers-inference

⇱ uv-scripts/transformers-inference · Datasets at Hugging Face

Transformers Continuous Batching Scripts

Why transformers CB?

Available Scripts

generate-responses.py

Quick Start

Example with SmolTalk2

Parameters

Tuning `--max-batch-tokens`

Current Limitations

When to use this vs vLLM

URL: https://huggingface.co/datasets/uv-scripts/transformers-inference

⇱ uv-scripts/transformers-inference · Datasets at Hugging Face

Transformers Continuous Batching Scripts

Why transformers CB?

Available Scripts

generate-responses.py

Quick Start

Example with SmolTalk2

Parameters

Tuning --max-batch-tokens

Current Limitations

When to use this vs vLLM

Tuning `--max-batch-tokens`