VOOZH about

URL: https://huggingface.co/datasets/uv-scripts/transformers-inference

⇱ uv-scripts/transformers-inference · Datasets at Hugging Face


The Dataset Viewer has been disabled on this dataset.

Transformers Continuous Batching Scripts

GPU inference scripts using transformers' native continuous batching (CB). No vLLM dependency required.

Why transformers CB?

  • Instant new model support - works with any model supported by transformers, including newly released architectures. No waiting for vLLM to add support.
  • No dependency headaches - no vLLM, flashinfer, or custom wheel indexes. Just transformers + accelerate.
  • Simple HF Jobs setup - no Docker image needed. Just hf jobs uv run.
  • ~95% of vLLM throughput - uses PagedAttention and continuous scheduling for near-vLLM performance.

Available Scripts

generate-responses.py

Generate responses for prompts in a dataset. Supports chat messages and plain text prompts.

Quick Start

# Local (requires GPU)
uv run generate-responses.py \
 username/input-dataset \
 username/output-dataset \
 --prompt-column question

# HF Jobs (single GPU)
hf jobs uv run --flavor l4x1 -s HF_TOKEN \
 https://huggingface.co/datasets/uv-scripts/transformers-inference/raw/main/generate-responses.py \
 username/input-dataset \
 username/output-dataset \
 --prompt-column question \
 --max-tokens 1024

# HF Jobs (multi-GPU for larger models)
hf jobs uv run --flavor l4x4 -s HF_TOKEN \
 https://huggingface.co/datasets/uv-scripts/transformers-inference/raw/main/generate-responses.py \
 username/input-dataset \
 username/output-dataset \
 --model-id Qwen/Qwen3-30B-A3B-Instruct-2507 \
 --messages-column messages \
 --max-batch-tokens 2048 \
 --max-tokens 4096

Example with SmolTalk2

# Generate responses for SmolTalk2 chat data
hf jobs uv run --flavor l4x1 -s HF_TOKEN \
 https://huggingface.co/datasets/uv-scripts/transformers-inference/raw/main/generate-responses.py \
 HuggingFaceTB/smoltalk2 username/smoltalk2-responses \
 --subset SFT \
 --split OpenHermes_2.5_no_think \
 --messages-column messages \
 --max-tokens 256

Parameters

Parameter Default Description
--model-id Qwen/Qwen3-4B-Instruct-2507 Any HF causal LM model
--messages-column messages Column with chat messages
--prompt-column - Column with plain text prompts (alternative to messages)
--output-column response Name for the generated response column
--temperature 0.7 Sampling temperature
--top-p 0.8 Top-p (nucleus) sampling
--top-k 20 Top-k sampling
--max-tokens 4096 Maximum tokens to generate per response
--repetition-penalty 1.0 Repetition penalty
--max-batch-tokens 512 Token budget per scheduling step (see below)
--dtype bfloat16 Model precision (bfloat16, float16, float32)
--attn-implementation `paged sdpa`
--max-samples all Limit to N samples (useful for testing)
--hf-token - HF token (or use HF_TOKEN env var)
--skip-long-prompts True Skip prompts exceeding context length

Tuning --max-batch-tokens

This is the key performance parameter. It controls how many tokens the continuous batching scheduler processes per step:

  • Too low (e.g., 128): GPU underutilized, slow throughput
  • Too high (e.g., 8192): May cause out-of-memory errors
  • Default 512: Conservative, works on most GPUs
  • Recommended for A100/H100: 2048-4096
  • Recommended for L4: 512-1024

If you hit OOM errors, reduce this value or switch to --dtype float16.

Current Limitations

  • Single GPU only - device_map="auto" (pipeline parallelism) doesn't work with CB's PagedAttention cache. Transformers does have tensor parallelism (tp_plan="auto") for supported models, but it requires torchrun and is undocumented with CB. For now, use a model that fits on one GPU (e.g., 8B in bf16 on A10G/L4 with 24GB).
  • Text-only - no vision-language model support yet.

When to use this vs vLLM

Transformers CB vLLM
Best for New/niche models, simple setup, avoiding dependency issues Maximum throughput, production serving
Model support Any transformers model, immediately Popular models, may lag on new architectures
Dependencies transformers + accelerate vllm + flashinfer + custom indexes
Docker image Not needed vllm/vllm-openai recommended
Multi-GPU Single GPU only (for now) Tensor parallelism
Performance ~95% of vLLM for text generation Fastest for supported models
VLM support Not yet Yes

Rule of thumb: Use transformers CB when you want simplicity and broad model support. Use vLLM when you need maximum throughput with well-supported models.

Downloads last month
27