HunyuanVideo-1.5

🎬 HunyuanVideo-1.5: A leading lightweight video generation model

HunyuanVideo-1.5 is a video generation model that delivers top-tier quality with only 8.3B parameters, significantly lowering the barrier to usage. It runs smoothly on consumer-grade GPUs, making it accessible for every developer and creator. This repository provides the implementation and tools needed to generate creative videos.

👁 Image
👁 Image
👁 Image
👁 Image
👁 Image
👁 Image

👁 Image
👁 Image
👁 Image
👁 Image

👏 Join our WeChat and Discord | 💻 Official website Try our model!

🔥🔥🔥 News

🚀 Dec 23, 2025: Fp8 gemm inference is supported! 🔥🔥🔥🆕
🚀 Dec 05, 2025: New Release: We now release the 480p I2V step-distilled model, which generates videos in 8 or 12 steps (recommended)! On RTX 4090, end-to-end generation time is reduced by 75%, and a single RTX 4090 can generate videos within 75 seconds. The step-distilled model maintains comparable quality to the original model while achieving significant speedup. See Step Distillation Comparison for detailed quality comparisons. For even faster generation, you can also try 4 steps (faster speed with slightly reduced quality). To enable the step-distilled model, run generate.py with the --enable_step_distill parameter. See Usage for detailed usage instructions. 🔥🔥🔥🆕
📚 Dec 05, 2025: Training Code & LoRA Tuning Script Released: We now open-source the training code for HunyuanVideo-1.5! The training script (train.py) provides a full training pipeline with support for distributed training, FSDP, context parallel, gradient checkpointing, and more. HunyuanVideo-1.5 is trained using the Muon optimizer, which we have open-sourced in the Training section. If you would like to continue training our model or fine-tune it with LoRA, please use the Muon optimizer. See Training section for detailed usage instructions. 🔥🔥🔥🆕
🎉 Diffusers Support: HunyuanVideo-1.5 is now available on Hugging Face Diffusers! Check out Diffusers collection for easy integration. 🔥🔥🔥🆕
🚀 Nov 27, 2025: We now support cache inference (deepcache, teacache, taylorcache), achieving significant speedup! Pull the latest code to try it.
🚀 Nov 24, 2025: We now support deepcache inference.
👋 Nov 20, 2025: We release the inference code and model weights of HunyuanVideo-1.5.

🎥 Demo

🧩 Community Contributions

If you develop/use HunyuanVideo-1.5 in your projects, welcome to let us know.

Diffusers - HunyuanVideo-1.5 Diffusers: Official Hugging Face Diffusers integration for HunyuanVideo-1.5. Easily use HunyuanVideo-1.5 with the Diffusers library for seamless integration into your projects. See Usage with Diffusers section for details.
ComfyUI - ComfyUI: A powerful and modular diffusion model GUI with a graph/nodes interface. ComfyUI supports HunyuanVideo-1.5 with various engineering optimizations for fast inference. We provide a ComfyUI Usage Guide for HunyuanVideo-1.5.
Community-implemented ComfyUI Plugin - comfyui_hunyuanvideo_1.5_plugin: A community-implemented ComfyUI plugin for HunyuanVideo-1.5, offering both simplified and complete node sets for quick usage or deep workflow customization, with built-in automatic model download support.
LightX2V - LightX2V: A lightweight and efficient video generation framework that integrates HunyuanVideo-1.5, supporting multiple engineering acceleration techniques for fast inference.
Wan2GP v9.62 - Wan2GP: WanGP is a very low VRAM app (as low 6 GB of VRAM for Hunyuan Video 1.5) supports Lora Accelerator for a 8 steps generation and offers tools to facilitate Video Generation.
ComfyUI-MagCache - ComfyUI-MagCache: MagCache is a training-free caching approach that accelerates video generation by estimating fluctuating differences among model outputs across timesteps. It achieves 1.7x speedup for HunyuanVideo-1.5 with 20 inference steps.

📑 Open-source Plan

HunyuanVideo-1.5 (T2V/I2V)
- Inference Code and checkpoints
- ComfyUI Support
- LightX2V Support
- Diffusers Support
- Release all model weights (Sparse attention, distill model, and SR models)

📋 Table of Contents

📖 Introduction

We present HunyuanVideo-1.5, a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture with selective and sliding tile attention(SSTA), enhanced bilingual understanding through glyph-aware text encoding , progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source models. By releasing the code and weights of HunyuanVideo-1.5, we provide the community with a high-performance foundation that significantly lowers the cost of video creation and research, making advanced video generation more accessible to all.

✨ Key Features

Lightweight High-Performance Architecture: We propose an efficient architecture that integrates an 8.3B-parameter Diffusion Transformer (DiT) with a 3D causal VAE, achieving compression ratios of 16× in spatial dimensions and 4× along the temporal axis. Additionally, the innovative SSTA (Selective and Sliding Tile Attention) mechanism prunes redundant spatiotemporal kv blocks, significantly reduces computational overhead for long video sequences and accelerates inference, achieving an end-to-end speedup of $1.87 \times$ in 10-second 720p video synthesis compared to FlashAttention-3.

👁 HunyuanVideo-1.5 DiT

Video Super-Resolution Enhancement: We develop an efficient few-step super-resolution network that upscales outputs to 1080p. It enhances sharpness while correcting distortions, thereby refining details and overall visual texture.

👁 HunyuanVideo-1.5 VSR

End-to-End Training Optimization: This work employs a multi-stage, progressive training strategy covering the entire pipeline from pre-training to post-training. Combined with the Muon optimizer to accelerate convergence, this approach holistically refines motion coherence, aesthetic quality, and human preference alignment, achieving professional-grade content generation.

📜 System Requirements

Hardware Requirements

GPU: NVIDIA GPU with CUDA support
Minimum GPU Memory: 14 GB (with model offloading enabled)

Note: The memory requirements above are measured with model offloading enabled. If your GPU has sufficient memory, you may disable offloading for improved inference speed.

Software Requirements

Operating System: Linux
Python: Python 3.10 or higher
CUDA: Compatible CUDA version for your PyTorch installation

🛠️ Dependencies and Installation

Step 1: Clone the Repository

git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.git
cd HunyuanVideo-1.5

Step 2: Install Basic Dependencies

pip install -r requirements.txt
pip install -i https://mirrors.tencent.com/pypi/simple/ --upgrade tencentcloud-sdk-python

Step 3: Install Attention Libraries

Flash Attention: Install Flash Attention for faster inference and reduced GPU memory consumption. Detailed installation instructions are available at Flash Attention.

Flex-Block-Attention: flex-block-attn is only required for sparse attention to achieve faster inference and can be installed by the following command:

git clone https://github.com/Tencent-Hunyuan/flex-block-attn.git
cd flex-block-attn
git submodule update --init --recursive
python3 setup.py install

SageAttention: To enable SageAttention for faster inference, you need to install it by the following command:

Note: Enabling SageAttention will automatically disable Flex-Block-Attention.

git clone https://github.com/cooper1637/SageAttention.git
cd SageAttention 
export EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 # Optional
python3 setup.py install

SGL-Kernel: To enable fp8 gemm for transformer, you need to install it by the following command:
```
pip install sgl-kernel==0.3.18
```

🧱 Download Pretrained Models

💡 Distillation models and sparse attention models are still coming soon. Please stay tuned for the latest updates on the Hugging Face Model Card.

Download the pretrained models before generating videos. Detailed instructions are available at checkpoints-download.md.

Model Cards

ModelName	Download
HunyuanVideo-1.5-480P-T2V	480P-T2V
HunyuanVideo-1.5-480P-I2V	480P-I2V
HunyuanVideo-1.5-480P-T2V-cfg-distill	480P-T2V-cfg-distill
HunyuanVideo-1.5-480P-I2V-cfg-distill	480P-I2V-cfg-distill
HunyuanVideo-1.5-480P-I2V-step-distill	480P-I2V-step-distill
HunyuanVideo-1.5-720P-T2V	720P-T2V
HunyuanVideo-1.5-720P-I2V	720P-I2V
HunyuanVideo-1.5-720P-T2V-cfg-distill	Comming soon
HunyuanVideo-1.5-720P-I2V-cfg-distill	720P-I2V-cfg-distill
HunyuanVideo-1.5-720P-T2V-sparse-cfg-distill	Comming soon
HunyuanVideo-1.5-720P-I2V-sparse-cfg-distill	720P-I2V-sparse-cfg-distill
HunyuanVideo-1.5-720P-sr-step-distill	720P-sr
HunyuanVideo-1.5-1080P-sr-step-distill	1080P-sr

📝 Prompt Guide

Prompt Writing Handbook

Prompt enhancement plays a crucial role in enabling our model to generate high-quality videos. By writing longer and more detailed prompts, the generated video will be significantly improved. We encourage you to craft comprehensive and descriptive prompts to achieve the best possible video quality. we recommend community partners consulting our official guide on how to write effective prompts.

Reference: HunyuanVideo-1.5 Prompt Handbook

System Prompts for Automatic Prompt Enhancement

For users seeking to optimize prompts for other large models, it is recommended to consult the definition of t2v_rewrite_system_prompt in the file hyvideo/utils/rewrite/t2v_prompt.py to guide text-to-video rewriting. Similarly, for image-to-video rewriting, refer to the definition of i2v_rewrite_system_prompt in hyvideo/utils/rewrite/i2v_prompt.py.

🔑 Inference

Inference with Source Code

For prompt rewriting, we recommend using Gemini or models deployed via vLLM. This codebase currently only supports models compatible with the vLLM API. If you wish to use Gemini, you will need to implement your own interface calls.

For models with a vLLM API, note that T2V (text-to-video) and I2V (image-to-video) have different recommended models and environment variables:

T2V: use Qwen3-235B-A22B-Thinking-2507, configure T2V_REWRITE_BASE_URL and T2V_REWRITE_MODEL_NAME
I2V: use Qwen3-VL-235B-A22B-Instruct, configure I2V_REWRITE_BASE_URL and I2V_REWRITE_MODEL_NAME

You may set the above model names to any other vLLM-compatible models you have deployed (including HuggingFace models).
Rewriting is enabled by default (--rewrite defaults to true); to disable it explicitly, use --rewrite false or --rewrite 0. If no vLLM endpoint is configured, the pipeline runs without remote rewriting.

Example: Generate a video (works for both T2V and I2V; set IMAGE_PATH=none for T2V or provide an image path for I2V)

💡 Tip: For faster inference speed, you can enable the step-distilled model using the --enable_step_distill parameter. The step-distilled model (480p I2V) can generate videos in 8 or 12 steps (recommended), achieving up to 75% speedup on RTX 4090 while maintaining comparable quality.

Tips: If your GPU memory is > 14GB but you encounter OOM (Out of Memory) errors during generation, you can try setting the following environment variable before running:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128
Tips: If you have limited CPU memory and encounter OOM during inference, you can try disable overlapped group offloading by adding the following argument:
--overlap_group_offloading false

export T2V_REWRITE_BASE_URL="<your_vllm_server_base_url>"
export T2V_REWRITE_MODEL_NAME="<your_model_name>"
export I2V_REWRITE_BASE_URL="<your_vllm_server_base_url>"
export I2V_REWRITE_MODEL_NAME="<your_model_name>"

PROMPT='A girl holding a paper with words "Hello, world!"'

IMAGE_PATH=/path/to/image.png # Optional, none or <image path> to enable i2v mode
SEED=1
ASPECT_RATIO=16:9
RESOLUTION=480p
OUTPUT_PATH=./outputs/output.mp4
MODEL_PATH=./ckpts # Path to pretrained model

# Configuration for faster inference
N_INFERENCE_GPU=8 # Parallel inference GPU count
CFG_DISTILLED=true # Inference with CFG distilled model, 2x speedup
SAGE_ATTN=true # Inference with SageAttention
SPARSE_ATTN=false # Inference with sparse attention (only 720p models are equipped with sparse attention). Please ensure flex-block-attn is installed
OVERLAP_GROUP_OFFLOADING=true # Only valid when group offloading is enabled, significantly increases CPU memory usage but speeds up inference
ENABLE_CACHE=true # Enable feature cache during inference. Significantly speeds up inference.
CACHE_TYPE=deepcache # Support: deepcache, teacache, taylorcache
ENABLE_STEP_DISTILL=true # Enable step distilled model for 480p I2V, recommended 8 or 12 steps, up to 6x speedup


# Configuration for better quality
REWRITE=true # Enable prompt rewriting. Please ensure rewrite vLLM server is deployed and configured.
ENABLE_SR=true # Enable super resolution


torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
 --prompt "$PROMPT" \
 --image_path $IMAGE_PATH \
 --resolution $RESOLUTION \
 --aspect_ratio $ASPECT_RATIO \
 --seed $SEED \
 --rewrite $REWRITE \
 --cfg_distilled $CFG_DISTILLED \
 --enable_step_distill $ENABLE_STEP_DISTILL \
 --sparse_attn $SPARSE_ATTN --use_sageattn $SAGE_ATTN \
 --enable_cache $ENABLE_CACHE --cache_type $CACHE_TYPE \
 --overlap_group_offloading $OVERLAP_GROUP_OFFLOADING \
 --sr $ENABLE_SR --save_pre_sr_video \
 --output_path $OUTPUT_PATH \
 --model_path $MODEL_PATH

Command Line Arguments

Argument	Type	Required	Default	Description
`--prompt`	str	Yes	-	Text prompt for video generation
`--negative_prompt`	str	No	`''`	Negative prompt for video generation
`--resolution`	str	Yes	-	Video resolution: `480p` or `720p`
`--model_path`	str	Yes	-	Path to pretrained model directory
`--aspect_ratio`	str	No	`16:9`	Aspect ratio of the output video
`--num_inference_steps`	int	No	`50`	Number of inference steps
`--video_length`	int	No	`121`	Number of frames to generate
`--seed`	int	No	`123`	Random seed for reproducibility
`--image_path`	str	No	`None`	Path to reference image (enables i2v mode). Use `none` or `None` to explicitly use text-to-video mode
`--output_path`	str	No	`None`	Output file path (if not provided, saves to `./outputs/output_{transformer_version}_{timestamp}.mp4`)
`--sr`	bool	No	`true`	Enable super resolution (use `--sr false` or `--sr 0` to disable)
`--save_pre_sr_video`	bool	No	`false`	Save original video before super resolution (use `--save_pre_sr_video` or `--save_pre_sr_video true` to enable, only effective when super resolution is enabled)
`--rewrite`	bool	No	`true`	Enable prompt rewriting (use `--rewrite false` or `--rewrite 0` to disable, may result in lower quality video generation)
`--cfg_distilled`	bool	No	`false`	Enable CFG distilled model for faster inference (~2x speedup, use `--cfg_distilled` or `--cfg_distilled true` to enable)
`--enable_step_distill`	bool	No	`false`	Enable step distilled model for 480p I2V (recommended 8 or 12 steps, ~75% speedup on RTX 4090, use `--enable_step_distill` or `--enable_step_distill true` to enable)
`--sparse_attn`	bool	No	`false`	Enable sparse attention for faster inference (~1.5-2x speedup, requires H-series GPUs, auto-enables CFG distilled, use `--sparse_attn` or `--sparse_attn true` to enable)
`--offloading`	bool	No	`true`	Enable CPU offloading (use `--offloading false` or `--offloading 0` to disable for faster inference if GPU memory allows)
`--group_offloading`	bool	No	`None`	Enable group offloading (default: None, automatically enabled if offloading is enabled. Use `--group_offloading` or `--group_offloading true/1` to enable, `--group_offloading false/0` to disable)
`--overlap_group_offloading`	bool	No	`true`	Enable overlap group offloading (default: true). Significantly increases CPU memory usage but speeds up inference. Use `--overlap_group_offloading` or `--overlap_group_offloading true/1` to enable, `--overlap_group_offloading false/0` to disable
`--dtype`	str	No	`bf16`	Data type for transformer: `bf16` (faster, lower memory) or `fp32` (better quality, slower, higher memory)
`--use_sageattn`	bool	No	`false`	Enable SageAttention (use `--use_sageattn` or `--use_sageattn true/1` to enable, `--use_sageattn false/0` to disable)
`--sage_blocks_range`	str	No	`0-53`	SageAttention blocks range (e.g., `0-5` or `0,1,2,3,4,5`)
`--enable_cache`	bool	No	`false`	Enable cache for transformer (use `--enable_cache` or `--enable_cache true/1` to enable, `--enable_cache false/0` to disable)
`--cache_type`	str	No	`deepcache`	Cache type for transformer (e.g., `deepcache, teacache, taylorcache`)
`--no_cache_block_id`	str	No	`53`	Blocks to exclude from deepcache (e.g., `0-5` or `0,1,2,3,4,5`)
`--cache_start_step`	int	No	`11`	Start step to skip when using cache
`--cache_end_step`	int	No	`45`	End step to skip when using cache
`--total_steps`	int	No	`50`	Total inference steps
`--cache_step_interval`	int	No	`4`	Step interval to skip when using cache

Note: Use --nproc_per_node to specify the number of GPUs. For example, --nproc_per_node=8 uses 8 GPUs.

Optimal Inference Configurations

The following table provides the optimal inference configurations (CFG scale, embedded CFG scale, flow shift, and inference steps) for each model to achieve the best generation quality:

Model	CFG Scale	Embedded CFG Scale	Flow Shift	Inference Steps
480p T2V	6	None	5	50
480p I2V	6	None	5	50
720p T2V	6	None	9	50
720p I2V	6	None	7	50
480p T2V CFG Distilled	1	None	5	50
480p I2V CFG Distilled	1	None	5	50
480p I2V Step Distilled	1	None	7	8 or 12 (recommended)
720p T2V CFG Distilled	1	None	9	50
720p I2V CFG Distilled	1	None	7	50
720p T2V CFG Distilled Sparse	1	None	9	50
720p I2V CFG Distilled Sparse	1	None	7	50
480→720 SR Step Distilled	1	None	2	6
720→1080 SR Step Distilled	1	None	2	8

Please note that the cfg distilled model we provided, must use 50 steps to generate correct results.

Usage with Diffusers

HunyuanVideo-1.5 is available on Hugging Face Diffusers! You can easily use it with the Diffusers library:

Basic Usage:

import torch

dtype = torch.bfloat16
device = "cuda:0"

from diffusers import HunyuanVideo15Pipeline
from diffusers.utils import export_to_video

pipe = HunyuanVideo15Pipeline.from_pretrained("hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_t2v", torch_dtype=dtype)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

generator = torch.Generator(device=device).manual_seed(seed)

video = pipe(
 prompt=prompt,
 generator=generator,
 num_frames=121,
 num_inference_steps=50,
).frames[0]

export_to_video(video, "output.mp4", fps=24)

Optimized Usage with Attention Backend:

HunyuanVideo-1.5 uses attention masks with variable-length sequences. For best performance, we recommend using an attention backend that handles padding efficiently.

We recommend installing kernels (pip install kernels) to access prebuilt attention kernels.

import torch

dtype = torch.bfloat16
device = "cuda:0"

from diffusers import HunyuanVideo15Pipeline, attention_backend
from diffusers.utils import export_to_video

pipe = HunyuanVideo15Pipeline.from_pretrained("hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_t2v", torch_dtype=dtype)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

generator = torch.Generator(device=device).manual_seed(seed)

with attention_backend("_flash_3_hub"): # or `"flash_hub"` if you are not on H100/H800
 video = pipe(
 prompt=prompt,
 generator=generator,
 num_frames=121,
 num_inference_steps=50,
 ).frames[0]
 export_to_video(video, "output.mp4", fps=24)

For more details, please visit HunyuanVideo-1.5 Diffusers Collection.

🎓 Training

HunyuanVideo-1.5 is trained using the Muon optimizer, which accelerates convergence and improves training stability. The Muon optimizer combines momentum-based updates with Newton-Schulz orthogonalization for efficient optimization of large-scale video generation models.

Quick Start

The training script (train.py) provides a complete training pipeline for HunyuanVideo-1.5. Here's how to use it:

1. Implement Your DataLoader

Replace the create_dummy_dataloader() function in train.py with your own implementation. Your dataset's __getitem__ method should return a single sample.

Required fields:
- "pixel_values": torch.Tensor - Video: [C, F, H, W] or Image: [C, H, W]
  - Pixel values must be in range [-1, 1]
  - Note: For video data, temporal dimension F must be 4n+1 (e.g., 1, 5, 9, 13, 17, ...)
- "text": str - Text prompt for this sample
- "data_type": str - "video" or "image"
Optional fields (for performance optimization):
- "latents": Pre-encoded VAE latents (skips VAE encoding for faster training)
- "byt5_text_ids" and "byt5_text_mask": Pre-tokenized byT5 inputs

See the create_dummy_dataloader() function in train.py for detailed format documentation.

2. Run Training

Single GPU:

python train.py --pretrained_model_root <path_to_pretrained_model> [other args]

Multi-GPU:

N=8
torchrun --nproc_per_node=$N train.py --pretrained_model_root <path_to_pretrained_model> [other args]

Example:

torchrun --nproc_per_node=8 train.py \
 --pretrained_model_root ./ckpts \
 --learning_rate 1e-5 \
 --batch_size 1 \
 --max_steps 10000 \
 --output_dir ./outputs \
 --enable_fsdp \
 --enable_gradient_checkpointing \
 --sp_size 8

3. Key Training Parameters

Parameter	Description	Default
`--pretrained_model_root`	Path to pretrained model (required)	-
`--learning_rate`	Learning rate	1e-5
`--batch_size`	Batch size	1
`--max_steps`	Maximum training steps	10000
`--warmup_steps`	Warmup steps	500
`--gradient_accumulation_steps`	Gradient accumulation steps	1
`--enable_fsdp`	Enable FSDP for distributed training	true
`--enable_gradient_checkpointing`	Enable gradient checkpointing	true
`--sp_size`	Sequence parallelism size (must divide world_size)	8
`--i2v_prob`	Probability of i2v task for video data	0.3
`--use_muon`	Use Muon optimizer	true
`--resume_from_checkpoint`	Resume from checkpoint directory	None
`--use_lora`	Enable LoRA fine-tuning	false
`--lora_r`	LoRA rank	8
`--lora_alpha`	LoRA alpha scaling parameter	16
`--lora_dropout`	LoRA dropout rate	0.0
`--pretrained_lora_path`	Path to pretrained LoRA adapter	None

4. Monitor Training

Checkpoints are saved to output_dir at intervals specified by --save_interval
Validation videos are generated at intervals specified by --validation_interval
Training logs are printed to console at intervals specified by --log_interval

5. Resume Training

Use --resume_from_checkpoint <checkpoint_dir> to resume from a saved checkpoint:

python train.py \
 --pretrained_model_root <path> \
 --resume_from_checkpoint ./outputs/checkpoint-1000

6. LoRA Fine-tuning

To enable LoRA fine-tuning, add --use_lora to your training command. LoRA adapters will be saved in the checkpoint directory under lora/:

torchrun --nproc_per_node=8 train.py \
 --pretrained_model_root ./ckpts \
 --use_lora \
 --lora_r 8 \
 --lora_alpha 16 \
 --learning_rate 1e-4 \
 --output_dir ./outputs

To load a pretrained LoRA adapter, use --pretrained_lora_path:

torchrun --nproc_per_node=8 train.py \
 --pretrained_model_root ./ckpts \
 --use_lora \
 --pretrained_lora_path ./outputs/checkpoint-1000/lora/default

📊 Evaluation

Rating

We assess text-to-video generation using a comprehensive rating methodology that considers five key dimensions: text-video consistency, visual quality, structural stability, motion effects, and the aesthetic quality of individual frames. For image-to-video generation, the evaluation encompasses image-video consistency, instruction responsiveness, visual quality, structural stability, and motion effects.

👁 rating result of t2v

👁 rating result of i2v

GSB

The GSB(Good/Same/Bad) approach is widely used to evaluate the relative performance of two models based on overall video perception quality.We carefully construct 300 diverse text prompts and 300 image samples to cover balanced application scenarios for both text-to-video and image-to-video tasks. For each prompt or image input, an equal number of video samples are generated by each model in a single run to ensure comparability. To maintain fairness, inference is performed only once per input without any cherry-picking of results. All competing models are evaluated using their default configurations. The evaluation is conducted by over 100 professional assessors

👁 gsb result of t2v

👁 gsb result of i2v

Inference speed

We report inference speed with basic engineering-level acceleration techniques enabled on 8 H800 GPUs to demonstrate practical performance achievable in real-world deployment scenarios. Please note that in this experiment, we do not pursue the most extreme acceleration at the cost of generation quality, but rather to achieve notable speed improvements while maintaining nearly identical output quality.

We report the total inference time for 50 diffusion steps for HunyuanVideo 1.5 below:

👁 Image

🎬 More Examples

Features	Demo1	Demo2
Strong Instruction Following
Smooth Motion Generation
Cinematic Aesthetics
Text Rendering
Physics Compliance
Camera Movement
Multi-Style Support
High Image-Video Consistency	👁 Image	👁 Image

📚 Citation

@misc{hunyuanvideo2025,
 title={HunyuanVideo 1.5 Technical Report}, 
 author={Tencent Hunyuan Foundation Model Team},
 year={2025},
 eprint={2511.18870},
 archivePrefix={arXiv},
 primaryClass={cs.CV},
 url={https://arxiv.org/abs/2511.18870}, 
}

🙏 Acknowledgements

We would like to thank the contributors to the Transformers, Diffusers , HuggingFace and Qwen-VL, for their open research and exploration.

🌟 Github Star History

Downloads last month: 1,940

Model tree for tencent/HunyuanVideo-1.5

Adapters

1 model

Finetunes

12 models

Quantizations

5 models

Spaces using tencent/HunyuanVideo-1.5 100

Collection including tencent/HunyuanVideo-1.5

6 items • Updated Apr 22 • 15

Paper for tencent/HunyuanVideo-1.5

Paper • 2511.18870 • Published Nov 24, 2025 • 29

Evaluation results

meituan-longcat/WBench leaderboard
Wbench Navi View evaluation results source
78.2
Wbench Full View evaluation results source
74.3

URL: https://huggingface.co/tencent/HunyuanVideo-1.5

⇱ tencent/HunyuanVideo-1.5 · Hugging Face