VOOZH about

URL: https://huggingface.co/tencent/HunyuanVideo-1.5

⇱ tencent/HunyuanVideo-1.5 · Hugging Face


中文文档

HunyuanVideo-1.5

👁 HunyuanVideo-1.5 Logo

🎬 HunyuanVideo-1.5: A leading lightweight video generation model

HunyuanVideo-1.5 is a video generation model that delivers top-tier quality with only 8.3B parameters, significantly lowering the barrier to usage. It runs smoothly on consumer-grade GPUs, making it accessible for every developer and creator. This repository provides the implementation and tools needed to generate creative videos.

👏 Join our WeChat and Discord | 💻 Official website Try our model!  

🔥🔥🔥 News

  • 🚀 Dec 23, 2025: Fp8 gemm inference is supported! 🔥🔥🔥🆕
  • 🚀 Dec 05, 2025: New Release: We now release the 480p I2V step-distilled model, which generates videos in 8 or 12 steps (recommended)! On RTX 4090, end-to-end generation time is reduced by 75%, and a single RTX 4090 can generate videos within 75 seconds. The step-distilled model maintains comparable quality to the original model while achieving significant speedup. See Step Distillation Comparison for detailed quality comparisons. For even faster generation, you can also try 4 steps (faster speed with slightly reduced quality). To enable the step-distilled model, run generate.py with the --enable_step_distill parameter. See Usage for detailed usage instructions. 🔥🔥🔥🆕
  • 📚 Dec 05, 2025: Training Code & LoRA Tuning Script Released: We now open-source the training code for HunyuanVideo-1.5! The training script (train.py) provides a full training pipeline with support for distributed training, FSDP, context parallel, gradient checkpointing, and more. HunyuanVideo-1.5 is trained using the Muon optimizer, which we have open-sourced in the Training section. If you would like to continue training our model or fine-tune it with LoRA, please use the Muon optimizer. See Training section for detailed usage instructions. 🔥🔥🔥🆕
  • 🎉 Diffusers Support: HunyuanVideo-1.5 is now available on Hugging Face Diffusers! Check out Diffusers collection for easy integration. 🔥🔥🔥🆕
  • 🚀 Nov 27, 2025: We now support cache inference (deepcache, teacache, taylorcache), achieving significant speedup! Pull the latest code to try it.
  • 🚀 Nov 24, 2025: We now support deepcache inference.
  • 👋 Nov 20, 2025: We release the inference code and model weights of HunyuanVideo-1.5.

🎥 Demo

🧩 Community Contributions

If you develop/use HunyuanVideo-1.5 in your projects, welcome to let us know.

  • Diffusers - HunyuanVideo-1.5 Diffusers: Official Hugging Face Diffusers integration for HunyuanVideo-1.5. Easily use HunyuanVideo-1.5 with the Diffusers library for seamless integration into your projects. See Usage with Diffusers section for details.

  • ComfyUI - ComfyUI: A powerful and modular diffusion model GUI with a graph/nodes interface. ComfyUI supports HunyuanVideo-1.5 with various engineering optimizations for fast inference. We provide a ComfyUI Usage Guide for HunyuanVideo-1.5.

  • Community-implemented ComfyUI Plugin - comfyui_hunyuanvideo_1.5_plugin: A community-implemented ComfyUI plugin for HunyuanVideo-1.5, offering both simplified and complete node sets for quick usage or deep workflow customization, with built-in automatic model download support.

  • LightX2V - LightX2V: A lightweight and efficient video generation framework that integrates HunyuanVideo-1.5, supporting multiple engineering acceleration techniques for fast inference.

  • Wan2GP v9.62 - Wan2GP: WanGP is a very low VRAM app (as low 6 GB of VRAM for Hunyuan Video 1.5) supports Lora Accelerator for a 8 steps generation and offers tools to facilitate Video Generation.

  • ComfyUI-MagCache - ComfyUI-MagCache: MagCache is a training-free caching approach that accelerates video generation by estimating fluctuating differences among model outputs across timesteps. It achieves 1.7x speedup for HunyuanVideo-1.5 with 20 inference steps.

📑 Open-source Plan

  • HunyuanVideo-1.5 (T2V/I2V)
    • Inference Code and checkpoints
    • ComfyUI Support
    • LightX2V Support
    • Diffusers Support
    • Release all model weights (Sparse attention, distill model, and SR models)

📋 Table of Contents

📖 Introduction

We present HunyuanVideo-1.5, a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture with selective and sliding tile attention(SSTA), enhanced bilingual understanding through glyph-aware text encoding , progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source models. By releasing the code and weights of HunyuanVideo-1.5, we provide the community with a high-performance foundation that significantly lowers the cost of video creation and research, making advanced video generation more accessible to all.

✨ Key Features

  • Lightweight High-Performance Architecture: We propose an efficient architecture that integrates an 8.3B-parameter Diffusion Transformer (DiT) with a 3D causal VAE, achieving compression ratios of 16× in spatial dimensions and 4× along the temporal axis. Additionally, the innovative SSTA (Selective and Sliding Tile Attention) mechanism prunes redundant spatiotemporal kv blocks, significantly reduces computational overhead for long video sequences and accelerates inference, achieving an end-to-end speedup of $1.87 \times$ in 10-second 720p video synthesis compared to FlashAttention-3.
  • Video Super-Resolution Enhancement: We develop an efficient few-step super-resolution network that upscales outputs to 1080p. It enhances sharpness while correcting distortions, thereby refining details and overall visual texture.
  • End-to-End Training Optimization: This work employs a multi-stage, progressive training strategy covering the entire pipeline from pre-training to post-training. Combined with the Muon optimizer to accelerate convergence, this approach holistically refines motion coherence, aesthetic quality, and human preference alignment, achieving professional-grade content generation.

📜 System Requirements

Hardware Requirements

  • GPU: NVIDIA GPU with CUDA support

  • Minimum GPU Memory: 14 GB (with model offloading enabled)

    Note: The memory requirements above are measured with model offloading enabled. If your GPU has sufficient memory, you may disable offloading for improved inference speed.

Software Requirements

  • Operating System: Linux
  • Python: Python 3.10 or higher
  • CUDA: Compatible CUDA version for your PyTorch installation

🛠️ Dependencies and Installation

Step 1: Clone the Repository

git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.git
cd HunyuanVideo-1.5

Step 2: Install Basic Dependencies

pip install -r requirements.txt
pip install -i https://mirrors.tencent.com/pypi/simple/ --upgrade tencentcloud-sdk-python

Step 3: Install Attention Libraries

  • Flash Attention: Install Flash Attention for faster inference and reduced GPU memory consumption. Detailed installation instructions are available at Flash Attention.

  • Flex-Block-Attention: flex-block-attn is only required for sparse attention to achieve faster inference and can be installed by the following command:

    git clone https://github.com/Tencent-Hunyuan/flex-block-attn.git
    cd flex-block-attn
    git submodule update --init --recursive
    python3 setup.py install
    
  • SageAttention: To enable SageAttention for faster inference, you need to install it by the following command:

    Note: Enabling SageAttention will automatically disable Flex-Block-Attention.

    git clone https://github.com/cooper1637/SageAttention.git
    cd SageAttention 
    export EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 # Optional
    python3 setup.py install
    
  • SGL-Kernel: To enable fp8 gemm for transformer, you need to install it by the following command:

    pip install sgl-kernel==0.3.18
    

🧱 Download Pretrained Models

💡 Distillation models and sparse attention models are still coming soon. Please stay tuned for the latest updates on the Hugging Face Model Card.

Download the pretrained models before generating videos. Detailed instructions are available at checkpoints-download.md.

Model Cards

ModelName Download
HunyuanVideo-1.5-480P-T2V 480P-T2V
HunyuanVideo-1.5-480P-I2V 480P-I2V
HunyuanVideo-1.5-480P-T2V-cfg-distill 480P-T2V-cfg-distill
HunyuanVideo-1.5-480P-I2V-cfg-distill 480P-I2V-cfg-distill
HunyuanVideo-1.5-480P-I2V-step-distill 480P-I2V-step-distill
HunyuanVideo-1.5-720P-T2V 720P-T2V
HunyuanVideo-1.5-720P-I2V 720P-I2V
HunyuanVideo-1.5-720P-T2V-cfg-distill Comming soon
HunyuanVideo-1.5-720P-I2V-cfg-distill 720P-I2V-cfg-distill
HunyuanVideo-1.5-720P-T2V-sparse-cfg-distill Comming soon
HunyuanVideo-1.5-720P-I2V-sparse-cfg-distill 720P-I2V-sparse-cfg-distill
HunyuanVideo-1.5-720P-sr-step-distill 720P-sr
HunyuanVideo-1.5-1080P-sr-step-distill 1080P-sr

📝 Prompt Guide

Prompt Writing Handbook

Prompt enhancement plays a crucial role in enabling our model to generate high-quality videos. By writing longer and more detailed prompts, the generated video will be significantly improved. We encourage you to craft comprehensive and descriptive prompts to achieve the best possible video quality. we recommend community partners consulting our official guide on how to write effective prompts.

Reference: HunyuanVideo-1.5 Prompt Handbook

System Prompts for Automatic Prompt Enhancement

For users seeking to optimize prompts for other large models, it is recommended to consult the definition of t2v_rewrite_system_prompt in the file hyvideo/utils/rewrite/t2v_prompt.py to guide text-to-video rewriting. Similarly, for image-to-video rewriting, refer to the definition of i2v_rewrite_system_prompt in hyvideo/utils/rewrite/i2v_prompt.py.

🔑 Inference

Inference with Source Code

For prompt rewriting, we recommend using Gemini or models deployed via vLLM. This codebase currently only supports models compatible with the vLLM API. If you wish to use Gemini, you will need to implement your own interface calls.

For models with a vLLM API, note that T2V (text-to-video) and I2V (image-to-video) have different recommended models and environment variables:

You may set the above model names to any other vLLM-compatible models you have deployed (including HuggingFace models).
Rewriting is enabled by default (--rewrite defaults to true); to disable it explicitly, use --rewrite false or --rewrite 0. If no vLLM endpoint is configured, the pipeline runs without remote rewriting.

Example: Generate a video (works for both T2V and I2V; set IMAGE_PATH=none for T2V or provide an image path for I2V)

💡 Tip: For faster inference speed, you can enable the step-distilled model using the --enable_step_distill parameter. The step-distilled model (480p I2V) can generate videos in 8 or 12 steps (recommended), achieving up to 75% speedup on RTX 4090 while maintaining comparable quality.

Tips: If your GPU memory is > 14GB but you encounter OOM (Out of Memory) errors during generation, you can try setting the following environment variable before running:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128

Tips: If you have limited CPU memory and encounter OOM during inference, you can try disable overlapped group offloading by adding the following argument:

--overlap_group_offloading false
export T2V_REWRITE_BASE_URL="<your_vllm_server_base_url>"
export T2V_REWRITE_MODEL_NAME="<your_model_name>"
export I2V_REWRITE_BASE_URL="<your_vllm_server_base_url>"
export I2V_REWRITE_MODEL_NAME="<your_model_name>"

PROMPT='A girl holding a paper with words "Hello, world!"'

IMAGE_PATH=/path/to/image.png # Optional, none or <image path> to enable i2v mode
SEED=1
ASPECT_RATIO=16:9
RESOLUTION=480p
OUTPUT_PATH=./outputs/output.mp4
MODEL_PATH=./ckpts # Path to pretrained model

# Configuration for faster inference
N_INFERENCE_GPU=8 # Parallel inference GPU count
CFG_DISTILLED=true # Inference with CFG distilled model, 2x speedup
SAGE_ATTN=true # Inference with SageAttention
SPARSE_ATTN=false # Inference with sparse attention (only 720p models are equipped with sparse attention). Please ensure flex-block-attn is installed
OVERLAP_GROUP_OFFLOADING=true # Only valid when group offloading is enabled, significantly increases CPU memory usage but speeds up inference
ENABLE_CACHE=true # Enable feature cache during inference. Significantly speeds up inference.
CACHE_TYPE=deepcache # Support: deepcache, teacache, taylorcache
ENABLE_STEP_DISTILL=true # Enable step distilled model for 480p I2V, recommended 8 or 12 steps, up to 6x speedup


# Configuration for better quality
REWRITE=true # Enable prompt rewriting. Please ensure rewrite vLLM server is deployed and configured.
ENABLE_SR=true # Enable super resolution


torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
 --prompt "$PROMPT" \
 --image_path $IMAGE_PATH \
 --resolution $RESOLUTION \
 --aspect_ratio $ASPECT_RATIO \
 --seed $SEED \
 --rewrite $REWRITE \
 --cfg_distilled $CFG_DISTILLED \
 --enable_step_distill $ENABLE_STEP_DISTILL \
 --sparse_attn $SPARSE_ATTN --use_sageattn $SAGE_ATTN \
 --enable_cache $ENABLE_CACHE --cache_type $CACHE_TYPE \
 --overlap_group_offloading $OVERLAP_GROUP_OFFLOADING \
 --sr $ENABLE_SR --save_pre_sr_video \
 --output_path $OUTPUT_PATH \
 --model_path $MODEL_PATH

Command Line Arguments

Argument Type Required Default Description
--prompt str Yes - Text prompt for video generation
--negative_prompt str No '' Negative prompt for video generation
--resolution str Yes - Video resolution: 480p or 720p
--model_path str Yes - Path to pretrained model directory
--aspect_ratio str No 16:9 Aspect ratio of the output video
--num_inference_steps int No 50 Number of inference steps
--video_length int No 121 Number of frames to generate
--seed int No 123 Random seed for reproducibility
--image_path str No None Path to reference image (enables i2v mode). Use none or None to explicitly use text-to-video mode
--output_path str No None Output file path (if not provided, saves to ./outputs/output_{transformer_version}_{timestamp}.mp4)
--sr bool No true Enable super resolution (use --sr false or --sr 0 to disable)
--save_pre_sr_video bool No false Save original video before super resolution (use --save_pre_sr_video or --save_pre_sr_video true to enable, only effective when super resolution is enabled)
--rewrite bool No true Enable prompt rewriting (use --rewrite false or --rewrite 0 to disable, may result in lower quality video generation)
--cfg_distilled bool No false Enable CFG distilled model for faster inference (~2x speedup, use --cfg_distilled or --cfg_distilled true to enable)
--enable_step_distill bool No false Enable step distilled model for 480p I2V (recommended 8 or 12 steps, ~75% speedup on RTX 4090, use --enable_step_distill or --enable_step_distill true to enable)
--sparse_attn bool No false Enable sparse attention for faster inference (~1.5-2x speedup, requires H-series GPUs, auto-enables CFG distilled, use --sparse_attn or --sparse_attn true to enable)
--offloading bool No true Enable CPU offloading (use --offloading false or --offloading 0 to disable for faster inference if GPU memory allows)
--group_offloading bool No None Enable group offloading (default: None, automatically enabled if offloading is enabled. Use --group_offloading or --group_offloading true/1 to enable, --group_offloading false/0 to disable)
--overlap_group_offloading bool No true Enable overlap group offloading (default: true). Significantly increases CPU memory usage but speeds up inference. Use --overlap_group_offloading or --overlap_group_offloading true/1 to enable, --overlap_group_offloading false/0 to disable
--dtype str No bf16 Data type for transformer: bf16 (faster, lower memory) or fp32 (better quality, slower, higher memory)
--use_sageattn bool No false Enable SageAttention (use --use_sageattn or --use_sageattn true/1 to enable, --use_sageattn false/0 to disable)
--sage_blocks_range str No 0-53 SageAttention blocks range (e.g., 0-5 or 0,1,2,3,4,5)
--enable_cache bool No false Enable cache for transformer (use --enable_cache or --enable_cache true/1 to enable, --enable_cache false/0 to disable)
--cache_type str No deepcache Cache type for transformer (e.g., deepcache, teacache, taylorcache)
--no_cache_block_id str No 53 Blocks to exclude from deepcache (e.g., 0-5 or 0,1,2,3,4,5)
--cache_start_step int No 11 Start step to skip when using cache
--cache_end_step int No 45 End step to skip when using cache
--total_steps int No 50 Total inference steps
--cache_step_interval int No 4 Step interval to skip when using cache

Note: Use --nproc_per_node to specify the number of GPUs. For example, --nproc_per_node=8 uses 8 GPUs.

Optimal Inference Configurations

The following table provides the optimal inference configurations (CFG scale, embedded CFG scale, flow shift, and inference steps) for each model to achieve the best generation quality:

Model CFG Scale Embedded CFG Scale Flow Shift Inference Steps
480p T2V 6 None 5 50
480p I2V 6 None 5 50
720p T2V 6 None 9 50
720p I2V 6 None 7 50
480p T2V CFG Distilled 1 None 5 50
480p I2V CFG Distilled 1 None 5 50
480p I2V Step Distilled 1 None 7 8 or 12 (recommended)
720p T2V CFG Distilled 1 None 9 50
720p I2V CFG Distilled 1 None 7 50
720p T2V CFG Distilled Sparse 1 None 9 50
720p I2V CFG Distilled Sparse 1 None 7 50
480→720 SR Step Distilled 1 None 2 6
720→1080 SR Step Distilled 1 None 2 8

Please note that the cfg distilled model we provided, must use 50 steps to generate correct results.

Usage with Diffusers

HunyuanVideo-1.5 is available on Hugging Face Diffusers! You can easily use it with the Diffusers library:

Basic Usage:

import torch

dtype = torch.bfloat16
device = "cuda:0"

from diffusers import HunyuanVideo15Pipeline
from diffusers.utils import export_to_video

pipe = HunyuanVideo15Pipeline.from_pretrained("hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_t2v", torch_dtype=dtype)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

generator = torch.Generator(device=device).manual_seed(seed)

video = pipe(
 prompt=prompt,
 generator=generator,
 num_frames=121,
 num_inference_steps=50,
).frames[0]

export_to_video(video, "output.mp4", fps=24)

Optimized Usage with Attention Backend:

HunyuanVideo-1.5 uses attention masks with variable-length sequences. For best performance, we recommend using an attention backend that handles padding efficiently.

We recommend installing kernels (pip install kernels) to access prebuilt attention kernels.

import torch

dtype = torch.bfloat16
device = "cuda:0"

from diffusers import HunyuanVideo15Pipeline, attention_backend
from diffusers.utils import export_to_video

pipe = HunyuanVideo15Pipeline.from_pretrained("hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_t2v", torch_dtype=dtype)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

generator = torch.Generator(device=device).manual_seed(seed)

with attention_backend("_flash_3_hub"): # or `"flash_hub"` if you are not on H100/H800
 video = pipe(
 prompt=prompt,
 generator=generator,
 num_frames=121,
 num_inference_steps=50,
 ).frames[0]
 export_to_video(video, "output.mp4", fps=24)

For more details, please visit HunyuanVideo-1.5 Diffusers Collection.

🎓 Training

HunyuanVideo-1.5 is trained using the Muon optimizer, which accelerates convergence and improves training stability. The Muon optimizer combines momentum-based updates with Newton-Schulz orthogonalization for efficient optimization of large-scale video generation models.

Quick Start

The training script (train.py) provides a complete training pipeline for HunyuanVideo-1.5. Here's how to use it:

1. Implement Your DataLoader

Replace the create_dummy_dataloader() function in train.py with your own implementation. Your dataset's __getitem__ method should return a single sample.

  • Required fields:

    • "pixel_values": torch.Tensor - Video: [C, F, H, W] or Image: [C, H, W]
      • Pixel values must be in range [-1, 1]
      • Note: For video data, temporal dimension F must be 4n+1 (e.g., 1, 5, 9, 13, 17, ...)
    • "text": str - Text prompt for this sample
    • "data_type": str - "video" or "image"
  • Optional fields (for performance optimization):

    • "latents": Pre-encoded VAE latents (skips VAE encoding for faster training)
    • "byt5_text_ids" and "byt5_text_mask": Pre-tokenized byT5 inputs

See the create_dummy_dataloader() function in train.py for detailed format documentation.

2. Run Training

Single GPU:

python train.py --pretrained_model_root <path_to_pretrained_model> [other args]

Multi-GPU:

N=8
torchrun --nproc_per_node=$N train.py --pretrained_model_root <path_to_pretrained_model> [other args]

Example:

torchrun --nproc_per_node=8 train.py \
 --pretrained_model_root ./ckpts \
 --learning_rate 1e-5 \
 --batch_size 1 \
 --max_steps 10000 \
 --output_dir ./outputs \
 --enable_fsdp \
 --enable_gradient_checkpointing \
 --sp_size 8

3. Key Training Parameters

Parameter Description Default
--pretrained_model_root Path to pretrained model (required) -
--learning_rate Learning rate 1e-5
--batch_size Batch size 1
--max_steps Maximum training steps 10000
--warmup_steps Warmup steps 500
--gradient_accumulation_steps Gradient accumulation steps 1
--enable_fsdp Enable FSDP for distributed training true
--enable_gradient_checkpointing Enable gradient checkpointing true
--sp_size Sequence parallelism size (must divide world_size) 8
--i2v_prob Probability of i2v task for video data 0.3
--use_muon Use Muon optimizer true
--resume_from_checkpoint Resume from checkpoint directory None
--use_lora Enable LoRA fine-tuning false
--lora_r LoRA rank 8
--lora_alpha LoRA alpha scaling parameter 16
--lora_dropout LoRA dropout rate 0.0
--pretrained_lora_path Path to pretrained LoRA adapter None

4. Monitor Training

  • Checkpoints are saved to output_dir at intervals specified by --save_interval
  • Validation videos are generated at intervals specified by --validation_interval
  • Training logs are printed to console at intervals specified by --log_interval

5. Resume Training

Use --resume_from_checkpoint <checkpoint_dir> to resume from a saved checkpoint:

python train.py \
 --pretrained_model_root <path> \
 --resume_from_checkpoint ./outputs/checkpoint-1000

6. LoRA Fine-tuning

To enable LoRA fine-tuning, add --use_lora to your training command. LoRA adapters will be saved in the checkpoint directory under lora/:

torchrun --nproc_per_node=8 train.py \
 --pretrained_model_root ./ckpts \
 --use_lora \
 --lora_r 8 \
 --lora_alpha 16 \
 --learning_rate 1e-4 \
 --output_dir ./outputs

To load a pretrained LoRA adapter, use --pretrained_lora_path:

torchrun --nproc_per_node=8 train.py \
 --pretrained_model_root ./ckpts \
 --use_lora \
 --pretrained_lora_path ./outputs/checkpoint-1000/lora/default

📊 Evaluation

Rating

We assess text-to-video generation using a comprehensive rating methodology that considers five key dimensions: text-video consistency, visual quality, structural stability, motion effects, and the aesthetic quality of individual frames. For image-to-video generation, the evaluation encompasses image-video consistency, instruction responsiveness, visual quality, structural stability, and motion effects.


GSB

The GSB(Good/Same/Bad) approach is widely used to evaluate the relative performance of two models based on overall video perception quality.We carefully construct 300 diverse text prompts and 300 image samples to cover balanced application scenarios for both text-to-video and image-to-video tasks. For each prompt or image input, an equal number of video samples are generated by each model in a single run to ensure comparability. To maintain fairness, inference is performed only once per input without any cherry-picking of results. All competing models are evaluated using their default configurations. The evaluation is conducted by over 100 professional assessors


Inference speed

We report inference speed with basic engineering-level acceleration techniques enabled on 8 H800 GPUs to demonstrate practical performance achievable in real-world deployment scenarios. Please note that in this experiment, we do not pursue the most extreme acceleration at the cost of generation quality, but rather to achieve notable speed improvements while maintaining nearly identical output quality.

We report the total inference time for 50 diffusion steps for HunyuanVideo 1.5 below:

🎬 More Examples

Features Demo1 Demo2
Strong Instruction Following
Smooth Motion Generation
Cinematic Aesthetics
Text Rendering
Physics Compliance
Camera Movement
Multi-Style Support
High Image-Video Consistency 👁 Image
👁 Image

📚 Citation

@misc{hunyuanvideo2025,
 title={HunyuanVideo 1.5 Technical Report}, 
 author={Tencent Hunyuan Foundation Model Team},
 year={2025},
 eprint={2511.18870},
 archivePrefix={arXiv},
 primaryClass={cs.CV},
 url={https://arxiv.org/abs/2511.18870}, 
}

🙏 Acknowledgements

We would like to thank the contributors to the Transformers, Diffusers , HuggingFace and Qwen-VL, for their open research and exploration.

🌟 Github Star History

Downloads last month
1,940

Model tree for tencent/HunyuanVideo-1.5

Adapters
1 model
Finetunes
12 models
Quantizations
5 models

Spaces using tencent/HunyuanVideo-1.5 100

Collection including tencent/HunyuanVideo-1.5

Paper for tencent/HunyuanVideo-1.5