VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding
If you like our project, please give us a star โญ on Github for the latest update.
๐ฐ News
- [2024.01.24] ๐ฅ๐ฅ Online Demo is available: VideoLLaMA3-Image-7B, VideoLLaMA3-7B.
- [2024.01.22] Release models and inference code of VideoLLaMA 3.
๐ Introduction
VideoLLaMA 3 represents a state-of-the-art series of multimodal foundation models designed to excel in both image and video understanding tasks. Leveraging advanced architectures, VideoLLaMA 3 demonstrates exceptional capabilities in processing and interpreting visual content across various contexts. These models are specifically designed to address complex multimodal challenges, such as integrating textual and visual information, extracting insights from sequential video data, and performing high-level reasoning over both dynamic and static visual scenes.
๐ Model Zoo
| Model | Base Model | HF Link |
|---|---|---|
| VideoLLaMA3-7B | Qwen2.5-7B | DAMO-NLP-SG/VideoLLaMA3-7B |
| VideoLLaMA3-2B (This Checkpoint) | Qwen2.5-1.5B | DAMO-NLP-SG/VideoLLaMA3-2B |
| VideoLLaMA3-7B-Image | Qwen2.5-7B | DAMO-NLP-SG/VideoLLaMA3-7B-Image |
| VideoLLaMA3-2B-Image | Qwen2.5-1.5B | DAMO-NLP-SG/VideoLLaMA3-2B-Image |
We also upload the tuned vision encoder of VideoLLaMA3-7B for wider application:
| Model | Base Model | HF Link |
|---|---|---|
| VideoLLaMA3-7B Vision Encoder | siglip-so400m-patch14-384 | DAMO-NLP-SG/VL3-SigLIP-NaViT |
๐ Main Results
๐ image- * denotes the reproduced results.
๐ค Quick Start
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoModel, AutoImageProcessor
model_name = "DAMO-NLP-SG/VideoLLaMA3-2B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
video_path = "put your video path here"
question = "Describe this video in detail."
# Video conversation
conversation = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{"type": "video", "video": {"video_path": video_path, "fps": 1, "max_frames": 128}},
{"type": "text", "text": question},
]
},
]
inputs = processor(conversation=conversation, return_tensors="pt")
inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(response)
Citation
If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:
@article{damonlpsg2025videollama3,
title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
journal={arXiv preprint arXiv:2501.13106},
year={2025},
url = {https://arxiv.org/abs/2501.13106}
}
@article{damonlpsg2024videollama2,
title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
journal={arXiv preprint arXiv:2406.07476},
year={2024},
url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
author = {Zhang, Hang and Li, Xin and Bing, Lidong},
journal = {arXiv preprint arXiv:2306.02858},
year = {2023},
url = {https://arxiv.org/abs/2306.02858}
}
- Downloads last month
- 4
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for cbipok/VideoLLaMA3-2B-fork
Base model
Qwen/Qwen2.5-1.5B Finetuned
Qwen/Qwen2.5-1.5B-Instruct Finetuned
DAMO-NLP-SG/VideoLLaMA3-2B-ImageDatasets used to train cbipok/VideoLLaMA3-2B-fork
Papers for cbipok/VideoLLaMA3-2B-fork
Paper โข 2501.13106 โข Published โข 92
Paper โข 2406.07476 โข Published โข 37
Paper โข 2306.02858 โข Published โข 20
