VOOZH about

URL: https://huggingface.co/Dream-org/Dream-VL-7B

⇱ Dream-org/Dream-VL-7B · Hugging Face


Dream-VL 7B

Dream-VL 7B is an open diffusion vision-language model trained on 12M multimodal data from the MAmmoTH-VL-Instruct-12M dataset. The model takes language instructions and images as input and generates language outputs.

All Dream-VL checkpoints, as well as our training codebase are released under an Apache 2.0 License.

For full details, please read our blog and the paper: Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone.

Model Summary

Getting Started

import torch
from transformers import AutoProcessor, AutoModel

model_name = "Dream-org/Dream-VL-7B"

model = AutoModel.from_pretrained(
 model_name,
 torch_dtype=torch.bfloat16,
 trust_remote_code=True,
).to('cuda')

processor = AutoProcessor.from_pretrained(
 model_name,
 trust_remote_code=True
)

####### Method 1
from PIL import Image
import requests
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
messages = [
 {
 "role": "user","content": [{"type": "image"}, {"type": "text", "text": "Describe this image"}]
 }
]
text = processor.apply_chat_template(
 messages, tokenize=False, add_generation_prompt=True
)
print(text)
inputs = processor(
 text=[text], images=[image], padding=True, return_tensors="pt"
)

####### Method 2: use qwen_vl_utils
# messages = [
# {
# "role": "user",
# "content": [
# {
# "type": "image",
# "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
# },
# {"type": "text", "text": "Describe this image."},
# ],
# }
# ]
# text = processor.apply_chat_template(
# messages, tokenize=False, add_generation_prompt=True
# )
# from qwen_vl_utils import process_vision_info
# image_inputs, video_inputs = process_vision_info(messages)
# inputs = processor(
# text=[text],
# images=image_inputs,
# videos=video_inputs,
# padding=True,
# return_tensors="pt",
# )

inputs = inputs.to("cuda")
input_ids = inputs.pop("input_ids")
output = model.diffusion_generate(
 input_ids,
 max_new_tokens=128,
 output_history=True,
 return_dict_in_generate=True,
 steps=128,
 temperature=0.1,
 top_p=1,
 alg="maskgit_plus",
 alg_temp=0,
 use_cache=False,
 **inputs
)

generations = [
 processor.tokenizer.decode(g[len(p):].cpu().tolist())
 for p, g in zip(input_ids, output.sequences)
]

for j in range(len(messages)):
 print("output:", j, generations[j].split(processor.tokenizer.eos_token)[0])


# output: The image depicts a serene beach scene featuring a young woman and a golden retriever.
# The woman, dressed in a plaid shirt and dark pants, is seated on the sandy shore, smiling warmly at the camera.
# The golden retriever, adorned with a colorful harness, sits attentively beside her, its gaze fixed on the woman.
# The background reveals the vast expanse of the ocean, with waves gently kissing the shore. The sky above is a clear blue, suggesting a sunny day.
# The overall atmosphere exudes a sense of peace and companionship between the woman and her dog.

Citation

BibTeX:

@article{ye2025dreamvla,
 title={Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone},
 author={Ye, Jiacheng and Gong, Shansan and Gao, Jiahui and Fan, Junming and Wu, Shuang} and Bi, Wei and Bai, Haoli and Shang, Lifeng and Kong, Lingpeng},
 journal={arXiv preprint arXiv:2512.22615},
 year={2025}
}
Downloads last month
3,755
Safetensors
Model size
8B params
Tensor type
BF16
·

Model tree for Dream-org/Dream-VL-7B

Finetuned
(13)
this model
Finetunes
1 model

Dataset used to train Dream-org/Dream-VL-7B

Paper for Dream-org/Dream-VL-7B