Caption3o-XL-2B-Qwen2VL

The Caption3o-XL-2B-Qwen2VL model is a fine-tuned version of Qwen2-VL-2B-Instruct, tailored for Image Captioning and Vision Language Attribution. This variant is designed to generate precise, highly descriptive captions with a focus on defining visual properties, object attributes, and scene details across a wide spectrum of images and aspect ratios.

Key Highlights

Vision Language Attribution (VLA): Specially fine-tuned to attribute and define visual properties of objects, scenes, and environments.
Detailed Object Definitions: Generates captions with rich attribute descriptions, making outputs more precise than generic captioners.
High-Fidelity Descriptions: Handles general, artistic, technical, abstract, and low-context images with descriptive depth.
Robust Across Aspect Ratios: Accurately captions images regardless of format—wide, tall, square, or irregular.
Variational Detail Control: Supports both concise summaries and fine-grained attributions depending on prompt structure.
Foundation on Qwen2-VL Architecture: Leverages Qwen2-VL-2B-Instruct’s multimodal reasoning for visual comprehension and instruction-following.
Multilingual Capability: Default in English, but adaptable for multilingual captioning through prompt engineering.

model type: experimental

General Query: Caption the image precisely.

Demo
👁 Open In Colab

Demo Inference

Image A	Image B
👁 Image A	👁 Image B

Quick Start with Transformers

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2VLForConditionalGeneration.from_pretrained(
 "prithivMLmods/Caption3o-XL-2B-Qwen2VL", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/Caption3o-XL-2B-Qwen2VL")

messages = [
 {
 "role": "user",
 "content": [
 {
 "type": "image",
 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
 },
 {"type": "text", "text": "Describe this image with detailed attributes and properties."},
 ],
 }
]

text = processor.apply_chat_template(
 messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
 text=[text],
 images=image_inputs,
 videos=video_inputs,
 padding=True,
 return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
 out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Intended Use

Generating attribute-rich image captions for research, dataset creation, and AI training.
Vision-language attribution for object detection, scene understanding, and dataset annotation.
Supporting creative, artistic, and technical applications requiring detailed descriptions.
Captioning across varied aspect ratios, unusual visual styles, and non-standard datasets.

Limitations

May over-attribute or infer properties not explicitly visible in ambiguous images.
Outputs can vary in tone depending on prompt phrasing.
Accuracy may degrade on synthetic or highly abstract visual domains.

Downloads last month: 6

Safetensors

Model size

2B params

Tensor type

F16

Model tree for prithivMLmods/Caption3o-XL-2B-Qwen2VL

Base model

Qwen/Qwen2-VL-2B

Finetuned

Qwen/Qwen2-VL-2B-Instruct

Finetuned

(352)

this model

Quantizations

2 models

Datasets used to train prithivMLmods/Caption3o-XL-2B-Qwen2VL

Collection including prithivMLmods/Caption3o-XL-2B-Qwen2VL

Multimodal models • 5 items • Updated 2 days ago • 1

URL: https://huggingface.co/prithivMLmods/Caption3o-XL-2B-Qwen2VL

⇱ prithivMLmods/Caption3o-XL-2B-Qwen2VL · Hugging Face

Caption3o-XL-2B-Qwen2VL

Key Highlights

Demo Inference

Quick Start with Transformers

Intended Use

Limitations

Model tree for prithivMLmods/Caption3o-XL-2B-Qwen2VL

Datasets used to train prithivMLmods/Caption3o-XL-2B-Qwen2VL

Collection including prithivMLmods/Caption3o-XL-2B-Qwen2VL