Qwen2.5-VL-3B-Instruct-Unredacted-MAX
Qwen2.5-VL-3B-Instruct-Unredacted-MAX is an optimized release built on top of huihui-ai/Qwen2.5-VL-3B-Instruct-abliterated. This version focuses on improved model packaging, updated compatibility with modern Transformers pipelines, and stable multimodal inference behavior, while preserving the core vision-language reasoning capabilities of the original architecture. The result is a compact 3B vision-language model designed for efficient deployment, research experimentation, and multimodal application development.
Key Highlights
Optimized Release Packaging Streamlined repository structure for smoother loading, inference, and deployment workflows.
Modern Transformers Compatibility Updated to ensure stable integration with recent Hugging Face Transformers versions.
3B Vision-Language Architecture Built on Qwen2.5-VL-3B-Instruct, balancing multimodal capability with lightweight deployment requirements.
Stable Multimodal Inference Designed for consistent performance across image-text reasoning tasks.
Efficient Caption Generation Produces structured, descriptive outputs suitable for annotation and dataset building.
Dynamic Resolution Support Retains native handling of varying image resolutions and aspect ratios.
Base Model Signatures:
This model has been re-sharded and optimized for the latest Transformers version from the base model: https://huggingface.co/huihui-ai/Qwen2.5-VL-3B-Instruct-abliterated
Quick Start with Transformers
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"prithivMLmods/Qwen2.5-VL-3B-Instruct-Unredacted-MAX",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(
"prithivMLmods/Qwen2.5-VL-3B-Instruct-Unredacted-MAX"
)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Provide a detailed caption for this image."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=256)
output_text = processor.batch_decode(
[out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print(output_text)
Intended Use
- Multimodal AI research and evaluation
- Image captioning and dataset generation pipelines
- Vision-language prototyping and experimentation
- Lightweight deployment in constrained environments
- Development of multimodal applications and tools
Limitations & Risks
Important Note: This model inherits behavior and constraints from its base architecture.
- Performance depends on image quality, resolution, and prompt design
- May produce incomplete or inaccurate interpretations in complex scenes
- Requires adequate GPU resources for stable inference
- Output consistency varies with decoding settings and runtime optimization
- Downloads last month
- 98
