VOOZH about

URL: https://huggingface.co/prithivMLmods/Qwen3-VL-8B-Instruct-Unredacted-MAX

⇱ prithivMLmods/Qwen3-VL-8B-Instruct-Unredacted-MAX · Hugging Face


👁 1

Qwen3-VL-8B-Instruct-Unredacted-MAX

Qwen3-VL-8B-Instruct-Unredacted-MAX is an optimized release built on top of huihui-ai/Huihui-Qwen3-VL-8B-Instruct-abliterated. This version focuses on packaging improvements, inference stability, and modern Transformers compatibility, while preserving the strong multimodal reasoning capabilities of the base architecture. The result is a powerful 8B vision-language model designed for efficient research, structured captioning, and multimodal experimentation at scale.

Key Highlights

  • Optimized Release Pipeline Improved repository structure and loading consistency for smoother deployment and inference.

  • Modern Transformers Integration Updated compatibility for recent Hugging Face Transformers versions and vision-language utilities.

  • 8B Vision-Language Architecture Built on Qwen3-VL-8B-Instruct, offering strong reasoning ability across image-text tasks with balanced compute requirements.

  • Stable Multimodal Inference Improved consistency for caption generation, visual reasoning, and structured outputs.

  • High-Quality Caption Generation Produces detailed, structured descriptions suitable for dataset creation, annotation workflows, and accessibility applications.

  • Dynamic Resolution Handling Maintains native support for variable image resolutions and aspect ratios.


Base Model Signatures

This model has been re-sharded and optimized for the latest Transformers version from the base model: https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-8B-Instruct-abliterated


Quick Start with Transformers

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model = Qwen3VLForConditionalGeneration.from_pretrained(
 "prithivMLmods/Qwen3-VL-8B-Instruct-Unredacted-MAX",
 torch_dtype="auto",
 device_map="auto"
)

processor = AutoProcessor.from_pretrained(
 "prithivMLmods/Qwen3-VL-8B-Instruct-Unredacted-MAX"
)

messages = [
 {
 "role": "user",
 "content": [
 {
 "type": "image",
 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
 },
 {"type": "text", "text": "Provide a detailed caption for this image."},
 ],
 }
]

text = processor.apply_chat_template(
 messages,
 tokenize=False,
 add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
 text=[text],
 images=image_inputs,
 videos=video_inputs,
 padding=True,
 return_tensors="pt",
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)

output_text = processor.batch_decode(
 [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],
 skip_special_tokens=True,
 clean_up_tokenization_spaces=False
)

print(output_text)

Intended Use

  • Multimodal research and vision-language evaluation
  • Image captioning and dataset generation pipelines
  • Red-teaming and robustness testing of VLMs
  • Creative and descriptive visual storytelling tasks
  • AI system prototyping with image-text reasoning components

Limitations & Risks

Important Note: This model inherits behavioral characteristics from its base architecture and fine-tuning process.

  • Performance depends on image quality, prompt clarity, and decoding settings
  • May produce incomplete or inconsistent reasoning in complex visual scenes
  • Requires moderate to high VRAM for stable inference depending on resolution
  • Output quality varies across domains such as medical, artistic, or technical imagery
Downloads last month
909
Safetensors
Model size
9B params
Tensor type
BF16
·

Model tree for prithivMLmods/Qwen3-VL-8B-Instruct-Unredacted-MAX

Finetuned
(322)
this model
Finetunes
2 models
Quantizations
6 models

Space using prithivMLmods/Qwen3-VL-8B-Instruct-Unredacted-MAX 1

Collection including prithivMLmods/Qwen3-VL-8B-Instruct-Unredacted-MAX