VOOZH about

URL: https://huggingface.co/prithivMLmods/Caption-Pro

⇱ prithivMLmods/Caption-Pro · Hugging Face


👁 VSXzdfgvsdxf.png

Caption-Pro

Caption-Pro is an advanced image caption and annotation generator optimized for generating detailed, structured JSON outputs. Built upon a powerful vision-language architecture with enhanced OCR and multilingual support, Caption-Pro extracts high-quality captions and annotations from images for seamless integration into your applications.

Key Enhancements:

  • Advanced Image Understanding: Fine-tuned on millions of annotated images, Caption-Pro delivers precise comprehension and interpretation of visual content.
  • Optimized for JSON Output: Produces structured JSON data containing captions and detailed annotations—perfect for integration with databases, APIs, and automation pipelines.
  • Enhanced OCR Capabilities: Accurately extracts textual content from images in multiple languages, including English, Chinese, Japanese, Korean, Arabic, and more.
  • Multimodal Processing: Seamlessly handles both image and text inputs, generating comprehensive annotations based on the provided image.
  • Multilingual Support: Recognizes and processes text within images across various languages.
  • Secure and Optimized Model Weights: Employs safetensors for efficient and secure model loading.

How to Use

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the Caption-Pro model with optimized parameters
model = Qwen2VLForConditionalGeneration.from_pretrained(
 "prithivMLmods/Caption-Pro", torch_dtype="auto", device_map="auto"
)

# Recommended acceleration for performance optimization:
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# "prithivMLmods/Caption-Pro",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )

# Load the default processor for Caption-Pro
processor = AutoProcessor.from_pretrained("prithivMLmods/Caption-Pro")

# Define the input messages with both an image and a text prompt
messages = [
 {
 "role": "user",
 "content": [
 {
 "type": "image",
 "image": "https://flux-generated.com/sample_image.jpeg",
 },
 {"type": "text", "text": "Provide detailed captions and annotations for this image in JSON format."},
 ],
 }
]

# Prepare the input for inference
text = processor.apply_chat_template(
 messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
 text=[text],
 images=image_inputs,
 videos=video_inputs,
 padding=True,
 return_tensors="pt",
)
inputs = inputs.to("cuda")

# Generate the output
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
 out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Key Features

  1. Annotation-Ready Training Data

    • Trained using a diverse dataset of annotated images to ensure high-quality structured output.
  2. Optical Character Recognition (OCR)

    • Robustly extracts and processes text from images in various languages and scripts.
  3. Structured JSON Output

    • Generates detailed captions and annotations in standardized JSON format for easy downstream integration.
  4. Image & Text Processing

    • Capable of handling both visual and textual inputs, delivering comprehensive and context-aware annotations.
  5. Conversational Annotation Generation

    • Supports multi-turn interactions, enabling detailed and iterative refinement of annotations.
  6. Secure and Efficient Model Weights

    • Uses safetensors for enhanced security and optimized model performance.

Caption-Pro streamlines the process of generating image captions and annotations, making it an ideal solution for applications that require detailed visual content analysis and structured data integration.

Downloads last month
8
Safetensors
Model size
2B params
Tensor type
BF16
·

Model tree for prithivMLmods/Caption-Pro

Base model

Qwen/Qwen2-VL-2B
Finetuned
(352)
this model
Merges
1 model
Quantizations
2 models

Collection including prithivMLmods/Caption-Pro