Voozh

Fine tuned version of moondream2 model using gokaygokay/random_instruct_docci dataset. Which gives extremely detailed captions of the images.

pip install transformers timm einops bitsandbytes accelerate flash-attn

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from PIL import Image

DEVICE = "cuda"
DTYPE = (
 torch.float32 if DEVICE == "cpu" else torch.float16
) # CPU doesn't support float16
revision = "3ec40c7b6b5d87bc0c51edee45e21f5f29b449d8"
tokenizer = AutoTokenizer.from_pretrained(
 "fal-ai/moondream2-docci-instruct",
 trust_remote_code=True,
 revision=revision
)
moondream = AutoModelForCausalLM.from_pretrained(
 "fal-ai/moondream2-docci-instruct",
 trust_remote_code=True,
 torch_dtype=DTYPE,
 device_map={"": DEVICE},
 attn_implementation="flash_attention_2",
 revision=revision
)
moondream.eval()

image_path = "<your_image_path>"
image = Image.open(image_path).convert("RGB")
md_answer = moondream.answer_question(
 moondream.encode_image(image),
 "what is this picture about",
 tokenizer=tokenizer,
)

print(md_answer)

Downloads last month: 14

Safetensors

Model size

2B params

Tensor type

F16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

URL: https://huggingface.co/fal/moondream2-docci-instruct

⇱ fal/moondream2-docci-instruct · Hugging Face

Datasets used to train fal/moondream2-docci-instruct