VOOZH about

URL: https://huggingface.co/mahmoudalrefaey/clarivis-r1

โ‡ฑ mahmoudalrefaey/clarivis-r1 ยท Hugging Face


๐Ÿ“Š Clarivis-R1

Chart Reasoning Vision-Language Model

Structured chain-of-thought reasoning over complex chart images

๐Ÿ‘ Model
๐Ÿ‘ Base Model
๐Ÿ‘ Dataset
๐Ÿ‘ License
๐Ÿ‘ ChartQA


๐Ÿง  Overview

Clarivis-R1 is a vision-language model fine-tuned specifically for complex chart reasoning. Given a chart image and an analytical question, the model produces a fully transparent response with an explicit chain-of-thought trace followed by a concise final answer โ€” structured as:

<think> ... step-by-step reasoning ... </think>
<answer> ... final answer ... </answer>

Built on google/gemma-4-E2B-it and trained on 300,000 chart reasoning samples from IBM Granite's ChartNet dataset, Clarivis-R1 targets use cases where interpretability matters: document intelligence, chart QA, RAG pipelines, and agentic workflows.


๐Ÿ“ˆ Benchmark Results

Benchmark Split Metric Score
ChartQA test (2,500 samples) Relaxed Accuracy ยฑ5% 51.64%

Evaluated using greedy decoding (do_sample=False, max_new_tokens=1024). Relaxed accuracy allows ยฑ5% tolerance for numerical answers โ€” the standard ChartQA evaluation protocol.


๐Ÿ’ก Example Output

Chart: Bar chart โ€” Netflix content ratings by type (Movies vs TV Shows)

Question:

"Which rating category has the most balanced distribution between Movies and TV Shows?"

Response:

<think>
The question asks for the rating category where the number of Movies and
TV Shows are most balanced โ€” i.e., where counts are closest to each other.

From the chart:
- TV-MA: Movies โ‰ˆ 2050, TV Shows โ‰ˆ 1150 โ†’ difference โ‰ˆ 900
- TV-14: Movies โ‰ˆ 1430, TV Shows โ‰ˆ 700 โ†’ difference โ‰ˆ 730
- TV-PG: Movies โ‰ˆ 540, TV Shows โ‰ˆ 320 โ†’ difference โ‰ˆ 220
- TV-Y7: Movies โ‰ˆ 150, TV Shows โ‰ˆ 200 โ†’ difference โ‰ˆ 50 โœ“
- TVY: Movies โ‰ˆ 130, TV Shows โ‰ˆ 170 โ†’ difference โ‰ˆ 40

The smallest absolute difference belongs to TVY and TV-Y7.
Considering both absolute difference and visual proximity, TV-Y7
is the most balanced category.
</think>

<answer>
TV-Y7
</answer>

๐Ÿ‹๏ธ Training Details

Base Model

google/gemma-4-E2B-it โ€” 2B parameters, 4-bit quantized via Unsloth

Dataset

Property Value
Source ibm-granite/ChartNet (reasoning split)
Size 300,000 samples
Format (chart image, complex question, full reasoning chain) triplets
Chart types Bar, stacked bar, line, bubble, scatter, grouped bar, area, and more
Answer format <think>...</think><answer>...</answer>

LoRA Configuration

r = 32
lora_alpha = 32
lora_dropout = 0.05
target_modules = "all-linear"
bias = "none"

# All layer types fine-tuned:
finetune_vision_layers = True
finetune_language_layers = True
finetune_attention_modules = True
finetune_mlp_modules = True

Training Configuration

per_device_train_batch_size = 1
gradient_accumulation_steps = 4 # effective batch size = 4
learning_rate = 2e-5
lr_scheduler_type = "cosine"
warmup_ratio = 0.03
num_train_epochs = 1
max_seq_length = 2048
optim = "adamw_torch"
max_grad_norm = 0.5
weight_decay = 0.001
dataloader_num_workers = 0 # required for PIL image safety

Infrastructure

Property Value
Platform AWS SageMaker (single GPU)
Framework Unsloth + HuggingFace TRL
Total steps ~72,750 / 75,000 (97% of 1 epoch)
Training time ~5 days
Trainable params 57,360,384 / 5,161,657,888 (1.11%)

Loss Curve

Step Training Loss
100 2.39
5,000 0.43
14,000 0.39
24,000 0.376
44,000 0.361
72,750 ~0.35

๐Ÿš€ Quick Start

Installation

pip install unsloth transformers torch pillow

Inference

from PIL import Image
from transformers import TextStreamer

# โ”€โ”€ Input โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from unsloth import FastVisionModel
from PIL import Image
from transformers import TextStreamer

# โ”€โ”€ Load model โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
model, processor = FastVisionModel.from_pretrained(
 model_name="mahmoudalrefaey/clarivis-r1",
 load_in_4bit=True,
)
FastVisionModel.for_inference(model)

# โ”€โ”€ Input โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
image = Image.open("your_chart.png").convert("RGB")

instruction = "Your question here."

messages = [
 {
 "role": "user",
 "content": [{"type": "image"}, {"type": "text", "text": instruction}],
 }
]

# โ”€โ”€ Tokenize โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)

inputs = processor(
 image,
 input_text,
 add_special_tokens=False,
 return_tensors="pt",
).to("cuda")

# โ”€โ”€ Generate โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
text_streamer = TextStreamer(processor, skip_prompt=True)

result = model.generate(
 **inputs,
 streamer=text_streamer,
 max_new_tokens=2048,
 use_cache=True,
 do_sample=False,
)

โš ๏ธ Important: Always use do_sample=False for chart reasoning tasks. Do not pass temperature=0.0 โ€” it raises a ValueError. Use do_sample=False for greedy (deterministic) decoding instead.


โœ… Intended Use

Use Case Status
Chart question answering โœ… Supported
Structured reasoning traces โœ… Supported
RAG pipelines over chart documents โœ… Supported
Document intelligence โœ… Supported
Agentic workflows โœ… Supported
Precise small-value bar reading โš ๏ธ May hallucinate
Stacked segment decomposition โš ๏ธ Limited at 2B scale

โš ๏ธ Limitations

  • Stacked bar decomposition: Precisely isolating individual colored segments in stacked charts is challenging at the 2B parameter scale โ€” total bar heights are read more reliably than individual segments
  • Small value precision: Values in the single-digit to ~20 range on crowded axes may be approximated or hallucinated
  • Single epoch training: Each sample was seen exactly once during training
  • Out-of-domain charts: Performance may degrade on chart styles not well-represented in ChartNet (e.g. Sankey diagrams, radar charts)

๐Ÿ“„ Citation

If you use Clarivis-R1 in your research or build on it, please cite:

@misc{clarivis-r1-2025,
 author = {Mahmoud Alrefaey},
 title = {Clarivis-R1: Chart Reasoning Vision-Language Model},
 year = {2025},
 publisher = {HuggingFace},
 howpublished = {\url{https://huggingface.co/mahmoudalrefaey/clarivis-r1}},
}

๐Ÿ“œ License

This model is built on Gemma 4 and is subject to the Gemma Terms of Use. Please review the license terms before any commercial use.


๐Ÿ™ Acknowledgements

๐Ÿ“ฆ Dataset IBM Granite โ€” ChartNet
โšก Training Unsloth โ€” 2x faster fine-tuning
๐Ÿค— Framework HuggingFace TRL โ€” SFTTrainer
๐Ÿงฑ Base Model Google Gemma 4

Built with โค๏ธ by Mahmoud Alrefaey
Downloads last month
7
Safetensors
Model size
5B params
Tensor type
BF16
ยท

Model tree for mahmoudalrefaey/clarivis-r1

Adapter
(106)
this model

Dataset used to train mahmoudalrefaey/clarivis-r1