Omni-R1-Zero

Overview

Omni-R1-Zero is trained without multimodal annotations. It bootstraps step-wise visualizations from text-only CoT seeds (e.g., M3CoT), and then follows the same PeSFT+PeRPO recipe as Omni-R1 to learn interleaved multimodal reasoning.

Usage

import torch
from PIL import Image
from transformers import ChameleonProcessor, ChameleonForConditionalGeneration

# 1) Import & load
model_id = "ModalityDance/Omni-R1-Zero" # or a local checkpoint path
processor = ChameleonProcessor.from_pretrained(model_id)
model = ChameleonForConditionalGeneration.from_pretrained(
 model_id,
 torch_dtype=torch.bfloat16,
 device_map="auto",
)
model.eval()

# 2) Prepare a single input
prompt = "You are a helpful assistant.\nUser: Which of these would appear shinier when polished? A. Metal spoon B. Wooden spoon\nThink with images first, the image reasoning process and answer are enclosed within <reserved12856> <reserved12857> and <reserved12866> <reserved12867> XML tags, respectively.\nAssistant:"

inputs = processor(
 prompt,
 padding=False,
 return_for_text_completion=True,
 return_tensors="pt",
).to(model.device)

# 3) Call the model
outputs = model.generate(
 **inputs,
 max_length=4096,
 do_sample=True,
 temperature=1.0,
 top_p=0.9,
 pad_token_id=1,
 multimodal_generation_mode="unrestricted",
)

# 4) Get results
text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
print(text)

For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository:
https://github.com/ModalityDance/Omni-R1

License

This project is licensed under the MIT License.
It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License.

Citation

@misc{cheng2026omnir1unifiedgenerativeparadigm,
 title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning}, 
 author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li},
 year={2026},
 eprint={2601.09536},
 archivePrefix={arXiv},
 primaryClass={cs.AI},
 url={https://arxiv.org/abs/2601.09536}, 
}