Motion-O: Motion-Aware Trajectory Reasoning for Video

Motion-O is a family of Qwen2.5-VL models fine-tuned for motion-aware trajectory reasoning in videos. This work is introduced in the paper Motion-o: Trajectory-Grounded Video Reasoning.

The models learn to produce structured <think>...</think> chains with <obj>, <box>, <t>, and <motion> tags that describe object motion over time, and to answer a final question about the video.

Links:

Available variants

All variants live in this repository as subfolders:

(root) – grpo_dense_t07_4737145/checkpoint-800/merged
- Name: Motion-O (no visual grounding)
- Description: GRPO on STGR with motion-aware rewards; no explicit open-o3 visual grounding.
open-o3-mcot – open-o3_grpo_v3_074917638/checkpoint-600/merged
- Name: Open-o3 + MCoT (with visual grounding)
- Description: Open-o3-Video style model with multi-chain-of-thought and explicit visual grounding.
open-o3-mcot-no-vg – open-o3_grpo_v2_4896760/checkpoint-1000/merged
- Name: Open-o3 + MCoT (no visual grounding)
- Description: Same training recipe as above but without the additional visual-grounding objective.

How to load

from transformers import AutoModelForCausalLM, AutoProcessor

# 1) Motion-O (no visual grounding) – repo root
model = AutoModelForCausalLM.from_pretrained(
 "bishoygaloaa/motion-o",
 torch_dtype="auto",
)
processor = AutoProcessor.from_pretrained("bishoygaloaa/motion-o")

# 2) Open-o3 + MCoT (with visual grounding)
model_vg = AutoModelForCausalLM.from_pretrained(
 "bishoygaloaa/motion-o",
 subfolder="open-o3-mcot",
 torch_dtype="auto",
)
processor_vg = AutoProcessor.from_pretrained(
 "bishoygaloaa/motion-o",
 subfolder="open-o3-mcot",
)

# 3) Open-o3 + MCoT (no visual grounding)
model_no_vg = AutoModelForCausalLM.from_pretrained(
 "bishoygaloaa/motion-o",
 subfolder="open-o3-mcot-no-vg",
 torch_dtype="auto",
)
processor_no_vg = AutoProcessor.from_pretrained(
 "bishoygaloaa/motion-o",
 subfolder="open-o3-mcot-no-vg",
)

Citation

If you use Motion-O in your work, please cite:

@article{galoaa2026motion,
 title = {Motion-Aware Trajectory Reasoning for Video Understanding},
 author = {Galoaa, Bishoy* and Moezzi, Shayda* and Bai, Xiangyu and Ostadabbas, Sarah},
 journal = {arXiv preprint arXiv:2603.18856},
 year = {2026},
 url = {https://arxiv.org/abs/2603.18856}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bishoygaloaa/motion-o

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(1106)

this model

Datasets used to train bishoygaloaa/motion-o

Paper for bishoygaloaa/motion-o

Paper • 2603.18856 • Published Mar 19 • 2

URL: https://huggingface.co/bishoygaloaa/motion-o