VOOZH about

URL: https://huggingface.co/TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT

⇱ TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT Β· Hugging Face


πŸ‘ MapTrace Task Illustration

Qwen3.5-0.8B β€” MapTrace Path Planning (Partial Fine-Tune)

This model is a Partial Fine-Tune of Qwen/Qwen3.5-0.8B specifically trained for visual path planning on indoor and outdoor maps. Given an image of a traversable map with marked start (green) and end (red) locations, the model predicts a sequence of normalized (x, y) waypoints that connect the two locations while staying on traversable paths.


πŸ—ΊοΈ Task Description

MapTrace is a visual navigation benchmark that requires a model to:

  1. Understand a map image with a marked start location (green dot) and end location (red dot).
  2. Trace a realistic, traversable path between the two points.
  3. Output the path as a list of normalized (x, y) coordinates: [(x1, y1), (x2, y2), ...]

Example Input

You are provided an image of a path with a start location denoted in green 
and an end location denoted in red. 
The normalized xy-coordinates of the start location are (0.77, 0.47) and 
of the end location (0.54, 0.54). 
Output a list of normalized coordinates in the form of a list [(x1,y1), (x2,y2)...] 
of the path between the start and end location. 
Ensure that the path follows the traversable locations of the map.

Example Output

[(0.7695, 0.4727), (0.7344, 0.5156), (0.5898, 0.5234), (0.543, 0.543)]

πŸ‹οΈ Training Strategy

The model was trained using a Partial Fine-Tuning approach to balance quality and computational efficiency:

Setting Value
Base Model Qwen/Qwen3.5-0.8B
Strategy Partial Fine-Tuning (Last 6 of 24 layers)
Dataset google/MapTrace (maptrace_20k subset)
Vision Encoder βœ… Trained (not frozen)
Effective Batch Size 128 (1 per GPU Γ— 32 grad_accum Γ— 4 GPUs)
Learning Rate 1e-5 with Warmup (100 steps)
Max Sequence Length 2048 tokens
Mixed Precision bf16
Gradient Checkpointing βœ… Enabled
Image Resolution Cap 1024 Γ— 768 pixels (VRAM safety)
Hardware 4Γ— NVIDIA RTX 4090 24GB

Why Partial Fine-Tuning?

Instead of full fine-tuning or LoRA, we opted for unlocking the last 6 transformer layers and the visual encoder for gradient updates. This approach:

  • Preserves the model's general language understanding from the foundation weights.
  • Directly adapts the "reasoning" layers to the spatial coordinate prediction task.
  • Avoids LoRA's approximation error β€” weights are modified directly.

Training Highlights

  • Resumed from checkpoint-50 β†’ checkpoint-100 β†’ final across multiple sessions.
  • VRAM stability was achieved by setting max_pixels=1024x768 as a safety cap, preventing OOM spikes from high-resolution map images.
  • Loss masking was applied so the model only learns from the assistant's coordinate response, not the user's prompt or image tokens.
  • Final Token Accuracy: ~~84.5%, Final Loss: ~0.37.

πŸš€ Inference

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model_id = "TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
 model_id,
 torch_dtype=torch.bfloat16,
 device_map="auto",
 trust_remote_code=True,
)
model.eval()

def predict_path(image_path: str, start_xy: tuple, end_xy: tuple) -> str:
 image = Image.open(image_path).convert("RGB")
 
 prompt_text = (
 f"You are provided an image of a path with a start location denoted in green "
 f"and an end location denoted in red. \n"
 f"The normalized xy-coordinates of the start location are ({start_xy[0]}, {start_xy[1]}) "
 f"and of the end location ({end_xy[0]}, {end_xy[1]}). \n"
 f"Output a list of normalized coordinates in the form of a list [(x1,y1), (x2,y2)...] "
 f"of the path between the start and end location. \n"
 f"Ensure that the path follows the traversable locations of the map."
 )

 messages = [
 {
 "role": "user",
 "content": [
 {"type": "image"},
 {"type": "text", "text": prompt_text},
 ],
 }
 ]

 text = processor.apply_chat_template(
 messages,
 tokenize=False,
 add_generation_prompt=True,
 )

 inputs = processor(
 text=[text],
 images=[image],
 return_tensors="pt",
 padding=True,
 min_pixels=256 * 28 * 28,
 max_pixels=1024 * 768,
 ).to(model.device)

 with torch.no_grad():
 generated_ids = model.generate(
 **inputs,
 max_new_tokens=512,
 do_sample=False,
 temperature=0.0,
 )
 
 output = processor.decode(
 generated_ids[0][inputs.input_ids.shape[1]:],
 skip_special_tokens=True
 )
 return output.strip()


# Example usage
result = predict_path(
 image_path="map.png",
 start_xy=(0.7695, 0.4727),
 end_xy=(0.543, 0.543),
)
print(result)
# Expected: [(0.7695, 0.4727), (0.7344, 0.5156), (0.5898, 0.5234), (0.543, 0.543)]

Important: Use the raw prompt format shown above (not apply_chat_template with add_generation_prompt=True). Since the training data did not include <think> tokens, applying the default Qwen3.5 template would inject thinking tags and cause degraded output.


πŸ“Š Training Metrics

Epoch Step Loss Token Accuracy Learning Rate
0.50 71 0.5044 70.35% 6.9e-06
...
0.71 100 0.4614 81.57% 9.9e-06
0.85 120 0.4294 82.71% 9.73e-06
1.00 141 0.4147 83.29% 9.39e-06
1.42 200 0.3841 84.32% 4.31e-06
1.50 211 0.3741 84.50% 3.47e-06
1.63 224 0.3669 84.82% 1.95e-06

πŸ—‚οΈ Dataset

The model was trained on google/MapTrace (the maptrace_20k subset), a dataset containing 20,000 map images paired with path annotations from start to end locations.


⚠️ Limitations

  • Map-Specific: The model is trained exclusively for top-down map images; it is not a general path planner.
  • Coordinate Precision: Predictions are in normalized [0, 1] space and may have small pixel-level deviations.
  • Resolution Cap: Images larger than 1024Γ—768 will be downscaled to this resolution during inference to match training conditions.

πŸ“œ License

This model is released under the Apache 2.0 license, consistent with the base model.

Downloads last month
5
Safetensors
Model size
0.9B params
Tensor type
BF16
Β·

Model tree for TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT

Finetuned
(232)
this model

Dataset used to train TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT

Space using TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT 1