VOOZH about

URL: https://huggingface.co/prithivMLmods/proxima-ocr-d.markdown-post3.0.l

โ‡ฑ prithivMLmods/proxima-ocr-d.markdown-post3.0.l ยท Hugging Face


๐Ÿ‘ 1

proxima-ocr-d.markdown-post3.0.l

proxima-ocr-d.markdown-post3.0.l is an experimental document AI multimodal model fine-tuned on top of Qwen3-VL-8B-Instruct, optimized for high precision OCR and structured document reconstruction. The model converts documents into Markdown, HTML-Markdown, and hybrid enriched documentation formats capable of embedding inline programming languages and reconstructing complex layouts such as tables, forms, and mathematical content.

Key Enhancements

  • Dynamic Markdown Reconstruction Converts complex documents to structured Markdown or HTML-Markdown while preserving layout hierarchy, formatting consistency, semantic ordering, and section alignment.

  • Inline Code and Language Embedding Direct adaptation of Python, JavaScript, LaTeX, and shell syntax into reconstructed documents for technical and research documentation.

  • High Fidelity OCR and Visual Parsing Accurate recognition of text across structured and unstructured scanned documents, including multi page layout reasoning.

  • Complex Layout Interpretation Interprets tables, grids, equations, graphs, multi column layouts, and forms without structural distortion.

  • Document Retrieval and Semantic Linking Efficient multi page chunking with cross reference recognition and content traceability.

  • Multimodal Long Reasoning Supports advanced document question answering and reasoning across long input streams such as slides and manuscripts.


๐Ÿ‘‰ This model is a stage progression model, and it may currently contain artifacts.


Example Preview

[1] Markdown HTML

Input Image Markdown Preview Page 1 Markdown Preview Page 2
๐Ÿ‘ 1
๐Ÿ‘ Page1
๐Ÿ‘ Page2

[2] JSON Nodes

Input Image Node Preview Page 1 Node Preview Page 2
๐Ÿ‘ 1
๐Ÿ‘ Page1
๐Ÿ‘ Page2

[3] YAML Nodes

Input Image Node Preview Page 1 Node Preview Page 2
๐Ÿ‘ input
๐Ÿ‘ Page1
๐Ÿ‘ Page2

Quick Start with Transformers

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen3VLForConditionalGeneration.from_pretrained(
 "prithivMLmods/proxima-ocr-d.markdown-post3.0.l", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/proxima-ocr-d.markdown-post3.0.l")

messages = [
 {
 "role": "user",
 "content": [
 {
 "type": "image",
 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
 },
 {"type": "text", "text": "Convert to Markdown."},
 ],
 }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
 text=[text],
 images=image_inputs,
 videos=video_inputs,
 padding=True,
 return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
 out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Intended Use

  • OCR to Markdown or HTML-Markdown conversion
  • Complex document reconstruction and formatting regeneration
  • Multi page document reasoning and retrieval
  • Table extraction and structured output transformation
  • Mathematical OCR and LaTeX conversion
  • Form extraction and structured entity generation
  • Knowledge base indexing and large document QA
  • Documentation regeneration for enterprise automation

Limitations

  • Accuracy may drop on extremely damaged or poorly scanned images
  • Significant GPU VRAM required for long sequences and multi page documents
  • Language accuracy varies for low resource scripts
  • Complex objects such as mixed orientation blocks may require secondary post processing
  • May occasionally produce formatting misalignment in highly irregular layouts

Training Details

Parameter Value
Dataset Size approx. 544K [ modular combination open source data & synthetic document data entries from Gemini 3 Pro ]
Architecture Qwen3VLForConditionalGeneration
Training Time approx. 17,040 seconds (4 h 44 m)
Precision bfloat16
Hardware 4x H100 SXM (320 GB VRAM)
System Memory 752 GB RAM
CPU 80 vCPU

References

Downloads last month
12
Safetensors
Model size
9B params
Tensor type
BF16
ยท

Model tree for prithivMLmods/proxima-ocr-d.markdown-post3.0.l

Finetuned
(322)
this model
Quantizations
4 models

Collections including prithivMLmods/proxima-ocr-d.markdown-post3.0.l

Papers for prithivMLmods/proxima-ocr-d.markdown-post3.0.l