VOOZH about

URL: https://apify.com/shoebill-dev27/doc-structure-extractor

โ‡ฑ Document Structure Extractor โ€” Markdown to JSON outline ยท Apify


๐Ÿ‘ Document Structure Extractor โ€” Markdown to JSON outline avatar

Document Structure Extractor โ€” Markdown to JSON outline

Pricing

Pay per usage

Go to Apify Store

Document Structure Extractor โ€” Markdown to JSON outline

Turn Markdown documents into structured JSON: nested heading tree with section text, fenced code blocks, links, parsed tables, and size statistics. Pure parsing, no LLM cost.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

๐Ÿ‘ Shinobu Otani

Shinobu Otani

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Categories

Share

Document Structure Extractor

Turn Markdown documents into structured JSON โ€” heading tree, section text, code blocks, links, and parsed tables. Pure parsing, deterministic, no LLM cost.

What it does

For each input document it extracts:

  • Title (first # heading) and preamble text
  • Nested section tree: level, heading, body text, character counts, children โ€” fenced code blocks never miscounted as headings
  • Code blocks with language tags and line numbers
  • Links ([text](url))
  • Tables parsed into header + rows
  • Stats: lines, characters, heading and code-block counts

Input

{
"documents":["# Guide\n\nIntro.\n\n## Setup\n\n```bash\npip install x\n```"]
}

Output (one dataset item per document)

{
"title":"Guide",
"sections":[
{
"level":1,"heading":"Guide","text":"Intro.",
"children":[{"level":2,"heading":"Setup","...":"..."}]
}
],
"code_blocks":[{"lang":"bash","code":"pip install x","line":7}],
"links":[],
"tables":[],
"stats":{"lines":9,"chars":52,"headings":2,"code_blocks":1}
}

Typical uses

  • Building tables of contents / outlines for documentation sites
  • Feeding section-level structure into RAG ingestion pipelines
  • Auditing docs: section sizes, code-block coverage, dead-link candidates

You might also like

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

HTML to Markdown

web.harvester/html-to-markdown

Convert HTML to clean Markdown. Supports GFM tables, code blocks, and custom rules. Perfect for content migration and documentation.

3

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

HTML to Markdown Converter - Bulk Web Content to MD

santamaria-automations/html-to-markdown

Extract main article content from any website and convert to clean Markdown including headings, links, images, tables, and code blocks. Perfect for LLM training, AI pipelines, and documentation. Export data, run via API, schedule and monitor runs, or integrate with other tools.

Image to Markdown

abotapi/any-doc-parser

Image to Markdown converts images and scanned PDFs into structured Markdown using AI-powered document understanding. It recognizes text, tables, mathematical formulas (LaTeX), and figures while preserving the correct reading order and document layout.

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Html to Markdown Converter

antonio_espresso/html-to-markdown-converter

Crawl a target URL and convert its HTML content into clean, structured Markdown with optional heading-based chunking.

39