👁 Document Structure Extractor — Markdown to JSON outline avatar

Document Structure Extractor — Markdown to JSON outline

Pricing

Pay per usage

👁 Document Structure Extractor — Markdown to JSON outline

Document Structure Extractor — Markdown to JSON outline

Turn Markdown documents into structured JSON: nested heading tree with section text, fenced code blocks, links, parsed tables, and size statistics. Pure parsing, no LLM cost.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

👁 Shinobu Otani

Shinobu Otani

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

Document Structure Extractor

Turn Markdown documents into structured JSON — heading tree, section text, code blocks, links, and parsed tables. Pure parsing, deterministic, no LLM cost.

What it does

For each input document it extracts:

Title (first # heading) and preamble text
Nested section tree: level, heading, body text, character counts, children — fenced code blocks never miscounted as headings
Code blocks with language tags and line numbers
Links ([text](url))
Tables parsed into header + rows
Stats: lines, characters, heading and code-block counts

Input

{
"documents":["# Guide\n\nIntro.\n\n## Setup\n\n```bash\npip install x\n```"]
}

Output (one dataset item per document)

{
"title":"Guide",
"sections":[
{
"level":1,"heading":"Guide","text":"Intro.",
"children":[{"level":2,"heading":"Setup","...":"..."}]
}
],
"code_blocks":[{"lang":"bash","code":"pip install x","line":7}],
"links":[],
"tables":[],
"stats":{"lines":9,"chars":52,"headings":2,"code_blocks":1}
}

Typical uses

Building tables of contents / outlines for documentation sites
Feeding section-level structure into RAG ingestion pipelines
Auditing docs: section sizes, code-block coverage, dead-link candidates

HTML to Markdown — clean conversion, boilerplate stripping

shoebill-dev27/html-to-markdown

Convert scraped HTML into clean Markdown and plain text: headings, nested lists, links, images, code blocks, blockquotes, and tables. Drops scripts, styles, and structural boilerplate (nav/footer/aside) so only content remains. Pure parsing, no LLM cost.

👁 User avatar

Shinobu Otani

Markdown API

vivid_astronaut/markdown

👁 User avatar

Fabio Suizu

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

👁 HTML to Markdown avatar

HTML to Markdown

web.harvester/html-to-markdown

Convert HTML to clean Markdown. Supports GFM tables, code blocks, and custom rules. Perfect for content migration and documentation.

👁 User avatar

Web Harvester

👁 PDF URL to Markdown, Tables & RAG Extractor avatar

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

👁 User avatar

Inus Grobler

👁 HTML to Markdown Converter - Bulk Web Content to MD avatar

HTML to Markdown Converter - Bulk Web Content to MD

santamaria-automations/html-to-markdown

Extract main article content from any website and convert to clean Markdown including headings, links, images, tables, and code blocks. Perfect for LLM training, AI pipelines, and documentation. Export data, run via API, schedule and monitor runs, or integrate with other tools.

👁 User avatar

Ale

👁 Image to Markdown avatar

Image to Markdown

abotapi/any-doc-parser

Image to Markdown converts images and scanned PDFs into structured Markdown using AI-powered document understanding. It recognizes text, tables, mathematical formulas (LaTeX), and figures while preserving the correct reading order and document layout.

👁 User avatar

AbotAPI

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

AI Web to Markdown - LLM-Ready Extractor

wiry_kingdom/ai-web-to-markdown

Convert any URL into clean LLM-ready markdown. Strips ads, nav, footer. Preserves headings, lists, tables, code blocks. Returns token count. Perfect for RAG, fine-tuning, AI agents. 10x cheaper than Firecrawl.

👁 User avatar

Mohieldin Mohamed

👁 Html to Markdown Converter avatar

Html to Markdown Converter

antonio_espresso/html-to-markdown-converter

Crawl a target URL and convert its HTML content into clean, structured Markdown with optional heading-based chunking.

👁 User avatar

Antonio Blago

URL: https://apify.com/shoebill-dev27/doc-structure-extractor