👁 RAG-Ready Documentation Scraper avatar

RAG-Ready Documentation Scraper

Pricing

from $3.99 / 1,000 results

RAG-Ready Documentation Scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

Pricing

from $3.99 / 1,000 results

Rating

0.0

(0)

Developer

👁 Alaricus

Alaricus

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

What does the RAG-Ready Documentation Scraper do?

The RAG-Ready Documentation Scraper is a high-performance web crawler and content parser designed specifically for LLM, vector database, and Retrieval-Augmented Generation (RAG) pipelines. It extracts clean, structured, framework-optimized Markdown text from documentation sites and standard websites, stripping out all clutter (navigation panels, header menus, search boxes, cookie consent forms, and footer noise) to leave only pure content body.

To make the outputs immediately ready for ingestion, the actor performs semantic paragraph-based chunking with configurable character sizes and contextual overlaps. It also parses XML sitemaps automatically to crawl entire documentation trees with zero extra configuration.

Key Features

🧹 Boilerplate Layout Scrubbing: Automatically detects and isolates main documentation content layouts. Eliminates menus, headers, sidebars, footer links, and cookie alerts.
🧩 Semantic Chunking: Splits extracted Markdown documents cleanly on paragraph boundaries. If any single paragraph is too large, it is split sentences/character-wise, with a configurable context overlap to avoid losing context.
📄 XML Sitemap Parsing: Simply supply a sitemap.xml URL as a starting point and the scraper will auto-discover and queue all links in the sitemap.
📦 Framework Adaptation: Built-in optimized container detection for popular documentation builders:
- Docusaurus
- GitBook
- Sphinx
- ReadTheDocs
- Auto-Detect (for any generic blog, API reference, or standard page)
🖼️ Image & Link Toggles: Include or strip images (![alt](url)) and hyperlinks ([text](url)) on demand depending on your RAG embedding requirements.

Input Parameters

Parameter	Type	Default	Description
Start URLs (`start_urls`)	`Array`	Required	List of documentation base URLs or XML sitemap URLs.
Documentation Framework (`framework`)	`String`	`auto`	Choose target framework (`auto`, `docusaurus`, `gitbook`, `sphinx`, `readthedocs`) to improve main content wrapper detection.
Enable Semantic Chunking (`enable_chunking`)	`Boolean`	`true`	When enabled, splits Markdown outputs into semantic chunks on paragraph boundaries.
Chunk Size (`chunk_size`)	`Integer`	`1500`	Target character size of each chunk.
Chunk Overlap (`chunk_overlap`)	`Integer`	`200`	Overlap character length between sequential chunks.
Maximum Pages to Scrape (`max_pages`)	`Integer`	`50`	Maximum number of pages the crawler will visit.
Include Image Links (`include_images`)	`Boolean`	`true`	Retain image Markdown tags in extracted text.
Include Hyperlinks (`include_links`)	`Boolean`	`true`	Retain anchor link Markdown tags in extracted text.

Input Example

{
"start_urls":[
{
"url":"https://docusaurus.io/docs"
}
],
"framework":"docusaurus",
"enable_chunking":true,
"chunk_size":1500,
"chunk_overlap":200,
"max_pages":100,
"include_images":true,
"include_links":true
}

Output Data Structure

The results are pushed directly to your Apify dataset. Each item represents a scraped page and has the following schema:

{
"url":"https://docs.gitbook.com/",
"title":"Overview | GitBook Documentation",
"markdown":"# Overview\n\nWelcome to the GitBook documentation portal...",
"chunks":[
"# Overview\n\nWelcome to the GitBook documentation portal...",
"To start configuring your docs, see the Git Sync integration guide..."
],
"chunk_count":2
}

Pricing: Pay-Per-Event (PPE)

This Actor uses the transparent Pay-Per-Event pricing model, meaning you only pay for the pages you successfully scrape.

Price per 1,000 pages: $3.99
Price per page: $0.00399

Feedback & Customizations

If you encounter any issues, need to request a specific feature, or require a custom scraping solution for your business, feel free to get in touch.

Developer: bd.pascari@gmail.com

👁 Docs To Rag avatar

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

👁 User avatar

Gabriel Antony Xaviour

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

👁 Tech Docs to LLM-Ready Markdown avatar

Tech Docs to LLM-Ready Markdown

hedelka/tech-docs-scraper

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

👁 User avatar

Dmitry Goncharov

👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

👁 User avatar

Dev with Bobby

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.

👁 User avatar

Kai Agent

rag-docs-scraper

marbled_jury/my-actor

Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.

👁 User avatar

Hastin S.

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

👁 User avatar

Juan Triviño

👁 Documentation Crawler for RAG avatar

Documentation Crawler for RAG

liquid_bark/docs-crawler-for-rag

Specialized crawler for developer documentation sites. Detects frameworks (Docusaurus, GitBook, ReadTheDocs, MkDocs, Sphinx), extracts clean content, and outputs semantically chunked Markdown optimized for RAG pipelines.

👁 User avatar

Izz

👁 Docs-to-RAG Optimizer avatar

Docs-to-RAG Optimizer

vamsi-krishna/docs-to-rag-optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.

👁 User avatar

Vamsi Krishna

👁 RAG-Ready Markdown Converter & Chunker avatar

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

👁 User avatar

Nguyễn Anh Duy

4.7

URL: https://apify.com/alaricus/rag-docs-markdown-scraper

⇱ RAG Documentation Scraper & Markdown Extractor · Apify

RAG-Ready Documentation Scraper

What does the RAG-Ready Documentation Scraper do?

Key Features

Input Parameters

Input Example

Output Data Structure

Pricing: Pay-Per-Event (PPE)

Feedback & Customizations

You might also like

Docs To Rag

Web-to-Markdown Generator for AI & RAG Pipelines

Tech Docs to LLM-Ready Markdown

Docs Markdown Rag Ready Crawler

AI Content Crawler

rag-docs-scraper

Website to Markdown for LLM and RAG

Documentation Crawler for RAG

Docs-to-RAG Optimizer

RAG-Ready Markdown Converter & Chunker