rag-docs-scraper

Deprecated

Pricing

Pay per usage

See alternative Actors

Go to Apify Store

👁 rag-docs-scraper

rag-docs-scraper

Deprecated

See alternative Actors

Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

👁 Hastin S.

Hastin S.

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

AI Documentation & RAG Scraper 🤖📄

The AI Documentation & RAG Scraper is a high-performance tool designed to transform messy technical documentation into clean, structured Markdown. It is specifically optimized for RAG (Retrieval-Augmented Generation) pipelines, LLM fine-tuning, and AI agents.

Stop feeding your AI noisy HTML. Get the clean text you need, instantly.

✨ Key Features

Markdown Optimized: Automatically converts HTML to clean Markdown while preserving headers, code blocks, and tables.
Noise Removal: Smartly identifies and strips out navbars, footers, sidebars, and cookie banners to focus only on the content.
Modern Web Support: Powered by Playwright, it easily handles JavaScript-heavy documentation sites (React, Docusaurus, GitBook, Next.js).
Recursive Crawling: Provide a homepage, and the scraper will automatically follow internal links to map out the entire documentation set.
AI-Agent Ready: Output is structured perfectly for Vector Databases (Pinecone, Weaviate) or direct upload to ChatGPT/Claude.

🚀 How to Use

Input URLs: Enter the starting URL of the documentation you want to scrape (e.g., https://docs.apify.com/).
Set Page Limit: Define how many pages you want to crawl to stay within your budget.
Run & Download: Start the Actor and download your results in JSON, CSV, or Excel.

🛠️ Input Configuration

Field	Type	Description
Start URLs	Array	The entry points for the crawl. Supports multiple URLs.
Max Pages	Integer	The maximum number of pages to crawl (default: 50).
Proxy	Object	Uses Apify Proxy to ensure high success rates and avoid rate limits.

📊 Sample Output

{
"url":"[https://crawlee.dev/docs/quick-start](https://crawlee.dev/docs/quick-start)",
"title":"Quick Start | Crawlee",
"markdown":"# Quick Start\n\nInstall Crawlee using npm...\n\n```bash\nnpm install crawlee playwright\n```",
"scrapedAt":"2026-05-07T12:00:00Z"
}

👁 Docs-to-RAG Crawler avatar

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

👁 User avatar

Stas Persiianenko

👁 RAG-Ready Documentation Scraper avatar

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

👁 User avatar

Alaricus

👁 Documentation Crawler for RAG avatar

Documentation Crawler for RAG

liquid_bark/docs-crawler-for-rag

Specialized crawler for developer documentation sites. Detects frameworks (Docusaurus, GitBook, ReadTheDocs, MkDocs, Sphinx), extracts clean content, and outputs semantically chunked Markdown optimized for RAG pipelines.

👁 User avatar

Izz

👁 Docs to Markdown + AI Embeddings → Vector DB Crawler avatar

Docs to Markdown + AI Embeddings → Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.

👁 User avatar

Badruddeen Naseem

5.0

(1)

HTML to Markdown — clean conversion, boilerplate stripping

shoebill-dev27/html-to-markdown

Convert scraped HTML into clean Markdown and plain text: headings, nested lists, links, images, code blocks, blockquotes, and tables. Drops scripts, styles, and structural boilerplate (nav/footer/aside) so only content remains. Pure parsing, no LLM cost.

👁 User avatar

Shinobu Otani

👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

👁 User avatar

Dev with Bobby

👁 Docs-to-RAG AI Crawler avatar

Docs-to-RAG AI Crawler

charitable_jeopardy/WebScraperAp

Stop wasting space on website headers, footers, cookie banners, and navigation menus. Extract clean body text, chunk it for RAG, and detect page changes across runs crawling public docs, blogs, and knowledge bases,

👁 User avatar

charitable_jeopardy

👁 Website to Text & Markdown — AI / RAG Content Crawler avatar

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

👁 User avatar

Hitman studio

👁 Tech Docs to LLM-Ready Markdown avatar

Tech Docs to LLM-Ready Markdown

hedelka/tech-docs-scraper

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

👁 User avatar

Dmitry Goncharov

Document Structure Extractor — Markdown to JSON outline

shoebill-dev27/doc-structure-extractor

Turn Markdown documents into structured JSON: nested heading tree with section text, fenced code blocks, links, parsed tables, and size statistics. Pure parsing, no LLM cost.

👁 User avatar

Shinobu Otani

URL: https://apify.com/marbled_jury/my-actor

⇱ AI Documentation & RAG Scraper | Convert Docs to Markdown [DEPRECATED] · Apify

rag-docs-scraper

AI Documentation & RAG Scraper 🤖📄

✨ Key Features

🚀 How to Use

🛠️ Input Configuration

📊 Sample Output

You might also like

Docs-to-RAG Crawler

RAG-Ready Documentation Scraper

Documentation Crawler for RAG

Docs to Markdown + AI Embeddings → Vector DB Crawler

HTML to Markdown — clean conversion, boilerplate stripping

Docs Markdown Rag Ready Crawler

Docs-to-RAG AI Crawler

Website to Text & Markdown — AI / RAG Content Crawler

Tech Docs to LLM-Ready Markdown

Document Structure Extractor — Markdown to JSON outline