VOOZH about

URL: https://apify.com/marbled_jury/my-actor

⇱ AI Documentation & RAG Scraper | Convert Docs to Markdown [DEPRECATED] Β· Apify


πŸ‘ rag-docs-scraper avatar

rag-docs-scraper

Deprecated

Pricing

Pay per usage

Go to Apify Store

rag-docs-scraper

Deprecated

Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

πŸ‘ Hastin S.

Hastin S.

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a month ago

Last modified

Share

AI Documentation & RAG Scraper πŸ€–πŸ“„

The AI Documentation & RAG Scraper is a high-performance tool designed to transform messy technical documentation into clean, structured Markdown. It is specifically optimized for RAG (Retrieval-Augmented Generation) pipelines, LLM fine-tuning, and AI agents.

Stop feeding your AI noisy HTML. Get the clean text you need, instantly.


✨ Key Features

  • Markdown Optimized: Automatically converts HTML to clean Markdown while preserving headers, code blocks, and tables.
  • Noise Removal: Smartly identifies and strips out navbars, footers, sidebars, and cookie banners to focus only on the content.
  • Modern Web Support: Powered by Playwright, it easily handles JavaScript-heavy documentation sites (React, Docusaurus, GitBook, Next.js).
  • Recursive Crawling: Provide a homepage, and the scraper will automatically follow internal links to map out the entire documentation set.
  • AI-Agent Ready: Output is structured perfectly for Vector Databases (Pinecone, Weaviate) or direct upload to ChatGPT/Claude.

πŸš€ How to Use

  1. Input URLs: Enter the starting URL of the documentation you want to scrape (e.g., https://docs.apify.com/).
  2. Set Page Limit: Define how many pages you want to crawl to stay within your budget.
  3. Run & Download: Start the Actor and download your results in JSON, CSV, or Excel.

πŸ› οΈ Input Configuration

FieldTypeDescription
Start URLsArrayThe entry points for the crawl. Supports multiple URLs.
Max PagesIntegerThe maximum number of pages to crawl (default: 50).
ProxyObjectUses Apify Proxy to ensure high success rates and avoid rate limits.

πŸ“Š Sample Output

{
"url":"[https://crawlee.dev/docs/quick-start](https://crawlee.dev/docs/quick-start)",
"title":"Quick Start | Crawlee",
"markdown":"# Quick Start\n\nInstall Crawlee using npm...\n\n```bash\nnpm install crawlee playwright\n```",
"scrapedAt":"2026-05-07T12:00:00Z"
}

You might also like

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

πŸ‘ User avatar

Stas Persiianenko

7

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

Documentation Crawler for RAG

liquid_bark/docs-crawler-for-rag

Specialized crawler for developer documentation sites. Detects frameworks (Docusaurus, GitBook, ReadTheDocs, MkDocs, Sphinx), extracts clean content, and outputs semantically chunked Markdown optimized for RAG pipelines.

Docs to Markdown + AI Embeddings β†’ Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus β€” ready for RAG, AI assistants, and semantic search in minutes.

πŸ‘ User avatar

Badruddeen Naseem

8

5.0

(1)

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

Docs-to-RAG AI Crawler

charitable_jeopardy/WebScraperAp

Stop wasting space on website headers, footers, cookie banners, and navigation menus. Extract clean body text, chunk it for RAG, and detect page changes across runs crawling public docs, blogs, and knowledge bases,

πŸ‘ User avatar

charitable_jeopardy

1

Website to Text & Markdown β€” AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

2

Tech Docs to LLM-Ready Markdown

hedelka/tech-docs-scraper

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

πŸ‘ User avatar

Dmitry Goncharov

25