👁 Llm Ready Documentation Scraper avatar

Llm Ready Documentation Scraper

Pricing

Pay per usage

Llm Ready Documentation Scraper

Developers and AI agents need to read documentation (e.g. Stripe Docs, Next.js Docs), but standard scrapers return noisy HTML that includes: navigation bars headers / footers ads / cookie banners This Actor must return pure content-only Markdown, suitable for vectorization and semantic search.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

👁 Sean

Sean

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

6 months ago

Last modified

LLM-Ready Documentation Scraper

Crawl any documentation website and get clean, formatted Markdown perfect for LLMs and RAG (Retrieval-Augmented Generation) applications.

🎯 Problem

Developers and AI agents need to read documentation (Stripe Docs, Next.js Docs, etc.), but standard scrapers return messy HTML with navbars, footers, and ads. This Actor solves that by delivering pure, clean Markdown.

✨ Features

Clean Markdown Output: Strips navigation, sidebars, footers, scripts, and ads
Smart Content Detection: Automatically finds the main content area
Token Counting: Each page includes token count for LLM context planning
Merge Mode: Combine all pages into a single full_documentation.md file
Configurable Depth: Control how deep to crawl
URL Filtering: Include/exclude patterns using globs

📥 Input

Field	Type	Description
`startUrl`	String	The root URL of the documentation site
`maxDepth`	Number	Maximum link depth to crawl (default: 10)
`maxPages`	Number	Maximum pages to scrape (default: 100)
`includeGlobs`	Array	URL patterns to include
`excludeGlobs`	Array	URL patterns to exclude
`excludeElements`	String	CSS selectors to remove
`contentSelector`	String	CSS selector for main content
`mergeOutput`	Boolean	Combine all pages into one file

📤 Output

Each page is stored in the dataset with:

{
"url":"https://docs.example.com/api/auth",
"title":"Authentication",
"markdown":"# Authentication\n\nThis guide covers...",
"tokenCount":1523,
"scrapedAt":"2024-01-15T10:30:00.000Z"
}

When mergeOutput is enabled, a combined full_documentation.md is saved to the Key-Value Store.

🚀 Usage Examples

Crawl Stripe Docs

{
"startUrl":"https://stripe.com/docs/api",
"maxPages":50,
"mergeOutput":true
}

Crawl with Custom Content Selector

{
"startUrl":"https://nextjs.org/docs",
"contentSelector":".docs-content",
"excludeElements":"nav, footer, .sidebar, .carbon-ads",
"maxDepth":3
}

🔧 Technical Details

Built with TypeScript and the Apify SDK
Uses CheerioCrawler for fast HTML parsing
Turndown library for HTML-to-Markdown conversion
gpt-tokenizer for accurate token counting

📝 License

ISC

👁 Docs-to-RAG AI Crawler avatar

Docs-to-RAG AI Crawler

charitable_jeopardy/WebScraperAp

Stop wasting space on website headers, footers, cookie banners, and navigation menus. Extract clean body text, chunk it for RAG, and detect page changes across runs crawling public docs, blogs, and knowledge bases,

👁 User avatar

charitable_jeopardy

👁 Tech Docs to LLM-Ready Markdown avatar

Tech Docs to LLM-Ready Markdown

hedelka/tech-docs-scraper

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

👁 User avatar

Dmitry Goncharov

👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

👁 User avatar

Dev with Bobby

👁 AI-Ready Documentation Crawler avatar

AI-Ready Documentation Crawler

funny_electrician/Korak1901

Scrapes developer docs and outputs perfectly formatted Markdown for LLM fine-tuning.

👁 User avatar

Milton Gardener

API Docs Extractor

skystone_labs/api-docs-extractor

Extract API documentation from Swagger/OpenAPI specs, Postman docs, and API reference pages. Perfect for building API clients, SDKs, and documentation.

👁 User avatar

Skystone

👁 RAG-Ready Documentation Scraper avatar

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

👁 User avatar

Alaricus

👁 Product Documentation Change Monitor scraper avatar

Product Documentation Change Monitor scraper

funny_electrician/Korak1910

Product Documentation Change Monitor scraper: Alerts AI agents when an API or library's documentation updates.

👁 User avatar

Milton Gardener

👁 Docs MCP Server Starter — Live Docs: Claude, Cursor & AI Agents avatar

Docs MCP Server Starter — Live Docs: Claude, Cursor & AI Agents

joeslade/docs-mcp-server-starter

Persistent MCP server that gives Claude, Cursor, and any MCP-compatible AI assistant queryable access to technical documentation. Indexes any docs site, exposes search and fetch tools over MCP, caches pages for speed. Ships with templates for Next.js, Tailwind, React, TypeScript, Prisma.

👁 User avatar

Joe Slade

rag-docs-scraper

marbled_jury/my-actor

Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.

👁 User avatar

Hastin S.

👁 Docs To Rag avatar

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

👁 User avatar

Gabriel Antony Xaviour

URL: https://apify.com/direct_duty/llm-ready-documentation-scraper