👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

Pricing

from $5.00 / 1,000 results

Try for free

Go to Apify Store

👁 Docs Markdown Rag Ready Crawler

Docs Markdown Rag Ready Crawler

Try for free

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

👁 Dev with Bobby

Dev with Bobby

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

5 months ago

Last modified

Docs Markdown RAG-Ready Crawler

An Apify Actor that crawls documentation websites and converts them into clean markdown with RAG-ready chunks for embeddings. Includes internal link graphs and content hashes for change detection.

Features

Markdown Conversion - Converts HTML content to clean, well-formatted markdown
RAG-Ready Chunks - Automatically splits content into chunks optimized for embedding models
Dual Crawler Support - Playwright for JavaScript SPAs, Cheerio for static HTML (faster)
Link Graph - Extracts internal link relationships for building knowledge graphs
Content Hashing - SHA-256 hashes for detecting content changes
Smart Content Extraction - Automatically identifies main content and removes navigation/noise
URL Normalization - Handles query params, trailing slashes, and tracking parameters

Output Datasets

The crawler generates multiple dataset types (identified by _datasetType):

Pages (`_datasetType: 'pages'`)

Full page data including:

url, normalizedUrl, canonicalUrl
title, h1, language
text - Plain text content
markdown - Converted markdown
excerpt - First 300 characters
depth - Crawl depth from start URL
referrers - URLs that linked to this page
outgoingInternalLinks, outgoingExternalLinks
contentHash - SHA-256 hash of markdown content
fetchedAt - ISO timestamp

Chunks (`_datasetType: 'chunks'`)

RAG-ready content chunks:

chunkId - Stable unique identifier
url, normalizedUrl
chunkIndex - Position in document
headingPath - Array of parent headings (e.g., ["Getting Started", "Installation"])
markdown, text - Chunk content
charStart, charEnd - Character positions in original document
chunkHash - Hash of chunk content
pageContentHash - Hash of parent page
tokenEstimate - Approximate token count

Edges (`_datasetType: 'edges'`)

Internal link graph:

from - Source URL (normalized)
to - Target URL (normalized)
type - Link type (a[href])
anchorText - Link text

Issues (`_datasetType: 'issues'`)

Crawl errors and warnings:

type - Error type
url - Affected URL
message - Error message
severity - Error severity level

Input Configuration

Parameter	Type	Default	Description
`domain`	string	required	Domain to crawl (e.g., `https://docs.example.com`)
`startUrls`	array	`[]`	Override start URLs (optional)
`maxPages`	integer	`200`	Maximum pages to crawl (1-10,000)
`maxDepth`	integer	`4`	Maximum crawl depth (1-10)
`makeRagReady`	boolean	`true`	Generate RAG-ready chunks
`mode`	string	`"docs"`	Extraction mode: `docs`, `article`, `generic`
`output`	string	`"all"`	Output: `all`, `pagesOnly`, `chunksOnly`, `edgesOnly`
`crawlerType`	string	`"playwright"`	Engine: `playwright` (for SPAs) or `cheerio` (for static)
`includeSubdomains`	boolean	`false`	Also crawl subdomains
`respectRobotsTxt`	boolean	`true`	Follow robots.txt rules
`removeSelectors`	array	`["nav", "aside", ...]`	CSS selectors to remove
`allowPatterns`	array	`[]`	Regex patterns for URLs to include
`denyPatterns`	array	`[".utm_.", ...]`	Regex patterns for URLs to exclude
`stripQueryParams`	boolean	`true`	Remove query parameters from URLs
`chunkTargetChars`	integer	`2500`	Target chunk size (500-10,000)
`chunkMaxChars`	integer	`4500`	Maximum chunk size (1,000-20,000)
`minChunkChars`	integer	`400`	Minimum chunk size (100-2,000)
`proxyConfiguration`	object	-	Apify proxy settings

Example Input

{
"domain":"https://docs.convex.dev",
"maxPages":500,
"maxDepth":5,
"makeRagReady":true,
"mode":"docs",
"output":"all",
"crawlerType":"playwright",
"chunkTargetChars":2500,
"chunkMaxChars":4500
}

Crawler Types

Playwright (default)

Best for: JavaScript SPAs, React/Vue/Next.js documentation sites
Waits for domcontentloaded + content selectors for fast, reliable extraction
Slower but handles dynamic content
Timeout: 90 seconds per page (45s navigation)

Cheerio

Best for: Static HTML sites, traditional documentation
Much faster (no browser required)
Lower resource usage
Timeout: 45 seconds per page

Content Extraction

The crawler uses smart selectors to find main content:

Docs mode tries (in order):

main, article, [role="main"]
.content, .markdown, .prose
.theme-doc-markdown, .md-content, .docs-content
Falls back to body

Automatically removes noise elements:

nav, aside, header, footer
.toc, .sidebar, .navigation, .menu
Any custom selectors you specify

Chunking Strategy

Content is split into chunks based on:

Heading boundaries - New chunks at #, ##, ###, #### headings
Target size - Aims for ~2,500 characters per chunk
Max size - Hard limit at 4,500 characters
Min size - Avoids tiny chunks under 400 characters
Paragraph preservation - Splits at paragraph boundaries when possible
Sentence preservation - Falls back to sentence/word boundaries for very long paragraphs

Each chunk includes its headingPath for context, making it ideal for RAG systems.

Local Development

# Install dependencies
npminstall
# Run locally
apify run
# Run with input
apify run --input='{"domain": "https://docs.example.com"}'
# Deploy to Apify
apify push

Technical Notes

9MB Limit: Apify dataset items have a ~9MB limit. Pages exceeding this are automatically truncated (with truncated: true flag).
URL Normalization: URLs are normalized (HTTPS, no trailing slashes, tracking params stripped) for deduplication.
Content Hashes: Use contentHash and chunkHash fields to detect content changes between crawls.
Stable Chunk IDs: chunkId is deterministic based on URL, position, and content - same content = same ID.

Dependencies

Apify SDK - Actor framework
Crawlee - Web scraping library
Playwright - Browser automation
Turndown - HTML to Markdown conversion

License

ISC

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

👁 Docs to Markdown + AI Embeddings → Vector DB Crawler avatar

Docs to Markdown + AI Embeddings → Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.

👁 User avatar

Badruddeen Naseem

5.0

rag-docs-scraper

marbled_jury/my-actor

Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.

👁 User avatar

Hastin S.

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

RAG Website Crawler - Clean Markdown for LLMs & AI

themineworks/rag-crawler

Affordable RAG website crawler: clean Markdown for LLMs & RAG. Free (compute-only), no per-result charge, no subscription. Works in Claude, ChatGPT & any MCP-compatible AI agent.

👁 User avatar

The Mine Works

Site to Markdown — any site to clean, LLM-ready markdown

topsail/site-to-markdown

Scrape any website to clean, LLM-ready markdown — a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.

👁 User avatar

Connor Teskey

👁 Site to Agent Feed (URL to RAG-ready Markdown) avatar

Site to Agent Feed (URL to RAG-ready Markdown)

constant_quadruped/site-to-agent-feed

Turn any URL into clean, RAG-ready Markdown + structured JSON for LLMs and AI agents. Self-healing main-content extraction (survives redesigns), headings/links/tables, optional change-detection. No paid APIs.

👁 User avatar

👁 Website To Markdown avatar

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

👁 User avatar

SmartApi

5.0

👁 Website Content Crawler API - Markdown for RAG avatar

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

👁 User avatar

Tugelbay Konabayev

👁 AI / RAG Web Crawler avatar

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

👁 User avatar

Group Oject

URL: https://apify.com/devwithbobby/docs-markdown-rag-ready-crawler