VOOZH about

URL: https://apify.com/devwithbobby/docs-markdown-rag-ready-crawler

⇱ Docs Markdown Rag Ready Crawler Β· Apify


Pricing

from $5.00 / 1,000 results

Go to Apify Store

Docs Markdown Rag Ready Crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ Dev with Bobby

Dev with Bobby

Maintained by Community

Actor stats

0

Bookmarked

11

Total users

2

Monthly active users

5 months ago

Last modified

Categories

Share

Docs Markdown RAG-Ready Crawler

An Apify Actor that crawls documentation websites and converts them into clean markdown with RAG-ready chunks for embeddings. Includes internal link graphs and content hashes for change detection.

Features

  • Markdown Conversion - Converts HTML content to clean, well-formatted markdown
  • RAG-Ready Chunks - Automatically splits content into chunks optimized for embedding models
  • Dual Crawler Support - Playwright for JavaScript SPAs, Cheerio for static HTML (faster)
  • Link Graph - Extracts internal link relationships for building knowledge graphs
  • Content Hashing - SHA-256 hashes for detecting content changes
  • Smart Content Extraction - Automatically identifies main content and removes navigation/noise
  • URL Normalization - Handles query params, trailing slashes, and tracking parameters

Output Datasets

The crawler generates multiple dataset types (identified by _datasetType):

Pages (_datasetType: 'pages')

Full page data including:

  • url, normalizedUrl, canonicalUrl
  • title, h1, language
  • text - Plain text content
  • markdown - Converted markdown
  • excerpt - First 300 characters
  • depth - Crawl depth from start URL
  • referrers - URLs that linked to this page
  • outgoingInternalLinks, outgoingExternalLinks
  • contentHash - SHA-256 hash of markdown content
  • fetchedAt - ISO timestamp

Chunks (_datasetType: 'chunks')

RAG-ready content chunks:

  • chunkId - Stable unique identifier
  • url, normalizedUrl
  • chunkIndex - Position in document
  • headingPath - Array of parent headings (e.g., ["Getting Started", "Installation"])
  • markdown, text - Chunk content
  • charStart, charEnd - Character positions in original document
  • chunkHash - Hash of chunk content
  • pageContentHash - Hash of parent page
  • tokenEstimate - Approximate token count

Edges (_datasetType: 'edges')

Internal link graph:

  • from - Source URL (normalized)
  • to - Target URL (normalized)
  • type - Link type (a[href])
  • anchorText - Link text

Issues (_datasetType: 'issues')

Crawl errors and warnings:

  • type - Error type
  • url - Affected URL
  • message - Error message
  • severity - Error severity level

Input Configuration

ParameterTypeDefaultDescription
domainstringrequiredDomain to crawl (e.g., https://docs.example.com)
startUrlsarray[]Override start URLs (optional)
maxPagesinteger200Maximum pages to crawl (1-10,000)
maxDepthinteger4Maximum crawl depth (1-10)
makeRagReadybooleantrueGenerate RAG-ready chunks
modestring"docs"Extraction mode: docs, article, generic
outputstring"all"Output: all, pagesOnly, chunksOnly, edgesOnly
crawlerTypestring"playwright"Engine: playwright (for SPAs) or cheerio (for static)
includeSubdomainsbooleanfalseAlso crawl subdomains
respectRobotsTxtbooleantrueFollow robots.txt rules
removeSelectorsarray["nav", "aside", ...]CSS selectors to remove
allowPatternsarray[]Regex patterns for URLs to include
denyPatternsarray[".*utm_.*", ...]Regex patterns for URLs to exclude
stripQueryParamsbooleantrueRemove query parameters from URLs
chunkTargetCharsinteger2500Target chunk size (500-10,000)
chunkMaxCharsinteger4500Maximum chunk size (1,000-20,000)
minChunkCharsinteger400Minimum chunk size (100-2,000)
proxyConfigurationobject-Apify proxy settings

Example Input

{
"domain":"https://docs.convex.dev",
"maxPages":500,
"maxDepth":5,
"makeRagReady":true,
"mode":"docs",
"output":"all",
"crawlerType":"playwright",
"chunkTargetChars":2500,
"chunkMaxChars":4500
}

Crawler Types

Playwright (default)

  • Best for: JavaScript SPAs, React/Vue/Next.js documentation sites
  • Waits for domcontentloaded + content selectors for fast, reliable extraction
  • Slower but handles dynamic content
  • Timeout: 90 seconds per page (45s navigation)

Cheerio

  • Best for: Static HTML sites, traditional documentation
  • Much faster (no browser required)
  • Lower resource usage
  • Timeout: 45 seconds per page

Content Extraction

The crawler uses smart selectors to find main content:

Docs mode tries (in order):

  1. main, article, [role="main"]
  2. .content, .markdown, .prose
  3. .theme-doc-markdown, .md-content, .docs-content
  4. Falls back to body

Automatically removes noise elements:

  • nav, aside, header, footer
  • .toc, .sidebar, .navigation, .menu
  • Any custom selectors you specify

Chunking Strategy

Content is split into chunks based on:

  1. Heading boundaries - New chunks at #, ##, ###, #### headings
  2. Target size - Aims for ~2,500 characters per chunk
  3. Max size - Hard limit at 4,500 characters
  4. Min size - Avoids tiny chunks under 400 characters
  5. Paragraph preservation - Splits at paragraph boundaries when possible
  6. Sentence preservation - Falls back to sentence/word boundaries for very long paragraphs

Each chunk includes its headingPath for context, making it ideal for RAG systems.

Local Development

# Install dependencies
npminstall
# Run locally
apify run
# Run with input
apify run --input='{"domain": "https://docs.example.com"}'
# Deploy to Apify
apify push

Technical Notes

  • 9MB Limit: Apify dataset items have a ~9MB limit. Pages exceeding this are automatically truncated (with truncated: true flag).
  • URL Normalization: URLs are normalized (HTTPS, no trailing slashes, tracking params stripped) for deduplication.
  • Content Hashes: Use contentHash and chunkHash fields to detect content changes between crawls.
  • Stable Chunk IDs: chunkId is deterministic based on URL, position, and content - same content = same ID.

Dependencies

License

ISC

You might also like

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Docs to Markdown + AI Embeddings β†’ Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus β€” ready for RAG, AI assistants, and semantic search in minutes.

πŸ‘ User avatar

Badruddeen Naseem

8

5.0

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Site to Agent Feed (URL to RAG-ready Markdown)

constant_quadruped/site-to-agent-feed

Turn any URL into clean, RAG-ready Markdown + structured JSON for LLMs and AI agents. Self-healing main-content extraction (survives redesigns), headings/links/tables, optional change-detection. No paid APIs.

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds β€” perfect for AI training data, RAG pipelines, and content archiving.

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

πŸ‘ User avatar

Tugelbay Konabayev

26

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.