VOOZH about

URL: https://apify.com/vamsi-krishna/docs-to-rag-optimizer

⇱ Docs-to-RAG Optimizer Β· Apify


Pricing

from $0.50 / 1,000 page processeds

Go to Apify Store

Docs-to-RAG Optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.

Pricing

from $0.50 / 1,000 page processeds

Rating

0.0

(0)

Developer

πŸ‘ Vamsi Krishna

Vamsi Krishna

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a month ago

Last modified

Share

Docs to RAG - Documentation to Markdown, JSONL & AI Chunks

Turn public developer documentation into clean, LLM-ready data for RAG pipelines.

This Apify Actor crawls docs websites, removes navigation/sidebar/footer noise, converts pages to Markdown, splits content into semantic chunks, counts tokens, detects duplicates, and exports JSONL files that are easy to load into vector databases and AI search systems.

Best For

  • Building AI assistants over product or developer documentation
  • Preparing docs for OpenAI vector stores, Pinecone, Supabase Vector, Weaviate, Qdrant, Chroma, LangChain, and LlamaIndex
  • Converting Docusaurus, GitBook, MkDocs/Material, MDN-style, and custom docs pages into clean Markdown
  • Creating stable page and chunk records with content hashes for incremental RAG ingestion

What You Get

  • Clean Markdown for every processed page
  • Page JSON records in the pages dataset
  • Chunk JSON records in the chunks dataset
  • Default dataset records for easy Apify Console/API export
  • Consolidated pages.jsonl and chunks.jsonl exports in the key-value store
  • Token counts for pages and chunks using OpenAI-style tokenization
  • Header-aware RAG chunks with heading paths and previous/next chunk IDs
  • SHA-256 content hashes for pages and chunks
  • Exact duplicate detection with duplicateOf
  • RAG quality score, warnings, and recommendedAction
  • Optional per-page Markdown files in key-value store

Why Use This Instead of a Generic Web Scraper?

Generic website scrapers are useful when you need broad website crawling. This Actor is built specifically for documentation-to-RAG workflows:

  • Docs-specific cleanup for Docusaurus, GitBook, and MkDocs/Material
  • Header-aware chunks instead of fixed character splitting
  • embeddingText on every chunk for direct vector database ingestion
  • Page and chunk JSONL exports for batch pipelines
  • Duplicate detection to avoid embedding the same page twice
  • Quality warnings so bad extractions are visible before you embed them
  • Page-based pricing at $1.00 / 1,000 pages, not per generated chunk

Supported Documentation Platforms

The Actor is optimized for:

  • Docusaurus
  • GitBook
  • MkDocs / Material for MkDocs

Unknown or custom documentation sites use a Readability-based fallback extractor.

Example Use Cases

  • Crawl https://docusaurus.io/docs and create JSONL chunks for a docs chatbot
  • Convert GitBook docs into Markdown files for an internal knowledge base
  • Extract MkDocs/Material documentation into chunk records for Supabase Vector
  • Deduplicate repeated docs pages before embedding to reduce vector database cost
  • Build an AI search index from public developer documentation

Example Input

{
"startUrls":[{"url":"https://docusaurus.io/docs"}],
"maxPages":50,
"maxDepth":3,
"includePatterns":["^https://docusaurus\\.io/docs"],
"excludePatterns":["/blog/"],
"outputFormats":["json","markdown"],
"chunkingEnabled":true,
"chunkStrategy":"header-aware",
"chunkSize":800,
"chunkOverlap":100,
"deduplicateContent":true,
"respectRobotsTxt":true,
"maxConcurrency":5
}

Example Page Output

{
"recordType":"page",
"url":"https://docusaurus.io/docs",
"canonicalUrl":"https://docusaurus.io/docs",
"title":"Introduction | Docusaurus",
"metadata":{
"docsPlatform":"docusaurus",
"language":"en"
},
"tokenCount":2189,
"contentHash":"sha256:...",
"duplicateOf":null,
"qualityScore":95,
"recommendedAction":"use"
}

Page records also include cleanMarkdown, textContent, headings, codeBlocks, tables, links, qualityWarnings, and crawledAt.

Example Chunk Output

{
"recordType":"chunk",
"chunkId":"chunk_abc123_000",
"sourceUrl":"https://docusaurus.io/docs",
"pageTitle":"Introduction | Docusaurus",
"sectionTitle":"Getting started",
"headingPath":["Introduction","Getting started"],
"embeddingText":"Install Docusaurus and create your first docs site...",
"tokenCount":392,
"chunkIndex":0,
"previousChunkId":null,
"nextChunkId":"chunk_abc123_001",
"contentHash":"..."
}

Chunk records also include chunkMarkdown, chunkText, and metadata such as docsPlatform, hasCodeBlock, hasTable, sourceLastModified, and sourceContentHash.

Copy-Paste API Example

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token: process.env.APIFY_TOKEN});
const run =await client.actor('YOUR_USERNAME/docs-to-rag-optimizer').call({
startUrls:[{url:'https://docusaurus.io/docs'}],
maxPages:50,
includePatterns:['^https://docusaurus\\.io/docs'],
outputFormats:['json','markdown'],
chunkingEnabled:true,
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
const chunks = items.filter((item)=> item.recordType ==='chunk');

For embeddings, use embeddingText from chunk records and store sourceUrl, pageTitle, headingPath, contentHash, and metadata as vector metadata.

Output Locations

  • Named dataset pages: one page record per successfully processed page
  • Named dataset chunks: one chunk record per generated chunk
  • Key-value store pages.jsonl: consolidated page export
  • Key-value store chunks.jsonl: consolidated chunk export
  • Key-value store OUTPUT.json: run summary with counts and export keys
  • Key-value store pages_<sha256>.md: optional per-page Markdown when outputFormats includes markdown

Pricing

Pricing is based on successfully processed pages:

  • Base price: $1.00 / 1,000 pages
  • Starter discount: $0.90 / 1,000 pages
  • Scale discount: $0.75 / 1,000 pages
  • Business discount: $0.50 / 1,000 pages

The Actor charges the page-processed event only after a page has been crawled, extracted, converted, saved, and chunked when chunking is enabled.

It does not charge per chunk. Large pages may produce many chunks, but billing remains page-based.

Known Limits

  • Private docs behind login are not supported in v1.
  • PDF/DOCX extraction is not included in v1.
  • JavaScript-heavy docs use a Playwright fallback, but static docs are faster and cheaper.
  • Exact duplicate detection uses normalized text hashes; near-duplicate detection is not included yet.

Input Fields

  • startUrls: documentation URLs to start crawling
  • sitemapUrls: optional XML sitemap URLs
  • maxPages: maximum successfully processed pages
  • maxDepth: maximum crawl depth
  • includePatterns: JavaScript regex strings for allowed URLs
  • excludePatterns: JavaScript regex strings for blocked URLs
  • crawlOnlyDocs: skip obvious non-doc paths such as blog, pricing, login, legal
  • outputFormats: json, markdown, or both
  • removeSelectors: CSS selectors to remove before extraction
  • keepSelectors: CSS selectors to restrict extraction to specific areas
  • preserveCodeBlocks: keep fenced code blocks
  • preserveTables: keep GitHub-Flavored Markdown tables
  • preserveLinks: keep links in Markdown and JSON
  • chunkingEnabled: generate RAG chunks
  • chunkStrategy: header-aware
  • chunkSize: target chunk size in tokens
  • chunkOverlap: approximate chunk overlap in tokens
  • deduplicateContent: mark exact duplicate pages and skip duplicate chunking
  • respectRobotsTxt: respect robots.txt rules
  • maxConcurrency: maximum concurrent requests

URL Pattern Policy

includePatterns and excludePatterns are treated as JavaScript regular expression strings and compiled with new RegExp(pattern).

Example:

{
"includePatterns":["^https://developer\\.mozilla\\.org/en-US/docs/Web/JavaScript"],
"excludePatterns":["/contributors\\.txt$","/blog/"]
}

Quality Signals

Each page includes:

  • qualityScore: deterministic 0-100 score
  • qualityWarnings: extraction/chunking warnings
  • recommendedAction: use, review, or skip

These fields help identify pages that are ready for embedding versus pages that need manual review.

Local Development

pnpminstall
pnpm run build
pnpm start

Run locally with Apify CLI:

$apify run --purge --input-file INPUT.example.json

Search Keywords

RAG, LLM, AI assistant, documentation scraper, docs scraper, Markdown scraper, JSONL export, vector database, embeddings, chunks, semantic chunking, Docusaurus scraper, GitBook scraper, MkDocs scraper, Material for MkDocs, developer docs, AI search, LangChain, LlamaIndex, OpenAI, Pinecone, Supabase Vector.

You might also like

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

πŸ‘ User avatar

Gabriel Antony Xaviour

9

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

AI Dataset Converter - Website to Training Data

boztek-ltd/ai-dataset-converter

Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

πŸ‘ User avatar

Stas Persiianenko

7

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.