Docs-to-RAG Optimizer

Pricing

from $0.50 / 1,000 page processeds

Docs-to-RAG Optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.

Pricing

from $0.50 / 1,000 page processeds

Rating

0.0

(0)

Developer

👁 Vamsi Krishna

Vamsi Krishna

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Docs to RAG - Documentation to Markdown, JSONL & AI Chunks

Turn public developer documentation into clean, LLM-ready data for RAG pipelines.

This Apify Actor crawls docs websites, removes navigation/sidebar/footer noise, converts pages to Markdown, splits content into semantic chunks, counts tokens, detects duplicates, and exports JSONL files that are easy to load into vector databases and AI search systems.

Best For

Building AI assistants over product or developer documentation
Preparing docs for OpenAI vector stores, Pinecone, Supabase Vector, Weaviate, Qdrant, Chroma, LangChain, and LlamaIndex
Converting Docusaurus, GitBook, MkDocs/Material, MDN-style, and custom docs pages into clean Markdown
Creating stable page and chunk records with content hashes for incremental RAG ingestion

What You Get

Clean Markdown for every processed page
Page JSON records in the pages dataset
Chunk JSON records in the chunks dataset
Default dataset records for easy Apify Console/API export
Consolidated pages.jsonl and chunks.jsonl exports in the key-value store
Token counts for pages and chunks using OpenAI-style tokenization
Header-aware RAG chunks with heading paths and previous/next chunk IDs
SHA-256 content hashes for pages and chunks
Exact duplicate detection with duplicateOf
RAG quality score, warnings, and recommendedAction
Optional per-page Markdown files in key-value store

Why Use This Instead of a Generic Web Scraper?

Generic website scrapers are useful when you need broad website crawling. This Actor is built specifically for documentation-to-RAG workflows:

Docs-specific cleanup for Docusaurus, GitBook, and MkDocs/Material
Header-aware chunks instead of fixed character splitting
embeddingText on every chunk for direct vector database ingestion
Page and chunk JSONL exports for batch pipelines
Duplicate detection to avoid embedding the same page twice
Quality warnings so bad extractions are visible before you embed them
Page-based pricing at $1.00 / 1,000 pages, not per generated chunk

Supported Documentation Platforms

The Actor is optimized for:

Docusaurus
GitBook
MkDocs / Material for MkDocs

Unknown or custom documentation sites use a Readability-based fallback extractor.

Example Use Cases

Crawl https://docusaurus.io/docs and create JSONL chunks for a docs chatbot
Convert GitBook docs into Markdown files for an internal knowledge base
Extract MkDocs/Material documentation into chunk records for Supabase Vector
Deduplicate repeated docs pages before embedding to reduce vector database cost
Build an AI search index from public developer documentation

Example Input

{
"startUrls":[{"url":"https://docusaurus.io/docs"}],
"maxPages":50,
"maxDepth":3,
"includePatterns":["^https://docusaurus\\.io/docs"],
"excludePatterns":["/blog/"],
"outputFormats":["json","markdown"],
"chunkingEnabled":true,
"chunkStrategy":"header-aware",
"chunkSize":800,
"chunkOverlap":100,
"deduplicateContent":true,
"respectRobotsTxt":true,
"maxConcurrency":5
}

Example Page Output

{
"recordType":"page",
"url":"https://docusaurus.io/docs",
"canonicalUrl":"https://docusaurus.io/docs",
"title":"Introduction | Docusaurus",
"metadata":{
"docsPlatform":"docusaurus",
"language":"en"
},
"tokenCount":2189,
"contentHash":"sha256:...",
"duplicateOf":null,
"qualityScore":95,
"recommendedAction":"use"
}

Page records also include cleanMarkdown, textContent, headings, codeBlocks, tables, links, qualityWarnings, and crawledAt.

Example Chunk Output

{
"recordType":"chunk",
"chunkId":"chunk_abc123_000",
"sourceUrl":"https://docusaurus.io/docs",
"pageTitle":"Introduction | Docusaurus",
"sectionTitle":"Getting started",
"headingPath":["Introduction","Getting started"],
"embeddingText":"Install Docusaurus and create your first docs site...",
"tokenCount":392,
"chunkIndex":0,
"previousChunkId":null,
"nextChunkId":"chunk_abc123_001",
"contentHash":"..."
}

Chunk records also include chunkMarkdown, chunkText, and metadata such as docsPlatform, hasCodeBlock, hasTable, sourceLastModified, and sourceContentHash.

Copy-Paste API Example

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token: process.env.APIFY_TOKEN});
const run =await client.actor('YOUR_USERNAME/docs-to-rag-optimizer').call({
startUrls:[{url:'https://docusaurus.io/docs'}],
maxPages:50,
includePatterns:['^https://docusaurus\\.io/docs'],
outputFormats:['json','markdown'],
chunkingEnabled:true,
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
const chunks = items.filter((item)=> item.recordType ==='chunk');

For embeddings, use embeddingText from chunk records and store sourceUrl, pageTitle, headingPath, contentHash, and metadata as vector metadata.

Output Locations

Named dataset pages: one page record per successfully processed page
Named dataset chunks: one chunk record per generated chunk
Key-value store pages.jsonl: consolidated page export
Key-value store chunks.jsonl: consolidated chunk export
Key-value store OUTPUT.json: run summary with counts and export keys
Key-value store pages_<sha256>.md: optional per-page Markdown when outputFormats includes markdown

Pricing

Pricing is based on successfully processed pages:

Base price: $1.00 / 1,000 pages
Starter discount: $0.90 / 1,000 pages
Scale discount: $0.75 / 1,000 pages
Business discount: $0.50 / 1,000 pages

The Actor charges the page-processed event only after a page has been crawled, extracted, converted, saved, and chunked when chunking is enabled.

It does not charge per chunk. Large pages may produce many chunks, but billing remains page-based.

Known Limits

Private docs behind login are not supported in v1.
PDF/DOCX extraction is not included in v1.
JavaScript-heavy docs use a Playwright fallback, but static docs are faster and cheaper.
Exact duplicate detection uses normalized text hashes; near-duplicate detection is not included yet.

Input Fields

startUrls: documentation URLs to start crawling
sitemapUrls: optional XML sitemap URLs
maxPages: maximum successfully processed pages
maxDepth: maximum crawl depth
includePatterns: JavaScript regex strings for allowed URLs
excludePatterns: JavaScript regex strings for blocked URLs
crawlOnlyDocs: skip obvious non-doc paths such as blog, pricing, login, legal
outputFormats: json, markdown, or both
removeSelectors: CSS selectors to remove before extraction
keepSelectors: CSS selectors to restrict extraction to specific areas
preserveCodeBlocks: keep fenced code blocks
preserveTables: keep GitHub-Flavored Markdown tables
preserveLinks: keep links in Markdown and JSON
chunkingEnabled: generate RAG chunks
chunkStrategy: header-aware
chunkSize: target chunk size in tokens
chunkOverlap: approximate chunk overlap in tokens
deduplicateContent: mark exact duplicate pages and skip duplicate chunking
respectRobotsTxt: respect robots.txt rules
maxConcurrency: maximum concurrent requests

URL Pattern Policy

includePatterns and excludePatterns are treated as JavaScript regular expression strings and compiled with new RegExp(pattern).

Example:

{
"includePatterns":["^https://developer\\.mozilla\\.org/en-US/docs/Web/JavaScript"],
"excludePatterns":["/contributors\\.txt$","/blog/"]
}

Quality Signals

Each page includes:

qualityScore: deterministic 0-100 score
qualityWarnings: extraction/chunking warnings
recommendedAction: use, review, or skip

These fields help identify pages that are ready for embedding versus pages that need manual review.

Local Development

pnpminstall
pnpm run build
pnpm start

Run locally with Apify CLI:

$apify run --purge --input-file INPUT.example.json

Search Keywords

RAG, LLM, AI assistant, documentation scraper, docs scraper, Markdown scraper, JSONL export, vector database, embeddings, chunks, semantic chunking, Docusaurus scraper, GitBook scraper, MkDocs scraper, Material for MkDocs, developer docs, AI search, LangChain, LlamaIndex, OpenAI, Pinecone, Supabase Vector.

👁 Docs To Rag avatar

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

👁 User avatar

Gabriel Antony Xaviour

Docs & Help Center to RAG JSONL

orbiscribe/docs-help-center-rag-snapshot

Paste a docs or help center URL and get clean Markdown, breadcrumbs, page records, and JSONL chunks for RAG.

👁 User avatar

Orbiscribe Labs

Fast Website to Markdown & RAG JSONL Crawler

orbiscribe/website-rag-dataset-builder

Paste a homepage or sitemap and get clean Markdown, metadata, JSONL chunks, and source URLs for RAG at a low per-page price.

👁 User avatar

Orbiscribe Labs

👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

👁 User avatar

Dev with Bobby

Website & PDF to RAG JSONL Crawler

orbiscribe/linked-pdf-website-rag-crawler

Paste webpage and PDF URLs and get Markdown, JSONL chunks, PDF inventory, source warnings, and RAG-ready records.

👁 User avatar

Orbiscribe Labs

rag-docs-scraper

marbled_jury/my-actor

Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.

👁 User avatar

Hastin S.

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

👁 AI Dataset Converter - Website to Training Data avatar

AI Dataset Converter - Website to Training Data

boztek-ltd/ai-dataset-converter

Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.

👁 User avatar

Boztek LTD

👁 Docs-to-RAG Crawler avatar

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

👁 User avatar

Stas Persiianenko

👁 RAG-Ready Documentation Scraper avatar

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

👁 User avatar

Alaricus

URL: https://apify.com/vamsi-krishna/docs-to-rag-optimizer