👁 RAG Docs Extractor - Documentation to Chunks avatar

RAG Docs Extractor - Documentation to Chunks

Pricing

from $10.00 / 1,000 document processeds

👁 RAG Docs Extractor - Documentation to Chunks

RAG Docs Extractor - Documentation to Chunks

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata.

Pricing

from $10.00 / 1,000 document processeds

Rating

0.0

(0)

Developer

👁 C. K.

C. K.

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

RAG Docs Extractor

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata (source URL, heading path, token count). No post-processing. Pay per document processed.

What it does

Most doc scrapers give you raw HTML or a single wall of text. You then spend hours cleaning, splitting, and fixing broken context before anything is usable in a vector store. This Actor eliminates that step entirely.

Give it a documentation URL. It crawls the site, strips navigation/chrome, converts to clean markdown, and splits each page into semantically meaningful chunks that respect heading boundaries. Every chunk includes the metadata you need for retrieval: source URL, heading path (so you know where in the doc tree it came from), and token count (so you can plan your embedding budget).

The output drops straight into any vector store or RAG pipeline without cleanup.

Output format

Each chunk in the dataset contains:

Field	Type	Description
`content`	string	The chunk text in markdown or plain text
`heading_path`	string	Hierarchical path, e.g. `"Guide > Installation > Requirements"`
`chunk_index`	integer	Position of this chunk within its source document
`token_count`	integer	Token count (cl100k_base encoding)
`source_url`	string	The URL this chunk was extracted from
`document_title`	string	Page title

Input parameters

Parameter	Type	Default	Description
`startUrl`	string	required	Documentation URL to start crawling from
`maxPages`	integer	50	Maximum pages to crawl
`maxChunkTokens`	integer	512	Target max tokens per chunk
`crawlSameDomain`	boolean	true	Stay within the start URL's domain
`pathPrefix`	string	`""`	Only crawl paths starting with this prefix
`outputFormat`	string	`"markdown"`	`"markdown"` or `"plain_text"`

Example usage

Single page extraction

{
"startUrl":"https://docs.python.org/3/library/asyncio.html",
"maxPages":1
}

Full docs site

{
"startUrl":"https://fastapi.tiangolo.com/",
"maxPages":100,
"pathPrefix":"/tutorial/",
"maxChunkTokens":256
}

Pricing

This Actor uses the pay-per-event model. You are charged per document (page) successfully processed and chunked. No charge for pages that are skipped (empty, non-content).

How the chunking works

HTML cleaning — strips navigation, sidebars, footers, cookie banners, and other non-content elements using a curated set of selectors. Falls back to <article>, <main>, or <body>.
Markdown conversion — converts the cleaned HTML to structured markdown, preserving headings, code blocks, tables, lists, and links.
Semantic splitting — splits on heading boundaries first, then paragraph boundaries, then sentence boundaries. Each chunk inherits the heading hierarchy from its position in the document.
Token counting — uses cl100k_base (the encoding used by GPT-4 and most modern embeddings) for accurate token counts.

Responsible use

This Actor respects robots.txt by default (enforced by Crawlee).
It identifies itself with a descriptive User-Agent header so site owners can identify and block it.
Crawlee's built-in autoscaling keeps request rates reasonable and avoids overloading target servers.
You are responsible for ensuring your use complies with the target site's Terms of Service. Only crawl content you have the right to access and process.

Built with

Crawlee for reliable crawling (robots.txt compliant)
BeautifulSoup for HTML parsing
tiktoken for token counting

👁 Docs To Rag avatar

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

👁 User avatar

Gabriel Antony Xaviour

👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

👁 User avatar

Dev with Bobby

👁 Docs-to-RAG Optimizer avatar

Docs-to-RAG Optimizer

vamsi-krishna/docs-to-rag-optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.

👁 User avatar

Vamsi Krishna

👁 Docs-to-RAG Crawler avatar

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

👁 User avatar

Stas Persiianenko

👁 RAG-Ready Documentation Scraper avatar

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

👁 User avatar

Alaricus

👁 RAG Web Crawler: Clean Markdown + Token-Sized Chunks avatar

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

commonelements/rag-ready-crawler

Turn any website into embeddings-ready chunks for RAG and vector databases. Structure-aware token-sized chunking, clean LLM-ready markdown, per-chunk citations and metadata, dedup, and junk filtering. Pay per result, no surprise compute bills.

👁 User avatar

Harry Schoeller

👁 AI / RAG Web Crawler avatar

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

👁 User avatar

Group Oject

👁 YouTube Transcript API - RAG Chapters, Summary & Chunks avatar

YouTube Transcript API - RAG Chapters, Summary & Chunks

webdatalabs/youtube-transcript-rag

Turn any YouTube video, playlist, or channel into RAG-ready data: clean transcript, timestamped segments, AI chapters, summary, key quotes, and embeddings-ready chunks. Built for AI agents and RAG pipelines.

👁 User avatar

WebDataLabs

rag-docs-scraper

marbled_jury/my-actor

Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.

👁 User avatar

Hastin S.

👁 Knowledge Intelligence Engine — Website to Markdown for RAG avatar

Knowledge Intelligence Engine — Website to Markdown for RAG

ryanclinton/website-content-to-markdown

Turn any website, documentation site or help centre into a retrieval-ready knowledge corpus for RAG and AI search. Clean Markdown plus chunks, change detection, deduplication, retrieval scoring, version awareness and a full corpus audit, in one run.

👁 User avatar

Ryan Clinton

URL: https://apify.com/ambitious_door/ragdocs-extractor

⇱ RAG Docs Extractor - Docs to LLM Chunks · Apify

RAG Docs Extractor - Documentation to Chunks

RAG Docs Extractor

What it does

Output format

Input parameters

Example usage

Single page extraction

Full docs site

Pricing

How the chunking works

Responsible use

Built with

You might also like

Docs To Rag

Docs Markdown Rag Ready Crawler

Docs-to-RAG Optimizer

Docs-to-RAG Crawler

RAG-Ready Documentation Scraper

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

AI / RAG Web Crawler

YouTube Transcript API - RAG Chapters, Summary & Chunks

rag-docs-scraper

Knowledge Intelligence Engine — Website to Markdown for RAG