VOOZH about

URL: https://apify.com/ambitious_door/ragdocs-extractor

⇱ RAG Docs Extractor - Docs to LLM Chunks Β· Apify


πŸ‘ RAG Docs Extractor - Documentation to Chunks avatar

RAG Docs Extractor - Documentation to Chunks

Pricing

from $10.00 / 1,000 document processeds

Go to Apify Store

RAG Docs Extractor - Documentation to Chunks

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata.

Pricing

from $10.00 / 1,000 document processeds

Rating

0.0

(0)

Developer

πŸ‘ C. K.

C. K.

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

RAG Docs Extractor

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata (source URL, heading path, token count). No post-processing. Pay per document processed.

What it does

Most doc scrapers give you raw HTML or a single wall of text. You then spend hours cleaning, splitting, and fixing broken context before anything is usable in a vector store. This Actor eliminates that step entirely.

Give it a documentation URL. It crawls the site, strips navigation/chrome, converts to clean markdown, and splits each page into semantically meaningful chunks that respect heading boundaries. Every chunk includes the metadata you need for retrieval: source URL, heading path (so you know where in the doc tree it came from), and token count (so you can plan your embedding budget).

The output drops straight into any vector store or RAG pipeline without cleanup.

Output format

Each chunk in the dataset contains:

FieldTypeDescription
contentstringThe chunk text in markdown or plain text
heading_pathstringHierarchical path, e.g. "Guide > Installation > Requirements"
chunk_indexintegerPosition of this chunk within its source document
token_countintegerToken count (cl100k_base encoding)
source_urlstringThe URL this chunk was extracted from
document_titlestringPage title

Input parameters

ParameterTypeDefaultDescription
startUrlstringrequiredDocumentation URL to start crawling from
maxPagesinteger50Maximum pages to crawl
maxChunkTokensinteger512Target max tokens per chunk
crawlSameDomainbooleantrueStay within the start URL's domain
pathPrefixstring""Only crawl paths starting with this prefix
outputFormatstring"markdown""markdown" or "plain_text"

Example usage

Single page extraction

{
"startUrl":"https://docs.python.org/3/library/asyncio.html",
"maxPages":1
}

Full docs site

{
"startUrl":"https://fastapi.tiangolo.com/",
"maxPages":100,
"pathPrefix":"/tutorial/",
"maxChunkTokens":256
}

Pricing

This Actor uses the pay-per-event model. You are charged per document (page) successfully processed and chunked. No charge for pages that are skipped (empty, non-content).

How the chunking works

  1. HTML cleaning β€” strips navigation, sidebars, footers, cookie banners, and other non-content elements using a curated set of selectors. Falls back to <article>, <main>, or <body>.
  2. Markdown conversion β€” converts the cleaned HTML to structured markdown, preserving headings, code blocks, tables, lists, and links.
  3. Semantic splitting β€” splits on heading boundaries first, then paragraph boundaries, then sentence boundaries. Each chunk inherits the heading hierarchy from its position in the document.
  4. Token counting β€” uses cl100k_base (the encoding used by GPT-4 and most modern embeddings) for accurate token counts.

Responsible use

  • This Actor respects robots.txt by default (enforced by Crawlee).
  • It identifies itself with a descriptive User-Agent header so site owners can identify and block it.
  • Crawlee's built-in autoscaling keeps request rates reasonable and avoids overloading target servers.
  • You are responsible for ensuring your use complies with the target site's Terms of Service. Only crawl content you have the right to access and process.

Built with

You might also like

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

πŸ‘ User avatar

Gabriel Antony Xaviour

9

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

Docs-to-RAG Optimizer

vamsi-krishna/docs-to-rag-optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.

2

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

πŸ‘ User avatar

Stas Persiianenko

8

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

commonelements/rag-ready-crawler

Turn any website into embeddings-ready chunks for RAG and vector databases. Structure-aware token-sized chunking, clean LLM-ready markdown, per-chunk citations and metadata, dedup, and junk filtering. Pay per result, no surprise compute bills.

πŸ‘ User avatar

Harry Schoeller

2

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

YouTube Transcript API - RAG Chapters, Summary & Chunks

webdatalabs/youtube-transcript-rag

Turn any YouTube video, playlist, or channel into RAG-ready data: clean transcript, timestamped segments, AI chapters, summary, key quotes, and embeddings-ready chunks. Built for AI agents and RAG pipelines.

Knowledge Intelligence Engine β€” Website to Markdown for RAG

ryanclinton/website-content-to-markdown

Turn any website, documentation site or help centre into a retrieval-ready knowledge corpus for RAG and AI search. Clean Markdown plus chunks, change detection, deduplication, retrieval scoring, version awareness and a full corpus audit, in one run.

17