VOOZH about

URL: https://apify.com/commonelements/rag-ready-crawler

⇱ RAG Web Crawler: Website to Markdown & RAG Chunks Β· Apify


πŸ‘ RAG Web Crawler: Clean Markdown + Token-Sized Chunks avatar

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

Under maintenance

Pricing

$2.00 / 1,000 dataset item scrapeds

Go to Apify Store

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

Under maintenance

Turn any website into embeddings-ready chunks for RAG and vector databases. Structure-aware token-sized chunking, clean LLM-ready markdown, per-chunk citations and metadata, dedup, and junk filtering. Pay per result, no surprise compute bills.

Pricing

$2.00 / 1,000 dataset item scrapeds

Rating

0.0

(0)

Developer

πŸ‘ Harry Schoeller

Harry Schoeller

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

10 days ago

Last modified

Share

RAG Web Crawler β€” Clean Markdown + Token-Sized Chunks, Pay-Per-Result

Turn any website into embeddings-ready chunks with citations and predictable per-chunk pricing. No CSS tuning, no runaway compute bills.

Generic crawlers hand you raw pages and make you build the RAG pipeline yourself. This actor hands you clean, token-sized, deduplicated, citable chunks β€” at a fixed price per chunk you keep.

What it does

  • Clean LLM-ready Markdown β€” @mozilla/readability strips nav, footers, ads, and cookie banners; turndown + GFM converts the cleaned DOM to Markdown with heading hierarchy, fenced code blocks, and tables preserved.
  • Structure-aware, token-budgeted chunking β€” splits on the heading tree, then recursively sub-splits oversized sections to your token budget (default 512) with overlap (default 75). Code blocks and tables are kept intact, never split mid-block.
  • Rich per-chunk provenance β€” every chunk ships source URL + deep anchor, page title, full headings path, content hash, token count, content type, and language for metadata-filtered vector search and deep-link citations.
  • Dedup + junk filtering β€” exact content-hash dedup plus 64-bit SimHash near-duplicate collapsing, and low-information / nav-residue chunk filtering.
  • Four output formats β€” chunks-jsonl (one record per chunk), markdown (one record per page), langchain (drop-in {page_content, metadata} Document JSON), and jsonl-bulk (flat one-record-per-chunk for DB/COPY/pgvector).
  • Incremental / delta sync β€” on scheduled re-runs, only NEW or CHANGED pages are re-emitted (and billed). Makes daily/weekly crawls cheap.
  • Budget guarantee β€” maxPages is a hard ceiling; billing is per emitted result, so a runaway crawl can never produce a runaway bill.

Incremental / delta sync β€” cheap scheduled re-runs

Turn on Incremental sync (incremental: true) and schedule the actor to run daily or weekly. The first run does a full crawl and seeds a per-URL content-hash state in a named key-value store. Every later run crawls the site, but only re-emits chunks for pages that are new or changed β€” unchanged pages cost nothing. A typical weekly docs re-crawl re-emits a handful of pages instead of hundreds.

  • State is automatic. The state store name defaults to a deterministic hash of your start URLs, so a scheduled task reuses its own prior state with zero config. Set stateStoreName explicitly to share state across tasks/schedules.
  • forceFullCrawl: true re-emits everything and rebuilds the baseline β€” use after changing chunking settings or to refresh a stale index.
  • emitDeletions: true writes a tombstone record ({ deleted: true, url, ... }) to a separate deletions dataset for every URL that disappeared since the last run, so downstream vector stores can purge stale vectors. Tombstones are not billed.
  • When incremental is ON, each emitted record carries a change_status (new | changed) in its metadata.

The run summary (OUTPUT key-value record) includes a delta block: pages_new, pages_changed, pages_unchanged, pages_deleted, chunks_skipped_unchanged (the spend you saved), state_store, prior_run_id.

When all incremental options are OFF (the default), behavior and output are byte-for-byte identical to v1.0.

Output (chunks-jsonl)

{
"id":"a1f3c9e29b2c4d10",
"url":"https://docs.example.com/guide/install",
"title":"Getting Started β€” Example Docs",
"chunkIndex":3,
"chunkTotal":11,
"headingsPath":["Getting Started","Setup","Installation"],
"text":"## Installation\n\nInstall via npm:\n\n```bash\nnpm install crawlee\n```",
"tokenEstimate":498,
"fetchedAt":"2026-06-20T14:02:11Z",
"content_hash":"sha256:...",
"metadata":{
"source_url":"https://docs.example.com/guide/install",
"deep_link":"https://docs.example.com/guide/install#installation",
"anchor":"installation",
"canonical_url":"https://docs.example.com/guide/install",
"page_title":"Getting Started β€” Example Docs",
"char_count":2104,
"content_type":"mixed",
"language":"en",
"last_modified":null,
"crawl_timestamp":"2026-06-20T14:02:11Z"
}
}

Each record maps 1:1 to a vector-DB upsert: { id, values=embed(text), metadata }.

Output (langchain)

One record per chunk, drop-in for LangChain β€” [Document(**r) for r in dataset]:

{
"page_content":"## Installation\n\nInstall via npm...",
"metadata":{
"id":"a1f3c9e29b2c4d10",
"source":"https://docs.example.com/guide/install",
"title":"Getting Started β€” Example Docs",
"deep_link":"https://docs.example.com/guide/install#installation",
"canonical_url":"https://docs.example.com/guide/install",
"headings_path":["Getting Started","Setup","Installation"],
"chunk_index":3,
"chunk_total":11,
"content_type":"mixed",
"language":"en",
"token_estimate":498,
"char_count":2104,
"content_hash":"sha256:...",
"last_modified":null,
"crawl_timestamp":"2026-06-20T14:02:11Z"
}
}

Output (jsonl-bulk)

Fully flat one-record-per-chunk for generic bulk import (DB COPY / pgvector):

{
"id":"a1f3c9e29b2c4d10",
"text":"## Installation\n\nInstall via npm...",
"source_url":"https://docs.example.com/guide/install",
"deep_link":"https://docs.example.com/guide/install#installation",
"canonical_url":"https://docs.example.com/guide/install",
"title":"Getting Started β€” Example Docs",
"headings_path":"Getting Started > Setup > Installation",
"chunk_index":3,
"chunk_total":11,
"content_type":"mixed",
"language":"en",
"token_estimate":498,
"char_count":2104,
"content_hash":"sha256:...",
"last_modified":null,
"crawl_timestamp":"2026-06-20T14:02:11Z"
}

Input

See .actor/input_schema.json. Key fields: startUrls, crawlScope, maxCrawlDepth, maxPages, renderJs, outputFormat, chunkSize, chunkOverlap, dedupNearDuplicates, filterJunkChunks, and the incremental sync fields incremental, forceFullCrawl, stateStoreName, emitDeletions.

Pricing

Pay-Per-Event. Billable unit = one emitted dataset item. Deduped and junk-filtered chunks are not billed.

EventPrice
Per chunk emitted (chunks-jsonl)$0.0008 / chunk ($0.80 / 1,000)
Per page emitted (markdown)$0.002 / page ($2.00 / 1,000)

Run locally

npminstall
npm run build
apify run # reads .actor/INPUT.json

Roadmap (v1.2+)

Inline embeddings, direct vector-DB push (Pinecone/Qdrant/Weaviate/pgvector), missedGraceRuns before tombstoning, Standby low-latency mode.

You might also like

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Website to Text & Markdown β€” AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

5

RAG Website Crawler - Markdown Chunks for LLMs & MCP

themineworks/rag-crawler

Crawl any website into clean, pre-chunked Markdown with per-chunk token counts for RAG pipelines, vector DBs (Pinecone, Qdrant) and LLM context. MCP-native for Claude & ChatGPT. SPA support via Playwright. Pay only for pages that crawl. A Firecrawl alternative.

πŸ‘ User avatar

The Mine Works

2

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summaryβ€”ready for embeddings or vector DBs without extra glue code.

RAG Docs Extractor - Documentation to Chunks

ambitious_door/ragdocs-extractor

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata.