👁 RAG Web Crawler: Clean Markdown + Token-Sized Chunks avatar

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

Under maintenance

Pricing

$2.00 / 1,000 dataset item scrapeds

Try for free

Go to Apify Store

👁 RAG Web Crawler: Clean Markdown + Token-Sized Chunks

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

Under maintenance

Try for free

Turn any website into embeddings-ready chunks for RAG and vector databases. Structure-aware token-sized chunking, clean LLM-ready markdown, per-chunk citations and metadata, dedup, and junk filtering. Pay per result, no surprise compute bills.

Pricing

$2.00 / 1,000 dataset item scrapeds

Rating

0.0

(0)

Developer

👁 Harry Schoeller

Harry Schoeller

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

10 days ago

Last modified

RAG Web Crawler — Clean Markdown + Token-Sized Chunks, Pay-Per-Result

Turn any website into embeddings-ready chunks with citations and predictable per-chunk pricing. No CSS tuning, no runaway compute bills.

Generic crawlers hand you raw pages and make you build the RAG pipeline yourself. This actor hands you clean, token-sized, deduplicated, citable chunks — at a fixed price per chunk you keep.

What it does

Clean LLM-ready Markdown — @mozilla/readability strips nav, footers, ads, and cookie banners; turndown + GFM converts the cleaned DOM to Markdown with heading hierarchy, fenced code blocks, and tables preserved.
Structure-aware, token-budgeted chunking — splits on the heading tree, then recursively sub-splits oversized sections to your token budget (default 512) with overlap (default 75). Code blocks and tables are kept intact, never split mid-block.
Rich per-chunk provenance — every chunk ships source URL + deep anchor, page title, full headings path, content hash, token count, content type, and language for metadata-filtered vector search and deep-link citations.
Dedup + junk filtering — exact content-hash dedup plus 64-bit SimHash near-duplicate collapsing, and low-information / nav-residue chunk filtering.
Four output formats — chunks-jsonl (one record per chunk), markdown (one record per page), langchain (drop-in {page_content, metadata} Document JSON), and jsonl-bulk (flat one-record-per-chunk for DB/COPY/pgvector).
Incremental / delta sync — on scheduled re-runs, only NEW or CHANGED pages are re-emitted (and billed). Makes daily/weekly crawls cheap.
Budget guarantee — maxPages is a hard ceiling; billing is per emitted result, so a runaway crawl can never produce a runaway bill.

Incremental / delta sync — cheap scheduled re-runs

Turn on Incremental sync (incremental: true) and schedule the actor to run daily or weekly. The first run does a full crawl and seeds a per-URL content-hash state in a named key-value store. Every later run crawls the site, but only re-emits chunks for pages that are new or changed — unchanged pages cost nothing. A typical weekly docs re-crawl re-emits a handful of pages instead of hundreds.

State is automatic. The state store name defaults to a deterministic hash of your start URLs, so a scheduled task reuses its own prior state with zero config. Set stateStoreName explicitly to share state across tasks/schedules.
forceFullCrawl: true re-emits everything and rebuilds the baseline — use after changing chunking settings or to refresh a stale index.
emitDeletions: true writes a tombstone record ({ deleted: true, url, ... }) to a separate deletions dataset for every URL that disappeared since the last run, so downstream vector stores can purge stale vectors. Tombstones are not billed.
When incremental is ON, each emitted record carries a change_status (new | changed) in its metadata.

The run summary (OUTPUT key-value record) includes a delta block: pages_new, pages_changed, pages_unchanged, pages_deleted, chunks_skipped_unchanged (the spend you saved), state_store, prior_run_id.

When all incremental options are OFF (the default), behavior and output are byte-for-byte identical to v1.0.

Output (chunks-jsonl)

{
"id":"a1f3c9e29b2c4d10",
"url":"https://docs.example.com/guide/install",
"title":"Getting Started — Example Docs",
"chunkIndex":3,
"chunkTotal":11,
"headingsPath":["Getting Started","Setup","Installation"],
"text":"## Installation\n\nInstall via npm:\n\n```bash\nnpm install crawlee\n```",
"tokenEstimate":498,
"fetchedAt":"2026-06-20T14:02:11Z",
"content_hash":"sha256:...",
"metadata":{
"source_url":"https://docs.example.com/guide/install",
"deep_link":"https://docs.example.com/guide/install#installation",
"anchor":"installation",
"canonical_url":"https://docs.example.com/guide/install",
"page_title":"Getting Started — Example Docs",
"char_count":2104,
"content_type":"mixed",
"language":"en",
"last_modified":null,
"crawl_timestamp":"2026-06-20T14:02:11Z"
}
}

Each record maps 1:1 to a vector-DB upsert: { id, values=embed(text), metadata }.

Output (langchain)

One record per chunk, drop-in for LangChain — [Document(**r) for r in dataset]:

{
"page_content":"## Installation\n\nInstall via npm...",
"metadata":{
"id":"a1f3c9e29b2c4d10",
"source":"https://docs.example.com/guide/install",
"title":"Getting Started — Example Docs",
"deep_link":"https://docs.example.com/guide/install#installation",
"canonical_url":"https://docs.example.com/guide/install",
"headings_path":["Getting Started","Setup","Installation"],
"chunk_index":3,
"chunk_total":11,
"content_type":"mixed",
"language":"en",
"token_estimate":498,
"char_count":2104,
"content_hash":"sha256:...",
"last_modified":null,
"crawl_timestamp":"2026-06-20T14:02:11Z"
}
}

Output (jsonl-bulk)

Fully flat one-record-per-chunk for generic bulk import (DB COPY / pgvector):

{
"id":"a1f3c9e29b2c4d10",
"text":"## Installation\n\nInstall via npm...",
"source_url":"https://docs.example.com/guide/install",
"deep_link":"https://docs.example.com/guide/install#installation",
"canonical_url":"https://docs.example.com/guide/install",
"title":"Getting Started — Example Docs",
"headings_path":"Getting Started > Setup > Installation",
"chunk_index":3,
"chunk_total":11,
"content_type":"mixed",
"language":"en",
"token_estimate":498,
"char_count":2104,
"content_hash":"sha256:...",
"last_modified":null,
"crawl_timestamp":"2026-06-20T14:02:11Z"
}

Input

See .actor/input_schema.json. Key fields: startUrls, crawlScope, maxCrawlDepth, maxPages, renderJs, outputFormat, chunkSize, chunkOverlap, dedupNearDuplicates, filterJunkChunks, and the incremental sync fields incremental, forceFullCrawl, stateStoreName, emitDeletions.

Pricing

Pay-Per-Event. Billable unit = one emitted dataset item. Deduped and junk-filtered chunks are not billed.

Event	Price
Per chunk emitted (chunks-jsonl)	$0.0008 / chunk ($0.80 / 1,000)
Per page emitted (markdown)	$0.002 / page ($2.00 / 1,000)

Run locally

npminstall
npm run build
apify run # reads .actor/INPUT.json

Roadmap (v1.2+)

Inline embeddings, direct vector-DB push (Pinecone/Qdrant/Weaviate/pgvector), missedGraceRuns before tombstoning, Standby low-latency mode.

PDF to RAG Markdown Chunks for Embeddings

awesome_highboy/docforge

Convert PDFs into token-bounded Markdown chunks for RAG, embeddings, and vector databases (Pinecone, Chroma, Weaviate, Qdrant). Set maxTokens + overlap; get clean chunks with page number, token count, and SHA-256 content hash for dedup. JSON dataset ready for any LLM pipeline.

👁 User avatar

Adam

👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

👁 User avatar

Dev with Bobby

👁 Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks avatar

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

👁 User avatar

Ken M

👁 AI / RAG Web Crawler avatar

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

👁 User avatar

Group Oject

👁 Web Scraper RAG Ready avatar

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

👁 User avatar

Alexy Traore

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

👁 Website to Text & Markdown — AI / RAG Content Crawler avatar

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

👁 User avatar

Hitman studio

👁 RAG Website Crawler - Markdown Chunks for LLMs & MCP avatar

RAG Website Crawler - Markdown Chunks for LLMs & MCP

themineworks/rag-crawler

Crawl any website into clean, pre-chunked Markdown with per-chunk token counts for RAG pipelines, vector DBs (Pinecone, Qdrant) and LLM context. MCP-native for Claude & ChatGPT. SPA support via Playwright. Pay only for pages that crawl. A Firecrawl alternative.

👁 User avatar

The Mine Works

👁 Rag Content Chunker avatar

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

👁 User avatar

mick_

👁 RAG Docs Extractor - Documentation to Chunks avatar

RAG Docs Extractor - Documentation to Chunks

ambitious_door/ragdocs-extractor

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata.

👁 User avatar

C. K.

URL: https://apify.com/commonelements/rag-ready-crawler

⇱ RAG Web Crawler: Website to Markdown & RAG Chunks · Apify

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

RAG Web Crawler — Clean Markdown + Token-Sized Chunks, Pay-Per-Result

What it does

Incremental / delta sync — cheap scheduled re-runs

Output (chunks-jsonl)

Output (langchain)

Output (jsonl-bulk)

Input

Pricing

Run locally

Roadmap (v1.2+)

You might also like

PDF to RAG Markdown Chunks for Embeddings

Docs Markdown Rag Ready Crawler

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

AI / RAG Web Crawler

Web Scraper RAG Ready

Web-to-Markdown Generator for AI & RAG Pipelines

Website to Text & Markdown — AI / RAG Content Crawler

RAG Website Crawler - Markdown Chunks for LLMs & MCP

Rag Content Chunker

RAG Docs Extractor - Documentation to Chunks