VOOZH about

URL: https://apify.com/awesome_highboy/aeo-rag-ready-content-structurer

⇱ Website to Markdown + RAG Chunks for Embeddings & LLMs Β· Apify


πŸ‘ Website & Docs to Markdown + RAG Chunks avatar

Website & Docs to Markdown + RAG Chunks

Pricing

from $3.00 / 1,000 page processeds

Go to Apify Store

Website & Docs to Markdown + RAG Chunks

Turn websites & docs into clean Markdown plus token-bounded, embeddings-ready RAG chunks (heading lineage + sha256) ready for Pinecone, Weaviate, Qdrant or pgvector. Optional no-hallucination field extraction and AEO mode (FAQ, answer-first, llms.txt, citations). Robots honored; ownership required.

Pricing

from $3.00 / 1,000 page processeds

Rating

0.0

(0)

Developer

πŸ‘ Adam

Adam

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

RAG-Ready Content Structurer

Turn pages you own into deterministic, embeddings-ready RAG chunks β€” clean, hashed, and token-bounded.

What it does

This Actor ingests a list of URLs that you own or are authorized to crawl, fetches each page, strips boilerplate, and converts it into clean Markdown. It then produces RAG chunks: semantic (heading-aware) or fixed-size chunks with full heading lineage, a deterministic token estimate, and a sha256 content hash per chunk β€” ready to upsert into a vector store.

The pipeline is built around a fail-closed gate chain:

  1. Gate A β€” Ownership attestation. You must attest you own/are authorized for every URL. If not true, the run is rejected before any fetch, with zero events billed.
  2. Gate B β€” Paid-plan only. Default-deny until the Apify paid-plan run flag is confirmed; free/unknown plans run nothing and are not billed.
  3. Gate C β€” robots.txt. Always honored and cannot be disabled. Disallowed URLs emit 0 chunks, are charged 0, and are listed in robots_skipped_urls.

Cleaning uses Mozilla Readability + Turndown for boilerplate-stripped Markdown. Chunking is deterministic: identical input yields byte-stable output and the same sha256 content hash, and no chunk exceeds maxTokens.

Note: the token count per chunk is a deterministic word-based estimate used to bound chunk size, not a tiktoken count.

Input

Defined by INPUT_SCHEMA.json. Key fields:

FieldTypeNotes
sourceobjectThe URLs to ingest (URL list) that you own/are authorized to crawl. Includes maxPages (default 1000, max 50000). Required.
ownership_attestationbooleanYou attest you own or are authorized for every URL. Must be true or the run is rejected (zero billing). Required.
renderenumhttp (default, cheapest) or browser (Playwright/Chromium for JS-heavy pages).
chunkingobjectstrategy (semantic|fixed), maxTokens (128–2048, default 512), overlapTokens (default 64). No chunk exceeds maxTokens.
languagestring (nullable)Optional ISO language-code hint (e.g. en).

Output

Defined by dataset_schema.json. Every record carries a record_type. The run path emits:

  • chunk (RAG): chunk_id (position-stable key for idempotent vector-DB upserts β€” re-crawling updates the same vectors instead of duplicating them), source_url, page_title, section_path (heading lineage, e.g. ["Guide","Setup","Auth"]), heading (immediate section title), chunk_index, section_chunk_index/section_chunk_count (position within the section), chunk_text (clean Markdown), token_count (≀ maxTokens), char_count, word_count, overlap_prev (chunk carries overlap from the previous one), content_hash (sha256:...), language, extracted_fields, retrieved_at, render_mode.
  • run_summary (exactly one per run): pages_requested, pages_fetched, pages_failed ([{url, reason}] β€” pages that failed extraction are isolated here at zero charge, never crashing the run), chunks_emitted, total_tokens, robots_skipped_urls, output_mode.

Chunking quality. Code fences are kept atomic β€” a chunk never contains a half-open ``` fence; an oversized code block is split with each piece re-wrapped in its original fence + language, so every chunk is independently valid, embeddable Markdown. Splits prefer natural boundaries (paragraph β†’ sentence β†’ word β†’ char), and a hard character cap (secondary to the token bound) means a whitespace-free blob (minified JS, base64) can't masquerade as a tiny chunk and silently blow an embedding model's real token limit. All output is deterministic and byte-stable across runs.

The dataset schema also defines AEO record types (faq_pair, answer_block, llms_txt, citation_block) and a structured-extraction extracted_fields shape, with prebuilt dataset views (RAG chunks, AEO assets, Run summary). These AEO/extraction outputs are reserved in the schema but are not produced by the current run path β€” the Actor currently emits RAG chunk records plus one run_summary.

The RAG chunks view is ready to upsert into Pinecone / Weaviate / Qdrant / pgvector.

Pricing

Pay-Per-Event. You are billed only for what actually runs (after the gates), via Actor.charge():

EventPriceWhen charged
actor_run_start$0.05Once per run, only after the ownership + paid-plan + pilot gates pass. Never on a rejected or free-plan run.
page_processed$0.003Per page successfully fetched + converted + chunked. Failed pages and robots-disallowed URLs charge $0.
field_extracted$0.005Per (page Γ— requested field) pair returning a non-null value. Reserved for structured extraction, which is not active in the current run path, so this event is not charged today.

The developer keeps 80% and Apify keeps 20% (standard Apify 80/20 split).

Example run cost β€” 100 owned pages, RAG mode: $0.05 + (100 Γ— $0.003) = $0.35.

Why this Actor

  • Deterministic & idempotent. Cleaning and chunking are pure and byte-stable: identical input produces identical chunks and identical sha256 content hashes β€” safe re-runs, safe vector-store de-duplication.
  • Ownership-gated and robots-compliant by design. A mandatory ownership/authorization attestation rejects unauthorized runs before any fetch (zero billing), and robots.txt is forced on and cannot be disabled.
  • Paid-plan only, fail-closed billing. Default-deny until a paid plan is confirmed; failed and robots-disallowed pages charge $0.
  • Embeddings-ready output with token bounds. Chunks carry heading lineage and a token estimate that never exceeds your maxTokens, in a schema with a prebuilt view for Pinecone / Weaviate / Qdrant / pgvector.

About

This Actor is AI-authored and operated under the publisher's LLC. Actor.charge() is the only billing path and it bills the customer only β€” the Actor has no payout or money-out capability; revenue settlement is handled entirely by Apify's monetization rail.

You might also like

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

Website to Text & Markdown β€” AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

7

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

commonelements/rag-ready-crawler

Turn any website into embeddings-ready chunks for RAG and vector databases. Structure-aware token-sized chunking, clean LLM-ready markdown, per-chunk citations and metadata, dedup, and junk filtering. Pay per result, no surprise compute bills.

πŸ‘ User avatar

Harry Schoeller

2

Docs to Markdown + AI Embeddings β†’ Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus β€” ready for RAG, AI assistants, and semantic search in minutes.

πŸ‘ User avatar

Badruddeen Naseem

8

5.0

RAG Docs Extractor - Documentation to Chunks

ambitious_door/ragdocs-extractor

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata.

RAG Website Crawler - Markdown Chunks for LLMs & MCP

themineworks/rag-crawler

Crawl any website into clean, pre-chunked Markdown with per-chunk token counts for RAG pipelines, vector DBs (Pinecone, Qdrant) and LLM context. MCP-native for Claude & ChatGPT. SPA support via Playwright. Pay only for pages that crawl. A Firecrawl alternative.

πŸ‘ User avatar

The Mine Works

2

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

πŸ‘ User avatar

Gabriel Antony Xaviour

9

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

πŸ‘ User avatar

Stas Persiianenko

8