Pricing
from $3.00 / 1,000 page processeds
Website & Docs to Markdown + RAG Chunks
Turn websites & docs into clean Markdown plus token-bounded, embeddings-ready RAG chunks (heading lineage + sha256) ready for Pinecone, Weaviate, Qdrant or pgvector. Optional no-hallucination field extraction and AEO mode (FAQ, answer-first, llms.txt, citations). Robots honored; ownership required.
Pricing
from $3.00 / 1,000 page processeds
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
RAG-Ready Content Structurer
Turn pages you own into deterministic, embeddings-ready RAG chunks β clean, hashed, and token-bounded.
What it does
This Actor ingests a list of URLs that you own or are authorized to crawl, fetches each page, strips boilerplate, and converts it into clean Markdown. It then produces RAG chunks: semantic (heading-aware) or fixed-size chunks with full heading lineage, a deterministic token estimate, and a sha256 content hash per chunk β ready to upsert into a vector store.
The pipeline is built around a fail-closed gate chain:
- Gate A β Ownership attestation. You must attest you own/are authorized for every URL. If not true, the run is rejected before any fetch, with zero events billed.
- Gate B β Paid-plan only. Default-deny until the Apify paid-plan run flag is confirmed; free/unknown plans run nothing and are not billed.
- Gate C β robots.txt. Always honored and cannot be disabled. Disallowed URLs emit 0 chunks, are charged 0, and are listed in
robots_skipped_urls.
Cleaning uses Mozilla Readability + Turndown for boilerplate-stripped Markdown. Chunking is deterministic: identical input yields byte-stable output and the same sha256 content hash, and no chunk exceeds maxTokens.
Note: the token count per chunk is a deterministic word-based estimate used to bound chunk size, not a tiktoken count.
Input
Defined by INPUT_SCHEMA.json. Key fields:
| Field | Type | Notes |
|---|---|---|
source | object | The URLs to ingest (URL list) that you own/are authorized to crawl. Includes maxPages (default 1000, max 50000). Required. |
ownership_attestation | boolean | You attest you own or are authorized for every URL. Must be true or the run is rejected (zero billing). Required. |
render | enum | http (default, cheapest) or browser (Playwright/Chromium for JS-heavy pages). |
chunking | object | strategy (semantic|fixed), maxTokens (128β2048, default 512), overlapTokens (default 64). No chunk exceeds maxTokens. |
language | string (nullable) | Optional ISO language-code hint (e.g. en). |
Output
Defined by dataset_schema.json. Every record carries a record_type. The run path emits:
chunk(RAG):chunk_id(position-stable key for idempotent vector-DB upserts β re-crawling updates the same vectors instead of duplicating them),source_url,page_title,section_path(heading lineage, e.g.["Guide","Setup","Auth"]),heading(immediate section title),chunk_index,section_chunk_index/section_chunk_count(position within the section),chunk_text(clean Markdown),token_count(β€maxTokens),char_count,word_count,overlap_prev(chunk carries overlap from the previous one),content_hash(sha256:...),language,extracted_fields,retrieved_at,render_mode.run_summary(exactly one per run):pages_requested,pages_fetched,pages_failed([{url, reason}]β pages that failed extraction are isolated here at zero charge, never crashing the run),chunks_emitted,total_tokens,robots_skipped_urls,output_mode.
Chunking quality. Code fences are kept atomic β a chunk never contains a half-open ``` fence; an oversized code block is split with each piece re-wrapped in its original fence + language, so every chunk is independently valid, embeddable Markdown. Splits prefer natural boundaries (paragraph β sentence β word β char), and a hard character cap (secondary to the token bound) means a whitespace-free blob (minified JS, base64) can't masquerade as a tiny chunk and silently blow an embedding model's real token limit. All output is deterministic and byte-stable across runs.
The dataset schema also defines AEO record types (faq_pair, answer_block, llms_txt, citation_block) and a structured-extraction extracted_fields shape, with prebuilt dataset views (RAG chunks, AEO assets, Run summary). These AEO/extraction outputs are reserved in the schema but are not produced by the current run path β the Actor currently emits RAG chunk records plus one run_summary.
The RAG chunks view is ready to upsert into Pinecone / Weaviate / Qdrant / pgvector.
Pricing
Pay-Per-Event. You are billed only for what actually runs (after the gates), via Actor.charge():
| Event | Price | When charged |
|---|---|---|
actor_run_start | $0.05 | Once per run, only after the ownership + paid-plan + pilot gates pass. Never on a rejected or free-plan run. |
page_processed | $0.003 | Per page successfully fetched + converted + chunked. Failed pages and robots-disallowed URLs charge $0. |
field_extracted | $0.005 | Per (page Γ requested field) pair returning a non-null value. Reserved for structured extraction, which is not active in the current run path, so this event is not charged today. |
The developer keeps 80% and Apify keeps 20% (standard Apify 80/20 split).
Example run cost β 100 owned pages, RAG mode:
$0.05 + (100 Γ $0.003) = $0.35.
Why this Actor
- Deterministic & idempotent. Cleaning and chunking are pure and byte-stable: identical input produces identical chunks and identical
sha256content hashes β safe re-runs, safe vector-store de-duplication. - Ownership-gated and robots-compliant by design. A mandatory ownership/authorization attestation rejects unauthorized runs before any fetch (zero billing), and
robots.txtis forced on and cannot be disabled. - Paid-plan only, fail-closed billing. Default-deny until a paid plan is confirmed; failed and robots-disallowed pages charge $0.
- Embeddings-ready output with token bounds. Chunks carry heading lineage and a token estimate that never exceeds your
maxTokens, in a schema with a prebuilt view for Pinecone / Weaviate / Qdrant / pgvector.
About
This Actor is AI-authored and operated under the publisher's LLC. Actor.charge() is the only billing path and it bills the customer only β the Actor has no payout or money-out capability; revenue settlement is handled entirely by Apify's monetization rail.
