👁 PDF to RAG Markdown Chunks for Embeddings avatar

PDF to RAG Markdown Chunks for Embeddings

Pricing

from $3.00 / 1,000 page parseds

👁 PDF to RAG Markdown Chunks for Embeddings

PDF to RAG Markdown Chunks for Embeddings

Convert PDFs into token-bounded Markdown chunks for RAG, embeddings, and vector databases (Pinecone, Chroma, Weaviate, Qdrant). Set maxTokens + overlap; get clean chunks with page number, token count, and SHA-256 content hash for dedup. JSON dataset ready for any LLM pipeline.

Pricing

from $3.00 / 1,000 page parseds

Rating

0.0

(0)

Developer

👁 Adam

Adam

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

DocForge: PDF → AI-Ready Markdown Chunks for RAG

Turn PDF files you own into clean, deterministic, token-bounded text chunks — each tagged with the page it came from — ready for RAG pipelines and embeddings.

What it does

DocForge takes a list of PDF URLs that you own or are authorized to process, downloads each file, and extracts its text page by page. Each page's text is cleaned of the artifacts that normally wreck embeddings — words split by a hyphen at a line break, ligatures, control characters, and the runs of stray whitespace PDFs emit between columns — and then split into sentence-aware, token-bounded chunks. Each chunk is emitted as a structured dataset record carrying its source document, its true originating page number, a chunk index, an estimated token count, and a content hash. A final run summary reports how many pages were parsed and how many chunks were emitted.

The chunker is sentence-aware and overlap-aware: it packs whole sentences up to your target chunk size instead of cutting mid-sentence, carries a configurable overlap between consecutive chunks, and guarantees no chunk exceeds your configured maxTokens. Each chunk's text is emitted in the markdown field as clean extracted text, so it drops straight into a vector store or embedding job — and because every chunk knows its page, retrieved passages can cite the exact page they came from.

Before any work begins, DocForge requires an explicit ownership attestation. If that attestation is not set, the run is rejected with zero billing. Pages that contain no extractable text (e.g. scanned images) are skipped rather than emitted as blank chunks, and documents that fail to download or parse are caught, logged, and skipped rather than guessed at, so the dataset only contains content that was actually extracted.

Input

Field	Type	Required	Description
`pdfUrls`	array of strings	Yes	URLs of PDFs you own or are authorized to process.
`chunking`	object	No	Chunking options. Prefilled with `maxTokens: 512` and `overlapTokens: 64`.
`ownership_attestation`	boolean	Yes	You confirm you own or are authorized to process these documents. Must be `true` or the run is rejected before any billing.

The chunking object accepts:

maxTokens (default 512) — the maximum estimated token size of each chunk; no chunk exceeds this.
overlapTokens (default 64) — how much each chunk overlaps the previous one, to preserve context across chunk boundaries.

Token counts are word-based estimates (approximately words × 1.3), not exact tokenizer counts.

Output

DocForge writes two record types to the dataset, distinguished by record_type.

chunk — one record per emitted text chunk:

Field	Type	Description
`record_type`	string	Always `chunk`.
`source_doc`	string	The source PDF URL the chunk came from.
`page_number`	integer	The 1-based page the chunk's text was extracted from. Use it to cite or filter retrieved passages by page.
`chunk_index`	integer	Zero-based index of the chunk within its document (continuous across pages).
`markdown`	string	The chunk's cleaned text (de-hyphenated, whitespace- and unicode-normalized).
`token_count`	integer	Estimated token count for the chunk (never exceeds `maxTokens`).
`content_hash`	string	Deterministic `sha256:<64 hex>` hash of the chunk text.

run_summary — one record per run:

Field	Type	Description
`record_type`	string	Always `run_summary`.
`pages_parsed`	integer	Total document pages parsed in the run.
`chunks_emitted`	integer	Total chunks emitted in the run.

Pricing

DocForge uses Apify Pay-Per-Event pricing. You are billed only for what a successful, gated run actually does:

Event	Price (USD)	When it fires
`actor_run_start`	$0.02	Once per run, after the run's gates pass.
`page_parsed`	$0.003	Per document page converted to text.
`chunk_emitted`	$0.0005	Per RAG chunk emitted.

Example run cost. Processing a single 40-page PDF that yields 120 chunks:

1 × actor_run_start = $0.02
40 × page_parsed = $0.12
120 × chunk_emitted = $0.06
Total ≈ $0.20

If the ownership attestation is missing, the run is rejected with zero billing.

Why this Actor

Page-accurate citations. Every chunk carries the real page it was extracted from, so retrieved passages can point back to the exact page — not a placeholder. (On a 14-page sample paper, that's 14 distinct page numbers across the chunks instead of one.)
Clean text, not PDF soup. De-hyphenation reconnects words broken across line wraps (≈150 fixes on a typical research PDF), ligatures and full-width characters are unicode-normalized, control characters are stripped, and runs of whitespace are collapsed — so your embeddings see real words, not inter- national.
Sentence-aware chunking. Chunks are packed from whole sentences up to your maxTokens instead of being cut mid-sentence, with configurable overlap to preserve context across boundaries. No chunk ever exceeds maxTokens.
Deterministic, idempotent output. Every chunk carries a sha256: content hash computed directly from its text, so identical input produces identical hashes — ideal for deduplication, change detection, and re-run safety.
Ownership-gated by design. A required attestation must be true before any processing or billing happens. DocForge runs on PDFs you provide and are authorized to use; it does not crawl or scrape third-party sites.
No invented content. Text is extracted deterministically with no LLM in the loop. Empty/scanned pages and documents that fail to fetch or parse are caught, logged, and skipped — they are not hallucinated or padded. The run summary reflects only what was genuinely parsed and emitted.

About this Actor

This Actor is AI-authored and operated under the publisher's LLC. It uses Actor.charge() strictly to bill the customer for the Pay-Per-Event units above; the Actor contains no payout or money-out capability. All claims here reflect behavior present in the Actor's code.

Website & Docs to Markdown + RAG Chunks

awesome_highboy/aeo-rag-ready-content-structurer

Turn websites & docs into clean Markdown plus token-bounded, embeddings-ready RAG chunks (heading lineage + sha256) ready for Pinecone, Weaviate, Qdrant or pgvector. Optional no-hallucination field extraction and AEO mode (FAQ, answer-first, llms.txt, citations). Robots honored; ownership required.

👁 User avatar

Adam

👁 RAG Web Crawler: Clean Markdown + Token-Sized Chunks avatar

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

commonelements/rag-ready-crawler

Turn any website into embeddings-ready chunks for RAG and vector databases. Structure-aware token-sized chunking, clean LLM-ready markdown, per-chunk citations and metadata, dedup, and junk filtering. Pay per result, no surprise compute bills.

👁 User avatar

Harry Schoeller

RSS & Atom Feeds to RAG Markdown Chunks

awesome_highboy/rss-news-structurer

Turn RSS/Atom feeds into full-article clean Markdown + token-bounded RAG chunks for embeddings & vector DBs. sha256 cross-item dedup means you pay only for net-new articles, not syndicated copies. Robots honored; ownership-gated.

👁 User avatar

Adam

👁 Website to Text & Markdown — AI / RAG Content Crawler avatar

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

👁 User avatar

Hitman studio

👁 RAG Pipeline avatar

RAG Pipeline

labrat011/rag-pipeline

One-click RAG pipeline: chunks text, generates embeddings, and stores vectors in Pinecone or Qdrant. Provide your content and API keys -- the orchestrator handles the rest.

👁 User avatar

mick_

👁 Rag Content Chunker avatar

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

👁 User avatar

mick_

Text Splitter & Chunker for RAG / LLMs

zenomastro/text-splitter-for-llm

Split text into clean, overlapping chunks ready for embeddings, vector databases, RAG and LLM context. Configurable size, overlap, and split strategy.

👁 User avatar

Rosario Vitale

👁 Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks avatar

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

👁 User avatar

Ken M

RAG Ingestor: Multi-Source Chunks for Vector DBs

aitoolbreakdown/atb-rag-ingestor

Ingest URLs, sitemaps, and GitHub READMEs into uniform chunks with titles, source URLs, and stable IDs. Ready to push straight into Pinecone, Weaviate, or any RAG pipeline.

👁 User avatar

AI Tool Breakdown

👁 Docs to Markdown + AI Embeddings → Vector DB Crawler avatar

Docs to Markdown + AI Embeddings → Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.

👁 User avatar

Badruddeen Naseem

5.0

URL: https://apify.com/awesome_highboy/docforge