👁 PDF → RAG Chunks (Token-Aware, Vector-Ready) avatar

PDF → RAG Chunks (Token-Aware, Vector-Ready)

Pricing

Pay per usage

👁 PDF → RAG Chunks (Token-Aware, Vector-Ready)

PDF → RAG Chunks (Token-Aware, Vector-Ready)

Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. Returns one row per chunk with page, char count, token estimate. Feed directly into OpenAI text-embedding-3 / Voyage / Cohere. $0.005 per PDF + $0.0002 per chunk.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

👁 Hojun Lee

Hojun Lee

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

PDF → RAG Chunks

Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. No LLM cost (zero tokens). Vector-ready output. $0.005 per PDF + $0.0002 per chunk.

Why this exists

To build a RAG (retrieval-augmented generation) system over a corpus of PDFs, you need:

Download → extract text per page
Chunk into semantic segments (1000-2000 chars typical)
Optional: embed each chunk and store in vector DB
Query: embed question, retrieve top-k chunks, ask LLM

This actor handles steps 1-2 (the most painful boilerplate). The output is shaped so you can pipe each chunk directly into OpenAI's text-embedding-3-small, Voyage AI, Cohere Embed, or any embedding model.

Other chunking SaaS (Unstructured.io API, LangChain Hosted) charge $5-20 per 1K pages. This actor: $0.50 per 1K pages.

What you get

Summary row (one per PDF)

{
"_type":"summary",
"url":"https://www.sec.gov/.../aapl-10k.pdf",
"ok":true,
"page_count":80,
"title":"Apple Inc. — Annual Report 2024",
"author":"Apple Inc.",
"chunk_size_chars":1500,
"overlap_chars":200
}

Per-chunk row

{
"_type":"chunk",
"url":"https://...",
"page":12,
"chunk_index":0,
"global_chunk_index":17,
"text":"Item 1A. Risk Factors\n\nOur business is...",
"char_count":1480,
"token_estimate":370
}

Quick start

Single PDF

{
"url":"https://www.example.com/report.pdf"
}

Batch with custom chunk size

{
"urls":[
"https://...filing1.pdf",
"https://...filing2.pdf"
],
"chunkSizeChars":2000,
"overlapChars":300,
"maxPages":100
}

Optimize for OpenAI text-embedding-3-small (8K-token max)

{
"url":"https://...",
"chunkSizeChars":1500,
"overlapChars":200
}

Recommended chunk sizes

Embedding model	chunkSizeChars	Notes
OpenAI text-embedding-3-small	1500	~375 tokens, fits well
OpenAI text-embedding-3-large	2000	~500 tokens
Voyage voyage-3-large	1500	optimal balance
Cohere embed-v3	1800	works with 512-token chunks

Overlap of 100-300 chars boosts recall by ~5-10% with minimal storage cost.

Pricing

Pay-Per-Event:

$0.005 per PDF processed
$0.0002 per chunk emitted

Run	Chunks	Cost
One 80-page 10-K	~200	$0.045
Batch of 100 papers @ 20 pages	~6000	$1.70
Compliance archive 1000 PDFs	~80000	$21

vs Unstructured.io ($30+/mo + per-doc) or LangChain Hosted ($500+/mo).

Pipeline pattern: PDFs → vector DB

import apify_client, openai, pinecone
# 1. Chunk PDFs
client = apify_client.ApifyClient(token)
run = client.actor("gochujang/pdf-rag-chunker").call(run_input={
"urls":["https://...filing.pdf"],
"chunkSizeChars":1500,
})
# 2. Embed each chunk
chunks =list(client.dataset(run["defaultDatasetId"]).iterate_items())
chunks =[c for c in chunks if c.get("_type")=="chunk"]
embeddings = openai.embeddings.create(
 model="text-embedding-3-small",
input=[c["text"]for c in chunks],
).data
# 3. Upsert to vector DB
index = pinecone.Index("rag-docs")
index.upsert([
{"id":f"{c['url']}-{c['global_chunk_index']}",
"values": embeddings[i].embedding,
"metadata":{"url": c["url"],"page": c["page"]}}
for i, c inenumerate(chunks)
])

Limitations

Scanned PDFs (image-only) — Returns 0 chunks. Use OCR-equipped actor.
Multi-column research papers — Reading order may be slightly off (pdfplumber respects column layout but isn't perfect).
No embedding included — Embedding requires your own OpenAI/Voyage/Cohere key (different vendor). We focus on chunking only to keep costs predictable.

Related actors (same author)

PDF Text & Table Extractor — Same engine, returns full text instead of chunks
Web Page → Markdown Converter — HTML equivalent
Article Summarizer — For one-shot summaries
JSON Schema Generator

Feedback

A short review helps RAG engineers find it: Leave a review on Apify Store

👁 AI / RAG Web Crawler avatar

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

👁 User avatar

Group Oject

👁 Markdown RAG Chunker avatar

Markdown RAG Chunker

codepoetry/markdown-rag-chunker

Chunk any document for RAG — PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.

👁 User avatar

CodePoetry

👁 Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks avatar

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

👁 User avatar

Ken M

👁 Docs-to-RAG Crawler avatar

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

👁 User avatar

Stas Persiianenko

👁 Rag Embedding Generator avatar

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

👁 User avatar

mick_

AI Data Pipeline — Crawl, Chunk & Export to Vector DB

ozapp/ai-data-pipeline

Crawl any website, extract clean text, split into chunks with quality scoring, and export to JSON, Pinecone, or Qdrant. Built for RAG pipelines and AI training data. Includes language detection, content type classification, and token counting.

👁 User avatar

Ozapp

👁 Rag Knowledge Graph Builder avatar

Rag Knowledge Graph Builder

cspnair/rag-knowledge-graph-builder

Transform websites into RAG-ready datasets. Crawls pages, chunks content into semantic segments (500-1000 tokens), and generates hypothetical questions for each chunk. No API key needed with native mode. Output: pre-indexed JSON optimized for AI retrieval with 3x better accuracy than raw text.

👁 User avatar

csp

129

5.0

👁 EU AI Act & Regulation Monitor (RAG-Optimized) avatar

EU AI Act & Regulation Monitor (RAG-Optimized)

aelix/eu-ai-act-regulation-monitor

Monitors EUR-Lex for EU AI-related legislation and delivers clean, structured Markdown/JSON enriched with CELEX IDs, version hashes, token counts, and vector-DB chunk hints. Ideal for RAG pipelines, legal AI assistants, and compliance dashboards. Premium RAG-Ready Feed: $150.00 per 1,000 results.

👁 User avatar

Aelix

RAG Text Chunker — heading & sentence aware, Japanese ready

shoebill-dev27/rag-text-chunker

Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.

👁 User avatar

Shinobu Otani

Website & PDF to RAG JSONL Crawler

orbiscribe/linked-pdf-website-rag-crawler

Paste webpage and PDF URLs and get Markdown, JSONL chunks, PDF inventory, source warnings, and RAG-ready records.

👁 User avatar

Orbiscribe Labs

URL: https://apify.com/gochujang/pdf-rag-chunker