PDF โ RAG Chunks (Token-Aware, Vector-Ready)
Pricing
Pay per usage
PDF โ RAG Chunks (Token-Aware, Vector-Ready)
Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. Returns one row per chunk with page, char count, token estimate. Feed directly into OpenAI text-embedding-3 / Voyage / Cohere. $0.005 per PDF + $0.0002 per chunk.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
PDF โ RAG Chunks
Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. No LLM cost (zero tokens). Vector-ready output. $0.005 per PDF + $0.0002 per chunk.
Why this exists
To build a RAG (retrieval-augmented generation) system over a corpus of PDFs, you need:
- Download โ extract text per page
- Chunk into semantic segments (1000-2000 chars typical)
- Optional: embed each chunk and store in vector DB
- Query: embed question, retrieve top-k chunks, ask LLM
This actor handles steps 1-2 (the most painful boilerplate). The output is shaped so you can pipe each chunk directly into OpenAI's text-embedding-3-small, Voyage AI, Cohere Embed, or any embedding model.
Other chunking SaaS (Unstructured.io API, LangChain Hosted) charge $5-20 per 1K pages. This actor: $0.50 per 1K pages.
What you get
Summary row (one per PDF)
{"_type":"summary","url":"https://www.sec.gov/.../aapl-10k.pdf","ok":true,"page_count":80,"title":"Apple Inc. โ Annual Report 2024","author":"Apple Inc.","chunk_size_chars":1500,"overlap_chars":200}
Per-chunk row
{"_type":"chunk","url":"https://...","page":12,"chunk_index":0,"global_chunk_index":17,"text":"Item 1A. Risk Factors\n\nOur business is...","char_count":1480,"token_estimate":370}
Quick start
Single PDF
{"url":"https://www.example.com/report.pdf"}
Batch with custom chunk size
{"urls":["https://...filing1.pdf","https://...filing2.pdf"],"chunkSizeChars":2000,"overlapChars":300,"maxPages":100}
Optimize for OpenAI text-embedding-3-small (8K-token max)
{"url":"https://...","chunkSizeChars":1500,"overlapChars":200}
Recommended chunk sizes
| Embedding model | chunkSizeChars | Notes |
|---|---|---|
| OpenAI text-embedding-3-small | 1500 | ~375 tokens, fits well |
| OpenAI text-embedding-3-large | 2000 | ~500 tokens |
| Voyage voyage-3-large | 1500 | optimal balance |
| Cohere embed-v3 | 1800 | works with 512-token chunks |
Overlap of 100-300 chars boosts recall by ~5-10% with minimal storage cost.
Pricing
Pay-Per-Event:
$0.005per PDF processed$0.0002per chunk emitted
| Run | Chunks | Cost |
|---|---|---|
| One 80-page 10-K | ~200 | $0.045 |
| Batch of 100 papers @ 20 pages | ~6000 | $1.70 |
| Compliance archive 1000 PDFs | ~80000 | $21 |
vs Unstructured.io ($30+/mo + per-doc) or LangChain Hosted ($500+/mo).
Pipeline pattern: PDFs โ vector DB
import apify_client, openai, pinecone# 1. Chunk PDFsclient = apify_client.ApifyClient(token)run = client.actor("gochujang/pdf-rag-chunker").call(run_input={"urls":["https://...filing.pdf"],"chunkSizeChars":1500,})# 2. Embed each chunkchunks =list(client.dataset(run["defaultDatasetId"]).iterate_items())chunks =[c for c in chunks if c.get("_type")=="chunk"]embeddings = openai.embeddings.create(model="text-embedding-3-small",input=[c["text"]for c in chunks],).data# 3. Upsert to vector DBindex = pinecone.Index("rag-docs")index.upsert([{"id":f"{c['url']}-{c['global_chunk_index']}","values": embeddings[i].embedding,"metadata":{"url": c["url"],"page": c["page"]}}for i, c inenumerate(chunks)])
Limitations
- Scanned PDFs (image-only) โ Returns 0 chunks. Use OCR-equipped actor.
- Multi-column research papers โ Reading order may be slightly off (pdfplumber respects column layout but isn't perfect).
- No embedding included โ Embedding requires your own OpenAI/Voyage/Cohere key (different vendor). We focus on chunking only to keep costs predictable.
Related actors (same author)
- PDF Text & Table Extractor โ Same engine, returns full text instead of chunks
- Web Page โ Markdown Converter โ HTML equivalent
- Article Summarizer โ For one-shot summaries
- JSON Schema Generator
Feedback
A short review helps RAG engineers find it: Leave a review on Apify Store
