VOOZH about

URL: https://apify.com/gochujang/pdf-rag-chunker

โ‡ฑ PDF โ†’ RAG Chunks (Token-Aware, Vector-Ready) ยท Apify


๐Ÿ‘ PDF โ†’ RAG Chunks (Token-Aware, Vector-Ready) avatar

PDF โ†’ RAG Chunks (Token-Aware, Vector-Ready)

Pricing

Pay per usage

Go to Apify Store

PDF โ†’ RAG Chunks (Token-Aware, Vector-Ready)

Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. Returns one row per chunk with page, char count, token estimate. Feed directly into OpenAI text-embedding-3 / Voyage / Cohere. $0.005 per PDF + $0.0002 per chunk.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

๐Ÿ‘ Hojun Lee

Hojun Lee

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

PDF โ†’ RAG Chunks

Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. No LLM cost (zero tokens). Vector-ready output. $0.005 per PDF + $0.0002 per chunk.


Why this exists

To build a RAG (retrieval-augmented generation) system over a corpus of PDFs, you need:

  1. Download โ†’ extract text per page
  2. Chunk into semantic segments (1000-2000 chars typical)
  3. Optional: embed each chunk and store in vector DB
  4. Query: embed question, retrieve top-k chunks, ask LLM

This actor handles steps 1-2 (the most painful boilerplate). The output is shaped so you can pipe each chunk directly into OpenAI's text-embedding-3-small, Voyage AI, Cohere Embed, or any embedding model.

Other chunking SaaS (Unstructured.io API, LangChain Hosted) charge $5-20 per 1K pages. This actor: $0.50 per 1K pages.


What you get

Summary row (one per PDF)

{
"_type":"summary",
"url":"https://www.sec.gov/.../aapl-10k.pdf",
"ok":true,
"page_count":80,
"title":"Apple Inc. โ€” Annual Report 2024",
"author":"Apple Inc.",
"chunk_size_chars":1500,
"overlap_chars":200
}

Per-chunk row

{
"_type":"chunk",
"url":"https://...",
"page":12,
"chunk_index":0,
"global_chunk_index":17,
"text":"Item 1A. Risk Factors\n\nOur business is...",
"char_count":1480,
"token_estimate":370
}

Quick start

Single PDF

{
"url":"https://www.example.com/report.pdf"
}

Batch with custom chunk size

{
"urls":[
"https://...filing1.pdf",
"https://...filing2.pdf"
],
"chunkSizeChars":2000,
"overlapChars":300,
"maxPages":100
}

Optimize for OpenAI text-embedding-3-small (8K-token max)

{
"url":"https://...",
"chunkSizeChars":1500,
"overlapChars":200
}

Recommended chunk sizes

Embedding modelchunkSizeCharsNotes
OpenAI text-embedding-3-small1500~375 tokens, fits well
OpenAI text-embedding-3-large2000~500 tokens
Voyage voyage-3-large1500optimal balance
Cohere embed-v31800works with 512-token chunks

Overlap of 100-300 chars boosts recall by ~5-10% with minimal storage cost.


Pricing

Pay-Per-Event:

  • $0.005 per PDF processed
  • $0.0002 per chunk emitted
RunChunksCost
One 80-page 10-K~200$0.045
Batch of 100 papers @ 20 pages~6000$1.70
Compliance archive 1000 PDFs~80000$21

vs Unstructured.io ($30+/mo + per-doc) or LangChain Hosted ($500+/mo).


Pipeline pattern: PDFs โ†’ vector DB

import apify_client, openai, pinecone
# 1. Chunk PDFs
client = apify_client.ApifyClient(token)
run = client.actor("gochujang/pdf-rag-chunker").call(run_input={
"urls":["https://...filing.pdf"],
"chunkSizeChars":1500,
})
# 2. Embed each chunk
chunks =list(client.dataset(run["defaultDatasetId"]).iterate_items())
chunks =[c for c in chunks if c.get("_type")=="chunk"]
embeddings = openai.embeddings.create(
model="text-embedding-3-small",
input=[c["text"]for c in chunks],
).data
# 3. Upsert to vector DB
index = pinecone.Index("rag-docs")
index.upsert([
{"id":f"{c['url']}-{c['global_chunk_index']}",
"values": embeddings[i].embedding,
"metadata":{"url": c["url"],"page": c["page"]}}
for i, c inenumerate(chunks)
])

Limitations

  • Scanned PDFs (image-only) โ€” Returns 0 chunks. Use OCR-equipped actor.
  • Multi-column research papers โ€” Reading order may be slightly off (pdfplumber respects column layout but isn't perfect).
  • No embedding included โ€” Embedding requires your own OpenAI/Voyage/Cohere key (different vendor). We focus on chunking only to keep costs predictable.

Related actors (same author)


Feedback

A short review helps RAG engineers find it: Leave a review on Apify Store

You might also like

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Markdown RAG Chunker

codepoetry/markdown-rag-chunker

Chunk any document for RAG โ€” PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

๐Ÿ‘ User avatar

Stas Persiianenko

7

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

Rag Knowledge Graph Builder

cspnair/rag-knowledge-graph-builder

Transform websites into RAG-ready datasets. Crawls pages, chunks content into semantic segments (500-1000 tokens), and generates hypothetical questions for each chunk. No API key needed with native mode. Output: pre-indexed JSON optimized for AI retrieval with 3x better accuracy than raw text.

EU AI Act & Regulation Monitor (RAG-Optimized)

aelix/eu-ai-act-regulation-monitor

Monitors EUR-Lex for EU AI-related legislation and delivers clean, structured Markdown/JSON enriched with CELEX IDs, version hashes, token counts, and vector-DB chunk hints. Ideal for RAG pipelines, legal AI assistants, and compliance dashboards. Premium RAG-Ready Feed: $150.00 per 1,000 results.