VOOZH about

URL: https://apify.com/gochujang/pdf-text-extractor

โ‡ฑ PDF Text & Table Extractor (pdfplumber, batch URLs) ยท Apify


๐Ÿ‘ PDF Text & Table Extractor (pdfplumber, batch URLs) avatar

PDF Text & Table Extractor (pdfplumber, batch URLs)

Pricing

Pay per usage

Go to Apify Store

PDF Text & Table Extractor (pdfplumber, batch URLs)

Download any PDF by URL and extract clean per-page text + detected tables (as 2D arrays) + document metadata (title/author/created/modified). Powered by pdfplumber. Batch up to 50 PDFs. $0.01 per PDF + $0.0005 per page.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

๐Ÿ‘ Hojun Lee

Hojun Lee

Maintained by Community

Actor stats

0

Bookmarked

7

Total users

6

Monthly active users

2 days ago

Last modified

Share

PDF Text & Table Extractor

Download any PDF by URL and extract clean per-page text + detected tables + document metadata. Powered by pdfplumber. Batch up to 50 PDFs. $0.01 per PDF + $0.0005 per page.


Why this exists

PDFs are how every important document gets distributed โ€” SEC filings, research papers, financial reports, government records. But the raw bytes aren't searchable, can't be fed to LLMs, can't be ingested into databases.

This actor handles the conversion. You give it a URL list; it returns a structured per-page dataset including:

  • Clean extracted text (preserving reading order)
  • Detected tables as 2D arrays (ready for CSV / Sheets export)
  • Document-level metadata (title, author, creation date)

What you get

Summary row (one per PDF)

{
"_type":"summary",
"url":"https://www.sec.gov/Archives/.../aapl-10k.pdf",
"ok":true,
"page_count":80,
"title":"Apple Inc. โ€” Annual Report 2024",
"author":"Apple Inc.",
"creator":"InDesign",
"producer":"Adobe Distiller",
"created":"D:20240928081300Z",
"modified":"D:20240928081400Z"
}

Per-page row

{
"_type":"page",
"url":"https://...",
"page":12,
"char_count":3210,
"word_count":524,
"text":"Item 1A. Risk Factors\n\nOur business...",
"tables":[
[
["Revenue","Q1 2024","Q4 2023"],
["iPhone","$45.96B","$43.81B"],
["Mac","$9.66B","$7.61B"]
]
],
"table_count":1
}

Quick start

Single PDF

{
"url":"https://www.example.com/whitepaper.pdf"
}

Batch of 10-K filings from SEC

{
"urls":[
"https://www.sec.gov/Archives/edgar/data/320193/aapl-10k.pdf",
"https://www.sec.gov/Archives/edgar/data/789019/msft-10k.pdf"
],
"extractTables":true,
"maxPages":200
}

Text-only (skip tables for speed)

{
"url":"https://...",
"extractTables":false
}

Pricing

Pay-Per-Event:

  • $0.01 โ€” flat per PDF (download + metadata)
  • $0.0005 โ€” per page extracted
RunPDFs ร— PagesCost
One 80-page 10-K1 ร— 80$0.05
Batch of 10 research papers10 ร— 20$0.20
Quarterly: 50 earnings releases50 ร— 15$0.88

Vs Adobe Acrobat Pro DC ($23/mo) for manual extraction, or DocParser ($199/mo for API) โ€” this is 5-10x cheaper at typical volumes.


Use cases

  1. SEC filings โ€” Pull text + tables from 10-K, 10-Q, 8-K. Combine with our SEC EDGAR Tracker.
  2. Research aggregation โ€” Build a searchable database of academic papers + abstracts
  3. Financial reports โ€” Auto-extract earnings tables from quarterly releases
  4. LLM RAG โ€” Convert PDFs to chunks for vector search / Q&A
  5. Compliance audit โ€” Index every PDF in your corporate document store

Limitations

  • Scanned PDFs (image-only) โ€” Returns empty text. Use OCR for scanned PDFs.
  • Complex layouts โ€” Multi-column research papers may merge column text awkwardly. Tweak with custom extraction parameters in v0.2.
  • Encrypted PDFs โ€” Will fail with a clear error message.

Data engine

  • pdfplumber v0.11+ โ€” Pure-Python, robust, used by countless data-engineering pipelines.
  • No OCR in this actor. For OCR, combine with a separate actor that runs Tesseract or Vision API.

Related actors (same author)


Feedback

A short review helps researchers / analysts find it: Leave a review on Apify Store

You might also like

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

PDF to Text Extractor

junipr/pdf-to-text-extractor

Extract text from PDFs with native parsing and OCR fallback. Per-page granularity, paragraph structure preserved. Batch process multiple URLs. Output as plain text, JSON, or combined document. Ideal for data pipelines.

HTML Table Extractor

automation-lab/html-table-extractor

Extract HTML tables from any webpage into structured JSON. Supports multiple URLs, filtering by CSS selector or table index, auto-header detection, and nested tables. Pure HTTP โ€” no proxy needed.

๐Ÿ‘ User avatar

Stas Persiianenko

19

Smart Page Fetcher โ€” HTML, Markdown & Text

shelvick/smart-page-fetcher

Fetch a batch of URLs and get the page as HTML, Markdown, or clean text. Tries plain HTTP first, renders JavaScript in a real browser when needed, and escalates to stealth + residential proxy for Cloudflare-protected, bot-defended pages, per URL. Pay only for the difficulty each URL needed.

4

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

71

PDF To JSON Parser

parseforge/pdf-to-json-parser

Convert PDF documents into structured JSON using AI-powered OCR and smart data extraction. The Actor processes every page to ensure complete coverage, then identifies text, fields, tables, and key details, delivering clean, organized JSON ready for automation or analysis.

56

5.0

(1)

Image OCR Scraper

seemuapps/image-ocr-scraper

Extract text from any image. Bulk OCR for screenshots, scanned documents, receipts, signs, and photos. Supports 109 languages and outputs clean Markdown or structured JSON with bounding boxes.

medRxiv Scraper

parseforge/medrxiv-scraper

Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.