👁 PDF Text & Table Extractor (pdfplumber, batch URLs) avatar

PDF Text & Table Extractor (pdfplumber, batch URLs)

Pricing

Pay per usage

👁 PDF Text & Table Extractor (pdfplumber, batch URLs)

PDF Text & Table Extractor (pdfplumber, batch URLs)

Download any PDF by URL and extract clean per-page text + detected tables (as 2D arrays) + document metadata (title/author/created/modified). Powered by pdfplumber. Batch up to 50 PDFs. $0.01 per PDF + $0.0005 per page.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

👁 Hojun Lee

Hojun Lee

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

PDF Text & Table Extractor

Download any PDF by URL and extract clean per-page text + detected tables + document metadata. Powered by pdfplumber. Batch up to 50 PDFs. $0.01 per PDF + $0.0005 per page.

Why this exists

PDFs are how every important document gets distributed — SEC filings, research papers, financial reports, government records. But the raw bytes aren't searchable, can't be fed to LLMs, can't be ingested into databases.

This actor handles the conversion. You give it a URL list; it returns a structured per-page dataset including:

Clean extracted text (preserving reading order)
Detected tables as 2D arrays (ready for CSV / Sheets export)
Document-level metadata (title, author, creation date)

What you get

Summary row (one per PDF)

{
"_type":"summary",
"url":"https://www.sec.gov/Archives/.../aapl-10k.pdf",
"ok":true,
"page_count":80,
"title":"Apple Inc. — Annual Report 2024",
"author":"Apple Inc.",
"creator":"InDesign",
"producer":"Adobe Distiller",
"created":"D:20240928081300Z",
"modified":"D:20240928081400Z"
}

Per-page row

{
"_type":"page",
"url":"https://...",
"page":12,
"char_count":3210,
"word_count":524,
"text":"Item 1A. Risk Factors\n\nOur business...",
"tables":[
[
["Revenue","Q1 2024","Q4 2023"],
["iPhone","$45.96B","$43.81B"],
["Mac","$9.66B","$7.61B"]
]
],
"table_count":1
}

Quick start

Single PDF

{
"url":"https://www.example.com/whitepaper.pdf"
}

Batch of 10-K filings from SEC

{
"urls":[
"https://www.sec.gov/Archives/edgar/data/320193/aapl-10k.pdf",
"https://www.sec.gov/Archives/edgar/data/789019/msft-10k.pdf"
],
"extractTables":true,
"maxPages":200
}

Text-only (skip tables for speed)

{
"url":"https://...",
"extractTables":false
}

Pricing

Pay-Per-Event:

$0.01 — flat per PDF (download + metadata)
$0.0005 — per page extracted

Run	PDFs × Pages	Cost
One 80-page 10-K	1 × 80	$0.05
Batch of 10 research papers	10 × 20	$0.20
Quarterly: 50 earnings releases	50 × 15	$0.88

Vs Adobe Acrobat Pro DC ($23/mo) for manual extraction, or DocParser ($199/mo for API) — this is 5-10x cheaper at typical volumes.

Use cases

SEC filings — Pull text + tables from 10-K, 10-Q, 8-K. Combine with our SEC EDGAR Tracker.
Research aggregation — Build a searchable database of academic papers + abstracts
Financial reports — Auto-extract earnings tables from quarterly releases
LLM RAG — Convert PDFs to chunks for vector search / Q&A
Compliance audit — Index every PDF in your corporate document store

Limitations

Scanned PDFs (image-only) — Returns empty text. Use OCR for scanned PDFs.
Complex layouts — Multi-column research papers may merge column text awkwardly. Tweak with custom extraction parameters in v0.2.
Encrypted PDFs — Will fail with a clear error message.

Data engine

pdfplumber v0.11+ — Pure-Python, robust, used by countless data-engineering pipelines.
No OCR in this actor. For OCR, combine with a separate actor that runs Tesseract or Vision API.

Related actors (same author)

Web Page → Markdown Converter — HTML version of the same idea
HTML Metadata Extractor
SEC EDGAR Filing Tracker — Get the SEC filing URLs to feed in
JSON Schema Generator

Feedback

A short review helps researchers / analysts find it: Leave a review on Apify Store

👁 PDF URL to Markdown, Tables & RAG Extractor avatar

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

👁 User avatar

Inus Grobler

PDF to Markdown & JSON Converter (Docling)

actorzlab/docling-pdf-converter

Convert PDF documents to clean Markdown, structured JSON, and plain text using IBM's open-source Docling AI. Handles text PDFs and scanned documents (OCR), extracts tables and images. No external API key required — runs fully on-device.

👁 User avatar

Khalil Drissi

👁 PDF to Text Extractor avatar

PDF to Text Extractor

junipr/pdf-to-text-extractor

Extract text from PDFs with native parsing and OCR fallback. Per-page granularity, paragraph structure preserved. Batch process multiple URLs. Output as plain text, JSON, or combined document. Ideal for data pipelines.

👁 User avatar

junipr

👁 HTML Table Extractor avatar

HTML Table Extractor

automation-lab/html-table-extractor

Extract HTML tables from any webpage into structured JSON. Supports multiple URLs, filtering by CSS selector or table index, auto-header detection, and nested tables. Pure HTTP — no proxy needed.

👁 User avatar

Stas Persiianenko

👁 Smart Page Fetcher — HTML, Markdown & Text avatar

Smart Page Fetcher — HTML, Markdown & Text

shelvick/smart-page-fetcher

Fetch a batch of URLs and get the page as HTML, Markdown, or clean text. Tries plain HTTP first, renders JavaScript in a real browser when needed, and escalates to stealth + residential proxy for Cloudflare-protected, bot-defended pages, per URL. Pay only for the difficulty each URL needed.

👁 User avatar

Scott Helvick

Universal Web Scraper - Extract Any URL

lazymac/web-scraper-toolkit

Pay-per-result web scraper with JS rendering, CSS selector / XPath / regex extraction, schema validation, retry on failure. Use for product catalogs, competitor pricing, news aggregation, lead generation. Fast (<2s/page), respects robots.txt by default.

👁 User avatar

2x lazymac

👁 Wayback Machine Scraper - Track Website Changes Over Time avatar

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

👁 User avatar

Ryan Clinton

👁 PDF To JSON Parser avatar

PDF To JSON Parser

parseforge/pdf-to-json-parser

Convert PDF documents into structured JSON using AI-powered OCR and smart data extraction. The Actor processes every page to ensure complete coverage, then identifies text, fields, tables, and key details, delivering clean, organized JSON ready for automation or analysis.

👁 User avatar

ParseForge

5.0

(1)

👁 Image OCR Scraper avatar

Image OCR Scraper

seemuapps/image-ocr-scraper

Extract text from any image. Bulk OCR for screenshots, scanned documents, receipts, signs, and photos. Supports 109 languages and outputs clean Markdown or structured JSON with bounding boxes.

👁 User avatar

Andrew

👁 medRxiv Scraper avatar

medRxiv Scraper

parseforge/medrxiv-scraper

Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.

👁 User avatar

ParseForge

URL: https://apify.com/gochujang/pdf-text-extractor