VOOZH about

URL: https://apify.com/automation-lab/pdf-text-extractor

⇱ PDF Text Extractor - Extract Text & Metadata from PDF Files Β· Apify


Pricing

Pay per event

Go to Apify Store

PDF Text Extractor

Extract text, metadata, and page-by-page content from PDF files. Provide PDF URLs and get structured JSON with full text, per-page text, page count, author, title, creation date, and more. Export as JSON, CSV, or Excel. No browser or proxy needed.

Pricing

Pay per event

Rating

0.0

(0)

Developer

πŸ‘ Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

85

Total users

42

Monthly active users

2 months ago

Last modified

Categories

Share

What does PDF Text Extractor do?

PDF Text Extractor downloads PDF files from any public URL and extracts structured text, metadata, and per-page content. It returns clean JSON with the full document text, individual page text, page count, and all PDF metadata (title, author, creation date, producer, and more).

Unlike browser-based PDF tools, this actor uses pure server-side processing with no browser overhead. It processes PDFs in parallel for maximum throughput and handles errors gracefully -- if one PDF fails, the rest still complete.

Try it now on the Apify Store with the prefilled example URLs.

Who is PDF Text Extractor for?

AI/ML Engineers and Data Scientists

  • Extract text from research papers, whitepapers, and technical documentation for RAG pipelines
  • Build training datasets from large PDF collections
  • Feed document content into LLMs for summarization and analysis

Legal and Compliance Teams

  • Extract text from contracts, filings, and regulatory documents
  • Build searchable archives from PDF-only document repositories
  • Automate document review workflows

Researchers and Academics

  • Bulk-extract text from academic papers and journal articles
  • Build citation databases from PDF collections
  • Convert lecture notes and course materials to searchable text

Developers and Automation Engineers

  • Integrate PDF text extraction into data pipelines via API
  • Process invoices, receipts, and forms at scale
  • Extract metadata for document management systems

Why use PDF Text Extractor?

  • Pure server-side processing -- no browser, no proxy, near-zero cost per PDF
  • Per-page text extraction -- get text for each individual page, not just the whole document
  • Rich metadata -- title, author, subject, keywords, creator, producer, creation/modification dates, PDF version
  • Parallel processing -- configure concurrency to process multiple PDFs simultaneously
  • Graceful error handling -- failed PDFs don't stop the entire batch
  • API access -- integrate with 5,000+ apps via Zapier, Make, and the Apify API
  • Scheduled runs -- set up recurring extractions for document monitoring
  • Multiple export formats -- JSON, CSV, Excel, XML, HTML

What data can you extract?

CategoryFields
Document textFull text, per-page text array
MetadataTitle, author, subject, keywords
Producer infoCreator application, producer application
DatesCreation date, modification date (ISO 8601)
TechnicalPage count, PDF version, file size in bytes
Error handlingError message (null when successful)

Each PDF produces one dataset row with 16 structured fields.

How much does it cost to extract text from PDFs?

PDF Text Extractor uses pay-per-event pricing. You only pay for what you use:

EventFREE tierBRONZESILVERGOLD
Run started (one-time)$0.005$0.005$0.005$0.005
Per PDF extracted$0.00345$0.003$0.00234$0.0018

Example costs (BRONZE tier):

  • 10 PDFs: $0.005 + 10 x $0.003 = $0.035
  • 100 PDFs: $0.005 + 100 x $0.003 = $0.305
  • 1,000 PDFs: $0.005 + 1,000 x $0.003 = $3.005

With the free $5 Apify credit, you can extract text from approximately 1,600 PDFs at no cost.

How to extract text from PDF files

  1. Go to the PDF Text Extractor page on Apify Store
  2. Click Try for free to open the actor in Apify Console
  3. Paste your PDF URLs into the PDF URLs field (one per line)
  4. Adjust concurrency and timeout settings if needed
  5. Click Start to begin extraction
  6. Download results in JSON, CSV, or Excel format

Example input

{
"urls":[
"https://example.com/report-2024.pdf",
"https://example.com/whitepaper.pdf",
"https://example.com/invoice-january.pdf"
],
"includePages":true,
"maxConcurrency":5
}

Minimal input

{
"urls":["https://example.com/document.pdf"]
}

Input parameters

ParameterTypeDefaultDescription
urlsarray of strings(required)Direct URLs to PDF files
includePagesbooleantrueInclude per-page text breakdown
maxConcurrencyinteger5Parallel PDF downloads (1-20)
timeoutPerPdfSecsinteger60Download timeout per PDF in seconds

Output example

{
"url":"https://www.orimi.com/pdf-test.pdf",
"fileName":"pdf-test.pdf",
"title":"PDF Test Page",
"author":"Yukon Department of Education",
"subject":null,
"keywords":null,
"creator":"Acrobat PDFMaker 7.0.7 for Word",
"producer":"Acrobat Distiller 7.0.5 (Windows)",
"creationDate":"2008-06-04T15:44:00.000Z",
"modificationDate":"2008-06-04T15:47:36.000Z",
"pageCount":1,
"fullText":"PDF Test File Congratulations, your computer is equipped with a PDF reader...",
"pages":[
{
"pageNumber":1,
"text":"PDF Test File Congratulations, your computer is equipped with a PDF reader..."
}
],
"pdfVersion":"1.6",
"fileSizeBytes":20597,
"error":null
}

Tips for best results

  • Start small -- test with 2-3 PDFs first to verify the URLs work and output meets your needs
  • Use direct PDF URLs -- the URL must point directly to a .pdf file, not a page that contains a PDF viewer
  • Disable per-page text for large PDFs -- set includePages: false to reduce output size when processing documents with hundreds of pages
  • Increase timeout for large files -- if you are processing PDFs over 50 MB, increase timeoutPerPdfSecs to 120 or more
  • Check the error field -- failed PDFs still appear in results with an error message, so you can identify and retry them
  • Schedule recurring runs -- use Apify's scheduler to automatically extract new PDFs on a daily or weekly basis

Integrations

  • PDF Text Extractor + Google Sheets -- automatically populate a spreadsheet with extracted text and metadata from new PDF uploads
  • PDF Text Extractor + Slack -- get notified when PDF extraction completes, with a summary of pages processed and any errors
  • PDF Text Extractor + Make/Zapier -- trigger PDF extraction when new files are uploaded to Google Drive, Dropbox, or S3
  • PDF Text Extractor + OpenAI/LLM -- chain extraction with AI summarization to create document summaries from PDF collections
  • Scheduled runs -- monitor a document repository and extract text from newly published PDFs on a schedule
  • Webhooks -- trigger downstream processing immediately when extraction completes

Using the Apify API

Node.js

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token:'YOUR_APIFY_TOKEN'});
const run =await client.actor('automation-lab/pdf-text-extractor').call({
urls:[
'https://example.com/report.pdf',
'https://example.com/whitepaper.pdf',
],
includePages:true,
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item=>{
console.log(`${item.fileName}: ${item.pageCount} pages, ${item.fullText.length} chars`);
});

Python

from apify_client import ApifyClient
client = ApifyClient('YOUR_APIFY_TOKEN')
run = client.actor('automation-lab/pdf-text-extractor').call(run_input={
'urls':[
'https://example.com/report.pdf',
'https://example.com/whitepaper.pdf',
],
'includePages':True,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
for item in items:
print(f"{item['fileName']}: {item['pageCount']} pages, {len(item['fullText'])} chars")

cURL

curl-X POST "https://api.apify.com/v2/acts/automation-lab~pdf-text-extractor/runs?token=YOUR_APIFY_TOKEN"\
-H"Content-Type: application/json"\
-d'{
"urls": ["https://example.com/report.pdf"],
"includePages": true
}'

Use with AI agents via MCP

PDF Text Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Add the Apify MCP server to your AI client -- this gives you access to all Apify actors, including this one:

Setup for Claude Code

$claude mcp add--transport http apify "https://mcp.apify.com"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
"mcpServers":{
"apify":{
"url":"https://mcp.apify.com"
}
}
}

Your AI assistant will use OAuth to authenticate with your Apify account on first use.

Example prompts

Once connected, try asking your AI assistant:

  • "Use automation-lab/pdf-text-extractor to extract all text from this research paper: https://arxiv.org/pdf/1706.03762"
  • "Extract metadata and page count from these 5 PDF invoices and summarize the results"
  • "Download and extract text from all PDFs linked on this page, then create a summary of each document"

Learn more in the Apify MCP documentation.

Is it legal to extract text from PDFs?

PDF Text Extractor processes publicly accessible PDF files that you provide URLs for. The actor downloads files the same way a web browser would. Always ensure you have the right to access and process the documents you are extracting text from.

For personal data, comply with GDPR and applicable privacy laws. Review the terms of service for any document repositories you are accessing. Apify provides a general web scraping legality guide for reference.

FAQ

How fast is PDF Text Extractor? Processing speed depends on PDF file size and download speed. A typical 1 MB PDF takes 1-3 seconds to download and parse. With maxConcurrency: 10, you can process 100 average-sized PDFs in under a minute.

How much does it cost to extract text from 1,000 PDFs? At BRONZE tier pricing: $0.005 (start) + 1,000 x $0.003 (per PDF) = $3.005 total. With the free $5 credit, you can process about 1,600 PDFs at no cost.

Does it work with scanned PDFs? No. This actor extracts embedded text from PDFs. Scanned documents that contain only images (no selectable text) will return empty text. For scanned PDFs, you would need an OCR (Optical Character Recognition) solution.

Why are some PDF fields returning null? Not all PDFs include metadata. The title, author, subject, and keywords fields depend on what the PDF creator set when generating the document. Many auto-generated PDFs leave these fields empty.

Why did a PDF fail with "Invalid PDF structure"? The URL may not point to an actual PDF file. Ensure the URL returns a direct PDF download, not an HTML page with an embedded PDF viewer. Some servers also require specific headers or authentication.

Can I extract text from password-protected PDFs? No. Password-protected (encrypted) PDFs cannot be parsed without the password. The actor will return an error for these files.

Other PDF and document tools

You might also like

PDF Text Extractor

jirimoravcik/pdf-text-extractor

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

πŸ‘ User avatar

JiΕ™Γ­ Moravčík

1.1K

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

πŸ‘ User avatar

Onidivo Technologies

512

YouTube Ultimate Scraper (PPE)

ultimate/youtube-scraper

Scrape videos, comments, replies, transcripts, and channel metadata with YouTube Ultimate Scraper. Designed for flexibility, speed, and reliability, it’s perfect for everything from one-off research to large-scale automation.

πŸ‘ User avatar

Ultimate Insight

182

5.0

BuildZoom Scraper

parsebird/buildzoom-scraper

Scrape contractor data from BuildZoom. Search by city and trade, filter by construction type and project value. Extract BZ scores, licenses, permits, reviews, insurance, and contact info for 4M+ US contractors.

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

BuildZoom Scraper

actums/buildzoom-scraper

Extract data from BuildZoom, a remodeling platform that aggregates information from building permits to contractors' licenses. Crawl properties and contractors based on location and scrape descriptions, photos, and page details. Export acquired data into datasets of HTML, JSON, Excel, or CSV.

11880.com Business Directory Scraper

santamaria-automations/11880-de-scraper

Scrape business listings from 11880.com, one of Germany's leading business directories. Extract company names, addresses, phone numbers, ratings, reviews, opening hours, and more. Supports keyword and location-based search with pagination.

Propwire.com [From $1πŸ’°] Leads Scraper

memo23/propwire-leads-scraper

Get comprehensive property data: estimated equity, foreclosure status (auction dates/lenders), mortgage and transfer history, owner portfolios, and tax assessments. Access detailed building specs like SF and year built, plus motivated seller lead types, including pre-foreclosures and high equity.

πŸ‘ User avatar

Muhamed Didovic

168

5.0

GovDeals Government Auction Scraper

parseforge/govdeals-scraper

Scrape government surplus auction listings from GovDeals.com. Extract detailed item data, bidding info, seller contacts, photos, and location from 22,000+ active auctions across 200+ categories including vehicles, electronics, real estate, heavy equipment, and more. Filter by keyword and categories.

33

5.0

Related articles

The definitive guide to text scraping
Read more