PDF Text Extractor

Pricing

Pay per event

PDF Text Extractor

Extract text, metadata, and page-by-page content from PDF files. Provide PDF URLs and get structured JSON with full text, per-page text, page count, author, title, creation date, and more. Export as JSON, CSV, or Excel. No browser or proxy needed.

Pricing

Pay per event

Rating

0.0

(0)

Developer

👁 Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

What does PDF Text Extractor do?

PDF Text Extractor downloads PDF files from any public URL and extracts structured text, metadata, and per-page content. It returns clean JSON with the full document text, individual page text, page count, and all PDF metadata (title, author, creation date, producer, and more).

Unlike browser-based PDF tools, this actor uses pure server-side processing with no browser overhead. It processes PDFs in parallel for maximum throughput and handles errors gracefully -- if one PDF fails, the rest still complete.

Try it now on the Apify Store with the prefilled example URLs.

Who is PDF Text Extractor for?

AI/ML Engineers and Data Scientists

Extract text from research papers, whitepapers, and technical documentation for RAG pipelines
Build training datasets from large PDF collections
Feed document content into LLMs for summarization and analysis

Legal and Compliance Teams

Extract text from contracts, filings, and regulatory documents
Build searchable archives from PDF-only document repositories
Automate document review workflows

Researchers and Academics

Bulk-extract text from academic papers and journal articles
Build citation databases from PDF collections
Convert lecture notes and course materials to searchable text

Developers and Automation Engineers

Integrate PDF text extraction into data pipelines via API
Process invoices, receipts, and forms at scale
Extract metadata for document management systems

Why use PDF Text Extractor?

Pure server-side processing -- no browser, no proxy, near-zero cost per PDF
Per-page text extraction -- get text for each individual page, not just the whole document
Rich metadata -- title, author, subject, keywords, creator, producer, creation/modification dates, PDF version
Parallel processing -- configure concurrency to process multiple PDFs simultaneously
Graceful error handling -- failed PDFs don't stop the entire batch
API access -- integrate with 5,000+ apps via Zapier, Make, and the Apify API
Scheduled runs -- set up recurring extractions for document monitoring
Multiple export formats -- JSON, CSV, Excel, XML, HTML

What data can you extract?

Category	Fields
Document text	Full text, per-page text array
Metadata	Title, author, subject, keywords
Producer info	Creator application, producer application
Dates	Creation date, modification date (ISO 8601)
Technical	Page count, PDF version, file size in bytes
Error handling	Error message (null when successful)

Each PDF produces one dataset row with 16 structured fields.

How much does it cost to extract text from PDFs?

PDF Text Extractor uses pay-per-event pricing. You only pay for what you use:

Event	FREE tier	BRONZE	SILVER	GOLD
Run started (one-time)	$0.005	$0.005	$0.005	$0.005
Per PDF extracted	$0.00345	$0.003	$0.00234	$0.0018

Example costs (BRONZE tier):

10 PDFs: $0.005 + 10 x $0.003 = $0.035
100 PDFs: $0.005 + 100 x $0.003 = $0.305
1,000 PDFs: $0.005 + 1,000 x $0.003 = $3.005

With the free $5 Apify credit, you can extract text from approximately 1,600 PDFs at no cost.

How to extract text from PDF files

Go to the PDF Text Extractor page on Apify Store
Click Try for free to open the actor in Apify Console
Paste your PDF URLs into the PDF URLs field (one per line)
Adjust concurrency and timeout settings if needed
Click Start to begin extraction
Download results in JSON, CSV, or Excel format

Example input

{
"urls":[
"https://example.com/report-2024.pdf",
"https://example.com/whitepaper.pdf",
"https://example.com/invoice-january.pdf"
],
"includePages":true,
"maxConcurrency":5
}

Minimal input

{
"urls":["https://example.com/document.pdf"]
}

Input parameters

Parameter	Type	Default	Description
`urls`	array of strings	(required)	Direct URLs to PDF files
`includePages`	boolean	`true`	Include per-page text breakdown
`maxConcurrency`	integer	`5`	Parallel PDF downloads (1-20)
`timeoutPerPdfSecs`	integer	`60`	Download timeout per PDF in seconds

Output example

{
"url":"https://www.orimi.com/pdf-test.pdf",
"fileName":"pdf-test.pdf",
"title":"PDF Test Page",
"author":"Yukon Department of Education",
"subject":null,
"keywords":null,
"creator":"Acrobat PDFMaker 7.0.7 for Word",
"producer":"Acrobat Distiller 7.0.5 (Windows)",
"creationDate":"2008-06-04T15:44:00.000Z",
"modificationDate":"2008-06-04T15:47:36.000Z",
"pageCount":1,
"fullText":"PDF Test File Congratulations, your computer is equipped with a PDF reader...",
"pages":[
{
"pageNumber":1,
"text":"PDF Test File Congratulations, your computer is equipped with a PDF reader..."
}
],
"pdfVersion":"1.6",
"fileSizeBytes":20597,
"error":null
}

Tips for best results

Start small -- test with 2-3 PDFs first to verify the URLs work and output meets your needs
Use direct PDF URLs -- the URL must point directly to a .pdf file, not a page that contains a PDF viewer
Disable per-page text for large PDFs -- set includePages: false to reduce output size when processing documents with hundreds of pages
Increase timeout for large files -- if you are processing PDFs over 50 MB, increase timeoutPerPdfSecs to 120 or more
Check the error field -- failed PDFs still appear in results with an error message, so you can identify and retry them
Schedule recurring runs -- use Apify's scheduler to automatically extract new PDFs on a daily or weekly basis

Integrations

PDF Text Extractor + Google Sheets -- automatically populate a spreadsheet with extracted text and metadata from new PDF uploads
PDF Text Extractor + Slack -- get notified when PDF extraction completes, with a summary of pages processed and any errors
PDF Text Extractor + Make/Zapier -- trigger PDF extraction when new files are uploaded to Google Drive, Dropbox, or S3
PDF Text Extractor + OpenAI/LLM -- chain extraction with AI summarization to create document summaries from PDF collections
Scheduled runs -- monitor a document repository and extract text from newly published PDFs on a schedule
Webhooks -- trigger downstream processing immediately when extraction completes

Using the Apify API

Node.js

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token:'YOUR_APIFY_TOKEN'});
const run =await client.actor('automation-lab/pdf-text-extractor').call({
urls:[
'https://example.com/report.pdf',
'https://example.com/whitepaper.pdf',
],
includePages:true,
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item=>{
 console.log(`${item.fileName}: ${item.pageCount} pages, ${item.fullText.length} chars`);
});

Python

from apify_client import ApifyClient
client = ApifyClient('YOUR_APIFY_TOKEN')
run = client.actor('automation-lab/pdf-text-extractor').call(run_input={
'urls':[
'https://example.com/report.pdf',
'https://example.com/whitepaper.pdf',
],
'includePages':True,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
for item in items:
print(f"{item['fileName']}: {item['pageCount']} pages, {len(item['fullText'])} chars")

cURL

curl-X POST "https://api.apify.com/v2/acts/automation-lab~pdf-text-extractor/runs?token=YOUR_APIFY_TOKEN"\
-H"Content-Type: application/json"\
-d'{
 "urls": ["https://example.com/report.pdf"],
 "includePages": true
 }'

Use with AI agents via MCP

PDF Text Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Add the Apify MCP server to your AI client -- this gives you access to all Apify actors, including this one:

Setup for Claude Code

$claude mcp add--transport http apify "https://mcp.apify.com"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
"mcpServers":{
"apify":{
"url":"https://mcp.apify.com"
}
}
}

Your AI assistant will use OAuth to authenticate with your Apify account on first use.

Example prompts

Once connected, try asking your AI assistant:

"Use automation-lab/pdf-text-extractor to extract all text from this research paper: https://arxiv.org/pdf/1706.03762"
"Extract metadata and page count from these 5 PDF invoices and summarize the results"
"Download and extract text from all PDFs linked on this page, then create a summary of each document"

Learn more in the Apify MCP documentation.

Is it legal to extract text from PDFs?

PDF Text Extractor processes publicly accessible PDF files that you provide URLs for. The actor downloads files the same way a web browser would. Always ensure you have the right to access and process the documents you are extracting text from.

For personal data, comply with GDPR and applicable privacy laws. Review the terms of service for any document repositories you are accessing. Apify provides a general web scraping legality guide for reference.

FAQ

How fast is PDF Text Extractor? Processing speed depends on PDF file size and download speed. A typical 1 MB PDF takes 1-3 seconds to download and parse. With maxConcurrency: 10, you can process 100 average-sized PDFs in under a minute.

How much does it cost to extract text from 1,000 PDFs? At BRONZE tier pricing: $0.005 (start) + 1,000 x $0.003 (per PDF) = $3.005 total. With the free $5 credit, you can process about 1,600 PDFs at no cost.

Does it work with scanned PDFs? No. This actor extracts embedded text from PDFs. Scanned documents that contain only images (no selectable text) will return empty text. For scanned PDFs, you would need an OCR (Optical Character Recognition) solution.

Why are some PDF fields returning null? Not all PDFs include metadata. The title, author, subject, and keywords fields depend on what the PDF creator set when generating the document. Many auto-generated PDFs leave these fields empty.

Why did a PDF fail with "Invalid PDF structure"? The URL may not point to an actual PDF file. Ensure the URL returns a direct PDF download, not an HTML page with an embedded PDF viewer. Some servers also require specific headers or authentication.

Can I extract text from password-protected PDFs? No. Password-protected (encrypted) PDFs cannot be parsed without the password. The actor will return an error for these files.

URL: https://apify.com/automation-lab/pdf-text-extractor