Pricing
from $1.50 / 1,000 results
PDF URL to Markdown, Tables & RAG Extractor
Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.
Pricing
from $1.50 / 1,000 results
Rating
0.0
(0)
Developer
Actor stats
1
Bookmarked
6
Total users
4
Monthly active users
a day ago
Last modified
Categories
Share
PDF URL Scraper: PDF to Markdown and AI-Ready Document Extractor
PDF URL Scraper converts public PDF URLs into clean Markdown, page-level text, metadata, tables, and AI-ready JSON for RAG pipelines, document automation, research workflows, and downstream Apify Actors.
At a glance: input examples are one or more public PDF URLs; output examples are page-level dataset rows, Markdown records, metadata, tables, and optional AI-ready chunks; use cases include RAG and document automation; limitations, troubleshooting, and pricing/cost notes are covered below.
What this Actor does
Give the Actor one PDF URL or a list of PDF URLs. It downloads each PDF, extracts readable content, stores the full document Markdown in the key-value store, and pushes one dataset row per useful processed page.
The default mode does not use an LLM, which keeps small tests and bulk text extraction inexpensive. Optional LLM modes can improve messy pages, extract RAG chunks, and handle harder documents when quality matters more than minimum cost.
Main use cases
- Convert PDF URLs to Markdown for AI prompts and agents.
- Prepare documents for RAG ingestion and vector databases.
- Extract page-level text with source URL and page references.
- Extract tables from financial reports, forms, manuals, procurement documents, and research PDFs.
- Process batches of public PDFs from web scraping or document monitoring workflows.
- Store full-document Markdown and page-level JSON for downstream automation.
Simple input
Most users only need two fields.
{"pdfUrls":["https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"],"mode":"no_llm"}
Basic fields
pdfUrls: One or more public PDF URLs. Put one PDF URL per row. Duplicate URLs are processed once to avoid duplicate output and unnecessary cost.mode: Chooseno_llm,llm_cheap, orllm_premium.
Mode guide
no_llm: Fastest and cheapest. Best for normal text PDFs and high-volume extraction.llm_cheap: Adds AI-ready cleanup, RAG chunks, table extraction, and OCR fallback at lower LLM cost.llm_premium: Uses the premium cleanup path for harder PDFs where output quality matters more than cost.
Legacy API calls using pdfUrl still work. Advanced API users can also use lower-level fields such as advancedMode, maxPages, includeRawText, saveDiagnostics, savePageMarkdown, savePageImages, proxyConfiguration, and custom request headers. These are optional and are not needed for normal runs.
Example batch input
{"pdfUrls":["https://example.com/report-1.pdf","https://example.com/report-2.pdf","https://example.com/report-3.pdf"],"mode":"no_llm"}
What data you get
The Actor pushes one dataset item per processed page. Each row can include:
- Source URL and final URL after redirects.
- Status and failure details, if a PDF could not be processed.
- File name, file size, content hash, title, author, and PDF dates when available.
- Page number, page text, and page Markdown.
- Tables and table metadata when table extraction is enabled.
- RAG chunks when AI-ready chunking is enabled.
- Language estimate, processing duration, warnings, and download details.
- Key-value store keys for the full Markdown document and optional artifacts.
Full-document Markdown is saved in the key-value store as OUTPUT_MARKDOWN for a single PDF, or OUTPUT_MARKDOWN_001, OUTPUT_MARKDOWN_002, and so on for batches.
Example output row
{"sourceUrl":"https://example.com/document.pdf","finalUrl":"https://example.com/document.pdf","status":"success","recordType":"page","fileName":"document.pdf","pageCount":12,"processedPageCount":12,"pageNumber":1,"mode":"no_llm","processingMode":"fast","markdownText":"Markdown for this page...","pageText":"Raw page text...","tables":[],"ragChunks":[],"download":{"attempts":1,"usedProxy":false,"contentType":"application/pdf"},"outputKeys":{"markdown":"OUTPUT_MARKDOWN"},"warnings":[],"errors":[]}
Failed PDFs still produce a clear failure row when the Actor starts successfully:
{"sourceUrl":"https://example.com/not-a-pdf","status":"failed","recordType":"failure","errors":[{"step":"download","message":"Failed to download PDF after retries"}],"warnings":[]}
How to run on Apify
- Open the Actor page on Apify.
- Paste one or more PDF URLs into
pdfUrls. - Keep
modeasno_llmfor the cheapest run, or choose an LLM mode when you need cleanup, OCR fallback, tables, or RAG chunks. - Start the run.
- Open the Dataset tab for page-by-page JSON results.
- Open the Key-value store tab to download full-document Markdown files.
Exporting results
You can export dataset rows from Apify as JSON, CSV, Excel, XML, or RSS. For document-level Markdown, open the run's key-value store and download the OUTPUT_MARKDOWN record or the numbered batch records.
Python API example
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")run_input ={"pdfUrls":["https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"],"mode":"no_llm",}run = client.actor("thescrapelab/Apify-PDF-url-scraper").call(run_input=run_input)dataset_id = run["defaultDatasetId"]for item in client.dataset(dataset_id).iterate_items():print(item["status"], item.get("pageNumber"), item.get("markdownText","")[:120])
Advanced options
Advanced options are available through JSON or API input for automation workflows. Use them only when you need tighter control.
maxPages: Process only the first N pages of each PDF. Useful for samples and cost control.includeRawText: Include full raw page text in dataset rows.saveDiagnostics: Save page-level diagnostics to the key-value store.savePageMarkdown: Save page Markdown records separately.savePageImages: Save selected page PNGs in a ZIP file for review.maxDownloadMb: Reject PDFs above a configured download size.maxRetries: Limit retry attempts for unreliable URLs.skipHeadPreflight: Skip the initial HEAD request for servers that block HEAD.proxyConfiguration: Use custom proxy settings for sources that block direct requests.
Cost and pricing notes
Cost is mainly driven by memory, runtime, page count, storage writes, and whether LLM/OCR features are used.
- Use
no_llmfor high-volume PDF-to-Markdown extraction. - Use
maxPageswhen testing large PDFs. - Avoid
savePageImagesunless you need visual review artifacts. - Use LLM modes only when the output quality gain is worth the extra cost.
- The recommended Store pricing model is pay per successful page result, with optional separate LLM page events if monetization is enabled.
Limits and caveats
- The Actor works with public HTTP and HTTPS PDF URLs.
- Password-protected or encrypted PDFs are not supported.
- Some scanned PDFs require OCR, and OCR quality depends on scan quality.
- Complex, nested, or visually designed tables may need review.
- LLM cleanup can improve formatting but may introduce interpretation.
- Very large PDFs can take longer; use
maxPagesfor sampling or testing. - Duplicate input URLs are ignored at runtime to avoid duplicate results.
Troubleshooting
- If a URL fails, confirm it opens directly in a browser and returns a PDF, not an HTML landing page.
- If a server blocks downloads, try
skipHeadPreflightor a proxy configuration. - If a run is expensive, switch to
no_llm, addmaxPages, and disable optional artifacts. - If output is empty, the PDF may be scanned, image-only, encrypted, or blocked by the source server.
- If tables look imperfect, try an LLM mode and review the
warningsfield.
FAQ
Can this Actor scrape PDF URLs from a website?
This Actor processes PDF URLs you provide. If you need to discover PDF links from web pages first, run a web crawler or link scraper before this Actor.
Does it convert PDF to Markdown?
Yes. It saves full-document Markdown in the key-value store and page-level Markdown in the dataset.
Does it use an LLM by default?
No. The default no_llm mode avoids LLM calls for lower cost.
Can it process multiple PDFs in one run?
Yes. Add multiple URLs to pdfUrls. Duplicate URLs are processed once.
Does it support RAG?
Yes. llm_cheap and llm_premium create source-aware RAG chunks by default. Advanced users can also enable RAG chunks through API input.
Does it extract tables?
Table extraction is enabled in the LLM modes and can be controlled by advanced options. Complex tables may still need manual review.
What happens if one PDF fails in a batch?
The Actor pushes a failure row for that PDF and continues with the remaining URLs.
What is the best setting for large batches?
Use mode: "no_llm", keep optional artifacts disabled, and use maxPages when you only need a sample.
