VOOZH about

URL: https://apify.com/thescrapelab/apify-pdf-url-scraper

⇱ PDF to Markdown Converter & AI-Ready Document Extractor Β· Apify


πŸ‘ PDF URL to Markdown, Tables & RAG Extractor avatar

PDF URL to Markdown, Tables & RAG Extractor

Pricing

from $1.50 / 1,000 results

Go to Apify Store

PDF URL to Markdown, Tables & RAG Extractor

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Pricing

from $1.50 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ Inus Grobler

Inus Grobler

Maintained by Community

Actor stats

1

Bookmarked

6

Total users

4

Monthly active users

a day ago

Last modified

Share

PDF URL Scraper: PDF to Markdown and AI-Ready Document Extractor

PDF URL Scraper converts public PDF URLs into clean Markdown, page-level text, metadata, tables, and AI-ready JSON for RAG pipelines, document automation, research workflows, and downstream Apify Actors.

At a glance: input examples are one or more public PDF URLs; output examples are page-level dataset rows, Markdown records, metadata, tables, and optional AI-ready chunks; use cases include RAG and document automation; limitations, troubleshooting, and pricing/cost notes are covered below.

What this Actor does

Give the Actor one PDF URL or a list of PDF URLs. It downloads each PDF, extracts readable content, stores the full document Markdown in the key-value store, and pushes one dataset row per useful processed page.

The default mode does not use an LLM, which keeps small tests and bulk text extraction inexpensive. Optional LLM modes can improve messy pages, extract RAG chunks, and handle harder documents when quality matters more than minimum cost.

Main use cases

  • Convert PDF URLs to Markdown for AI prompts and agents.
  • Prepare documents for RAG ingestion and vector databases.
  • Extract page-level text with source URL and page references.
  • Extract tables from financial reports, forms, manuals, procurement documents, and research PDFs.
  • Process batches of public PDFs from web scraping or document monitoring workflows.
  • Store full-document Markdown and page-level JSON for downstream automation.

Simple input

Most users only need two fields.

{
"pdfUrls":[
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
],
"mode":"no_llm"
}

Basic fields

  • pdfUrls: One or more public PDF URLs. Put one PDF URL per row. Duplicate URLs are processed once to avoid duplicate output and unnecessary cost.
  • mode: Choose no_llm, llm_cheap, or llm_premium.

Mode guide

  • no_llm: Fastest and cheapest. Best for normal text PDFs and high-volume extraction.
  • llm_cheap: Adds AI-ready cleanup, RAG chunks, table extraction, and OCR fallback at lower LLM cost.
  • llm_premium: Uses the premium cleanup path for harder PDFs where output quality matters more than cost.

Legacy API calls using pdfUrl still work. Advanced API users can also use lower-level fields such as advancedMode, maxPages, includeRawText, saveDiagnostics, savePageMarkdown, savePageImages, proxyConfiguration, and custom request headers. These are optional and are not needed for normal runs.

Example batch input

{
"pdfUrls":[
"https://example.com/report-1.pdf",
"https://example.com/report-2.pdf",
"https://example.com/report-3.pdf"
],
"mode":"no_llm"
}

What data you get

The Actor pushes one dataset item per processed page. Each row can include:

  • Source URL and final URL after redirects.
  • Status and failure details, if a PDF could not be processed.
  • File name, file size, content hash, title, author, and PDF dates when available.
  • Page number, page text, and page Markdown.
  • Tables and table metadata when table extraction is enabled.
  • RAG chunks when AI-ready chunking is enabled.
  • Language estimate, processing duration, warnings, and download details.
  • Key-value store keys for the full Markdown document and optional artifacts.

Full-document Markdown is saved in the key-value store as OUTPUT_MARKDOWN for a single PDF, or OUTPUT_MARKDOWN_001, OUTPUT_MARKDOWN_002, and so on for batches.

Example output row

{
"sourceUrl":"https://example.com/document.pdf",
"finalUrl":"https://example.com/document.pdf",
"status":"success",
"recordType":"page",
"fileName":"document.pdf",
"pageCount":12,
"processedPageCount":12,
"pageNumber":1,
"mode":"no_llm",
"processingMode":"fast",
"markdownText":"Markdown for this page...",
"pageText":"Raw page text...",
"tables":[],
"ragChunks":[],
"download":{
"attempts":1,
"usedProxy":false,
"contentType":"application/pdf"
},
"outputKeys":{
"markdown":"OUTPUT_MARKDOWN"
},
"warnings":[],
"errors":[]
}

Failed PDFs still produce a clear failure row when the Actor starts successfully:

{
"sourceUrl":"https://example.com/not-a-pdf",
"status":"failed",
"recordType":"failure",
"errors":[
{
"step":"download",
"message":"Failed to download PDF after retries"
}
],
"warnings":[]
}

How to run on Apify

  1. Open the Actor page on Apify.
  2. Paste one or more PDF URLs into pdfUrls.
  3. Keep mode as no_llm for the cheapest run, or choose an LLM mode when you need cleanup, OCR fallback, tables, or RAG chunks.
  4. Start the run.
  5. Open the Dataset tab for page-by-page JSON results.
  6. Open the Key-value store tab to download full-document Markdown files.

Exporting results

You can export dataset rows from Apify as JSON, CSV, Excel, XML, or RSS. For document-level Markdown, open the run's key-value store and download the OUTPUT_MARKDOWN record or the numbered batch records.

Python API example

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run_input ={
"pdfUrls":[
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
],
"mode":"no_llm",
}
run = client.actor("thescrapelab/Apify-PDF-url-scraper").call(run_input=run_input)
dataset_id = run["defaultDatasetId"]
for item in client.dataset(dataset_id).iterate_items():
print(item["status"], item.get("pageNumber"), item.get("markdownText","")[:120])

Advanced options

Advanced options are available through JSON or API input for automation workflows. Use them only when you need tighter control.

  • maxPages: Process only the first N pages of each PDF. Useful for samples and cost control.
  • includeRawText: Include full raw page text in dataset rows.
  • saveDiagnostics: Save page-level diagnostics to the key-value store.
  • savePageMarkdown: Save page Markdown records separately.
  • savePageImages: Save selected page PNGs in a ZIP file for review.
  • maxDownloadMb: Reject PDFs above a configured download size.
  • maxRetries: Limit retry attempts for unreliable URLs.
  • skipHeadPreflight: Skip the initial HEAD request for servers that block HEAD.
  • proxyConfiguration: Use custom proxy settings for sources that block direct requests.

Cost and pricing notes

Cost is mainly driven by memory, runtime, page count, storage writes, and whether LLM/OCR features are used.

  • Use no_llm for high-volume PDF-to-Markdown extraction.
  • Use maxPages when testing large PDFs.
  • Avoid savePageImages unless you need visual review artifacts.
  • Use LLM modes only when the output quality gain is worth the extra cost.
  • The recommended Store pricing model is pay per successful page result, with optional separate LLM page events if monetization is enabled.

Limits and caveats

  • The Actor works with public HTTP and HTTPS PDF URLs.
  • Password-protected or encrypted PDFs are not supported.
  • Some scanned PDFs require OCR, and OCR quality depends on scan quality.
  • Complex, nested, or visually designed tables may need review.
  • LLM cleanup can improve formatting but may introduce interpretation.
  • Very large PDFs can take longer; use maxPages for sampling or testing.
  • Duplicate input URLs are ignored at runtime to avoid duplicate results.

Troubleshooting

  • If a URL fails, confirm it opens directly in a browser and returns a PDF, not an HTML landing page.
  • If a server blocks downloads, try skipHeadPreflight or a proxy configuration.
  • If a run is expensive, switch to no_llm, add maxPages, and disable optional artifacts.
  • If output is empty, the PDF may be scanned, image-only, encrypted, or blocked by the source server.
  • If tables look imperfect, try an LLM mode and review the warnings field.

FAQ

Can this Actor scrape PDF URLs from a website?

This Actor processes PDF URLs you provide. If you need to discover PDF links from web pages first, run a web crawler or link scraper before this Actor.

Does it convert PDF to Markdown?

Yes. It saves full-document Markdown in the key-value store and page-level Markdown in the dataset.

Does it use an LLM by default?

No. The default no_llm mode avoids LLM calls for lower cost.

Can it process multiple PDFs in one run?

Yes. Add multiple URLs to pdfUrls. Duplicate URLs are processed once.

Does it support RAG?

Yes. llm_cheap and llm_premium create source-aware RAG chunks by default. Advanced users can also enable RAG chunks through API input.

Does it extract tables?

Table extraction is enabled in the LLM modes and can be controlled by advanced options. Complex tables may still need manual review.

What happens if one PDF fails in a batch?

The Actor pushes a failure row for that PDF and continues with the remaining URLs.

What is the best setting for large batches?

Use mode: "no_llm", keep optional artifacts disabled, and use maxPages when you only need a sample.

You might also like

PDF to Markdown RAG-Ready

hedelka/pdf-to-markdown-rag

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

πŸ‘ User avatar

Dmitry Goncharov

10

PDF AI Extractor MCP

devaditya/pdf-ai-extractor-mcp

Extracts text, tables, summaries, and structured data from any PDF using OpenAI, Google Gemini, or Claude. Supports bulk AI processing, clean JSON exports, and an AI-ready MCP mode for agent workflows.

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

Website to Text & Markdown β€” AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

2

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

PDF Intelligence

marielise.dev/pdf-intelligence

Stop fighting PDFs. Extract text, tables, and insights from any document, scanned or digital. Get RAG-ready chunks for LangChain & LlamaIndex. AI-powered summaries, classification, entity extraction. Use our API keys or bring your own (50% discount). From PDF chaos to clean data in minutes.

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.