VOOZH about

URL: https://apify.com/basisweb/pdf-to-markdown-rag

⇱ PDF to Markdown & JSON for RAG, LLMs & Vectors Β· Apify


πŸ‘ PDF to Markdown & JSON (RAG-Ready) avatar

PDF to Markdown & JSON (RAG-Ready)

Pricing

from $2.00 / 1,000 page processeds

Go to Apify Store

PDF to Markdown & JSON (RAG-Ready)

Convert PDFs to clean Markdown and structured JSON (text + tables) for RAG, LLMs, and vector DBs. Batch URLs, pay per page.

Pricing

from $2.00 / 1,000 page processeds

Rating

0.0

(0)

Developer

πŸ‘ BasisWeb

BasisWeb

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Convert PDFs into clean Markdown and structured JSON (text + tables) you can drop straight into a RAG pipeline, an LLM prompt, or a vector database. Give it a list of PDF URLs; it returns one record per page.

Think of it as the PDF companion to web crawlers like Website Content Crawler and RAG Web Browser: point it at the PDFs your crawler discovers and get clean, page-level text + tables back.

What it does

  • Downloads each PDF by URL.
  • Extracts text using the PDF's character layout (natural reading order for standard single-column pages) and detects tables, rendering them as Markdown tables and structured rows.
  • Returns one dataset item per page. url, page, totalPages, tableCount, and ok are always present; markdown and/or text + tables are included depending on outputFormat (default both returns all of them).

Use cases

  • RAG ingestion: turn reports, manuals, and whitepapers into clean, page-level Markdown chunks for a vector database.
  • LLM document Q&A: feed structured text and tables to an LLM without copy-paste cleanup.
  • Extract tables from PDF: pull tables out as both Markdown and structured rows.
  • Agent pipelines: chain it after a web crawler so an AI agent can read the PDFs it finds.

Input

FieldTypeDefaultDescription
pdfUrlsarray of URLs(required)Direct links to the PDFs to convert.
extractTablesbooleantrueDetect tables and render them as Markdown + structured rows.
outputFormatmarkdown | json | bothbothWhat each result includes.

Example input

{
"pdfUrls":["https://example.com/report.pdf"],
"extractTables":true,
"outputFormat":"both"
}

Example output (one item per page)

{
"url":"https://example.com/report.pdf",
"page":1,
"totalPages":12,
"markdown":"Q3 Report\nRevenue grew 18% YoY...\n\n| Region | Revenue |\n| --- | --- |\n| NA | $4.1M |",
"text":"Q3 Report\nRevenue grew 18% YoY...",
"tables":[[["Region","Revenue"],["NA","$4.1M"]]],
"tableCount":1,
"ok":true
}

The example above uses the default outputFormat: "both", so it includes every field. Each item also includes ok (set to false on a failed URL or page, with an error field explaining why) and, on pages with no extractable text or tables, a note flag.

Pricing (pay-per-event)

  • Run start: a small flat fee per run (Apify's built-in start event).
  • Page processed: charged per page that returns real content (text and/or tables).

Pages with no extractable text or tables are returned with a note and are NOT charged. Failed URLs and failed pages are reported with an error and are never charged.

Your spending limit is always respected: set a max cost per run and the Actor stops once it's reached.

Use it with AI agents (Apify MCP)

This Actor is available as a tool for AI agents through Apify's MCP server (mcp.apify.com). An agent can call basisweb/pdf-to-markdown-rag to convert a PDF to Markdown mid-task, then chain the page-level output into the next step. The only required input is pdfUrls, so an agent can invoke it in one shot, and the output schema tells the agent exactly which fields come back (markdown, text, tables, tableCount) before it spends a credit.

Honest notes

  • This handles digital, text-based PDFs. Scanned PDFs (image-only, no text layer) are not OCR'd in this version; those pages come back with a note instead of text and are not charged. OCR is planned for a future version.
  • Each PDF must be under 50 MB. Very large or table-heavy PDFs run best at 2 GB memory or higher (the 50 MB limit caps file size, not parsing memory).
  • You can also parse PDFs locally for free with open-source libraries. This is the no-setup, hosted, pay-per-page version for pipelines that just want it as an API.

FAQ

Does it work on scanned PDFs? Not in this version. Image-only pages with no text layer come back with a note and are not charged. OCR is planned for a future version.

What does it return per page? One item per page with markdown and/or text + tables (depending on outputFormat), plus url, page, totalPages, and tableCount.

How is it priced? A small per-run start fee plus a per-page fee, charged only for pages with real content. Blank, scanned, and failed pages are never charged.

Can I use it for RAG? Yes, that is the point. The Markdown is clean and page-scoped, so you can chunk and embed it directly.

How is this different from parsing PDFs locally? You can parse PDFs locally for free with open-source libraries. This is the no-setup, hosted, pay-per-page version for pipelines that just want it as an API.

Run locally

$apify run

Deploy

apify login
apify push

You might also like

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

πŸ‘ User avatar

Nguyα»…n Anh Duy

3

4.7

Web Page to Markdown Extractor β€” URL to Markdown API

fetch_cat/web-page-to-markdown-extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI agents, RAG, support, and automation workflows.

PDF to Markdown RAG-Ready

hedelka/pdf-to-markdown-rag

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

πŸ‘ User avatar

Dmitry Goncharov

10

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Website to Text & Markdown β€” AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

5

Site to Agent Feed (URL to RAG-ready Markdown)

constant_quadruped/site-to-agent-feed

Turn any URL into clean, RAG-ready Markdown + structured JSON for LLMs and AI agents. Self-healing main-content extraction (survives redesigns), headings/links/tables, optional change-detection. No paid APIs.