👁 PDF to Markdown & JSON (RAG-Ready) avatar

PDF to Markdown & JSON (RAG-Ready)

Pricing

from $2.00 / 1,000 page processeds

PDF to Markdown & JSON (RAG-Ready)

Convert PDFs to clean Markdown and structured JSON (text + tables) for RAG, LLMs, and vector DBs. Batch URLs, pay per page.

Pricing

from $2.00 / 1,000 page processeds

Rating

0.0

(0)

Developer

👁 BasisWeb

BasisWeb

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

What it does

Downloads each PDF by URL.
Extracts text using the PDF's character layout (natural reading order for standard single-column pages) and detects tables, rendering them as Markdown tables and structured rows.
Returns one dataset item per page. url, page, totalPages, tableCount, and ok are always present; markdown and/or text + tables are included depending on outputFormat (default both returns all of them).

Use cases

RAG ingestion: turn reports, manuals, and whitepapers into clean, page-level Markdown chunks for a vector database.
LLM document Q&A: feed structured text and tables to an LLM without copy-paste cleanup.
Extract tables from PDF: pull tables out as both Markdown and structured rows.
Agent pipelines: chain it after a web crawler so an AI agent can read the PDFs it finds.

Input

Field	Type	Default	Description
`pdfUrls`	array of URLs	(required)	Direct links to the PDFs to convert.
`extractTables`	boolean	`true`	Detect tables and render them as Markdown + structured rows.
`outputFormat`	`markdown` \| `json` \| `both`	`both`	What each result includes.

Example input

{
"pdfUrls":["https://example.com/report.pdf"],
"extractTables":true,
"outputFormat":"both"
}

Example output (one item per page)

{
"url":"https://example.com/report.pdf",
"page":1,
"totalPages":12,
"markdown":"Q3 Report\nRevenue grew 18% YoY...\n\n| Region | Revenue |\n| --- | --- |\n| NA | $4.1M |",
"text":"Q3 Report\nRevenue grew 18% YoY...",
"tables":[[["Region","Revenue"],["NA","$4.1M"]]],
"tableCount":1,
"ok":true
}

The example above uses the default outputFormat: "both", so it includes every field. Each item also includes ok (set to false on a failed URL or page, with an error field explaining why) and, on pages with no extractable text or tables, a note flag.

Pricing (pay-per-event)

Run start: a small flat fee per run (Apify's built-in start event).
Page processed: charged per page that returns real content (text and/or tables).

Pages with no extractable text or tables are returned with a note and are NOT charged. Failed URLs and failed pages are reported with an error and are never charged.

Your spending limit is always respected: set a max cost per run and the Actor stops once it's reached.

Use it with AI agents (Apify MCP)

This Actor is available as a tool for AI agents through Apify's MCP server (mcp.apify.com). An agent can call basisweb/pdf-to-markdown-rag to convert a PDF to Markdown mid-task, then chain the page-level output into the next step. The only required input is pdfUrls, so an agent can invoke it in one shot, and the output schema tells the agent exactly which fields come back (markdown, text, tables, tableCount) before it spends a credit.

Honest notes

This handles digital, text-based PDFs. Scanned PDFs (image-only, no text layer) are not OCR'd in this version; those pages come back with a note instead of text and are not charged. OCR is planned for a future version.
Each PDF must be under 50 MB. Very large or table-heavy PDFs run best at 2 GB memory or higher (the 50 MB limit caps file size, not parsing memory).
You can also parse PDFs locally for free with open-source libraries. This is the no-setup, hosted, pay-per-page version for pipelines that just want it as an API.

FAQ

Does it work on scanned PDFs? Not in this version. Image-only pages with no text layer come back with a note and are not charged. OCR is planned for a future version.

What does it return per page? One item per page with markdown and/or text + tables (depending on outputFormat), plus url, page, totalPages, and tableCount.

How is it priced? A small per-run start fee plus a per-page fee, charged only for pages with real content. Blank, scanned, and failed pages are never charged.

Can I use it for RAG? Yes, that is the point. The Markdown is clean and page-scoped, so you can chunk and embed it directly.

How is this different from parsing PDFs locally? You can parse PDFs locally for free with open-source libraries. This is the no-setup, hosted, pay-per-page version for pipelines that just want it as an API.

Run locally

$apify run

Deploy

apify login
apify push

👁 PDF URL to Markdown, Tables & RAG Extractor avatar

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

👁 User avatar

Inus Grobler

👁 RAG-Ready Markdown Converter & Chunker avatar

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

👁 User avatar

Nguyễn Anh Duy

4.7

👁 Web Page to Markdown Extractor — URL to Markdown API avatar

Web Page to Markdown Extractor — URL to Markdown API

fetch_cat/web-page-to-markdown-extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI agents, RAG, support, and automation workflows.

👁 User avatar

Hanna Nosova

👁 PDF to Markdown RAG-Ready avatar

PDF to Markdown RAG-Ready

hedelka/pdf-to-markdown-rag

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

👁 User avatar

Dmitry Goncharov

PDF to Text API | Document Extraction for LLMs & RAG

andok/pdf-text-converter

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

👁 User avatar

Andok

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

👁 User avatar

Juan Triviño

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

👁 Website to Text & Markdown — AI / RAG Content Crawler avatar

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

👁 User avatar

Hitman studio

👁 Site to Agent Feed (URL to RAG-ready Markdown) avatar

Site to Agent Feed (URL to RAG-ready Markdown)

constant_quadruped/site-to-agent-feed

Turn any URL into clean, RAG-ready Markdown + structured JSON for LLMs and AI agents. Self-healing main-content extraction (survives redesigns), headings/links/tables, optional change-detection. No paid APIs.

👁 User avatar

URL: https://apify.com/basisweb/pdf-to-markdown-rag

⇱ PDF to Markdown & JSON for RAG, LLMs & Vectors · Apify

PDF to Markdown & JSON (RAG-Ready)

What it does

Use cases

Input

Example input

Example output (one item per page)

Pricing (pay-per-event)

Use it with AI agents (Apify MCP)

Honest notes

FAQ

Run locally

Deploy

You might also like

PDF URL to Markdown, Tables & RAG Extractor

RAG-Ready Markdown Converter & Chunker

Web Page to Markdown Extractor — URL to Markdown API

PDF to Markdown RAG-Ready

PDF to Text API | Document Extraction for LLMs & RAG

Website to Markdown for LLM and RAG

Website to Markdown Crawler for LLM & RAG

Web-to-Markdown Generator for AI & RAG Pipelines

Website to Text & Markdown — AI / RAG Content Crawler

Site to Agent Feed (URL to RAG-ready Markdown)