👁 PDF to Text API | Document Extraction for LLMs & RAG avatar

PDF to Text API | Document Extraction for LLMs & RAG

Pricing

from $1.00 / 1,000 document converteds

👁 PDF to Text API | Document Extraction for LLMs & RAG

PDF to Text API | Document Extraction for LLMs & RAG

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

Pricing

from $1.00 / 1,000 document converteds

Rating

0.0

(0)

Developer

👁 Andok

Andok

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

PDF to Text Converter for AI & RAG

Extract clean text and metadata from PDF documents at scale for RAG pipelines, search indexing, and LLM ingestion. Point the actor at any PDF URL and get structured text output without installing local tools. Process entire document libraries in a single run.

Features

Full text extraction — extracts all readable text from PDF documents using pdf-parse
Metadata parsing — captures page count, PDF version, author, title, and creation date
Bulk processing — convert hundreds of PDFs in a single run
URL-based input — no file uploads needed, just provide URLs pointing to PDF files
Configurable concurrency — process 1 to 50 PDFs in parallel
Error resilience — failed documents are reported with error details, not skipped silently

Input

Field	Type	Required	Default	Description
`urls`	`array`	Yes	—	List of URLs pointing to PDF files to extract text from
`timeoutSeconds`	`integer`	No	`30`	Maximum seconds to wait for each PDF download
`concurrency`	`integer`	No	`5`	Number of PDFs to process in parallel (1-50)

Input Example

{
"urls":[
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
],
"timeoutSeconds":30,
"concurrency":5
}

Output

Each PDF produces one dataset item containing the extracted text and document metadata.

Key output fields:

inputUrl (string) — the original PDF URL provided
status (number) — HTTP status code from the download
pageCount (number) — number of pages in the PDF
info (object) — PDF metadata including title, author, creator, producer, and dates
text (string) — the full extracted text content
error (string) — error message if extraction failed, otherwise absent

Output Example

{
"inputUrl":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
"status":200,
"pageCount":1,
"info":{
"Title":"Dummy PDF file",
"Author":null,
"Creator":"Writer",
"Producer":"OpenOffice.org 2.1",
"CreationDate":"D:20070223175637+02'00'"
},
"text":"Dummy PDF file\n\nThis is a dummy PDF file for testing purposes."
}

Pricing

Event	Cost
Document Converted	Pay-per-event (see actor pricing page)

The actor respects the per-run max charge limit. Processing stops automatically when the spending cap is reached.

Use Cases

RAG document ingestion — extract text from PDF knowledge bases for vector database indexing
Search indexing — make PDF content searchable by extracting and indexing the text
Compliance review — bulk-extract text from policy documents and contracts for automated analysis
Academic research — convert research papers to plain text for NLP processing and citation analysis
Data migration — extract content from legacy PDF archives into structured text formats

Related Actors

Actor	What it adds
Web Page to Markdown Converter for LLMs	Convert web pages to Markdown alongside your PDF pipeline
Article Text Extractor for TTS & AI	Extract article text from web pages for a complete content pipeline
HTML Table Extractor	Extract structured table data from web pages

👁 Website to Text & Markdown — AI / RAG Content Crawler avatar

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

👁 User avatar

Hitman studio

👁 PDF Text Extractor - Bulk PDF to Text & Metadata avatar

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

👁 User avatar

Ale

👁 PDF Parser API avatar

PDF Parser API

george.the.developer/pdf-parser-api

Instant API that parses any PDF from a URL — extracts full text, page count, metadata (title, author, dates), and PDF version. Returns structured JSON. Perfect for document processing pipelines and AI agents.

👁 User avatar

George Kioko

👁 RAG Document Converter avatar

RAG Document Converter

web.harvester/rag-document-converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.

👁 User avatar

Web Harvester

Elite Document Ocr Lite

thepattyroller/elite-document-ocr-lite

Basic document text extraction and processing. Extract text from documents, analyze document structure, and extract structured data from invoices and receipts. Perfect for document automation workflows.

👁 User avatar

Logan Kiser

👁 Html To Pdf Api avatar

Html To Pdf Api

simplifysme/html-to-pdf-api

📄 Convert any HTML page or URL to high-quality PDF documents via API. Perfect for reports, invoices, documentation, web page archiving, and automated document generation.

👁 User avatar

SimplifySME Toolbox

👁 Document Extractor API - AI-Powered PDF & Text Analysis avatar

Document Extractor API - AI-Powered PDF & Text Analysis

fresh_cliff/document-extractor-api

Extract text and data from PDF, Word, and image documents using AI-powered OCR. Convert documents to structured JSON, analyze content, and extract insights. No API keys required with mirror fallbacks.

👁 User avatar

Brennan Crawford

👁 Pdf Text Extractor Pro avatar

Pdf Text Extractor Pro

dainty_screw/pdf-text-extractor-pro

PDF Text Extractor lets you quickly extract text from PDF files with high accuracy. Supports text chunking for AI, chatbots, and large language models (LLMs), making PDF-to-text conversion fast, clean, and ready for NLP or machine learning.

👁 User avatar

codemaster devops

5.0

👁 PDF to JSON Parser avatar

PDF to JSON Parser

jungle_synthesizer/pdf-to-json-parser

Convert PDF documents into structured JSON. Extracts text, tables, and fields from any PDF URL. Optional AI structuring pass (BYO OpenAI key) turns raw text into clean, organized JSON ready for automation or analysis.

👁 User avatar

BowTiedRaccoon

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

👁 User avatar

Juan Triviño

URL: https://apify.com/andok/pdf-text-converter