VOOZH about

URL: https://apify.com/andok/pdf-text-converter

⇱ PDF to Text API | Document Extraction for LLMs & RAG Β· Apify


πŸ‘ PDF to Text API | Document Extraction for LLMs & RAG avatar

PDF to Text API | Document Extraction for LLMs & RAG

Pricing

from $1.00 / 1,000 document converteds

Go to Apify Store

PDF to Text API | Document Extraction for LLMs & RAG

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

Pricing

from $1.00 / 1,000 document converteds

Rating

0.0

(0)

Developer

πŸ‘ Andok

Andok

Maintained by Community

Actor stats

0

Bookmarked

21

Total users

4

Monthly active users

3 months ago

Last modified

Share

PDF to Text Converter for AI & RAG

Extract clean text and metadata from PDF documents at scale for RAG pipelines, search indexing, and LLM ingestion. Point the actor at any PDF URL and get structured text output without installing local tools. Process entire document libraries in a single run.

Features

  • Full text extraction β€” extracts all readable text from PDF documents using pdf-parse
  • Metadata parsing β€” captures page count, PDF version, author, title, and creation date
  • Bulk processing β€” convert hundreds of PDFs in a single run
  • URL-based input β€” no file uploads needed, just provide URLs pointing to PDF files
  • Configurable concurrency β€” process 1 to 50 PDFs in parallel
  • Error resilience β€” failed documents are reported with error details, not skipped silently

Input

FieldTypeRequiredDefaultDescription
urlsarrayYesβ€”List of URLs pointing to PDF files to extract text from
timeoutSecondsintegerNo30Maximum seconds to wait for each PDF download
concurrencyintegerNo5Number of PDFs to process in parallel (1-50)

Input Example

{
"urls":[
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
],
"timeoutSeconds":30,
"concurrency":5
}

Output

Each PDF produces one dataset item containing the extracted text and document metadata.

Key output fields:

  • inputUrl (string) β€” the original PDF URL provided
  • status (number) β€” HTTP status code from the download
  • pageCount (number) β€” number of pages in the PDF
  • info (object) β€” PDF metadata including title, author, creator, producer, and dates
  • text (string) β€” the full extracted text content
  • error (string) β€” error message if extraction failed, otherwise absent

Output Example

{
"inputUrl":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
"status":200,
"pageCount":1,
"info":{
"Title":"Dummy PDF file",
"Author":null,
"Creator":"Writer",
"Producer":"OpenOffice.org 2.1",
"CreationDate":"D:20070223175637+02'00'"
},
"text":"Dummy PDF file\n\nThis is a dummy PDF file for testing purposes."
}

Pricing

EventCost
Document ConvertedPay-per-event (see actor pricing page)

The actor respects the per-run max charge limit. Processing stops automatically when the spending cap is reached.

Use Cases

  • RAG document ingestion β€” extract text from PDF knowledge bases for vector database indexing
  • Search indexing β€” make PDF content searchable by extracting and indexing the text
  • Compliance review β€” bulk-extract text from policy documents and contracts for automated analysis
  • Academic research β€” convert research papers to plain text for NLP processing and citation analysis
  • Data migration β€” extract content from legacy PDF archives into structured text formats

Related Actors

ActorWhat it adds
Web Page to Markdown Converter for LLMsConvert web pages to Markdown alongside your PDF pipeline
Article Text Extractor for TTS & AIExtract article text from web pages for a complete content pipeline
HTML Table ExtractorExtract structured table data from web pages

You might also like

Website to Text & Markdown β€” AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

2

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

PDF Parser API

george.the.developer/pdf-parser-api

Instant API that parses any PDF from a URL β€” extracts full text, page count, metadata (title, author, dates), and PDF version. Returns structured JSON. Perfect for document processing pipelines and AI agents.

RAG Document Converter

web.harvester/rag-document-converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.

2

Html To Pdf Api

simplifysme/html-to-pdf-api

πŸ“„ Convert any HTML page or URL to high-quality PDF documents via API. Perfect for reports, invoices, documentation, web page archiving, and automated document generation.

πŸ‘ User avatar

SimplifySME Toolbox

1

Document Extractor API - AI-Powered PDF & Text Analysis

fresh_cliff/document-extractor-api

Extract text and data from PDF, Word, and image documents using AI-powered OCR. Convert documents to structured JSON, analyze content, and extract insights. No API keys required with mirror fallbacks.

πŸ‘ User avatar

Brennan Crawford

2

Pdf Text Extractor Pro

dainty_screw/pdf-text-extractor-pro

PDF Text Extractor lets you quickly extract text from PDF files with high accuracy. Supports text chunking for AI, chatbots, and large language models (LLMs), making PDF-to-text conversion fast, clean, and ready for NLP or machine learning.

πŸ‘ User avatar

codemaster devops

56

5.0

PDF to JSON Parser

jungle_synthesizer/pdf-to-json-parser

Convert PDF documents into structured JSON. Extracts text, tables, and fields from any PDF URL. Optional AI structuring pass (BYO OpenAI key) turns raw text into clean, organized JSON ready for automation or analysis.

πŸ‘ User avatar

BowTiedRaccoon

2