Pricing
$4.99/month + usage
Go to Apify Store
Fast Pdf Processor
This API is a PDF Processing Service allowing users to upload a PDF to: Extract Text: Reads all text from the PDF and returns it as structured JSON data per page. Merge Pages: Creates a new PDF containing only the specific pages selected by the user. (260 characters)
Pricing
$4.99/month + usage
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
3
Total users
0
Monthly active users
6 months ago
Last modified
Categories
Share
PDF Processor - Apify Actor Deployment Guide
Overview
This PDF Processor provides four main operations via Apify Actor:
- Extract Text - Extract text content from all PDF pages
- Merge Pages - Create new PDFs with selected pages only
- HTML to PDF - Convert HTML content to PDF using Playwright
- URL to PDF - Convert web pages to PDF using Playwright
Files Structure
pdf-processor-actor/βββ main.py # Apify Actor wrapper(main entry point)βββ requirements.txt # Dependencies for Apify deploymentβββ requirements_apify.txt # Alternative requirements fileβββ Dockerfile # Docker configuration for Apifyβββ actor.json # Apify Actor configurationβββ INPUT_SCHEMA.json # Input schema definitionβββ apify_input_schema.json # Legacy input schemaβββ apify_output_schema.json # Output schema definitionβββ sample_inputs.json # Example inputs for testingβββ test_local.py # Local testing scriptβββ n8n_workflow_example.json # n8n integration exampleβββ n8n_direct_api_workflow.json # n8n direct API workflowβββ QUICK_START.md # Quick start guideβββ apify.json # Apify configurationβββ actor/ # Actor configuration directoryβ βββ actor.jsonβ βββ dataset_schema.jsonβββ README.md # This file
Deployment Steps
1. Prepare Your Repository
# Create a new directory for your actormkdir pdf-processor-actorcd pdf-processor-actor# Copy all the provided filescp /path/to/main.py .cp /path/to/app.py .cp /path/to/requirements_apify.txt .cp /path/to/Dockerfile .cp /path/to/actor.json .cp /path/to/apify_input_schema.json .cp /path/to/apify_output_schema.json .cp /path/to/sample_inputs.json .
2. Deploy to Apify
Option A: Using Apify CLI
# Install Apify CLInpminstall-g apify-cli# Login to your Apify accountapify login# Initialize the actorapify init# Push to Apify platformapify push
Option B: Using GitHub Integration
- Push your code to a GitHub repository
- Go to Apify Console
- Click "Actors" β "Create new"
- Choose "From GitHub repository"
- Connect your GitHub repo
- Apify will automatically build and deploy
3. Configure the Actor
In Apify Console:
- Navigate to your actor
- Go to "Settings" tab
- Set the following:
- Build tag:
latest - Memory:
512 MB(minimum, increase for complex webpages or large PDFs) - Timeout:
300 seconds(adjust based on PDF size and webpage complexity)
- Build tag:
4. Test Your Actor
- Go to the "Input" tab
- Use one of the sample inputs:
Extract Text:
{"action":"extract-text","pdfUrl":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}
Merge Pages:
{"action":"merge-pages","pdfUrl":"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf","pageNumbers":[0,2,4]}
HTML to PDF:
{"action":"html-to-pdf","html":"<html><body><h1>Hello World</h1><p>This is a test PDF.</p></body></html>"}
URL to PDF:
{"action":"url-to-pdf","pdfUrl":"https://example.com"}
- Click "Run"
- Check the output in the "Dataset" tab
Usage Examples
Via Apify API
from apify_client import ApifyClientclient = ApifyClient('YOUR_API_TOKEN')actor = client.actor('YOUR_USERNAME/pdf-processor')# Extract textrun = actor.call(run_input={"action":"extract-text","pdfUrl":"https://example.com/document.pdf"})# HTML to PDFrun = actor.call(run_input={"action":"html-to-pdf","html":"<html><body><h1>Invoice</h1><p>Amount: $100</p></body></html>"})# URL to PDFrun = actor.call(run_input={"action":"url-to-pdf","pdfUrl":"https://example.com"})# Get resultsdataset = client.dataset(run['defaultDatasetId'])results =list(dataset.iterate_items())
Via REST API
# Extract textcurl-X POST https://api.apify.com/v2/acts/YOUR_USERNAME~pdf-processor/runs \-H"Content-Type: application/json"\-H"Authorization: Bearer YOUR_API_TOKEN"\-d'{"action": "extract-text","pdfUrl": "https://example.com/document.pdf"}'# HTML to PDFcurl-X POST https://api.apify.com/v2/acts/YOUR_USERNAME~pdf-processor/runs \-H"Content-Type: application/json"\-H"Authorization: Bearer YOUR_API_TOKEN"\-d'{"action": "html-to-pdf","html": "<html><body><h1>Invoice</h1></body></html>"}'# URL to PDFcurl-X POST https://api.apify.com/v2/acts/YOUR_USERNAME~pdf-processor/runs \-H"Content-Type: application/json"\-H"Authorization: Bearer YOUR_API_TOKEN"\-d'{"action": "url-to-pdf","pdfUrl": "https://example.com"}'
Monitoring
- Check logs in the "Runs" tab for debugging
- Monitor performance in the "Analytics" tab
- Set up webhooks for run completion notifications
Cost Estimation
- Compute Units:
- Text extraction: ~0.001 CU per page
- Page merging: ~0.002 CU per page
- HTML/URL to PDF: ~0.005-0.02 CU (depends on complexity and load time)
- Storage: Minimal for text, ~1 MB per 100 pages for generated PDFs
- Bandwidth: Depends on PDF/webpage size (input + output)
Limitations
- Maximum PDF size: 100 MB (configurable)
- Maximum pages to process: 1000 (configurable)
- Timeout: 5 minutes default (configurable)
- HTML/URL to PDF: Requires Playwright/Chrome (included in Docker image)
- Complex JavaScript sites may need additional wait time
Support
For issues or questions:
- Check the actor logs for error details
- Verify PDF URL is publicly accessible
- Ensure page numbers are within valid range
License
MIT
