VOOZH about

URL: https://apify.com/gagandeo/bulk-pdf-to-json-ocr

โ‡ฑ Bulk Pdf To Json OCR ยท Apify


Pricing

from $300.00 / 1,000 results

Go to Apify Store

Bulk Pdf To Json OCR

Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.

Pricing

from $300.00 / 1,000 results

Rating

0.0

(0)

Developer

๐Ÿ‘ Kumar Gagandeo

Kumar Gagandeo

Maintained by Community

Actor stats

1

Bookmarked

6

Total users

1

Monthly active users

6 months ago

Last modified

Share

PDF to JSON OCR Actor with Gemini AI

This Apify Actor converts PDF files to structured JSON data using intelligent text extraction and Google Gemini AI-powered structuring.

Features

  • ๐Ÿ“„ Hybrid Text Extraction: Automatically detects digital text vs scanned images
  • ๐Ÿ” OCR Support: Uses Tesseract OCR for scanned documents
  • ๐Ÿค– AI Structuring: Powered by Google Gemini 2.0 Flash for intelligent data extraction
  • ๐Ÿ“‹ Document Types: Optimized for invoices, receipts, menus, resumes, contracts, brochures, and general documents
  • โšก Bulk Processing: Process multiple PDFs in a single run

Setup

1. Configure Environment Variables

Copy the example environment file and add your Gemini API key:

$cp .env.example .env

Edit .env and add your API key:

GEMINI_API_KEY=AIzaSy...your-actual-key-here
GEMINI_MODEL=gemini-2.0-flash-exp

Get your Gemini API key from: https://aistudio.google.com/apikey

2. Install Dependencies

$pip install-r requirements.txt

3. Run the Actor

$apify run

Deploy to Apify

apify login
apify push

Input Configuration

Required Fields

  • PDF URLs (startUrls): Array of direct PDF file URLs to process

Optional Fields

  • Enable AI Structuring (structureData): Toggle AI-powered data extraction (default: false)
  • Document Type (documentType): Context for AI extraction - general, invoice, receipt, menu, resume, contract, brochure, specification
  • Max Pages (maxPages): Limit pages processed per PDF (default: 10)

Example Input

{
"startUrls":[
{"url":"https://example.com/document.pdf"}
],
"structureData":true,
"documentType":"invoice",
"maxPages":5
}

How It Works

  1. Download: Fetches PDF from provided URL
  2. Text Extraction:
    • First attempts digital text extraction (fast)
    • Falls back to OCR if document is scanned (character density < 50/page)
  3. AI Structuring (optional):
    • Sends extracted text to Google Gemini AI
    • Returns structured JSON based on document type
  4. Data Storage: Pushes results to Apify dataset

Output Format

{
"url":"https://example.com/document.pdf",
"status":"success",
"document_type":"invoice",
"ai_enabled":true,
"ai_model":"gemini-2.0-flash-exp",
"is_ocr_scanned":false,
"page_count":3,
"raw_text_preview":"First 500 characters of extracted text...",
"extracted_data":{
"invoice_number":"INV-001",
"date":"2025-12-17",
"total":"$1,234.56"
}
}

Project Structure

.actor/
โ”œโ”€โ”€ actor.json # Actor config: name, version, env vars, runtime settings
โ”œโ”€โ”€ dataset_schema.json # Structure and representation of data produced by an Actor
โ”œโ”€โ”€ input_schema.json # Input validation & Console form definition
โ””โ”€โ”€ output_schema.json # Specifies where an Actor stores its output
src/
โ””โ”€โ”€ main.py # Actor entry point with PDF processing logic
.env # Environment variables (API keys) - DO NOT COMMIT!
.env.example # Template for environment variables
storage/ # Local storage (mirrors Cloud during development)
โ”œโ”€โ”€ datasets/ # Output items (JSON objects)
โ”œโ”€โ”€ key_value_stores/ # Files, config, INPUT
โ””โ”€โ”€ request_queues/ # Pending crawl requests
Dockerfile # Container image definition
requirements.txt # Python dependencies

For more information, see the Actor definition documentation.

Dependencies

  • Apify SDK - Actor runtime framework
  • pdfplumber - Digital PDF text extraction
  • pdf2image - Converts PDF pages to images
  • pytesseract - OCR text recognition
  • httpx - Async HTTP client for downloading PDFs
  • google-generativeai - Google Gemini API client
  • python-dotenv - Environment variable management

Environment Variables

The Actor uses environment variables for configuration. These can be set in the .env file for local development:

  • GEMINI_API_KEY - Your Google Gemini API key (required for AI structuring)
  • GEMINI_MODEL - Model to use (default: gemini-2.0-flash-exp)

For Apify Cloud deployment: Set these as environment variables in the Actor settings on the Apify Console.

Getting Started

For complete information see this article.

  1. Copy .env.example to .env and add your Gemini API key
  2. Install dependencies: pip install -r requirements.txt
  3. Run the Actor: apify run

Deploy to Apify

Connect Git repository to Apify

If you've created a Git repository for the project, you can easily connect to Apify:

  1. Go to Actor creation page
  2. Click on Link Git Repository button

Push project on your local machine to Apify

You can also deploy the project on your local machine to Apify without the need for the Git repository.

  1. Log in to Apify. You will need to provide your Apify API Token to complete this action.

    $apify login
  2. Deploy your Actor. This command will deploy and build the Actor on the Apify Platform. You can find your newly created Actor under Actors -> My Actors.

    $apify push

Documentation reference

To learn more about Apify and Actors, take a look at the following resources:

You might also like

OCR Structured Extractor (AI) โ€” Image/PDF โ†’ OCR Text + JSON

macheta/ocr-structured-extractor

Extract OCR text and structured JSON from an image or PDF URL. Great for invoices, receipts, forms, IDs, and tables. Powered by Gemini 3 Pro.

Pdf to json

shahabuddin38/pdf-to-json

Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.

10

Document Extractor API - AI-Powered PDF & Text Analysis

fresh_cliff/document-extractor-api

Extract text and data from PDF, Word, and image documents using AI-powered OCR. Convert documents to structured JSON, analyze content, and extract insights. No API keys required with mirror fallbacks.

๐Ÿ‘ User avatar

Brennan Crawford

2

PDF OCR API - Document Extraction

alizarin_refrigerator-owner/pdf-ocr-api

Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.

PDF To JSON Parser

parseforge/pdf-to-json-parser

Convert PDF documents into structured JSON using AI-powered OCR and smart data extraction. The Actor processes every page to ensure complete coverage, then identifies text, fields, tables, and key details, delivering clean, organized JSON ready for automation or analysis.

56

5.0

Pdf OCR API

cspnair/pdf-ocr-api

Extract and convert text from PDF documents using advanced optical character recognition technology with support for multiple AI models.

Pdf Json Extractor

p6t_p10n/pdf-json-extractor

Convert any PDF into structured JSON using AI and OCR (Tesseract or Google Vision). Supports custom schemas, validation, and auto-repair. Ideal for invoices, contracts, receipts, and automation workflows. Fast, accurate, and easy to integrate.

๐Ÿ‘ User avatar

Peerapat Pongnipakorn

2

PDF to JSON Parser

jungle_synthesizer/pdf-to-json-parser

Convert PDF documents into structured JSON. Extracts text, tables, and fields from any PDF URL. Optional AI structuring pass (BYO OpenAI key) turns raw text into clean, organized JSON ready for automation or analysis.

๐Ÿ‘ User avatar

BowTiedRaccoon

2

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.