VOOZH about

URL: https://apify.com/junipr/pdf-to-html

โ‡ฑ PDF to HTML Converter - Tables & Formatting ยท Apify


Pricing

from $3.90 / 1,000 page converteds

Go to Apify Store

PDF to HTML Converter

Convert PDFs to clean HTML preserving formatting, headings, tables, and layout. Multi-page support with per-page or combined output. OCR fallback for image PDFs. Inline CSS styling. Download via API.

Pricing

from $3.90 / 1,000 page converteds

Rating

0.0

(0)

Developer

๐Ÿ‘ junipr

junipr

Maintained by Community

Actor stats

0

Bookmarked

9

Total users

0

Monthly active users

a month ago

Last modified

Share

Introduction

Convert any PDF document to clean, semantic HTML that preserves the original document structure. Unlike most PDF-to-HTML tools that produce visual HTML with absolutely-positioned divs and spans (mimicking PDF layout pixel-by-pixel), this actor generates real semantic elements: <h1>-<h6> headings, <table> with <thead>/<tbody>, <ul>/<ol> lists, and <p> paragraphs. The output is valid HTML5, screen-reader accessible, and ready for web publishing, CMS import, content migration, or further processing in any pipeline. Batch processing is supported โ€” convert hundreds of PDFs in a single run with configurable styling options.

Why Use This Actor

Most PDF-to-HTML converters produce "visual HTML" โ€” absolute-positioned divs that look like the PDF but have no semantic meaning. This means tables are rendered as scattered text spans, headings are just bigger fonts, and lists become disconnected bullet characters. Our actor produces semantic HTML that browsers, search engines, screen readers, and CMS platforms can actually understand.

FeatureThis Actorpdf2htmlEXAdobe APIOnline Tools
Semantic HTMLHeadings, tables, listsAbsolute positioningPartialRarely
Table detectionProper <table>Positioned textYesPoor
List detection<ul> + <ol>NonePartialNone
CSS optionsInline / class / noneInline onlyClass-basedInline
Batch processingYes (up to 5,000 PDFs)CLI onlyYesSingle file
Cost$3/1K pagesFree (self-hosted)$0.05/pageFreemium
SetupZero configCLI install + DockerAPI key requiredUpload UI

The output is WCAG-friendly: screen readers can navigate headings, read table headers, and traverse list items โ€” something impossible with visual HTML output.

How to Use

Zero-config example โ€” just provide a PDF URL:

{
"sources":[
{"url":"https://example.com/report.pdf"}
]
}

Node.js (Apify Client):

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token:'YOUR_TOKEN'});
const run =await client.actor('junipr/pdf-to-html').call({
sources:[{url:'https://example.com/document.pdf'}],
stylingMode:'class',
wrapInDocument:true,
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].html);

Python:

from apify_client import ApifyClient
client = ApifyClient("YOUR_TOKEN")
run = client.actor("junipr/pdf-to-html").call(run_input={
"sources":[{"url":"https://example.com/document.pdf"}],
"stylingMode":"class",
})
dataset = client.dataset(run["defaultDatasetId"]).list_items().items
print(dataset[0]["html"])

Load from Apify Key-Value Store:

{
"sources":[
{"kvStoreKey":"my-document.pdf","kvStoreId":"abc123"}
]
}

Input Configuration

All parameters except sources are optional. Common recipes:

Quick conversion โ€” URL only, all defaults:

{"sources":[{"url":"https://example.com/doc.pdf"}]}

Web publishing โ€” styled HTML document with WebP images:

{
"sources":[{"url":"https://example.com/doc.pdf"}],
"stylingMode":"class",
"wrapInDocument":true,
"imageFormat":"webp",
"includeDefaultStyles":true
}

CMS import โ€” pure semantic HTML, no styling, no page breaks:

{
"sources":[{"url":"https://example.com/doc.pdf"}],
"stylingMode":"none",
"pageBreakMode":"none",
"wrapInDocument":false
}

Print archive โ€” preserve all formatting with inline CSS:

{
"sources":[{"url":"https://example.com/doc.pdf"}],
"stylingMode":"inline",
"preserveFontSizes":true,
"preserveColors":true,
"preserveFontStyles":true
}

See the Input tab for the full list of parameters with descriptions and defaults.

Output Format

Each converted PDF produces one dataset item with the full HTML, per-page breakdown, image references, metadata, and conversion stats. Example output (fragment mode):

<h1class="heading-1">Annual Report 2024</h1>
<pclass="paragraph">Revenue grew 23% year-over-year...</p>
<h2class="heading-2">Financial Summary</h2>
<tableclass="pdf-table">
<thead>
<tr><th>Quarter</th><th>Revenue</th><th>Growth</th></tr>
</thead>
<tbody>
<tr><td>Q1</td><td>$2.1M</td><td>18%</td></tr>
<tr><td>Q2</td><td>$2.5M</td><td>25%</td></tr>
</tbody>
</table>
<ulclass="pdf-list">
<li>Expanded into 3 new markets</li>
<li>Launched enterprise tier</li>
</ul>

When wrapInDocument is enabled, the output includes <!DOCTYPE html>, <html>, <head> with meta tags from PDF metadata, and a <style> block with the default or custom stylesheet.

Extracted images are stored in the run's Key-Value Store and referenced in the images array with dimensions, format, and originating page number.

Tips and Advanced Usage

  • Multi-column PDFs: Enable detectColumns (on by default) to merge multi-column text in natural reading order rather than interleaving columns.
  • Custom CSS: Use customCss with stylingMode: "class" to inject your own styles. Class names like .heading-1, .paragraph, .pdf-table, .pdf-list are consistent across all documents.
  • Scanned PDFs: This actor works with text-based PDFs only. For scanned documents, run them through an OCR actor first, then convert the output.
  • Page selection: Use pageRange to convert only specific pages (e.g., "1-3,7") โ€” you only pay for pages actually converted.
  • Batch optimization: Process up to 5,000 PDFs per run. Each PDF is processed sequentially to manage memory. For very large batches, increase the memory allocation to 4096 MB or higher.
  • CMS integration: Use stylingMode: "none" and wrapInDocument: false for WordPress, Contentful, or Strapi imports โ€” these platforms apply their own styling.

Pricing

This actor uses Pay-Per-Event (PPE) pricing at $3.00 per 1,000 pages converted ($0.003 per page).

ScenarioPagesCost
Single 10-page document10$0.03
Product catalog (100 pages)100$0.30
Legal contract batch (50 docs x 20 pages)1,000$3.00
Website migration (500 PDFs x 5 pages)2,500$7.50
Document archive (10K pages)10,000$30.00

Not billed: pages that fail to convert, scanned pages with no text, pages skipped by pageRange, empty pages, and duplicate PDFs. You only pay for successfully converted pages.

Compared to Adobe Document Services API ($0.05/page = $50/1K pages), this actor is 94% cheaper at any scale.

FAQ

What makes this different from pdf2htmlEX?

pdf2htmlEX produces visual HTML โ€” every text element is absolutely positioned with pixel coordinates, mimicking the PDF layout. While it looks identical to the original, the HTML has no semantic meaning. Our actor produces real <h1>, <table>, <ul>, and <p> elements that browsers, search engines, and screen readers understand. The tradeoff is that our output may not be pixel-perfect, but it is actually useful as HTML.

Can it handle tables with merged cells?

Yes. The actor detects table structures and attempts to identify colspan/rowspan relationships. For very complex tables (deeply nested or irregularly merged), it falls back to a <pre> block with formatted text to preserve readability.

What happens with scanned/image-only PDF pages?

Scanned pages are detected automatically and produce a SCANNED_PAGE_DETECTED warning. These pages are skipped (not billed) because there is no text to convert. For scanned documents, use an OCR actor first to extract text, then run this actor on the result.

Does it support password-protected PDFs?

Yes. Provide the password via the password input field (applies to all sources) or per-source via sources[].password. If a PDF is encrypted and no password is provided, you get an ENCRYPTED_NO_PASSWORD error. If the password is wrong, you get an INVALID_PASSWORD error.

Can I customize the CSS output?

Yes. Choose between three styling modes: "class" (CSS classes + <style> block), "inline" (style attributes on each element), or "none" (pure semantic HTML). With class-based styling, you can inject custom CSS via the customCss field and toggle the built-in default styles with includeDefaultStyles.

How are images handled in the output?

When extractImages is enabled, embedded images are extracted from the PDF, converted to your chosen format (PNG, JPEG, or WebP), and stored in the run's Key-Value Store. The HTML output references images via <img> tags, and the images array in the dataset provides each image's KV store key, dimensions, format, and originating page number.

What's the maximum PDF file size?

Configurable via maxFileSizeMb, default is 100 MB, maximum is 500 MB. For very large PDFs, increase the actor's memory allocation proportionally. A 200-page PDF with many images may need 4096 MB of memory.

Can I convert only specific pages?

Yes. Use the pageRange field with ranges like "1-5", "1,3,5", or "1-3,7,9-12". Pages outside the range are skipped and not billed. The output includes only the requested pages.

You might also like

PDF to Markdown Converter - AI-Powered with OCR & Tables

clearpath/pdf-to-markdown-api

Convert PDFs to clean Markdown with GPU-accelerated AI. Extracts tables, LaTeX formulas, and images from complex layouts. Supports OCR for scanned docs in 8 languages. Batch process hundreds of PDFs in parallel via URL, upload, or API.

HTML to PDF Converter

rainminer/html-to-pdf-converter

Convert raw HTML or web page URLs into downloadable PDF files using a real browser. Render CSS, images, tables, invoices, reports, and dynamic layouts, then save the generated PDF to the Apify Key-Value Store with dataset metadata.

Html To Pdf Api

simplifysme/html-to-pdf-api

๐Ÿ“„ Convert any HTML page or URL to high-quality PDF documents via API. Perfect for reports, invoices, documentation, web page archiving, and automated document generation.

๐Ÿ‘ User avatar

SimplifySME Toolbox

1

PDF to Text Extractor

junipr/pdf-to-text-extractor

Extract text from PDFs with native parsing and OCR fallback. Per-page granularity, paragraph structure preserved. Batch process multiple URLs. Output as plain text, JSON, or combined document. Ideal for data pipelines.

HTML to PDF Converter Pro ๐Ÿ”„

powerful_bachelor/html-to-pdf-converter-pro

๐Ÿ”„ Convert web pages to high-quality PDFs with special canvas element handling! Perfect for ๐Ÿ“„ documentation, ๐Ÿ–จ๏ธ printing, and ๐Ÿ”’ archiving. Features include batch processing and flexible page settings. Transform your web content into professional PDFs! ๐Ÿš€

๐Ÿ‘ User avatar

Powerful Bachelor

27

PDF OCR Tool โ€” Extract Text from Scanned Documents

junipr/pdf-ocr-tool

Extract text from scanned PDFs and images using Tesseract OCR. 100+ languages, multi-page support. Configurable DPI, page segmentation, language selection. Output as plain text or structured JSON per page.

Convert Image to PDF and PDF to Image

akash9078/image-pdf-converter

Convert images (JPG, PNG, BMP, and more) into high-quality PDFs, or extract images from PDF files in seconds. Imageโ€“PDF Converter Pro delivers fast, reliable, and professional results for all your document and image conversion needs.

๐Ÿ‘ User avatar

Akash Kumar Naik

37

Web Page to Single-Page PDF & HTML (Automation-Ready)

exciting_perfume/Web-Page-to-Single-Page-PDF-and-HTML

Convert webpages to single-page PDFs and extract raw HTML via API. Captures full scroll height (no A4 splits). Built for automation with n8n, Make, and Zapier. Ideal for archiving, AI workflows, compliance, and bulk processing.

๐Ÿ‘ User avatar

Gavin Campbell

9

Markdown to PDF MCP Server

parseforge/markdown-to-pdf-mcp

Convert Markdown content to PDF format using Model Context Protocol (MCP). Perfect for developers, content creators, and businesses who need to programmatically convert Markdown documents to professional PDFs with custom styling, page sizes, margins, and orientations.

8

5.0