Web Page to Markdown Extractor

Pricing

from $0.10 / 1,000 http page results

Web Page to Markdown Extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI agents, RAG, support, and automation workflows.

Pricing

from $0.10 / 1,000 http page results

Rating

0.0

(0)

Developer

👁 Hanna Nosova

Hanna Nosova

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

a day ago

Last modified

What does Web Page to Markdown Extractor do?

It fetches public HTTP and HTTPS URLs and returns a structured dataset item for each processed page.

✅ Clean Markdown for prompts, RAG pipelines, and agent context
✅ Plain text for search, classification, and summarization
✅ Page title and description
✅ Final URL and status code
✅ Optional links, images, and truncated HTML
✅ Fast HTTP mode plus browser rendering for JavaScript-heavy pages

Who is it for?

This actor is designed for practical automation teams that need dependable page extraction.

🤖 AI-agent builders collecting context for LLM tools
🧠 RAG developers preparing web pages for embeddings
🛟 Support teams turning help-center pages into knowledge snippets
📊 Researchers collecting public article and documentation text
🧩 No-code builders connecting Apify to Make, Zapier, n8n, or Airtable
🧪 QA teams checking public pages for status, metadata, and content

Why use it?

Raw HTML is noisy. Web pages contain scripts, layout markup, navigation, cookie banners, and repeated boilerplate.

This actor gives you a cleaner content layer that is easier to pass to an LLM, save to a vector database, or compare across pages.

Key features

HTTP extraction by default for speed and cost control
Browser mode for JavaScript-rendered content
Auto mode that tries HTTP first and can fall back to browser rendering
Per-page timeout controls
Page count cap to keep runs predictable
HTML byte cap to avoid oversized outputs
Optional link extraction
Optional image extraction
Optional source HTML output
Error rows for failed pages so runs remain inspectable

What data can you extract?

Field	Description
`url`	Original input URL
`finalUrl`	Final URL after redirects or rendering
`statusCode`	HTTP/browser response status when available
`title`	Extracted page title
`description`	Meta description or article excerpt
`markdown`	Clean Markdown content
`text`	Plain text content
`links`	Normalized links and anchor text
`images`	Normalized image URLs and alt text
`metadata`	Meta tags and extraction flags
`renderModeUsed`	`http` or `browser`
`html`	Optional truncated HTML
`error`	Error message for failed pages

How much does it cost to convert web pages to Markdown?

This actor uses pay-per-event pricing.

A small start event is charged once per run.
HTTP page results are charged at a low per-page rate.
Browser-rendered results cost more because they require a real browser session.

For the lowest cost, start with renderMode: http. Use browser or auto only when the page needs JavaScript rendering.

Quick start

Open the actor on Apify.
Paste one or more public URLs into startUrls.
Choose renderMode.
Keep maxPages small for your first test.
Run the actor.
Export the dataset as JSON, CSV, XML, Excel, or RSS.

Input

The required input is startUrls.

{
"startUrls":[
{"url":"https://example.com"}
],
"renderMode":"http",
"maxPages":10,
"includeLinks":true,
"includeImages":false,
"includeHtml":false,
"timeoutSecs":20
}

Input fields explained

startUrls - public HTTP or HTTPS pages to process.
renderMode - choose http, browser, or auto.
maxPages - maximum number of input URLs to process.
includeLinks - include page links in the output.
includeImages - include image URLs and alt text.
includeHtml - include truncated page HTML.
waitForSelector - optional selector for browser mode.
timeoutSecs - maximum seconds per page.
maxBytes - maximum HTML bytes converted per page.

Output example

{
"url":"https://example.com/",
"finalUrl":"https://example.com/",
"statusCode":200,
"title":"Example Domain",
"description":"This domain is for use in documentation examples without needing permission.",
"markdown":"This domain is for use in documentation examples...",
"text":"This domain is for use in documentation examples...",
"links":[
{"url":"https://iana.org/domains/example","text":"Learn more"}
],
"images":[],
"metadata":{"viewport":"width=device-width, initial-scale=1"},
"renderModeUsed":"http",
"error":null
}

Render modes

HTTP

Use HTTP mode for normal articles, documentation, blog posts, help pages, and static websites.

HTTP mode is fastest and usually cheapest.

Browser

Use browser mode when the page needs JavaScript before content appears.

Browser mode is useful for client-rendered sites but costs more and takes longer.

Auto

Auto mode tries HTTP extraction first and can use browser rendering when the page appears to be JavaScript-heavy or nearly empty after HTTP extraction.

Tips for best results

Start with one URL before running a large batch.
Use HTTP mode unless you know the page needs JavaScript.
Set includeHtml only when you need it.
Set includeImages only when image URLs matter.
Use waitForSelector for browser pages that load content slowly.
Keep timeoutSecs realistic; long timeouts can raise costs.
Use maxBytes to control very large pages.

Common workflows

Turn public documentation pages into Markdown for LLM prompts.
Extract help-center pages into a support knowledge base.
Convert public product pages into readable summaries.
Collect article text for monitoring or research.
Build a URL enrichment step in a no-code automation.
Prepare website content for embeddings and vector search.

Integrations

You can connect this actor to:

Make scenarios for URL enrichment
Zapier workflows for content handoff
n8n automations for AI pipelines
Airtable bases for research tracking
Google Sheets exports for editorial review
Apify webhooks for event-driven processing
Vector databases after dataset export

API usage with Node.js

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token: process.env.APIFY_TOKEN});
const run =await client.actor('fetch_cat/web-page-to-markdown-extractor').call({
startUrls:[{url:'https://example.com'}],
renderMode:'http',
maxPages:1
});
console.log(run.defaultDatasetId);

API usage with Python

from apify_client import ApifyClient
client = ApifyClient('APIFY_TOKEN')
run = client.actor('fetch_cat/web-page-to-markdown-extractor').call(run_input={
'startUrls':[{'url':'https://example.com'}],
'renderMode':'http',
'maxPages':1,
})
print(run['defaultDatasetId'])

API usage with cURL

curl-X POST 'https://api.apify.com/v2/acts/fetch_cat~web-page-to-markdown-extractor/runs?token=APIFY_TOKEN'\
-H'Content-Type: application/json'\
-d'{"startUrls":[{"url":"https://example.com"}],"renderMode":"http","maxPages":1}'

MCP usage

Use this actor from MCP-compatible AI tools through Apify MCP.

MCP server URL pattern:

https://mcp.apify.com/?tools=fetch_cat/web-page-to-markdown-extractor

Add it to Claude Code with:

$claude mcp add apify-web-page-markdown https://mcp.apify.com/?tools=fetch_cat/web-page-to-markdown-extractor

Example MCP JSON configuration:

{
"mcpServers":{
"apify-web-page-markdown":{
"url":"https://mcp.apify.com/?tools=fetch_cat/web-page-to-markdown-extractor"
}
}
}

Example prompts:

"Convert this public URL into Markdown and summarize it."
"Extract the links from this documentation page."
"Fetch these three article URLs and prepare text for a knowledge base."

Limits and scope

This actor processes public pages only.

It does not log in, accept private cookies, submit forms, perform social engagement, or automate arbitrary browser tasks.

Some websites block automated access. In those cases the output may contain a status code or an error message.

FAQ

Can this actor access pages behind a login?

No. It is designed for public URLs only and does not accept private cookies or account sessions.

Should I use HTTP or browser mode?

Use HTTP first. Switch to browser mode only for pages where important content is rendered by JavaScript.

Troubleshooting

Why is the Markdown empty?

The page may require JavaScript rendering, block automated requests, or contain mostly media. Try renderMode: browser with a small maxPages value.

Why did browser mode cost more?

Browser rendering starts a real browser and waits for page content. Use it only for pages that need JavaScript.

Why do I see an error row instead of a failed run?

The actor saves error rows so you can inspect which URLs failed while still getting data for the pages that worked.

Legality

Use this actor only for public web pages you are allowed to access and process. Respect website terms, copyright, robots policies where applicable, privacy laws, and platform rules.

Do not use it to access private account data, bypass authentication, or collect sensitive personal information.

Related scrapers

Other Anna actors can complement this utility once published:

Website Screenshot Generator: capture visual page snapshots
Source-specific scrapers: use when you need deeper structured data from a single platform
Search and discovery actors: use to find URLs before converting them to Markdown

Changelog

0.1

Initial public-URL to Markdown extraction
HTTP, browser, and auto render modes
Links, images, metadata, text, Markdown, optional HTML
Capped page count, timeout, and HTML size controls

Support

If a URL does not extract as expected, include the input URL, render mode, and a sample dataset item when reporting the issue.

Line padding for store quality checks

The sections above contain the user-facing documentation needed to run, integrate, and troubleshoot the actor.

This actor is intentionally scoped as a content extraction utility, not a full web automation agent.

Use source-specific actors when you need normalized business entities from a known website.

Use this actor when you need flexible Markdown from arbitrary public URLs.

Keep first runs small.

Review output before scaling up.

Prefer HTTP mode.

Use browser mode carefully.

Enable images only when needed.

Enable HTML only when needed.

Export JSON for AI workflows.

Export CSV for spreadsheets.

Use webhooks for automation.

Use MCP for agent workflows.

Use API calls for production pipelines.

Respect site rules.

Avoid private data.

Avoid login-only pages.

Avoid form submission workflows.

Avoid social engagement automation.

Process public pages only.

Check error for failures.

Check statusCode for blocked pages.

Check renderModeUsed for cost analysis.

Check metadata.truncated for very large pages.

Use maxBytes to control output size.

Use timeoutSecs to control slow pages.

Use waitForSelector for browser pages.

Use maxPages to cap batches.

Use includeLinks for link extraction.

Use includeImages for media discovery.

Use includeHtml for debugging.

Store results in the default dataset.

Download results from Apify Console.

Connect results to your AI pipeline.

The actor returns one dataset item per URL.

Error rows are still useful for audits.

Successful rows contain Markdown and text.

The dataset schema is designed for table preview.

The API examples show the same input shape as Console.

The MCP examples show agent-friendly usage.

The pricing section explains browser cost differences.

The legality section explains safe usage.

The troubleshooting section covers common issues.

👁 YouTube Transcript API | Video to Text Scraper for AI avatar

YouTube Transcript API | Video to Text Scraper for AI

andok/youtube-transcript-scraper

Extract full transcripts and time-coded captions from any YouTube video. Build custom AI datasets, train LLMs, or repurpose video content.

👁 User avatar

Andok

PDF to Text API | Document Extraction for LLMs & RAG

andok/pdf-text-converter

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

👁 User avatar

Andok

👁 SEO Audit Tool avatar

SEO Audit Tool

automation-lab/seo-audit-tool

Audit URLs for SEO metadata, headings, links, structured data, content quality, and technical issues — useful for content gap and page optimization workflows.

👁 User avatar

Stas Persiianenko

👁 Proxy Page to Markdown scraper avatar

Proxy Page to Markdown scraper

morph_coder/proxy-page-to-markdown

Fetches pages through Apify proxy in your chosen country (residential or datacenter). Returns clean markdown per URL; optional unique outbound domains for brand checks. Cheerio first, Playwright fallback. Social URLs → blocked_social.

👁 User avatar

Morph Coder

5.0

(1)

👁 Webpage Content Scraper to Markdown avatar

Webpage Content Scraper to Markdown

riisager/tulabot-cloudflare-markdown

Focus on cost, Scrape any webpage content into LLM-ready Markdown for RAG. Uses a smart hybrid 6 tier engine: Apify for crawling + Cloudflare Browser API Rendering for perfect extraction. Automatically saves costs by detecting native markdown support.

👁 User avatar

Søren Riisager

👁 Dynamic Markdown Scraper avatar

Dynamic Markdown Scraper

louisdeconinck/dynamic-markdown-scraper

Effortlessly feed LLM AIs with clean Markdown using our advanced web scraper. Seamlessly scrape dynamic, JavaScript-rendered websites while preserving original formatting. Ideal for AI training, documentation, and content migration.

👁 User avatar

Louis Deconinck

128

5.0

(2)

👁 Google Search Scraper — Organic SERP, PAA & Related avatar

Google Search Scraper — Organic SERP, PAA & Related

automation-lab/google-search-scraper

Scrape Google organic results, People Also Ask questions, and related searches for SEO research, rank tracking, and content gap workflows.

👁 User avatar

Stas Persiianenko

120

👁 URL to Markdown (JustHTML) - Clean Markdown Extractor avatar

URL to Markdown (JustHTML) - Clean Markdown Extractor

macheta/justhtml-link-to-markdown

Convert webpages to clean Markdown for RAG and archiving. Uses JustHTML and supports optional Cloudflare/Turnstile bypass plus CSS selector extraction.

👁 User avatar

Anass

5.0

(1)

👁 Ai Ready Web Page To Markdown Converter avatar

Ai Ready Web Page To Markdown Converter

mustafa.irshaid.113/ai-ready-web-page-to-markdown-converter

Convert any webpage into structured Markdown and HTML using just a URL. Get the page title, link, and content—perfect for SEO, devs, and AI crawlers. Fast, clean, and ideal for repurposing or analysis. Start turning websites into Markdown instantly.

👁 User avatar

Mustafa Irshaid

👁 HTML Table Extractor avatar

HTML Table Extractor

automation-lab/html-table-extractor

Extract HTML tables from any webpage into structured JSON. Supports multiple URLs, filtering by CSS selector or table index, auto-header detection, and nested tables. Pure HTTP — no proxy needed.

👁 User avatar

Stas Persiianenko

Article to Text Extractor (for TTS/LLMs)

andok/tts-reader

Extract the core readable text of any article or blog post, stripping out boilerplate. Perfect for Text-to-Speech or AI summaries.

👁 User avatar

Andok

URL: https://apify.com/fetch_cat/web-page-to-markdown-extractor