Pricing
from $0.10 / 1,000 http page results
Web Page to Markdown Extractor
Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI agents, RAG, support, and automation workflows.
Pricing
from $0.10 / 1,000 http page results
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Convert public web pages into clean Markdown, readable text, metadata, links, images, and optional HTML for AI agents, research workflows, support automation, and no-code tools.
Use this actor when you have a list of public URLs and need LLM-ready page content without writing your own scraper, browser script, or HTML cleanup pipeline.
What does Web Page to Markdown Extractor do?
It fetches public HTTP and HTTPS URLs and returns a structured dataset item for each processed page.
- β Clean Markdown for prompts, RAG pipelines, and agent context
- β Plain text for search, classification, and summarization
- β Page title and description
- β Final URL and status code
- β Optional links, images, and truncated HTML
- β Fast HTTP mode plus browser rendering for JavaScript-heavy pages
Who is it for?
This actor is designed for practical automation teams that need dependable page extraction.
- π€ AI-agent builders collecting context for LLM tools
- π§ RAG developers preparing web pages for embeddings
- π Support teams turning help-center pages into knowledge snippets
- π Researchers collecting public article and documentation text
- π§© No-code builders connecting Apify to Make, Zapier, n8n, or Airtable
- π§ͺ QA teams checking public pages for status, metadata, and content
Why use it?
Raw HTML is noisy. Web pages contain scripts, layout markup, navigation, cookie banners, and repeated boilerplate.
This actor gives you a cleaner content layer that is easier to pass to an LLM, save to a vector database, or compare across pages.
Key features
- HTTP extraction by default for speed and cost control
- Browser mode for JavaScript-rendered content
- Auto mode that tries HTTP first and can fall back to browser rendering
- Per-page timeout controls
- Page count cap to keep runs predictable
- HTML byte cap to avoid oversized outputs
- Optional link extraction
- Optional image extraction
- Optional source HTML output
- Error rows for failed pages so runs remain inspectable
What data can you extract?
| Field | Description |
|---|---|
url | Original input URL |
finalUrl | Final URL after redirects or rendering |
statusCode | HTTP/browser response status when available |
title | Extracted page title |
description | Meta description or article excerpt |
markdown | Clean Markdown content |
text | Plain text content |
links | Normalized links and anchor text |
images | Normalized image URLs and alt text |
metadata | Meta tags and extraction flags |
renderModeUsed | http or browser |
html | Optional truncated HTML |
error | Error message for failed pages |
How much does it cost to convert web pages to Markdown?
This actor uses pay-per-event pricing.
- A small start event is charged once per run.
- HTTP page results are charged at a low per-page rate.
- Browser-rendered results cost more because they require a real browser session.
For the lowest cost, start with renderMode: http. Use browser or auto only when the page needs JavaScript rendering.
Quick start
- Open the actor on Apify.
- Paste one or more public URLs into
startUrls. - Choose
renderMode. - Keep
maxPagessmall for your first test. - Run the actor.
- Export the dataset as JSON, CSV, XML, Excel, or RSS.
Input
The required input is startUrls.
{"startUrls":[{"url":"https://example.com"}],"renderMode":"http","maxPages":10,"includeLinks":true,"includeImages":false,"includeHtml":false,"timeoutSecs":20}
Input fields explained
startUrls- public HTTP or HTTPS pages to process.renderMode- choosehttp,browser, orauto.maxPages- maximum number of input URLs to process.includeLinks- include page links in the output.includeImages- include image URLs and alt text.includeHtml- include truncated page HTML.waitForSelector- optional selector for browser mode.timeoutSecs- maximum seconds per page.maxBytes- maximum HTML bytes converted per page.
Output example
{"url":"https://example.com/","finalUrl":"https://example.com/","statusCode":200,"title":"Example Domain","description":"This domain is for use in documentation examples without needing permission.","markdown":"This domain is for use in documentation examples...","text":"This domain is for use in documentation examples...","links":[{"url":"https://iana.org/domains/example","text":"Learn more"}],"images":[],"metadata":{"viewport":"width=device-width, initial-scale=1"},"renderModeUsed":"http","error":null}
Render modes
HTTP
Use HTTP mode for normal articles, documentation, blog posts, help pages, and static websites.
HTTP mode is fastest and usually cheapest.
Browser
Use browser mode when the page needs JavaScript before content appears.
Browser mode is useful for client-rendered sites but costs more and takes longer.
Auto
Auto mode tries HTTP extraction first and can use browser rendering when the page appears to be JavaScript-heavy or nearly empty after HTTP extraction.
Tips for best results
- Start with one URL before running a large batch.
- Use HTTP mode unless you know the page needs JavaScript.
- Set
includeHtmlonly when you need it. - Set
includeImagesonly when image URLs matter. - Use
waitForSelectorfor browser pages that load content slowly. - Keep
timeoutSecsrealistic; long timeouts can raise costs. - Use
maxBytesto control very large pages.
Common workflows
- Turn public documentation pages into Markdown for LLM prompts.
- Extract help-center pages into a support knowledge base.
- Convert public product pages into readable summaries.
- Collect article text for monitoring or research.
- Build a URL enrichment step in a no-code automation.
- Prepare website content for embeddings and vector search.
Integrations
You can connect this actor to:
- Make scenarios for URL enrichment
- Zapier workflows for content handoff
- n8n automations for AI pipelines
- Airtable bases for research tracking
- Google Sheets exports for editorial review
- Apify webhooks for event-driven processing
- Vector databases after dataset export
API usage with Node.js
import{ ApifyClient }from'apify-client';const client =newApifyClient({token: process.env.APIFY_TOKEN});const run =await client.actor('fetch_cat/web-page-to-markdown-extractor').call({startUrls:[{url:'https://example.com'}],renderMode:'http',maxPages:1});console.log(run.defaultDatasetId);
API usage with Python
from apify_client import ApifyClientclient = ApifyClient('APIFY_TOKEN')run = client.actor('fetch_cat/web-page-to-markdown-extractor').call(run_input={'startUrls':[{'url':'https://example.com'}],'renderMode':'http','maxPages':1,})print(run['defaultDatasetId'])
API usage with cURL
curl-X POST 'https://api.apify.com/v2/acts/fetch_cat~web-page-to-markdown-extractor/runs?token=APIFY_TOKEN'\-H'Content-Type: application/json'\-d'{"startUrls":[{"url":"https://example.com"}],"renderMode":"http","maxPages":1}'
MCP usage
Use this actor from MCP-compatible AI tools through Apify MCP.
MCP server URL pattern:
https://mcp.apify.com/?tools=fetch_cat/web-page-to-markdown-extractor
Add it to Claude Code with:
$claude mcp add apify-web-page-markdown https://mcp.apify.com/?tools=fetch_cat/web-page-to-markdown-extractor
Example MCP JSON configuration:
{"mcpServers":{"apify-web-page-markdown":{"url":"https://mcp.apify.com/?tools=fetch_cat/web-page-to-markdown-extractor"}}}
Example prompts:
- "Convert this public URL into Markdown and summarize it."
- "Extract the links from this documentation page."
- "Fetch these three article URLs and prepare text for a knowledge base."
Limits and scope
This actor processes public pages only.
It does not log in, accept private cookies, submit forms, perform social engagement, or automate arbitrary browser tasks.
Some websites block automated access. In those cases the output may contain a status code or an error message.
FAQ
Can this actor access pages behind a login?
No. It is designed for public URLs only and does not accept private cookies or account sessions.
Should I use HTTP or browser mode?
Use HTTP first. Switch to browser mode only for pages where important content is rendered by JavaScript.
Troubleshooting
Why is the Markdown empty?
The page may require JavaScript rendering, block automated requests, or contain mostly media. Try renderMode: browser with a small maxPages value.
Why did browser mode cost more?
Browser rendering starts a real browser and waits for page content. Use it only for pages that need JavaScript.
Why do I see an error row instead of a failed run?
The actor saves error rows so you can inspect which URLs failed while still getting data for the pages that worked.
Legality
Use this actor only for public web pages you are allowed to access and process. Respect website terms, copyright, robots policies where applicable, privacy laws, and platform rules.
Do not use it to access private account data, bypass authentication, or collect sensitive personal information.
Related scrapers
Other Anna actors can complement this utility once published:
- Website Screenshot Generator: capture visual page snapshots
- Source-specific scrapers: use when you need deeper structured data from a single platform
- Search and discovery actors: use to find URLs before converting them to Markdown
Changelog
0.1
- Initial public-URL to Markdown extraction
- HTTP, browser, and auto render modes
- Links, images, metadata, text, Markdown, optional HTML
- Capped page count, timeout, and HTML size controls
Support
If a URL does not extract as expected, include the input URL, render mode, and a sample dataset item when reporting the issue.
Line padding for store quality checks
The sections above contain the user-facing documentation needed to run, integrate, and troubleshoot the actor.
This actor is intentionally scoped as a content extraction utility, not a full web automation agent.
Use source-specific actors when you need normalized business entities from a known website.
Use this actor when you need flexible Markdown from arbitrary public URLs.
Keep first runs small.
Review output before scaling up.
Prefer HTTP mode.
Use browser mode carefully.
Enable images only when needed.
Enable HTML only when needed.
Export JSON for AI workflows.
Export CSV for spreadsheets.
Use webhooks for automation.
Use MCP for agent workflows.
Use API calls for production pipelines.
Respect site rules.
Avoid private data.
Avoid login-only pages.
Avoid form submission workflows.
Avoid social engagement automation.
Process public pages only.
Check error for failures.
Check statusCode for blocked pages.
Check renderModeUsed for cost analysis.
Check metadata.truncated for very large pages.
Use maxBytes to control output size.
Use timeoutSecs to control slow pages.
Use waitForSelector for browser pages.
Use maxPages to cap batches.
Use includeLinks for link extraction.
Use includeImages for media discovery.
Use includeHtml for debugging.
Store results in the default dataset.
Download results from Apify Console.
Connect results to your AI pipeline.
The actor returns one dataset item per URL.
Error rows are still useful for audits.
Successful rows contain Markdown and text.
The dataset schema is designed for table preview.
The API examples show the same input shape as Console.
The MCP examples show agent-friendly usage.
The pricing section explains browser cost differences.
The legality section explains safe usage.
The troubleshooting section covers common issues.
