Web to Markdown for LLMs

Pricing

from $3.00 / 1,000 page converted to markdowns

Try for free

Go to Apify Store

👁 Web to Markdown for LLMs

Web to Markdown for LLMs

Try for free

Convert any URL to clean LLM-ready markdown. 60-70% fewer tokens than raw HTML. Built for AI agents and RAG pipelines.

Pricing

from $3.00 / 1,000 page converted to markdowns

Rating

0.0

(0)

Developer

👁 George Kioko

George Kioko

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Why This Actor?

LLMs choke on raw HTML. Scripts, styles, navigation, ads — all noise that burns tokens and confuses models. This actor strips all that away and returns clean markdown that your AI can actually reason about.

Raw HTML:67,841 tokens → costs $0.068 per page(GPT-4)
Markdown:6,176 tokens → costs $0.006 per page(GPT-4)
 ↑ 91% savings

How It Works

┌──────────┐ ┌─────────────────┐ ┌──────────────┐
│ Any URL │────▶│ Puppeteer │────▶│ Clean │
│ │ │ renders page │ │ Markdown │
└──────────┘ │ (JavaScript, │ │ + metadata │
 │ SPAs, dynamic) │ │ + stats │
 └─────────────────┘ └──────────────┘
 │
 ┌───────┴───────┐
 │ Cheerio parses │
 │ Turndown │
 │ converts to MD │
 └───────────────┘
Noise removed: scripts, styles, nav, footer, ads, popups, modals
Kept: headings, paragraphs, lists, tables, links, images, code blocks

What Data Does It Extract?

Field	Description
`markdown`	Clean, structured markdown content
`title`	Page title
`description`	Meta description
`author`	Article author (when available)
`publishDate`	Publication date
`language`	Page language
`wordCount`	Total words in markdown
`links`	All links found (text + href)
`images`	All images (src + alt text)
`tableOfContents`	Heading structure for navigation
`stats.htmlTokensEstimate`	Original HTML token count
`stats.markdownTokensEstimate`	Markdown token count
`stats.tokenSavingsPercent`	Percentage of tokens saved
`stats.renderTimeMs`	Page render time

Use Cases

RAG Pipelines — Feed clean web content into vector databases (Pinecone, Weaviate, Chroma). 85% fewer tokens = 85% lower embedding costs.
AI Agent Tool Use — Give your agent a "read the web" tool. Pass any URL, get structured content back. Works with LangChain, LlamaIndex, CrewAI, AutoGen.
Content Repurposing — Convert any article/blog into markdown for your CMS, newsletter, or documentation site.
Training Data — Build LLM training datasets from web content. Clean markdown = higher quality training data.
Competitive Intelligence — Monitor competitor websites and extract structured content for analysis.

Input Parameters

Parameter	Type	Required	Default	Description
`url`	string	Yes*	—	Single URL to convert
`urls`	string[]	Yes*	—	Array of URLs for batch processing
`includeLinks`	boolean	No	true	Include extracted links in output
`includeImages`	boolean	No	true	Include image URLs in output
`includeToc`	boolean	No	false	Include table of contents
`waitFor`	number	No	3000	Wait time (ms) for JS rendering

*Provide either url or urls

Output Example

{
"url":"https://blog.example.com/article",
"sourceUrl":"https://blog.example.com/article",
"title":"How AI Agents Read the Web",
"description":"A guide to building web-reading capabilities for AI agents",
"author":"Jane Doe",
"publishDate":"2026-03-25T10:00:00.000Z",
"language":"en",
"markdown":"# How AI Agents Read the Web\n\n**Author:** Jane Doe\n**Published:** 2026-03-25\n\n---\n\nAI agents need structured data to reason about web content...",
"wordCount":2450,
"links":[
{"text":"LangChain docs","href":"https://docs.langchain.com"},
{"text":"Vector databases","href":"https://www.pinecone.io"}
],
"images":[
{"src":"https://blog.example.com/diagram.png","alt":"Architecture diagram"}
],
"tableOfContents":[
{"level":1,"text":"How AI Agents Read the Web"},
{"level":2,"text":"The Problem with Raw HTML"},
{"level":2,"text":"The Markdown Solution"}
],
"stats":{
"htmlSize":245000,
"markdownSize":12400,
"htmlTokensEstimate":61250,
"markdownTokensEstimate":3100,
"tokenSavingsPercent":95,
"renderTimeMs":4200
}
}

Performance Benchmarks

Tested across 60 diverse websites:

Site Type	Success Rate	Avg Token Savings	Avg Time
News (BBC, CNN, NYT)	100%	94%	16s
Blogs/Articles	100%	91%	8s
Documentation	100%	92%	5s
Company websites	100%	100%	12s
Wikipedia	100%	73%	7s
E-commerce	80%	90%	10s
Heavy SPAs	60%	54%	6s
Overall	80%	85%	10s

Comparison vs Firecrawl

Feature	This Actor	Firecrawl
Token savings	85% avg	67% avg
Price	$0.003/page	$0.0008-0.005/page
JS rendering	Puppeteer (full)	Playwright
Free tier	Apify free plan	500 credits
Open source	Yes (Apify)	Partial
Batch processing	Yes (urls array)	Yes
Standby API	Yes (instant)	Yes

Standby API (Instant Response)

This actor supports Apify Standby mode for instant HTTP responses:

# Health check
curl"https://george-the-developer--web-to-markdown-llm.apify.actor/"\
-H"Authorization: Bearer YOUR_TOKEN"
# Convert a URL
curl"https://george-the-developer--web-to-markdown-llm.apify.actor/convert?url=https://example.com"\
-H"Authorization: Bearer YOUR_TOKEN"

Pricing

Pay Per Event: $0.003 per page converted

Volume	Cost	Savings vs Firecrawl
100 pages	$0.30	—
1,000 pages	$3.00	—
10,000 pages	$30.00	—

No monthly subscription. Pay only for what you use.

Integrations

Works with any tool that can call an HTTP API:

LangChain: Use as a custom tool in your agent chain
LlamaIndex: Feed markdown into document loaders
n8n / Make: HTTP request node → markdown output
Python: requests.get() → JSON with markdown
Node.js: fetch() → structured response

FAQ

Q: Does it handle JavaScript-rendered pages? A: Yes. Uses Puppeteer with full Chrome to render JavaScript, SPAs, and dynamic content.

Q: What about pages behind logins? A: Currently extracts public content only. Authenticated scraping is on the roadmap.

Q: How accurate is the token estimate? A: Uses the ~4 chars/token heuristic for English text. Actual token counts may vary by model.

Q: Can I process multiple URLs at once? A: Yes. Pass an urls array in batch mode for multiple pages.

Support

Found a bug? Open an issue or DM on Twitter.

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

👁 Website To Markdown avatar

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

👁 User avatar

SmartApi

5.0

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

👁 User avatar

Juan Triviño

👁 Web Page to Markdown Extractor avatar

Web Page to Markdown Extractor

fetch_cat/web-page-to-markdown-extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI agents, RAG, support, and automation workflows.

👁 User avatar

Hanna Nosova

Web to Markdown — AI-Ready Text from Any URL

wsgcjj/web-to-markdown

Convert any web page URL to clean Markdown format. Perfect for LLM training data, RAG pipelines, and AI content processing. Extracts main content, strips ads/nav/footers.

👁 User avatar

陈俊杰

👁 Website to Clean Markdown (AI & RAG Ready) avatar

Website to Clean Markdown (AI & RAG Ready)

ahmed_jasarevic/website-to-clean-markdown-ai-rag-ready

Convert any website into clean, noise-free Markdown. Perfect for training LLMs, building Custom GPTs, and RAG pipelines. Save 80% on OpenAI tokens by stripping HTML junk.

👁 User avatar

Ahmed Jasarevic

Website to Markdown MCP Server

quodlibetical_buffalo/website-to-markdown-mcp

Convert any webpage to clean Markdown. MCP server for AI agents and LLM pipelines.

👁 User avatar

Marek Pommier

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

👁 User avatar

IntelScrape

URL to Markdown for LLMs (polite, robots-respecting)

weltverbenzer/url-to-markdown-for-llms

Turn any URL into clean, LLM-ready Markdown for AI agents and RAG pipelines. Enforces robots.txt, extracts main content (Readability) and converts to Markdown. Returns title, byline and markdown.

👁 User avatar

Johannes Witt

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

URL: https://apify.com/george.the.developer/web-to-markdown-llm