👁 Web Page → Markdown Converter (Trafilatura, LLM-ready) avatar

Web Page → Markdown Converter (Trafilatura, LLM-ready)

Pricing

Pay per usage

👁 Web Page → Markdown Converter (Trafilatura, LLM-ready)

Web Page → Markdown Converter (Trafilatura, LLM-ready)

Convert any URL to clean Markdown plus structured metadata (title, author, date, lang, image, tags). Uses trafilatura — the same library Common Crawl uses. LLM-ready output. Batch up to 500 URLs. $0.005 per URL.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

👁 Hojun Lee

Hojun Lee

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

Web Page → Markdown Converter

Convert any URL to clean Markdown plus structured metadata (title, author, date, lang, image, tags). Uses trafilatura — the same library Common Crawl uses. LLM-ready output. Batch up to 500 URLs. $0.005 per URL.

Why this exists

Most LLM pipelines need clean article-body text — but raw HTML is 60-90% boilerplate (nav, footer, ads, JS, related stories). Existing solutions:

Browserless / Puppeteer: complex setup, $30+/mo
Mercury Parser: deprecated
Diffbot: $299/mo minimum
Readability.js: requires running Node

This actor wraps trafilatura — the gold-standard Python library used by Common Crawl and most LLM training pipelines — into a one-call API. Pass a URL list, get clean Markdown + metadata back.

What you get per row

Field	Example	Notes
`url`	`https://...`	input URL
`ok`	`true`	did extraction succeed
`title`	`Bitcoin — Wikipedia`	from `<title>` or og
`author`	`Wikipedia contributors`
`description`	`Bitcoin is a cryptocurrency...`
`date_published`	`2025-12-01`
`language`	`en`	auto-detected
`sitename`	`Wikipedia`
`tags`	`["cryptocurrency", "blockchain"]`
`categories`	`["Technology"]`
`image`	`https://...`	hero image
`markdown`	`# Bitcoin\n\nBitcoin is...`	clean body
`char_count`	`48230`
`word_count`	`7842`

Quick start

Single URL

{
"url":"https://en.wikipedia.org/wiki/Bitcoin"
}

Batch of URLs

{
"urls":[
"https://techcrunch.com/article-1",
"https://www.theverge.com/article-2",
"https://www.wired.com/article-3"
],
"includeTables":true,
"deduplicate":true
}

Custom User-Agent (some sites require it)

{
"url":"https://...",
"userAgent":"Mozilla/5.0 (compatible; YourBot/1.0; +https://yourdomain.com/bot)"
}

Pricing

Pay-Per-Event: $0.005 per URL processed.

Run	URLs	Cost
Single article	1	$0.005
Batch of 100	100	$0.50
Daily crawl of 1K URLs	1000	$5.00

Vs Diffbot ($299/mo), Mercury ($199/mo for similar tier), this is 40-60x cheaper for typical volumes.

Common pipeline patterns

Feed to Claude / GPT for summarization

# 1. Extract clean text
curl-X POST "https://api.apify.com/v2/acts/gochujang~web-to-markdown/runs?token=$T"\
-d'{"url":"..."}'
# 2. Pipe markdown to Claude
curl-X POST https://api.anthropic.com/v1/messages \
-d"{\"messages\":[{\"role\":\"user\",\"content\":\"Summarize: $MARKDOWN\"}]}"

RSS-style aggregator

Sitemap URL Discovery to get all article URLs
Filter by lastmod (recent only)
This actor to convert each to Markdown
Store in your DB / Notion / Obsidian

Personal read-it-later

Schedule this actor with your "saved articles" Google Sheet → get clean markdown into Obsidian / Logseq daily.

Use cases

LLM input prep — Clean text for RAG / fine-tuning / summarization
Content curation — Newsletter / digest aggregation
SEO research — Compare clean content across competitors
Archiving — Read-it-later in Markdown format
Translation pipelines — Strip boilerplate before sending to MT

Data source / engine

Engine: trafilatura — actively maintained, used by Common Crawl
Fallback: Returns ok: false with error message if a page can't be extracted (paywall, JS-heavy SPA without SSR, etc.)

Limitations

JS-only sites: Pages that render entirely in client-side JS may return empty markdown. For those, use a browser-rendering actor (Playwright/Puppeteer-based).
Paywalls: This actor doesn't bypass paywalls.
Comments / discussion sections: Off by default; enable with includeComments: true.

Related actors (same author)

HTML Metadata Extractor — Just metadata (OG, Twitter, JSON-LD) without article body
Sitemap URL Discovery — Find all URLs to feed into this actor
PDF Text Extractor — PDF version
JSON Schema Generator

Feedback

A short review helps content/AI engineers find it: Leave a review on Apify Store

👁 Website to Markdown Converter avatar

Website to Markdown Converter

lofomachines/website-to-markdown-converter

Best faster and cheaper way to convert any web page into clean, structured, LLM-ready Markdown.

👁 User avatar

Lofomachines

Site to Markdown — any site to clean, LLM-ready markdown

topsail/site-to-markdown

Scrape any website to clean, LLM-ready markdown — a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.

👁 User avatar

Connor Teskey

Crawl4AI Web to Markdown — URL to Clean Markdown for LLM & RAG

bikram07/web-to-markdown-crawl4ai

Convert any URL, sitemap, or whole website into clean, LLM-ready Markdown for RAG pipelines, vector databases, and AI agents. Powered by the open-source Crawl4AI engine with real Chromium rendering. Pay per page ($1 / 1,000), failed pages never charged. MCP-ready for Claude & Cursor.

👁 User avatar

Bikram

Website to Markdown – Clean LLM & RAG Content Extractor

dataquarry/website-to-markdown

Convert any public web page to clean, LLM-ready Markdown with metadata — by URL, a list of URLs, or a whole-site crawl. Strips nav/ads/boilerplate, keeps headings/lists/tables/code. Respects robots.txt. No API key.

👁 User avatar

Daniel Brenner

Web to Markdown — AI-Ready Text from Any URL

wsgcjj/web-to-markdown

Convert any web page URL to clean Markdown format. Perfect for LLM training data, RAG pipelines, and AI content processing. Extracts main content, strips ads/nav/footers.

👁 User avatar

陈俊杰

👁 Website Content to Markdown for LLM Training avatar

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

👁 User avatar

EasyApi

319

5.0

👁 Website To Markdown avatar

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

👁 User avatar

SmartApi

5.0

👁 Ai Ready Web Page To Markdown Converter avatar

Ai Ready Web Page To Markdown Converter

mustafa.irshaid.113/ai-ready-web-page-to-markdown-converter

Convert any webpage into structured Markdown and HTML using just a URL. Get the page title, link, and content—perfect for SEO, devs, and AI crawlers. Fast, clean, and ideal for repurposing or analysis. Start turning websites into Markdown instantly.

👁 User avatar

Mustafa Irshaid

👁 Web to Markdown for LLMs avatar

Web to Markdown for LLMs

george.the.developer/web-to-markdown-llm

Convert any URL to clean LLM-ready markdown. 60-70% fewer tokens than raw HTML. Built for AI agents and RAG pipelines.

👁 User avatar

George Kioko

👁 Markdown Anything — URL to Markdown avatar

Markdown Anything — URL to Markdown

s-r/markdown-anything

Convert any URL to clean markdown using a 3-provider fallback chain. Batch input, high concurrency.

👁 User avatar

URL: https://apify.com/gochujang/web-to-markdown