VOOZH about

URL: https://apify.com/gochujang/web-to-markdown

⇱ Web Page β†’ Markdown Converter (Trafilatura, LLM-ready) Β· Apify


πŸ‘ Web Page β†’ Markdown Converter (Trafilatura, LLM-ready) avatar

Web Page β†’ Markdown Converter (Trafilatura, LLM-ready)

Pricing

Pay per usage

Go to Apify Store

Web Page β†’ Markdown Converter (Trafilatura, LLM-ready)

Convert any URL to clean Markdown plus structured metadata (title, author, date, lang, image, tags). Uses trafilatura β€” the same library Common Crawl uses. LLM-ready output. Batch up to 500 URLs. $0.005 per URL.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

πŸ‘ Hojun Lee

Hojun Lee

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Web Page β†’ Markdown Converter

Convert any URL to clean Markdown plus structured metadata (title, author, date, lang, image, tags). Uses trafilatura β€” the same library Common Crawl uses. LLM-ready output. Batch up to 500 URLs. $0.005 per URL.


Why this exists

Most LLM pipelines need clean article-body text β€” but raw HTML is 60-90% boilerplate (nav, footer, ads, JS, related stories). Existing solutions:

  • Browserless / Puppeteer: complex setup, $30+/mo
  • Mercury Parser: deprecated
  • Diffbot: $299/mo minimum
  • Readability.js: requires running Node

This actor wraps trafilatura β€” the gold-standard Python library used by Common Crawl and most LLM training pipelines β€” into a one-call API. Pass a URL list, get clean Markdown + metadata back.


What you get per row

FieldExampleNotes
urlhttps://...input URL
oktruedid extraction succeed
titleBitcoin β€” Wikipediafrom <title> or og
authorWikipedia contributors
descriptionBitcoin is a cryptocurrency...
date_published2025-12-01
languageenauto-detected
sitenameWikipedia
tags["cryptocurrency", "blockchain"]
categories["Technology"]
imagehttps://...hero image
markdown# Bitcoin\n\nBitcoin is...clean body
char_count48230
word_count7842

Quick start

Single URL

{
"url":"https://en.wikipedia.org/wiki/Bitcoin"
}

Batch of URLs

{
"urls":[
"https://techcrunch.com/article-1",
"https://www.theverge.com/article-2",
"https://www.wired.com/article-3"
],
"includeTables":true,
"deduplicate":true
}

Custom User-Agent (some sites require it)

{
"url":"https://...",
"userAgent":"Mozilla/5.0 (compatible; YourBot/1.0; +https://yourdomain.com/bot)"
}

Pricing

Pay-Per-Event: $0.005 per URL processed.

RunURLsCost
Single article1$0.005
Batch of 100100$0.50
Daily crawl of 1K URLs1000$5.00

Vs Diffbot ($299/mo), Mercury ($199/mo for similar tier), this is 40-60x cheaper for typical volumes.


Common pipeline patterns

Feed to Claude / GPT for summarization

# 1. Extract clean text
curl-X POST "https://api.apify.com/v2/acts/gochujang~web-to-markdown/runs?token=$T"\
-d'{"url":"..."}'
# 2. Pipe markdown to Claude
curl-X POST https://api.anthropic.com/v1/messages \
-d"{\"messages\":[{\"role\":\"user\",\"content\":\"Summarize: $MARKDOWN\"}]}"

RSS-style aggregator

  1. Sitemap URL Discovery to get all article URLs
  2. Filter by lastmod (recent only)
  3. This actor to convert each to Markdown
  4. Store in your DB / Notion / Obsidian

Personal read-it-later

Schedule this actor with your "saved articles" Google Sheet β†’ get clean markdown into Obsidian / Logseq daily.


Use cases

  1. LLM input prep β€” Clean text for RAG / fine-tuning / summarization
  2. Content curation β€” Newsletter / digest aggregation
  3. SEO research β€” Compare clean content across competitors
  4. Archiving β€” Read-it-later in Markdown format
  5. Translation pipelines β€” Strip boilerplate before sending to MT

Data source / engine

  • Engine: trafilatura β€” actively maintained, used by Common Crawl
  • Fallback: Returns ok: false with error message if a page can't be extracted (paywall, JS-heavy SPA without SSR, etc.)

Limitations

  • JS-only sites: Pages that render entirely in client-side JS may return empty markdown. For those, use a browser-rendering actor (Playwright/Puppeteer-based).
  • Paywalls: This actor doesn't bypass paywalls.
  • Comments / discussion sections: Off by default; enable with includeComments: true.

Related actors (same author)


Feedback

A short review helps content/AI engineers find it: Leave a review on Apify Store

You might also like

Website to Markdown Converter

lofomachines/website-to-markdown-converter

Best faster and cheaper way to convert any web page into clean, structured, LLM-ready Markdown.

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

πŸš€ Transform web content into clean, LLM-ready Markdown! πŸ“˜ Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! πŸŒπŸ“πŸ§ 

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds β€” perfect for AI training data, RAG pipelines, and content archiving.

Ai Ready Web Page To Markdown Converter

mustafa.irshaid.113/ai-ready-web-page-to-markdown-converter

Convert any webpage into structured Markdown and HTML using just a URL. Get the page title, link, and contentβ€”perfect for SEO, devs, and AI crawlers. Fast, clean, and ideal for repurposing or analysis. Start turning websites into Markdown instantly.

πŸ‘ User avatar

Mustafa Irshaid

16

Web to Markdown for LLMs

george.the.developer/web-to-markdown-llm

Convert any URL to clean LLM-ready markdown. 60-70% fewer tokens than raw HTML. Built for AI agents and RAG pipelines.

Markdown Anything β€” URL to Markdown

s-r/markdown-anything

Convert any URL to clean markdown using a 3-provider fallback chain. Batch input, high concurrency.