👁 Website to Markdown for LLM & RAG — Crawl4AI URL to Clean avatar

Website to Markdown for LLM & RAG — Crawl4AI URL to Clean

Pricing

from $1.00 / 1,000 page converteds

👁 Website to Markdown for LLM & RAG — Crawl4AI URL to Clean

Website to Markdown for LLM & RAG — Crawl4AI URL to Clean

Convert any URL, sitemap, or whole website into clean, LLM-ready Markdown for RAG, vector databases, and AI agents. Hosted Crawl4AI in a real Chromium browser — renders JavaScript and SPAs, strips boilerplate, and exports JSON/CSV. Callable over MCP from Claude and Cursor.

Pricing

from $1.00 / 1,000 page converteds

Rating

0.0

(0)

Developer

👁 Bikram

Bikram

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

13 hours ago

Last modified

Website to Markdown for LLM & RAG — Crawl4AI URL to Clean Markdown

Convert any URL, sitemap, or whole website into clean, LLM-ready Markdown — without installing or hosting anything. This Actor is a hosted Crawl4AI: it wraps the popular open-source crawler and runs it on Apify with a real Chromium browser, so JavaScript-heavy pages render correctly. Point it at a page, a sitemap, or a whole site and get back boilerplate-free Markdown ready for RAG pipelines, vector databases, fine-tuning datasets, or pasting straight into an LLM context window.

What it does

Turns URLs → clean Markdown (or Markdown + cleaned HTML, or Markdown + metadata/links JSON)
Strips navigation, footers, cookie banners, and sidebars, leaving "fit markdown" optimized for token budgets
Renders pages in a real Chromium browser via Playwright, so SPAs and JavaScript-rendered content convert correctly
Works on single pages, full sitemaps, or breadth-first same-domain crawls (up to 1,000 pages per run)
Writes one queryable dataset item per page — export as JSON, CSV, Excel, or via the Apify API
Callable as an MCP tool from Claude, Cursor, or any MCP client
Charges only for pages that convert successfully — failed, errored, timed-out, and robots-blocked pages are free

How it works

You provide one or more Start URLs and a crawl mode (single, sitemap, or crawl).
The Actor opens each page in headless Chromium and waits for it to render.
Crawl4AI's pruning content filter removes boilerplate (when removeBoilerplate is on) to produce "fit markdown".
Each successfully converted page is pushed to the dataset and one result-item event is charged (plus a single actor-start event when the run begins).
Pages that fail to load, return an HTTP 4xx/5xx, time out, or are disallowed by robots.txt are logged and never charged.
The run stops at maxPages or when your configured max cost is reached, whichever comes first.

Input

Field	Type	Default	Description
`startUrls`	array	— (required)	URLs to convert. In `single` mode each is converted as-is; in `sitemap` mode each is treated as / resolved to a `sitemap.xml`; in `crawl` mode each is a crawl starting point.
`crawlMode`	string	`single`	`single` (only the listed URLs), `sitemap` (pages from each site's sitemap.xml), or `crawl` (follow same-domain links, breadth-first).
`maxPages`	integer	`10`	Max pages converted across the whole run (1–1000). You're only charged for successful conversions.
`includeLinks`	boolean	`false`	Keep hyperlinks in the Markdown. Disable for cleaner text aimed at embeddings/RAG chunking.
`outputFormat`	string	`markdown`	`markdown`, `markdown+html` (adds cleaned HTML), or `markdown+json` (adds page metadata + link lists).
`removeBoilerplate`	boolean	`true`	Strip nav/footer/cookie-banner noise to produce "fit markdown".
`respectRobotsTxt`	boolean	`true`	Skip pages disallowed by `robots.txt` (skipped pages are not charged).
`proxyConfiguration`	object	none	Optionally route browser traffic through Apify Proxy or custom proxies. Not needed for most public sites.

Input example

{
"startUrls":[{"url":"https://docs.crawl4ai.com"}],
"crawlMode":"crawl",
"maxPages":50,
"outputFormat":"markdown",
"removeBoilerplate":true,
"respectRobotsTxt":true
}

Output fields

Each successfully converted page becomes one dataset item. These fields are always present:

Field	Type	Description
`url`	string	The final URL of the converted page.
`title`	string \| null	Page title from the page metadata (`null` if the page has none).
`markdown`	string	The clean Markdown. "Fit markdown" when `removeBoilerplate` is on, otherwise the full raw Markdown.
`wordCount`	integer	Word count of the `markdown` field.
`crawledAt`	string	ISO-8601 UTC timestamp of when the page was converted.

Additional fields appear depending on outputFormat:

Field	Appears when	Description
`html`	`outputFormat: "markdown+html"`	The cleaned HTML of the page.
`metadata`	`outputFormat: "markdown+json"`	Page metadata object (description, Open Graph tags, etc.).
`links.internal`	`outputFormat: "markdown+json"`	Array of internal link URLs found on the page.
`links.external`	`outputFormat: "markdown+json"`	Array of external link URLs found on the page.

Output example

{
"url":"https://docs.crawl4ai.com/core/quickstart/",
"title":"Quick Start - Crawl4AI Documentation",
"markdown":"# Getting Started with Crawl4AI\n\nWelcome to Crawl4AI, an open-source LLM-friendly Web Crawler & Scraper...",
"wordCount":1183,
"crawledAt":"2026-06-13T10:42:07.512345+00:00"
}

Use cases

RAG / AI engineer — Ingest a documentation site or knowledge base into a vector database. Use sitemap or crawl mode with removeBoilerplate: true so chunks contain content, not nav menus.
AI agent builder — Give an agent a "read this page" tool over MCP. The agent passes a URL, gets clean Markdown back, and reasons over it — no scraping code in your app.
LLM app developer — Pull live web content into a prompt at request time via the Apify API instead of pasting HTML and burning tokens on boilerplate.
Data / ML team — Build fine-tuning or evaluation datasets from public web pages, exported as JSON/CSV from the run's dataset.
Researcher / analyst — Convert a batch of articles or report pages to Markdown for summarization, search, or archival in a single run.

Pricing — pay only for pages you convert

This Actor uses Apify's pay-per-event model with two events:

Event	Price	When it's charged
`actor-start`	$0.01	Once per run, when the Actor starts
`result-item`	$0.003	Once per page successfully converted to Markdown

So a run costs $0.01 to start, then $0.003 per converted page (about $3 per 1,000 pages). Pages that fail to load, return an HTTP error, time out, or are blocked by robots.txt are never charged a result-item. Standard Apify platform usage (compute, and proxy if you enable it) applies to runs as usual. You can set a maximum cost per run in Apify Console — the Actor stops gracefully when that limit is reached.

Use from Claude, Cursor & other AI agents (MCP)

This Actor works as a tool over the Model Context Protocol. Add Apify's MCP server to your client and your agent can convert URLs to Markdown on demand:

{
"mcpServers":{
"apify":{
"url":"https://mcp.apify.com/sse?actors=bikram07/web-to-markdown-crawl4ai",
"headers":{
"Authorization":"Bearer YOUR_APIFY_TOKEN"
}
}
}
}

Then ask your agent things like: "Fetch https://example.com/blog as Markdown and summarize it" — the agent calls this Actor, gets clean Markdown back, and works with it directly.

You can also call it from code via the Apify API:

curl-X POST "https://api.apify.com/v2/acts/bikram07~web-to-markdown-crawl4ai/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN"\
-H"Content-Type: application/json"\
-d'{"startUrls": [{"url": "https://example.com"}], "crawlMode": "single"}'

FAQ

Is this a subscription? No. It's pay-per-event with no monthly fee. Each run charges a small $0.01 actor-start fee, then $0.003 for each page that successfully converts — nothing else from this Actor. Convert nothing successfully, and you pay only the start fee.

How does the pricing / billing work, and can I get a refund? You're charged one actor-start event ($0.01) per run and one result-item event ($0.003) per successfully converted page. Failed, errored, timed-out, and robots-blocked pages cost no result-item. Because per-page charges only accrue on successful output, there's nothing to refund for failures. To cap spend, set a maximum cost per run in Apify Console — the Actor stops cleanly when the limit is hit.

Does it use official APIs? There is no public "URL-to-Markdown API" to call — the Actor renders each page in a real Chromium browser (via Playwright) and converts the rendered content using the open-source Crawl4AI library. It respects robots.txt by default. Output is the real content of the pages you point it at.

Does it handle JavaScript-rendered pages and SPAs? Yes. Pages are rendered in headless Chromium before conversion, so client-side-rendered content is included — unlike simple HTML-to-Markdown converters that only see the initial HTML.

What's the difference between this and running Crawl4AI locally? The conversion engine is the same library. The difference is operational: no Python/Playwright setup, no server to maintain, an instant REST API and MCP endpoint, parallel scaling, and dataset storage with JSON/CSV export. If you convert millions of pages a month on dedicated hardware, self-hosting can be cheaper; for prototypes through moderate-volume production RAG ingestion, hosted is simpler.

What it does NOT do (limitations)

Not a search engine or content discoverer. It converts the URLs you give it (or links/sitemap entries it follows in crawl/sitemap mode) — it won't find pages from a keyword.
Crawl mode follows same-domain links only, breadth-first, up to depth 3 and up to maxPages. It does not crawl across external domains.
Sitemap mode needs a real sitemap. If a site exposes no sitemap.xml (and you didn't pass a sitemap URL directly), use single or crawl mode instead.
Hard cap of 1,000 pages per run. For larger jobs, split across multiple runs.
No login / form / paywall handling. Pages behind authentication or interactive walls won't convert.
robots.txt is respected by default. Disallowed pages are skipped (and not charged) unless you turn that off.
It does not extract data into custom schemas — output is Markdown (plus optional HTML/metadata/links), not arbitrary structured fields.

Built on Crawl4AI (Apache 2.0). This Actor is not affiliated with the Crawl4AI project; it packages the library as a hosted service.

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

👁 User avatar

Juan Triviño

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

👁 Website To Markdown avatar

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

👁 User avatar

SmartApi

5.0

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

👁 User avatar

Fulcria Labs

👁 Web to Markdown for LLMs avatar

Web to Markdown for LLMs

george.the.developer/web-to-markdown-llm

Convert any URL to clean LLM-ready markdown. 60-70% fewer tokens than raw HTML. Built for AI agents and RAG pipelines.

👁 User avatar

George Kioko

Crawl4ai To Markdown Pro2

juryless_rainbow/crawl4ai-to-markdown-pro2

A high-performance web-to-markdown crawler for AI agents, optimized for LLM data extraction using Crawl4AI. Features stealth browsing and high-fidelity content extraction.

👁 User avatar

aaron jungs

Website to Markdown MCP Server

quodlibetical_buffalo/website-to-markdown-mcp

Convert any webpage to clean Markdown. MCP server for AI agents and LLM pipelines.

👁 User avatar

Marek Pommier

👁 Web Scraper RAG Ready avatar

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

👁 User avatar

Alexy Traore

Website to Markdown – Clean LLM & RAG Content Extractor

dataquarry/website-to-markdown

Convert any public web page to clean, LLM-ready Markdown with metadata — by URL, a list of URLs, or a whole-site crawl. Strips nav/ads/boilerplate, keeps headings/lists/tables/code. Respects robots.txt. No API key.

👁 User avatar

Daniel Brenner

URL: https://apify.com/bikram07/web-to-markdown-crawl4ai

⇱ URL to Markdown for LLM & RAG — Crawl4AI · Apify

Website to Markdown for LLM & RAG — Crawl4AI URL to Clean

Website to Markdown for LLM & RAG — Crawl4AI URL to Clean Markdown

What it does

How it works

Input

Input example

Output fields

Output example

Use cases

Pricing — pay only for pages you convert

Use from Claude, Cursor & other AI agents (MCP)

FAQ

What it does NOT do (limitations)

You might also like

Web-to-Markdown Generator for AI & RAG Pipelines

Website to Markdown for LLM and RAG

Website to Markdown Crawler for LLM & RAG

Website To Markdown

AI-Ready Website Crawler

Web to Markdown for LLMs

Crawl4ai To Markdown Pro2

Website to Markdown MCP Server

Web Scraper RAG Ready

Website to Markdown – Clean LLM & RAG Content Extractor