Website to Markdown for LLM & RAG β Crawl4AI URL to Clean
Pricing
from $1.00 / 1,000 page converteds
Website to Markdown for LLM & RAG β Crawl4AI URL to Clean
Convert any URL, sitemap, or whole website into clean, LLM-ready Markdown for RAG, vector databases, and AI agents. Hosted Crawl4AI in a real Chromium browser β renders JavaScript and SPAs, strips boilerplate, and exports JSON/CSV. Callable over MCP from Claude and Cursor.
Pricing
from $1.00 / 1,000 page converteds
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
13 hours ago
Last modified
Categories
Share
Website to Markdown for LLM & RAG β Crawl4AI URL to Clean Markdown
Convert any URL, sitemap, or whole website into clean, LLM-ready Markdown β without installing or hosting anything. This Actor is a hosted Crawl4AI: it wraps the popular open-source crawler and runs it on Apify with a real Chromium browser, so JavaScript-heavy pages render correctly. Point it at a page, a sitemap, or a whole site and get back boilerplate-free Markdown ready for RAG pipelines, vector databases, fine-tuning datasets, or pasting straight into an LLM context window.
What it does
- Turns URLs β clean Markdown (or Markdown + cleaned HTML, or Markdown + metadata/links JSON)
- Strips navigation, footers, cookie banners, and sidebars, leaving "fit markdown" optimized for token budgets
- Renders pages in a real Chromium browser via Playwright, so SPAs and JavaScript-rendered content convert correctly
- Works on single pages, full sitemaps, or breadth-first same-domain crawls (up to 1,000 pages per run)
- Writes one queryable dataset item per page β export as JSON, CSV, Excel, or via the Apify API
- Callable as an MCP tool from Claude, Cursor, or any MCP client
- Charges only for pages that convert successfully β failed, errored, timed-out, and robots-blocked pages are free
How it works
- You provide one or more Start URLs and a crawl mode (
single,sitemap, orcrawl). - The Actor opens each page in headless Chromium and waits for it to render.
- Crawl4AI's pruning content filter removes boilerplate (when
removeBoilerplateis on) to produce "fit markdown". - Each successfully converted page is pushed to the dataset and one
result-itemevent is charged (plus a singleactor-startevent when the run begins). - Pages that fail to load, return an HTTP 4xx/5xx, time out, or are disallowed by
robots.txtare logged and never charged. - The run stops at
maxPagesor when your configured max cost is reached, whichever comes first.
Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrls | array | β (required) | URLs to convert. In single mode each is converted as-is; in sitemap mode each is treated as / resolved to a sitemap.xml; in crawl mode each is a crawl starting point. |
crawlMode | string | single | single (only the listed URLs), sitemap (pages from each site's sitemap.xml), or crawl (follow same-domain links, breadth-first). |
maxPages | integer | 10 | Max pages converted across the whole run (1β1000). You're only charged for successful conversions. |
includeLinks | boolean | false | Keep hyperlinks in the Markdown. Disable for cleaner text aimed at embeddings/RAG chunking. |
outputFormat | string | markdown | markdown, markdown+html (adds cleaned HTML), or markdown+json (adds page metadata + link lists). |
removeBoilerplate | boolean | true | Strip nav/footer/cookie-banner noise to produce "fit markdown". |
respectRobotsTxt | boolean | true | Skip pages disallowed by robots.txt (skipped pages are not charged). |
proxyConfiguration | object | none | Optionally route browser traffic through Apify Proxy or custom proxies. Not needed for most public sites. |
Input example
{"startUrls":[{"url":"https://docs.crawl4ai.com"}],"crawlMode":"crawl","maxPages":50,"outputFormat":"markdown","removeBoilerplate":true,"respectRobotsTxt":true}
Output fields
Each successfully converted page becomes one dataset item. These fields are always present:
| Field | Type | Description |
|---|---|---|
url | string | The final URL of the converted page. |
title | string | null | Page title from the page metadata (null if the page has none). |
markdown | string | The clean Markdown. "Fit markdown" when removeBoilerplate is on, otherwise the full raw Markdown. |
wordCount | integer | Word count of the markdown field. |
crawledAt | string | ISO-8601 UTC timestamp of when the page was converted. |
Additional fields appear depending on outputFormat:
| Field | Appears when | Description |
|---|---|---|
html | outputFormat: "markdown+html" | The cleaned HTML of the page. |
metadata | outputFormat: "markdown+json" | Page metadata object (description, Open Graph tags, etc.). |
links.internal | outputFormat: "markdown+json" | Array of internal link URLs found on the page. |
links.external | outputFormat: "markdown+json" | Array of external link URLs found on the page. |
Output example
{"url":"https://docs.crawl4ai.com/core/quickstart/","title":"Quick Start - Crawl4AI Documentation","markdown":"# Getting Started with Crawl4AI\n\nWelcome to Crawl4AI, an open-source LLM-friendly Web Crawler & Scraper...","wordCount":1183,"crawledAt":"2026-06-13T10:42:07.512345+00:00"}
Use cases
- RAG / AI engineer β Ingest a documentation site or knowledge base into a vector database. Use
sitemaporcrawlmode withremoveBoilerplate: trueso chunks contain content, not nav menus. - AI agent builder β Give an agent a "read this page" tool over MCP. The agent passes a URL, gets clean Markdown back, and reasons over it β no scraping code in your app.
- LLM app developer β Pull live web content into a prompt at request time via the Apify API instead of pasting HTML and burning tokens on boilerplate.
- Data / ML team β Build fine-tuning or evaluation datasets from public web pages, exported as JSON/CSV from the run's dataset.
- Researcher / analyst β Convert a batch of articles or report pages to Markdown for summarization, search, or archival in a single run.
Pricing β pay only for pages you convert
This Actor uses Apify's pay-per-event model with two events:
| Event | Price | When it's charged |
|---|---|---|
actor-start | $0.01 | Once per run, when the Actor starts |
result-item | $0.003 | Once per page successfully converted to Markdown |
So a run costs $0.01 to start, then $0.003 per converted page (about $3 per 1,000 pages). Pages that fail to load, return an HTTP error, time out, or are blocked by robots.txt are never charged a result-item. Standard Apify platform usage (compute, and proxy if you enable it) applies to runs as usual. You can set a maximum cost per run in Apify Console β the Actor stops gracefully when that limit is reached.
Use from Claude, Cursor & other AI agents (MCP)
This Actor works as a tool over the Model Context Protocol. Add Apify's MCP server to your client and your agent can convert URLs to Markdown on demand:
{"mcpServers":{"apify":{"url":"https://mcp.apify.com/sse?actors=bikram07/web-to-markdown-crawl4ai","headers":{"Authorization":"Bearer YOUR_APIFY_TOKEN"}}}}
Then ask your agent things like: "Fetch https://example.com/blog as Markdown and summarize it" β the agent calls this Actor, gets clean Markdown back, and works with it directly.
You can also call it from code via the Apify API:
curl-X POST "https://api.apify.com/v2/acts/bikram07~web-to-markdown-crawl4ai/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN"\-H"Content-Type: application/json"\-d'{"startUrls": [{"url": "https://example.com"}], "crawlMode": "single"}'
FAQ
Is this a subscription?
No. It's pay-per-event with no monthly fee. Each run charges a small $0.01 actor-start fee, then $0.003 for each page that successfully converts β nothing else from this Actor. Convert nothing successfully, and you pay only the start fee.
How does the pricing / billing work, and can I get a refund?
You're charged one actor-start event ($0.01) per run and one result-item event ($0.003) per successfully converted page. Failed, errored, timed-out, and robots-blocked pages cost no result-item. Because per-page charges only accrue on successful output, there's nothing to refund for failures. To cap spend, set a maximum cost per run in Apify Console β the Actor stops cleanly when the limit is hit.
Does it use official APIs?
There is no public "URL-to-Markdown API" to call β the Actor renders each page in a real Chromium browser (via Playwright) and converts the rendered content using the open-source Crawl4AI library. It respects robots.txt by default. Output is the real content of the pages you point it at.
Does it handle JavaScript-rendered pages and SPAs? Yes. Pages are rendered in headless Chromium before conversion, so client-side-rendered content is included β unlike simple HTML-to-Markdown converters that only see the initial HTML.
What's the difference between this and running Crawl4AI locally? The conversion engine is the same library. The difference is operational: no Python/Playwright setup, no server to maintain, an instant REST API and MCP endpoint, parallel scaling, and dataset storage with JSON/CSV export. If you convert millions of pages a month on dedicated hardware, self-hosting can be cheaper; for prototypes through moderate-volume production RAG ingestion, hosted is simpler.
What it does NOT do (limitations)
- Not a search engine or content discoverer. It converts the URLs you give it (or links/sitemap entries it follows in
crawl/sitemapmode) β it won't find pages from a keyword. - Crawl mode follows same-domain links only, breadth-first, up to depth 3 and up to
maxPages. It does not crawl across external domains. - Sitemap mode needs a real sitemap. If a site exposes no
sitemap.xml(and you didn't pass a sitemap URL directly), usesingleorcrawlmode instead. - Hard cap of 1,000 pages per run. For larger jobs, split across multiple runs.
- No login / form / paywall handling. Pages behind authentication or interactive walls won't convert.
robots.txtis respected by default. Disallowed pages are skipped (and not charged) unless you turn that off.- It does not extract data into custom schemas β output is Markdown (plus optional HTML/metadata/links), not arbitrary structured fields.
Built on Crawl4AI (Apache 2.0). This Actor is not affiliated with the Crawl4AI project; it packages the library as a hosted service.
