VOOZH about

URL: https://apify.com/bikram07/web-to-markdown-crawl4ai

⇱ URL to Markdown for LLM & RAG β€” Crawl4AI Β· Apify


πŸ‘ Website to Markdown for LLM & RAG β€” Crawl4AI URL to Clean avatar

Website to Markdown for LLM & RAG β€” Crawl4AI URL to Clean

Pricing

from $1.00 / 1,000 page converteds

Go to Apify Store

Website to Markdown for LLM & RAG β€” Crawl4AI URL to Clean

Convert any URL, sitemap, or whole website into clean, LLM-ready Markdown for RAG, vector databases, and AI agents. Hosted Crawl4AI in a real Chromium browser β€” renders JavaScript and SPAs, strips boilerplate, and exports JSON/CSV. Callable over MCP from Claude and Cursor.

Pricing

from $1.00 / 1,000 page converteds

Rating

0.0

(0)

Developer

πŸ‘ Bikram

Bikram

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

13 hours ago

Last modified

Share

Website to Markdown for LLM & RAG β€” Crawl4AI URL to Clean Markdown

Convert any URL, sitemap, or whole website into clean, LLM-ready Markdown β€” without installing or hosting anything. This Actor is a hosted Crawl4AI: it wraps the popular open-source crawler and runs it on Apify with a real Chromium browser, so JavaScript-heavy pages render correctly. Point it at a page, a sitemap, or a whole site and get back boilerplate-free Markdown ready for RAG pipelines, vector databases, fine-tuning datasets, or pasting straight into an LLM context window.

What it does

  • Turns URLs β†’ clean Markdown (or Markdown + cleaned HTML, or Markdown + metadata/links JSON)
  • Strips navigation, footers, cookie banners, and sidebars, leaving "fit markdown" optimized for token budgets
  • Renders pages in a real Chromium browser via Playwright, so SPAs and JavaScript-rendered content convert correctly
  • Works on single pages, full sitemaps, or breadth-first same-domain crawls (up to 1,000 pages per run)
  • Writes one queryable dataset item per page β€” export as JSON, CSV, Excel, or via the Apify API
  • Callable as an MCP tool from Claude, Cursor, or any MCP client
  • Charges only for pages that convert successfully β€” failed, errored, timed-out, and robots-blocked pages are free

How it works

  1. You provide one or more Start URLs and a crawl mode (single, sitemap, or crawl).
  2. The Actor opens each page in headless Chromium and waits for it to render.
  3. Crawl4AI's pruning content filter removes boilerplate (when removeBoilerplate is on) to produce "fit markdown".
  4. Each successfully converted page is pushed to the dataset and one result-item event is charged (plus a single actor-start event when the run begins).
  5. Pages that fail to load, return an HTTP 4xx/5xx, time out, or are disallowed by robots.txt are logged and never charged.
  6. The run stops at maxPages or when your configured max cost is reached, whichever comes first.

Input

FieldTypeDefaultDescription
startUrlsarrayβ€” (required)URLs to convert. In single mode each is converted as-is; in sitemap mode each is treated as / resolved to a sitemap.xml; in crawl mode each is a crawl starting point.
crawlModestringsinglesingle (only the listed URLs), sitemap (pages from each site's sitemap.xml), or crawl (follow same-domain links, breadth-first).
maxPagesinteger10Max pages converted across the whole run (1–1000). You're only charged for successful conversions.
includeLinksbooleanfalseKeep hyperlinks in the Markdown. Disable for cleaner text aimed at embeddings/RAG chunking.
outputFormatstringmarkdownmarkdown, markdown+html (adds cleaned HTML), or markdown+json (adds page metadata + link lists).
removeBoilerplatebooleantrueStrip nav/footer/cookie-banner noise to produce "fit markdown".
respectRobotsTxtbooleantrueSkip pages disallowed by robots.txt (skipped pages are not charged).
proxyConfigurationobjectnoneOptionally route browser traffic through Apify Proxy or custom proxies. Not needed for most public sites.

Input example

{
"startUrls":[{"url":"https://docs.crawl4ai.com"}],
"crawlMode":"crawl",
"maxPages":50,
"outputFormat":"markdown",
"removeBoilerplate":true,
"respectRobotsTxt":true
}

Output fields

Each successfully converted page becomes one dataset item. These fields are always present:

FieldTypeDescription
urlstringThe final URL of the converted page.
titlestring | nullPage title from the page metadata (null if the page has none).
markdownstringThe clean Markdown. "Fit markdown" when removeBoilerplate is on, otherwise the full raw Markdown.
wordCountintegerWord count of the markdown field.
crawledAtstringISO-8601 UTC timestamp of when the page was converted.

Additional fields appear depending on outputFormat:

FieldAppears whenDescription
htmloutputFormat: "markdown+html"The cleaned HTML of the page.
metadataoutputFormat: "markdown+json"Page metadata object (description, Open Graph tags, etc.).
links.internaloutputFormat: "markdown+json"Array of internal link URLs found on the page.
links.externaloutputFormat: "markdown+json"Array of external link URLs found on the page.

Output example

{
"url":"https://docs.crawl4ai.com/core/quickstart/",
"title":"Quick Start - Crawl4AI Documentation",
"markdown":"# Getting Started with Crawl4AI\n\nWelcome to Crawl4AI, an open-source LLM-friendly Web Crawler & Scraper...",
"wordCount":1183,
"crawledAt":"2026-06-13T10:42:07.512345+00:00"
}

Use cases

  • RAG / AI engineer β€” Ingest a documentation site or knowledge base into a vector database. Use sitemap or crawl mode with removeBoilerplate: true so chunks contain content, not nav menus.
  • AI agent builder β€” Give an agent a "read this page" tool over MCP. The agent passes a URL, gets clean Markdown back, and reasons over it β€” no scraping code in your app.
  • LLM app developer β€” Pull live web content into a prompt at request time via the Apify API instead of pasting HTML and burning tokens on boilerplate.
  • Data / ML team β€” Build fine-tuning or evaluation datasets from public web pages, exported as JSON/CSV from the run's dataset.
  • Researcher / analyst β€” Convert a batch of articles or report pages to Markdown for summarization, search, or archival in a single run.

Pricing β€” pay only for pages you convert

This Actor uses Apify's pay-per-event model with two events:

EventPriceWhen it's charged
actor-start$0.01Once per run, when the Actor starts
result-item$0.003Once per page successfully converted to Markdown

So a run costs $0.01 to start, then $0.003 per converted page (about $3 per 1,000 pages). Pages that fail to load, return an HTTP error, time out, or are blocked by robots.txt are never charged a result-item. Standard Apify platform usage (compute, and proxy if you enable it) applies to runs as usual. You can set a maximum cost per run in Apify Console β€” the Actor stops gracefully when that limit is reached.

Use from Claude, Cursor & other AI agents (MCP)

This Actor works as a tool over the Model Context Protocol. Add Apify's MCP server to your client and your agent can convert URLs to Markdown on demand:

{
"mcpServers":{
"apify":{
"url":"https://mcp.apify.com/sse?actors=bikram07/web-to-markdown-crawl4ai",
"headers":{
"Authorization":"Bearer YOUR_APIFY_TOKEN"
}
}
}
}

Then ask your agent things like: "Fetch https://example.com/blog as Markdown and summarize it" β€” the agent calls this Actor, gets clean Markdown back, and works with it directly.

You can also call it from code via the Apify API:

curl-X POST "https://api.apify.com/v2/acts/bikram07~web-to-markdown-crawl4ai/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN"\
-H"Content-Type: application/json"\
-d'{"startUrls": [{"url": "https://example.com"}], "crawlMode": "single"}'

FAQ

Is this a subscription? No. It's pay-per-event with no monthly fee. Each run charges a small $0.01 actor-start fee, then $0.003 for each page that successfully converts β€” nothing else from this Actor. Convert nothing successfully, and you pay only the start fee.

How does the pricing / billing work, and can I get a refund? You're charged one actor-start event ($0.01) per run and one result-item event ($0.003) per successfully converted page. Failed, errored, timed-out, and robots-blocked pages cost no result-item. Because per-page charges only accrue on successful output, there's nothing to refund for failures. To cap spend, set a maximum cost per run in Apify Console β€” the Actor stops cleanly when the limit is hit.

Does it use official APIs? There is no public "URL-to-Markdown API" to call β€” the Actor renders each page in a real Chromium browser (via Playwright) and converts the rendered content using the open-source Crawl4AI library. It respects robots.txt by default. Output is the real content of the pages you point it at.

Does it handle JavaScript-rendered pages and SPAs? Yes. Pages are rendered in headless Chromium before conversion, so client-side-rendered content is included β€” unlike simple HTML-to-Markdown converters that only see the initial HTML.

What's the difference between this and running Crawl4AI locally? The conversion engine is the same library. The difference is operational: no Python/Playwright setup, no server to maintain, an instant REST API and MCP endpoint, parallel scaling, and dataset storage with JSON/CSV export. If you convert millions of pages a month on dedicated hardware, self-hosting can be cheaper; for prototypes through moderate-volume production RAG ingestion, hosted is simpler.

What it does NOT do (limitations)

  • Not a search engine or content discoverer. It converts the URLs you give it (or links/sitemap entries it follows in crawl/sitemap mode) β€” it won't find pages from a keyword.
  • Crawl mode follows same-domain links only, breadth-first, up to depth 3 and up to maxPages. It does not crawl across external domains.
  • Sitemap mode needs a real sitemap. If a site exposes no sitemap.xml (and you didn't pass a sitemap URL directly), use single or crawl mode instead.
  • Hard cap of 1,000 pages per run. For larger jobs, split across multiple runs.
  • No login / form / paywall handling. Pages behind authentication or interactive walls won't convert.
  • robots.txt is respected by default. Disallowed pages are skipped (and not charged) unless you turn that off.
  • It does not extract data into custom schemas β€” output is Markdown (plus optional HTML/metadata/links), not arbitrary structured fields.

Built on Crawl4AI (Apache 2.0). This Actor is not affiliated with the Crawl4AI project; it packages the library as a hosted service.

You might also like

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds β€” perfect for AI training data, RAG pipelines, and content archiving.

Web to Markdown for LLMs

george.the.developer/web-to-markdown-llm

Convert any URL to clean LLM-ready markdown. 60-70% fewer tokens than raw HTML. Built for AI agents and RAG pipelines.

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.