👁 Site to Markdown — any site to clean, LLM-ready markdown avatar

Site to Markdown — any site to clean, LLM-ready markdown

Pricing

from $1.50 / 1,000 pages

👁 Site to Markdown — any site to clean, LLM-ready markdown

Site to Markdown — any site to clean, LLM-ready markdown

Scrape any website to clean, LLM-ready markdown — a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.

Pricing

from $1.50 / 1,000 pages

Rating

0.0

(0)

Developer

👁 Connor Teskey

Connor Teskey

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

9 days ago

Last modified

Site to Markdown

Turn any website into clean, LLM-ready markdown — one document per page, with robots.txt compliance locked on.

Built for AI agents, RAG builders, and documentation pipelines that need a website-to-markdown step without running crawler infrastructure. Point it at a URL: it crawls breadth-first, strips navigation, ads, and boilerplate, and keeps only the main content as tidy markdown. If you have been looking for a Firecrawl alternative on Apify for scrape-to-markdown jobs, this is that actor.

What you get

One dataset item per page:

Field	Meaning
`url`	The URL that was requested.
`finalUrl`	URL after redirects.
`status`	HTTP status code (0 when the fetch itself failed).
`title`	Page title, when found.
`markdown`	Clean, LLM-ready markdown of the page's main content.
`text`	Plain-text version (only when `outputFormat` is `markdown+text`).
`linksCount`	Number of links discovered on the page.
`fetchedAt`	ISO-8601 fetch timestamp.
`rendered`	Whether a headless browser rendered the page (always `false` in v1).
`error`	Error message when the page failed, otherwise `null`.

Every run also writes a RUN_SUMMARY record to the key-value store with page counts and a failure breakdown.

Quick start

{
"startUrls":[{"url":"https://docs.python.org/3/"}],
"crawlMode":"site-crawl",
"maxPages":10,
"maxDepth":1
}

A run like this returns one markdown document per crawled page and typically finishes in well under a minute; the verification crawl of docs.python.org converted 5 of 5 pages.

Output example

{
"url":"https://docs.python.org/3/tutorial/index.html",
"finalUrl":"https://docs.python.org/3/tutorial/index.html",
"status":200,
"title":"The Python Tutorial — Python 3.14.6 documentation",
"markdown":"# The Python Tutorial\n\nPython is an easy to learn, powerful programming language. It has efficient high-level data st...",
"linksCount":35,
"fetchedAt":"2026-06-11T00:49:18+00:00",
"rendered":false,
"error":null
}

Why this one

Robots-locked by design. Compliance is hard-coded into the crawler call, not an input default someone can flip. That makes the output safe to build a product on.
Selector-free extraction. Main content is found by trafilatura with an automatic readability-style fallback — no CSS selectors to maintain when a site redesigns.
Honest zero-yield. If no pages produce markdown, the run fails with a classified failure breakdown instead of finishing green on an empty dataset.
Precise scope control. Include/exclude glob patterns match against the full URL, exclude wins, and same-domain crawling is the default.
Open foundation. Built on trawl (MIT), a clean-room crawler, with trafilatura as the quality extraction engine — the exact wheel is vendored into the image.

Compliance and reliability

Topsail actors are built compliance-first and ship with self-healing plumbing:

robots.txt is always respected — locked on. Every fetch goes through the crawler with robots compliance hard-coded; there is no input to turn it off. Pages disallowed by robots.txt are reported as robots-blocked, never fetched, and robots Crawl-delay is honored when larger than your politeness delay.
This actor reads only the public, static HTML pages you point it at — the same documents any browser receives without logging in — and only where robots.txt permits.
Transient failures retry with backoff (408, 425, 429, and 5xx responses, honoring Retry-After); persistent failures are reported, not hidden.
Every run writes a HEALTH summary (RUN_SUMMARY) to the key-value store with page counts, a failure breakdown — robots-blocked, http-4xx, http-5xx, timeout, extract-fail — and a per-URL failedPages list, so you can see exactly which pages delivered and which were blocked, empty, or erroring. Only successful pages become dataset results.
No PII, no paywalled or login-gated content, no circumvention.

Pricing

Pay per result: $1.50 per 1,000 pages successfully extracted ($0.0015 per page), plus a fraction-of-a-cent actor start fee. Every dataset result is one extracted page — robots-blocked pages, failed fetches, and pages dropped by your URL filters never become results, so they cost nothing. The 10-page quick start above costs about two cents.

Honest limits

No JavaScript rendering. Static HTML only — SPAs that render entirely client-side will come back thin. Headless rendering is on the roadmap for v2.
No sitemap.xml seeding yet; discovery is link-following from your start URLs.
One markdown document per page; no site-level concatenated export (easy to build downstream from the dataset).
robots.txt compliance cannot be disabled. If your use case requires ignoring robots.txt, this actor is not for you — by design.

FAQ

Is this a Firecrawl alternative? For the core scrape and crawl endpoints, yes: website to markdown, one clean document per page, ready for RAG ingestion — as an Apify actor instead of separate infrastructure. It does not replicate Firecrawl's JS rendering or search features in v1.

Can it scrape JavaScript-heavy sites? Not in v1. It fetches static HTML, so server-rendered sites, documentation, and blogs work well; client-side SPAs come back thin.

How do I scrape a single page to markdown? Set crawlMode to single-page and list your URLs in startUrls; each one is converted on its own with no link following.

How do I keep a crawl focused on one section of a site? Use full-URL glob patterns: include https://docs.example.com/en/* and exclude */changelog/*, for example. Exclude always wins.

Can I turn off robots.txt compliance? No. It is hard-coded on, with no input to disable it. Disallowed pages are reported as robots-blocked so you can see what was skipped.

More compliant data feeds from Topsail

GTA 6 Countdown & Developments Tracker — countdown, confirmed facts, diffed developments, market odds
Commodity Intel — oil, gold, uranium headlines from permitted sources
Crypto News — BTC/ETH/DeFi headlines from major outlets
AI Research Radar — new papers and lab announcements

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

👁 User avatar

Juan Triviño

Website to Markdown – Clean LLM & RAG Content Extractor

dataquarry/website-to-markdown

Convert any public web page to clean, LLM-ready Markdown with metadata — by URL, a list of URLs, or a whole-site crawl. Strips nav/ads/boilerplate, keeps headings/lists/tables/code. Respects robots.txt. No API key.

👁 User avatar

Daniel Brenner

Website Markdown Crawler

moorish-dev/website-markdown-crawler

Crawls a website and converts every page to clean Markdown optimized for LLM ingestion.

👁 User avatar

Ziad Tarik

URL to Markdown for LLMs (polite, robots-respecting)

weltverbenzer/url-to-markdown-for-llms

Turn any URL into clean, LLM-ready Markdown for AI agents and RAG pipelines. Enforces robots.txt, extracts main content (Readability) and converts to Markdown. Returns title, byline and markdown.

👁 User avatar

Johannes Witt

👁 Website To Markdown avatar

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

👁 User avatar

SmartApi

5.0

👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

👁 User avatar

Dev with Bobby

👁 Website to Markdown Converter avatar

Website to Markdown Converter

lofomachines/website-to-markdown-converter

Best faster and cheaper way to convert any web page into clean, structured, LLM-ready Markdown.

👁 User avatar

Lofomachines

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

👁 Website Content to Markdown for LLM Training avatar

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

👁 User avatar

EasyApi

319

5.0

URL: https://apify.com/topsail/site-to-markdown

⇱ Site to Markdown API - Firecrawl Alternative · Apify

Site to Markdown — any site to clean, LLM-ready markdown

Site to Markdown

What you get

Quick start

Output example

Why this one

Compliance and reliability

Pricing

Honest limits

FAQ

More compliant data feeds from Topsail

You might also like

Website to Markdown Crawler for LLM & RAG

Website to Markdown for LLM and RAG

Website to Markdown – Clean LLM & RAG Content Extractor

Website Markdown Crawler

URL to Markdown for LLMs (polite, robots-respecting)

Website To Markdown

Docs Markdown Rag Ready Crawler

Website to Markdown Converter

Web-to-Markdown Generator for AI & RAG Pipelines

Website Content to Markdown for LLM Training