VOOZH about

URL: https://apify.com/topsail/site-to-markdown

⇱ Site to Markdown API - Firecrawl Alternative Β· Apify


πŸ‘ Site to Markdown β€” any site to clean, LLM-ready markdown avatar

Site to Markdown β€” any site to clean, LLM-ready markdown

Pricing

from $1.50 / 1,000 pages

Go to Apify Store

Site to Markdown β€” any site to clean, LLM-ready markdown

Scrape any website to clean, LLM-ready markdown β€” a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.

Pricing

from $1.50 / 1,000 pages

Rating

0.0

(0)

Developer

πŸ‘ Connor Teskey

Connor Teskey

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

9 days ago

Last modified

Share

Site to Markdown

Turn any website into clean, LLM-ready markdown β€” one document per page, with robots.txt compliance locked on.

Built for AI agents, RAG builders, and documentation pipelines that need a website-to-markdown step without running crawler infrastructure. Point it at a URL: it crawls breadth-first, strips navigation, ads, and boilerplate, and keeps only the main content as tidy markdown. If you have been looking for a Firecrawl alternative on Apify for scrape-to-markdown jobs, this is that actor.

What you get

One dataset item per page:

FieldMeaning
urlThe URL that was requested.
finalUrlURL after redirects.
statusHTTP status code (0 when the fetch itself failed).
titlePage title, when found.
markdownClean, LLM-ready markdown of the page's main content.
textPlain-text version (only when outputFormat is markdown+text).
linksCountNumber of links discovered on the page.
fetchedAtISO-8601 fetch timestamp.
renderedWhether a headless browser rendered the page (always false in v1).
errorError message when the page failed, otherwise null.

Every run also writes a RUN_SUMMARY record to the key-value store with page counts and a failure breakdown.

Quick start

{
"startUrls":[{"url":"https://docs.python.org/3/"}],
"crawlMode":"site-crawl",
"maxPages":10,
"maxDepth":1
}

A run like this returns one markdown document per crawled page and typically finishes in well under a minute; the verification crawl of docs.python.org converted 5 of 5 pages.

Output example

{
"url":"https://docs.python.org/3/tutorial/index.html",
"finalUrl":"https://docs.python.org/3/tutorial/index.html",
"status":200,
"title":"The Python Tutorial β€” Python 3.14.6 documentation",
"markdown":"# The Python Tutorial\n\nPython is an easy to learn, powerful programming language. It has efficient high-level data st...",
"linksCount":35,
"fetchedAt":"2026-06-11T00:49:18+00:00",
"rendered":false,
"error":null
}

Why this one

  • Robots-locked by design. Compliance is hard-coded into the crawler call, not an input default someone can flip. That makes the output safe to build a product on.
  • Selector-free extraction. Main content is found by trafilatura with an automatic readability-style fallback β€” no CSS selectors to maintain when a site redesigns.
  • Honest zero-yield. If no pages produce markdown, the run fails with a classified failure breakdown instead of finishing green on an empty dataset.
  • Precise scope control. Include/exclude glob patterns match against the full URL, exclude wins, and same-domain crawling is the default.
  • Open foundation. Built on trawl (MIT), a clean-room crawler, with trafilatura as the quality extraction engine β€” the exact wheel is vendored into the image.

Compliance and reliability

Topsail actors are built compliance-first and ship with self-healing plumbing:

  • robots.txt is always respected β€” locked on. Every fetch goes through the crawler with robots compliance hard-coded; there is no input to turn it off. Pages disallowed by robots.txt are reported as robots-blocked, never fetched, and robots Crawl-delay is honored when larger than your politeness delay.
  • This actor reads only the public, static HTML pages you point it at β€” the same documents any browser receives without logging in β€” and only where robots.txt permits.
  • Transient failures retry with backoff (408, 425, 429, and 5xx responses, honoring Retry-After); persistent failures are reported, not hidden.
  • Every run writes a HEALTH summary (RUN_SUMMARY) to the key-value store with page counts, a failure breakdown β€” robots-blocked, http-4xx, http-5xx, timeout, extract-fail β€” and a per-URL failedPages list, so you can see exactly which pages delivered and which were blocked, empty, or erroring. Only successful pages become dataset results.
  • No PII, no paywalled or login-gated content, no circumvention.

Pricing

Pay per result: $1.50 per 1,000 pages successfully extracted ($0.0015 per page), plus a fraction-of-a-cent actor start fee. Every dataset result is one extracted page β€” robots-blocked pages, failed fetches, and pages dropped by your URL filters never become results, so they cost nothing. The 10-page quick start above costs about two cents.

Honest limits

  • No JavaScript rendering. Static HTML only β€” SPAs that render entirely client-side will come back thin. Headless rendering is on the roadmap for v2.
  • No sitemap.xml seeding yet; discovery is link-following from your start URLs.
  • One markdown document per page; no site-level concatenated export (easy to build downstream from the dataset).
  • robots.txt compliance cannot be disabled. If your use case requires ignoring robots.txt, this actor is not for you β€” by design.

FAQ

Is this a Firecrawl alternative? For the core scrape and crawl endpoints, yes: website to markdown, one clean document per page, ready for RAG ingestion β€” as an Apify actor instead of separate infrastructure. It does not replicate Firecrawl's JS rendering or search features in v1.

Can it scrape JavaScript-heavy sites? Not in v1. It fetches static HTML, so server-rendered sites, documentation, and blogs work well; client-side SPAs come back thin.

How do I scrape a single page to markdown? Set crawlMode to single-page and list your URLs in startUrls; each one is converted on its own with no link following.

How do I keep a crawl focused on one section of a site? Use full-URL glob patterns: include https://docs.example.com/en/* and exclude */changelog/*, for example. Exclude always wins.

Can I turn off robots.txt compliance? No. It is hard-coded on, with no input to disable it. Disallowed pages are reported as robots-blocked so you can see what was skipped.

More compliant data feeds from Topsail

You might also like

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds β€” perfect for AI training data, RAG pipelines, and content archiving.

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

Website to Markdown Converter

lofomachines/website-to-markdown-converter

Best faster and cheaper way to convert any web page into clean, structured, LLM-ready Markdown.

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

πŸš€ Transform web content into clean, LLM-ready Markdown! πŸ“˜ Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! πŸŒπŸ“πŸ§