Website Markdown Crawler

Pricing

from $2.00 / 1,000 website analyzeds

Try for free

Go to Apify Store

👁 Website Markdown Crawler

Website Markdown Crawler

Try for free

Crawls a website and converts every page to clean Markdown optimized for LLM ingestion.

Pricing

from $2.00 / 1,000 website analyzeds

Rating

0.0

(0)

Developer

👁 Ziad Tarik

Ziad Tarik

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

Features

Clean Markdown Extraction: Strips noise (navigation, footers) to extract just the main content.
Smart Chunking: Splits content into token chunks respecting paragraph boundaries.
Language Filtering: Can automatically detect and filter pages by language (e.g., only en or fr).
Domain Control: Keeps the crawler scoped to the seed URL's domain.
Regex Exclusions: Skip non-valuable URLs like tags or author pages.

Output Example

Each crawled page yields a structured JSON record:

{
"url":"https://docs.example.com/getting-started",
"title":"Getting Started — Example Docs",
"description":"Learn how to set up Example in 5 minutes.",
"language":"en",
"wordCount":842,
"tokenEstimate":1120,
"headings":[
{"level":1,"text":"Getting Started"},
{"level":2,"text":"Installation"}
],
"markdown":"# Getting Started\n\nLearn how to...",
"chunks":[
{"index":0,"content":"# Getting Started\n\nLearn how to...","tokenEstimate":498}
],
"chunkCount":1,
"depth":1,
"crawledAt":"2026-06-10T14:32:00.000Z"
}

Integrations

Connect the crawler directly into your RAG stack.

LlamaIndex

from llama_index.core import Document
# After running the Actor, download dataset as JSON
docs =[
 Document(text=chunk['content'], metadata={'url': item['url'],'chunk': chunk['index']})
for item in dataset_items
for chunk in item['chunks']
]

LangChain

from langchain.docstore.document import Document as LCDoc
lc_docs =[
 LCDoc(page_content=chunk['content'], metadata={'source': item['url']})
for item in dataset_items
for chunk in item['chunks']
]

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

Site to Markdown — any site to clean, LLM-ready markdown

topsail/site-to-markdown

Scrape any website to clean, LLM-ready markdown — a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.

👁 User avatar

Connor Teskey

👁 LLM Markdown Crawler avatar

LLM Markdown Crawler

sleek_waveform/llm-markdown-crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.

👁 User avatar

Daniel Dimitrov

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

👁 User avatar

IntelScrape

Simple Website Scrapper (markdown format)

manojaditya64/simple-website-scrapper-markdown-format

A simple website scrapper that scrapes websites and converts it into markdown format which is easy to use with LLM. You can feed markdown data to LLM for easy analysis.

👁 User avatar

Manojaditya Nadar

5.0

👁 Website to Markdown Converter avatar

Website to Markdown Converter

lofomachines/website-to-markdown-converter

Best faster and cheaper way to convert any web page into clean, structured, LLM-ready Markdown.

👁 User avatar

Lofomachines

Markdown API

vivid_astronaut/markdown

👁 User avatar

Fabio Suizu

👁 Website Content Crawler avatar

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

👁 User avatar

Crawler Bros

👁 🔥 FireScrape AI Website Content Markdown Scraper avatar

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

👁 User avatar

mohamed el hadi msaid

302

1.9

Website Content Crawler — AI & RAG Ready

santamaria-automations/website-content-crawler

Crawl any website and extract clean Markdown and plain text optimized for AI ingestion, RAG pipelines, and LLM context. Readability-style main content extraction removes ads, navs, and footers. Configurable depth, concurrency, and page limits. Pay-per-page.

👁 User avatar

Ale

URL: https://apify.com/moorish-dev/website-markdown-crawler