VOOZH about

URL: https://apify.com/moorish-dev/website-markdown-crawler

โ‡ฑ Website Markdown Crawler ยท Apify


Pricing

from $2.00 / 1,000 website analyzeds

Go to Apify Store

Website Markdown Crawler

Crawls a website and converts every page to clean Markdown optimized for LLM ingestion.

Pricing

from $2.00 / 1,000 website analyzeds

Rating

0.0

(0)

Developer

๐Ÿ‘ Ziad Tarik

Ziad Tarik

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Crawls a website starting from a seed URL and converts every page to clean Markdown optimized for LLM ingestion (LlamaIndex, LangChain, OpenAI, Pinecone). Output includes structured metadata per page: title, language detected, publication date, headings outline, word count, and chunked content ready for vector store upsert.

Features

  • Clean Markdown Extraction: Strips noise (navigation, footers) to extract just the main content.
  • Smart Chunking: Splits content into token chunks respecting paragraph boundaries.
  • Language Filtering: Can automatically detect and filter pages by language (e.g., only en or fr).
  • Domain Control: Keeps the crawler scoped to the seed URL's domain.
  • Regex Exclusions: Skip non-valuable URLs like tags or author pages.

Output Example

Each crawled page yields a structured JSON record:

{
"url":"https://docs.example.com/getting-started",
"title":"Getting Started โ€” Example Docs",
"description":"Learn how to set up Example in 5 minutes.",
"language":"en",
"wordCount":842,
"tokenEstimate":1120,
"headings":[
{"level":1,"text":"Getting Started"},
{"level":2,"text":"Installation"}
],
"markdown":"# Getting Started\n\nLearn how to...",
"chunks":[
{"index":0,"content":"# Getting Started\n\nLearn how to...","tokenEstimate":498}
],
"chunkCount":1,
"depth":1,
"crawledAt":"2026-06-10T14:32:00.000Z"
}

Integrations

Connect the crawler directly into your RAG stack.

LlamaIndex

from llama_index.core import Document
# After running the Actor, download dataset as JSON
docs =[
Document(text=chunk['content'], metadata={'url': item['url'],'chunk': chunk['index']})
for item in dataset_items
for chunk in item['chunks']
]

LangChain

from langchain.docstore.document import Document as LCDoc
lc_docs =[
LCDoc(page_content=chunk['content'], metadata={'source': item['url']})
for item in dataset_items
for chunk in item['chunks']
]

You might also like

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

LLM Markdown Crawler

sleek_waveform/llm-markdown-crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required โ€” fast and cheap.

๐Ÿ‘ User avatar

Daniel Dimitrov

4

Website to Markdown Converter

lofomachines/website-to-markdown-converter

Best faster and cheaper way to convert any web page into clean, structured, LLM-ready Markdown.

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

45

๐Ÿ”ฅ FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer โ€” extracts website content, converts it to Markdown, and structures it for LLM training datasets.

๐Ÿ‘ User avatar

mohamed el hadi msaid

302

1.9