Pricing
from $2.00 / 1,000 website analyzeds
Go to Apify Store
Website Markdown Crawler
Crawls a website and converts every page to clean Markdown optimized for LLM ingestion.
Pricing
from $2.00 / 1,000 website analyzeds
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Crawls a website starting from a seed URL and converts every page to clean Markdown optimized for LLM ingestion (LlamaIndex, LangChain, OpenAI, Pinecone). Output includes structured metadata per page: title, language detected, publication date, headings outline, word count, and chunked content ready for vector store upsert.
Features
- Clean Markdown Extraction: Strips noise (navigation, footers) to extract just the main content.
- Smart Chunking: Splits content into token chunks respecting paragraph boundaries.
- Language Filtering: Can automatically detect and filter pages by language (e.g., only
enorfr). - Domain Control: Keeps the crawler scoped to the seed URL's domain.
- Regex Exclusions: Skip non-valuable URLs like tags or author pages.
Output Example
Each crawled page yields a structured JSON record:
{"url":"https://docs.example.com/getting-started","title":"Getting Started โ Example Docs","description":"Learn how to set up Example in 5 minutes.","language":"en","wordCount":842,"tokenEstimate":1120,"headings":[{"level":1,"text":"Getting Started"},{"level":2,"text":"Installation"}],"markdown":"# Getting Started\n\nLearn how to...","chunks":[{"index":0,"content":"# Getting Started\n\nLearn how to...","tokenEstimate":498}],"chunkCount":1,"depth":1,"crawledAt":"2026-06-10T14:32:00.000Z"}
Integrations
Connect the crawler directly into your RAG stack.
LlamaIndex
from llama_index.core import Document# After running the Actor, download dataset as JSONdocs =[Document(text=chunk['content'], metadata={'url': item['url'],'chunk': chunk['index']})for item in dataset_itemsfor chunk in item['chunks']]
LangChain
from langchain.docstore.document import Document as LCDoclc_docs =[LCDoc(page_content=chunk['content'], metadata={'source': item['url']})for item in dataset_itemsfor chunk in item['chunks']]
