Website to Text & Markdown β AI / RAG Content Crawler
Pricing
from $5.00 / 1,000 results
Website to Text & Markdown β AI / RAG Content Crawler
Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.
Pricing
from $5.00 / 1,000 results
Rating
0.0
(0)
Developer
Actor stats
1
Bookmarked
3
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
π·οΈ RAG Website Crawler β Markdown + Chunks + PDFs for AI
Turn any website into clean, LLM-ready data in one run. Built for RAG pipelines, AI chatbots, and vector databases (Pinecone, Qdrant, Weaviateβ¦).
Why this one is better
| Feature | Plain content crawlers | RAG Website Crawler |
|---|---|---|
| Clean Markdown | β | β |
| Auto chunks + token counts | β (extra step) | β built-in |
| PDF / Word / Excel extraction | β skipped | β included |
| Anti-block fetching | sometimes | β browser TLS + proxy |
| AI summary per page | β | β optional, your own key |
| robots.txt + trap protection | varies | β built-in |
| GPU needed | β | β 100% CPU |
What you get per page
{"url":"https://site.com/docs/intro","title":"Introduction","markdown":"# Introduction\n\n...","word_count":812,"token_count":1043,"chunk_count":3,"chunks":[{"index":0,"text":"...","tokens":500}],"is_document":false,"depth":1,"content_hash":"β¦","crawled_at":"2026-06-08T07:00:00Z"}
Chunks are ready to embed straight into a vector DB.
Robust by design
Handles the classic crawler traps automatically:
- Infinite loops / calendar traps β depth + page caps, trap heuristics
- Duplicate URLs / content β URL normalisation + content-hash dedup
- robots.txt & crawl-delay β respected (toggle)
- Rate limits / blocks β polite delay + jitter + proxy + 429 backoff
- Huge pages / memory β size cap, HTTP-only (no heavy browser)
- Dead URLs β limited retries, never re-queued
Input (key options)
startUrlsβ where to beginmaxPages,maxDepth,sameDomainOnly,allowSubdomainschunkSizeTokens,chunkOverlapTokensincludeDocumentsβ also crawl linked PDFs/Office filesrespectRobotsTxt,crawlDelaySeconds,useProxyaiProvider+aiApiKey(BYOK) β optional per-page AI summary
Privacy
The AI summary uses your own key (isSecret, encrypted, never logged).
The Actor never ships any built-in key, so nothing of ours can be exposed.
What people use this for (search terms)
Whether you are a beginner who just wants to copy a website's text, or a developer building a production RAG pipeline, this Actor fits:
- website to text Β· website to markdown Β· scrape website content Β· copy all pages of a website Β· website content downloader Β· website reader Β· extract text from a website Β· web page to text
- data for AI Β· LLM-ready data Β· RAG crawler Β· vector database ingestion Β· embeddings input Β· knowledge base builder Β· AI chatbot training data Β· documentation scraper Β· docs to markdown
- works with ChatGPT, Claude, Gemini, LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, Milvus, Supabase Vector
- also: PDF scraper Β· crawl PDFs on a website Β· Word/Excel text extraction Β· sitemap crawler Β· whole-site crawler
Common use cases
- Build an AI chatbot that answers questions about your website or docs
- Feed a company knowledge base into a vector database for RAG
- Turn documentation / help centers into clean Markdown for LLMs
- Collect research content from many pages into one structured dataset
- Extract text from PDFs and documents linked across a site
