VOOZH about

URL: https://apify.com/inexhaustible_glass/rag-website-crawler

⇱ Website to Text & Markdown β€” AI / RAG Content Crawler Β· Apify


πŸ‘ Website to Text & Markdown β€” AI / RAG Content Crawler avatar

Website to Text & Markdown β€” AI / RAG Content Crawler

Pricing

from $5.00 / 1,000 results

Go to Apify Store

Website to Text & Markdown β€” AI / RAG Content Crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ Hitman studio

Hitman studio

Maintained by Community

Actor stats

1

Bookmarked

3

Total users

1

Monthly active users

3 days ago

Last modified

Share

πŸ•·οΈ RAG Website Crawler β€” Markdown + Chunks + PDFs for AI

Turn any website into clean, LLM-ready data in one run. Built for RAG pipelines, AI chatbots, and vector databases (Pinecone, Qdrant, Weaviate…).

Why this one is better

FeaturePlain content crawlersRAG Website Crawler
Clean Markdownβœ…βœ…
Auto chunks + token counts❌ (extra step)βœ… built-in
PDF / Word / Excel extraction❌ skippedβœ… included
Anti-block fetchingsometimesβœ… browser TLS + proxy
AI summary per pageβŒβœ… optional, your own key
robots.txt + trap protectionvariesβœ… built-in
GPU neededβ€”βŒ 100% CPU

What you get per page

{
"url":"https://site.com/docs/intro",
"title":"Introduction",
"markdown":"# Introduction\n\n...",
"word_count":812,
"token_count":1043,
"chunk_count":3,
"chunks":[{"index":0,"text":"...","tokens":500}],
"is_document":false,
"depth":1,
"content_hash":"…",
"crawled_at":"2026-06-08T07:00:00Z"
}

Chunks are ready to embed straight into a vector DB.

Robust by design

Handles the classic crawler traps automatically:

  • Infinite loops / calendar traps β†’ depth + page caps, trap heuristics
  • Duplicate URLs / content β†’ URL normalisation + content-hash dedup
  • robots.txt & crawl-delay β†’ respected (toggle)
  • Rate limits / blocks β†’ polite delay + jitter + proxy + 429 backoff
  • Huge pages / memory β†’ size cap, HTTP-only (no heavy browser)
  • Dead URLs β†’ limited retries, never re-queued

Input (key options)

  • startUrls β€” where to begin
  • maxPages, maxDepth, sameDomainOnly, allowSubdomains
  • chunkSizeTokens, chunkOverlapTokens
  • includeDocuments β€” also crawl linked PDFs/Office files
  • respectRobotsTxt, crawlDelaySeconds, useProxy
  • aiProvider + aiApiKey (BYOK) β€” optional per-page AI summary

Privacy

The AI summary uses your own key (isSecret, encrypted, never logged). The Actor never ships any built-in key, so nothing of ours can be exposed.

What people use this for (search terms)

Whether you are a beginner who just wants to copy a website's text, or a developer building a production RAG pipeline, this Actor fits:

  • website to text Β· website to markdown Β· scrape website content Β· copy all pages of a website Β· website content downloader Β· website reader Β· extract text from a website Β· web page to text
  • data for AI Β· LLM-ready data Β· RAG crawler Β· vector database ingestion Β· embeddings input Β· knowledge base builder Β· AI chatbot training data Β· documentation scraper Β· docs to markdown
  • works with ChatGPT, Claude, Gemini, LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, Milvus, Supabase Vector
  • also: PDF scraper Β· crawl PDFs on a website Β· Word/Excel text extraction Β· sitemap crawler Β· whole-site crawler

Common use cases

  • Build an AI chatbot that answers questions about your website or docs
  • Feed a company knowledge base into a vector database for RAG
  • Turn documentation / help centers into clean Markdown for LLMs
  • Collect research content from many pages into one structured dataset
  • Extract text from PDFs and documents linked across a site

You might also like

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.