👁 Website to Text & Markdown — AI / RAG Content Crawler avatar

Website to Text & Markdown — AI / RAG Content Crawler

Pricing

from $5.00 / 1,000 results

👁 Website to Text & Markdown — AI / RAG Content Crawler

Website to Text & Markdown — AI / RAG Content Crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

👁 Hitman studio

Hitman studio

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

🕷️ RAG Website Crawler — Markdown + Chunks + PDFs for AI

Turn any website into clean, LLM-ready data in one run. Built for RAG pipelines, AI chatbots, and vector databases (Pinecone, Qdrant, Weaviate…).

Why this one is better

Feature	Plain content crawlers	RAG Website Crawler
Clean Markdown	✅	✅
Auto chunks + token counts	❌ (extra step)	✅ built-in
PDF / Word / Excel extraction	❌ skipped	✅ included
Anti-block fetching	sometimes	✅ browser TLS + proxy
AI summary per page	❌	✅ optional, your own key
robots.txt + trap protection	varies	✅ built-in
GPU needed	—	❌ 100% CPU

What you get per page

{
"url":"https://site.com/docs/intro",
"title":"Introduction",
"markdown":"# Introduction\n\n...",
"word_count":812,
"token_count":1043,
"chunk_count":3,
"chunks":[{"index":0,"text":"...","tokens":500}],
"is_document":false,
"depth":1,
"content_hash":"…",
"crawled_at":"2026-06-08T07:00:00Z"
}

Chunks are ready to embed straight into a vector DB.

Robust by design

Handles the classic crawler traps automatically:

Infinite loops / calendar traps → depth + page caps, trap heuristics
Duplicate URLs / content → URL normalisation + content-hash dedup
robots.txt & crawl-delay → respected (toggle)
Rate limits / blocks → polite delay + jitter + proxy + 429 backoff
Huge pages / memory → size cap, HTTP-only (no heavy browser)
Dead URLs → limited retries, never re-queued

Input (key options)

startUrls — where to begin
maxPages, maxDepth, sameDomainOnly, allowSubdomains
chunkSizeTokens, chunkOverlapTokens
includeDocuments — also crawl linked PDFs/Office files
respectRobotsTxt, crawlDelaySeconds, useProxy
aiProvider + aiApiKey (BYOK) — optional per-page AI summary

Privacy

The AI summary uses your own key (isSecret, encrypted, never logged). The Actor never ships any built-in key, so nothing of ours can be exposed.

What people use this for (search terms)

Whether you are a beginner who just wants to copy a website's text, or a developer building a production RAG pipeline, this Actor fits:

website to text · website to markdown · scrape website content · copy all pages of a website · website content downloader · website reader · extract text from a website · web page to text
data for AI · LLM-ready data · RAG crawler · vector database ingestion · embeddings input · knowledge base builder · AI chatbot training data · documentation scraper · docs to markdown
works with ChatGPT, Claude, Gemini, LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, Milvus, Supabase Vector
also: PDF scraper · crawl PDFs on a website · Word/Excel text extraction · sitemap crawler · whole-site crawler

Common use cases

Build an AI chatbot that answers questions about your website or docs
Feed a company knowledge base into a vector database for RAG
Turn documentation / help centers into clean Markdown for LLMs
Collect research content from many pages into one structured dataset
Extract text from PDFs and documents linked across a site

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

RAG Website Crawler - Clean Markdown for LLMs & AI

themineworks/rag-crawler

Crawl any website and extract clean, chunked Markdown ready for RAG pipelines and LLM context. Returns page text, titles and URLs. No API key. Works in Claude, ChatGPT & any MCP-compatible AI agent.

👁 User avatar

The Mine Works

👁 Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks avatar

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

👁 User avatar

Ken M

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

👁 User avatar

Juan Triviño

👁 AI / RAG Web Crawler avatar

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

👁 User avatar

Group Oject

👁 PDF URL to Markdown, Tables & RAG Extractor avatar

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

👁 User avatar

Inus Grobler

👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

👁 User avatar

Dev with Bobby

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

PDF to Text API | Document Extraction for LLMs & RAG

andok/pdf-text-converter

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

👁 User avatar

Andok

AI Data Pipeline — Crawl, Chunk & Export to Vector DB

ozapp/ai-data-pipeline

Crawl any website, extract clean text, split into chunks with quality scoring, and export to JSON, Pinecone, or Qdrant. Built for RAG pipelines and AI training data. Includes language detection, content type classification, and token counting.

👁 User avatar

Ozapp

URL: https://apify.com/inexhaustible_glass/rag-website-crawler