👁 RAG Data Ingestion: Website to AI Knowledge Base avatar

RAG Data Ingestion: Website to AI Knowledge Base

Under maintenance

Pricing

from $1.00 / 1,000 premium scraped pages

Try for free

Go to Apify Store

👁 RAG Data Ingestion: Website to AI Knowledge Base

RAG Data Ingestion: Website to AI Knowledge Base

Under maintenance

Try for free

Master complex documentation with a premium scraper that flattens Shadow DOM and handles modern web components. Delivers clean, token-accurate Markdown pre-chunked for immediate RAG ingestion into Pinecone, Weaviate, or LangChain. Optimized for high-fidelity LLM training data.

Pricing

from $1.00 / 1,000 premium scraped pages

Rating

0.0

(0)

Developer

👁 tekk

tekk

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Universal AI Knowledge Scraper — Premium RAG Ingestion Engine

The high-fidelity bridge between the complex web and your LLM. Convert any website or documentation portal into cleaned, chunked, and token-accurate Markdown optimized for RAG pipelines.

Build production-grade RAG (Retrieval-Augmented Generation) datasets with a single Actor run. Feed the output directly into Pinecone, Weaviate, Qdrant, ChromaDB, or any vector store.

🛡️ Why Use This Actor?

Most scrapers return empty strings on modern documentation sites. This Actor was built to solve the "Invisible Web" problem.

Feature	Standard Scrapers	This Actor
Vanilla HTML	✅	✅
Shadow DOM / Web Components	❌ (Empty Output)	✅ (Full Flattening)
Token Tracking	❌ (Manual Regex)	✅ (Native Tiktoken)
Modern Code Blocks	❌ (Garbled)	✅ (Clean GFM)

Built-in Token Counting for Budget Management — Every record includes a usage object with exact token counts, encoding type, and chunk parameters. Enterprise teams can calculate embedding costs before hitting the OpenAI API.
Shadow DOM Extraction — Successfully captures content from Shadow DOM-heavy sites (like Shoelace Web Components) where standard crawlers see nothing.
Zero-Config Extraction — No CSS selectors to maintain. The density-based Readability algorithm adapts to any site layout automatically.
Antifragile Stealth — Bézier-curve mouse simulation and fingerprint rotation make this Actor invisible to Cloudflare, Akamai, and behavioral detection systems.
CU-Optimized — Resource interception blocks images, fonts, and media. You get lower memory usage and higher concurrency at the same price.

🚀 Key Features

Hybrid Discovery — Priority parsing of sitemap.xml with fallback to recursive <a> tag extraction.
Universal Extraction — Powered by Mozilla's Readability algorithm with recursive Shadow DOM flattening.
Clean Markdown Output — Converts HTML to Markdown via Turndown with GFM support (tables, code blocks).
Token-Aware Chunking — Splits content using tiktoken (GPT-4o / o1 encodings) into configurable chunk sizes with overlap.
Bloom Filter Dedup — O(1) URL deduplication prevents infinite loops and duplicate scraping.

📦 Output Format

Each record is a standardized JSON object ready for vector database ingestion:

{
"metadata":{
"source_url":"[https://docs.example.com/api/auth](https://docs.example.com/api/auth)",
"title":"Authentication — API Docs",
"crawled_at":"2026-04-30T13:00:00Z",
"site_name":"Example Docs",
"lang":"en"
},
"usage":{
"total_tokens":1010,
"total_chunks":2,
"encoding":"o200k_base",
"chunk_size":512,
"chunk_overlap":50
},
"content":[
{
"chunk_id":1,
"token_count":512,
"text":"### Authentication\n\nAll API requests require a Bearer token..."
}
],
"raw_markdown":"### Authentication\n\nAll API requests require a Bearer token..."
}

👁 Site to LLM Knowledge Base avatar

Site to LLM Knowledge Base

adambounhar/site-to-knowledge-base

Turn any website or docs into clean, LLM-ready Markdown for RAG and AI agents — one record per page, each with a token count. Sitemap- and robots.txt-aware, with predictable per-page pricing (no token credits). Simple knowledge-base ingestion.

👁 User avatar

Mohamed Adam BOUNHAR

Website Markdown Crawler

moorish-dev/website-markdown-crawler

Crawls a website and converts every page to clean Markdown optimized for LLM ingestion.

👁 User avatar

Ziad Tarik

rag-docs-scraper

marbled_jury/my-actor

Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.

👁 User avatar

Hastin S.

👁 RAG Spider - Web to Markdown Crawler for AI Training Data avatar

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

👁 User avatar

Tejas Rawool

5.0

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.

👁 User avatar

Kai Agent

👁 Website Content Crawler avatar

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

👁 User avatar

Crawler Bros

Site to Markdown — any site to clean, LLM-ready markdown

topsail/site-to-markdown

Scrape any website to clean, LLM-ready markdown — a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.

👁 User avatar

Connor Teskey

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

👁 Web Scraper RAG Ready avatar

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

👁 User avatar

Alexy Traore

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

URL: https://apify.com/0xysn/rag-data-ingestion-website-to-ai-knowledge-base