VOOZH about

URL: https://apify.com/0xysn/rag-data-ingestion-website-to-ai-knowledge-base

⇱ RAG Data Ingestion: Website to AI Knowledge Base Β· Apify


πŸ‘ RAG Data Ingestion: Website to AI Knowledge Base avatar

RAG Data Ingestion: Website to AI Knowledge Base

Under maintenance

Pricing

from $1.00 / 1,000 premium scraped pages

Go to Apify Store

RAG Data Ingestion: Website to AI Knowledge Base

Under maintenance

Master complex documentation with a premium scraper that flattens Shadow DOM and handles modern web components. Delivers clean, token-accurate Markdown pre-chunked for immediate RAG ingestion into Pinecone, Weaviate, or LangChain. Optimized for high-fidelity LLM training data.

Pricing

from $1.00 / 1,000 premium scraped pages

Rating

0.0

(0)

Developer

πŸ‘ tekk

tekk

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 months ago

Last modified

Categories

Share

Universal AI Knowledge Scraper β€” Premium RAG Ingestion Engine

The high-fidelity bridge between the complex web and your LLM. Convert any website or documentation portal into cleaned, chunked, and token-accurate Markdown optimized for RAG pipelines.

Build production-grade RAG (Retrieval-Augmented Generation) datasets with a single Actor run. Feed the output directly into Pinecone, Weaviate, Qdrant, ChromaDB, or any vector store.


πŸ›‘οΈ Why Use This Actor?

Most scrapers return empty strings on modern documentation sites. This Actor was built to solve the "Invisible Web" problem.

FeatureStandard ScrapersThis Actor
Vanilla HTMLβœ…βœ…
Shadow DOM / Web Components❌ (Empty Output)βœ… (Full Flattening)
Token Tracking❌ (Manual Regex)βœ… (Native Tiktoken)
Modern Code Blocks❌ (Garbled)βœ… (Clean GFM)
  • Built-in Token Counting for Budget Management β€” Every record includes a usage object with exact token counts, encoding type, and chunk parameters. Enterprise teams can calculate embedding costs before hitting the OpenAI API.
  • Shadow DOM Extraction β€” Successfully captures content from Shadow DOM-heavy sites (like Shoelace Web Components) where standard crawlers see nothing.
  • Zero-Config Extraction β€” No CSS selectors to maintain. The density-based Readability algorithm adapts to any site layout automatically.
  • Antifragile Stealth β€” BΓ©zier-curve mouse simulation and fingerprint rotation make this Actor invisible to Cloudflare, Akamai, and behavioral detection systems.
  • CU-Optimized β€” Resource interception blocks images, fonts, and media. You get lower memory usage and higher concurrency at the same price.

πŸš€ Key Features

  • Hybrid Discovery β€” Priority parsing of sitemap.xml with fallback to recursive <a> tag extraction.
  • Universal Extraction β€” Powered by Mozilla's Readability algorithm with recursive Shadow DOM flattening.
  • Clean Markdown Output β€” Converts HTML to Markdown via Turndown with GFM support (tables, code blocks).
  • Token-Aware Chunking β€” Splits content using tiktoken (GPT-4o / o1 encodings) into configurable chunk sizes with overlap.
  • Bloom Filter Dedup β€” O(1) URL deduplication prevents infinite loops and duplicate scraping.

πŸ“¦ Output Format

Each record is a standardized JSON object ready for vector database ingestion:

{
"metadata":{
"source_url":"[https://docs.example.com/api/auth](https://docs.example.com/api/auth)",
"title":"Authentication β€” API Docs",
"crawled_at":"2026-04-30T13:00:00Z",
"site_name":"Example Docs",
"lang":"en"
},
"usage":{
"total_tokens":1010,
"total_chunks":2,
"encoding":"o200k_base",
"chunk_size":512,
"chunk_overlap":50
},
"content":[
{
"chunk_id":1,
"token_count":512,
"text":"### Authentication\n\nAll API requests require a Bearer token..."
}
],
"raw_markdown":"### Authentication\n\nAll API requests require a Bearer token..."
}

You might also like

Site to LLM Knowledge Base

adambounhar/site-to-knowledge-base

Turn any website or docs into clean, LLM-ready Markdown for RAG and AI agents β€” one record per page, each with a token count. Sitemap- and robots.txt-aware, with predictable per-page pricing (no token credits). Simple knowledge-base ingestion.

πŸ‘ User avatar

Mohamed Adam BOUNHAR

2

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

14

5.0

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

46

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.