VOOZH about

URL: https://apify.com/charitable_jeopardy/webscraperap

⇱ AI & RAG Doc Ingester Β· Apify


Pricing

from $0.20 / 1,000 page scrapeds

Go to Apify Store

Docs-to-RAG AI Crawler

Stop wasting space on website headers, footers, cookie banners, and navigation menus. Extract clean body text, chunk it for RAG, and detect page changes across runs crawling public docs, blogs, and knowledge bases,

Pricing

from $0.20 / 1,000 page scrapeds

Rating

0.0

(0)

Developer

πŸ‘ charitable_jeopardy

charitable_jeopardy

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

9 days ago

Last modified

Share

AI & RAG Documentation Ingester (Pre-Chunked Web Crawler)

Stop wasting LLM tokens and vector DB space on website headers, footers, cookie banners, and navigation menus.

This Actor crawls public documentation sites, blogs, and knowledge bases, extracts only the core body content, and outputs clean, pre-chunked text records mapped to their nearest headingsβ€”complete with incremental change detection to keep your vector database synced efficiently.


🎯 Best For

  • RAG & LLM Developers looking to ingest clean documentation, guides, or manuals into vector databases (Pinecone, Qdrant, PGVector, etc.).
  • AI Product Teams building custom customer support agents or search engines over vertical/niche websites.
  • Knowledge Engineers who need to monitor specific websites and ingest only new or updated pages.

Why this is better than a generic crawler

  1. Zero Noise: Automatically strips out navigation links, scripts, CSS, sidebars, newsletter boxes, and cookie overlays before parsing.
  2. Context-Aware Chunking: Instead of naive character splitting, it generates overlapping text blocks and attaches the relevant heading hierarchy (h1–h6) to every single chunk.
  3. Stateful Incremental Ingestion: Uses a persistent Key-Value Store across runs to compare page content hashes. It flags pages as new, changed, or unchanged so you only update changed chunks in your database.

πŸ’‘ Example Workflow: Ingesting a Blog to Pinecone

  1. Configure Target: Input the seed URL or sitemap (e.g., https://example.com/sitemap.xml).
  2. Filter blog posts: Add https://example.com/blog/** to Include patterns and exclude tags/authors.
  3. Enable Chunking & Change Detection: Set chunkText: true and detectChanges: true.
  4. Configure Output: Set format to chunks or pagesAndChunks.
  5. Sync: Run the Actor, retrieve only the new or changed chunks from the dataset, and upsert them to your vector database.

πŸ“„ Example Output: Chunk Record

Each chunk is a self-contained record ready for embedding generation:

{
"recordType":"chunk",
"chunkId":"a8f9c118bc28a192c73d9059f0f9bde0",
"pageUrl":"https://example.com/docs/getting-started",
"canonicalUrl":"https://example.com/docs/getting-started",
"site":"example.com",
"title":"Getting Started Guide | Documentation",
"chunkIndex":0,
"chunkText":"To install the library, run 'npm install @sdk/core'. Make sure you have Node.js version 20 or higher installed in your environment before initiating setup...",
"chunkCharStart":0,
"chunkCharEnd":150,
"chunkSize":1000,
"chunkOverlap":150,
"headingsContext":[
{"level":1,"text":"Getting Started"},
{"level":2,"text":"Installation"}
],
"language":"en",
"contentHash":"8f3c9e...",
"timestamp":"2026-06-06T12:00:00.000Z"
}

βš™οΈ Quick Start

  1. Start URLs / Sitemap URLs: Provide at least one URL. The default input uses https://example.com/ so the Actor produces a small dataset item without setup.
  2. Use Browser Rendering: Toggle on if the page relies heavily on client-side JavaScript (React, Vue, etc.) to render body text.
  3. Max Pages Per Site: Bounded limit (default 1) to keep the prefilled run fast and prevent uncontrolled resource use.
  4. Chunk Size & Overlap: Match this to your LLM's context window guidelines (e.g., size 1000 chars, overlap 150 chars).

Example Input

{
"startUrls":[{"url":"https://example.com/"}],
"sitemapUrls":[],
"maxPagesPerSite":1,
"includePatterns":[],
"excludePatterns":[],
"crawlDepth":0,
"maxCrawlRetries":1,
"useBrowserRendering":false,
"languageDetection":true,
"chunkText":false,
"chunkSize":1000,
"chunkOverlap":150,
"outputFormat":"pages",
"detectChanges":false,
"storeRawHtml":false,
"storeCleanText":true
}

You might also like

Docs Change Monitor for AI

careybrown/docs-change-rag-ready-monitor

Monitor public docs, changelogs, help centers, status pages, and pricing pages for changes, then output clean Markdown and RAG-ready chunks for AI knowledge bases.

Website Content Extractor for RAG: Markdown, HTML, Text

nezha/website-content-crawler

Turn docs sites, help centers, blogs, and websites into clean markdown, text, or HTML for RAG, AI knowledge bases, and internal search. Crawl from start URLs or sitemaps and keep the crawl in scope.

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

Llm Ready Documentation Scraper

direct_duty/llm-ready-documentation-scraper

Developers and AI agents need to read documentation (e.g. Stripe Docs, Next.js Docs), but standard scrapers return noisy HTML that includes: navigation bars headers / footers ads / cookie banners This Actor must return pure content-only Markdown, suitable for vectorization and semantic search.

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Website Content Crawler

worshipful_knife/website-content-crawler

Deep crawl websites and extract clean text, Markdown, or HTML for LLMs, RAG, and AI apps. Removes navigation, ads, cookie banners. Supports headless browser & HTTP. Sitemap discovery, URL scoping, file downloads. Feed ChatGPT, LangChain, LlamaIndex, Pinecone. The cheapest content crawler on Apify.