VOOZH about

URL: https://apify.com/adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

⇱ RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases Β· Apify


πŸ‘ RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases avatar

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

Pricing

from $0.01 / 1,000 results

Go to Apify Store

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ Artashes Arakelyan

Artashes Arakelyan

Maintained by Community

Actor stats

0

Bookmarked

7

Total users

1

Monthly active users

3 months ago

Last modified

Share

πŸš€ RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases Collect clean, deduplicated, relevance-filtered web content and export it as RAG-ready chunks optimized for LLM pipelines and vector databases. The RAG-Ready Web Scraper is an AI-focused crawler that collects, cleans, filters, deduplicates, scores, and chunks web content into knowledge-base datasets for Retrieval-Augmented Generation (RAG) systems. Stop feeding garbage into your vector database.

🧠 Why This Actor Exists Most web scrapers simply dump raw HTML. That leads to: ❌ Boilerplate (menus, navigation, cookie banners) ❌ Duplicate articles across domains ❌ Thin pages and link-heavy content ❌ Irrelevant pages ❌ Higher embedding costs ❌ Poor retrieval quality ❌ LLM hallucinations This Actor fixes the ingestion layer of your AI pipeline. It doesn’t just scrape. It performs a full AI ingestion workflow: Fetch β†’ Clean β†’ Filter β†’ Deduplicate β†’ Score β†’ Chunk β†’ Export The result is a clean dataset ready for vector databases and LLM retrieval.


πŸ”₯ What Makes This Actor Different Unlike generic scrapers, this Actor is RAG-first. βœ… Boilerplate Removal Automatically removes: β€’ navigation menus β€’ footers β€’ cookie banners β€’ UI elements β€’ scripts and styles Result: Clean semantic text ready for embeddings.


βœ… Noise Filtering Rejects low-signal pages such as: β€’ thin pages β€’ index pages β€’ link directories β€’ code dumps β€’ navigation pages Your vector database receives only high-signal content.


βœ… Near-Duplicate Suppression Prevents mirrored or syndicated content from polluting your embeddings. Uses: β€’ SimHash-64 fingerprinting β€’ Hamming distance comparison Default threshold: Hamming distance ≀ 3 This removes: β€’ mirrored articles β€’ syndicated content β€’ minor text variations


βœ… Topic Relevance Filtering Keep only content aligned with your target keywords. Example keywords: RAG vector database embeddings machine learning AI agents Irrelevant pages are automatically rejected.


βœ… Quality Scoring Engine Each document receives a normalized quality score (0–1). Factors include: β€’ topic keyword density β€’ text density β€’ link ratio penalty β€’ length normalization β€’ duplicate penalty Default acceptance threshold: score β‰₯ 0.55


βœ… Smart Chunking Documents are converted into retrieval-optimized chunks. Chunking strategy: β€’ paragraph-first segmentation β€’ merge micro-paragraphs β€’ configurable chunk size β€’ optional overlap β€’ drop tiny fragments Stable SHA-based chunk IDs ensure deterministic embeddings.


βœ… Audit Report Every processed page is recorded with a decision: β€’ kept β€’ rejected β€’ duplicate β€’ filtered You can see exactly why pages were accepted or rejected.


βš™οΈ How the Pipeline Works Websites / URLs ↓ HTML Extraction ↓ Content Cleaning ↓ Noise Filtering ↓ Duplicate Detection ↓ Quality Scoring ↓ Smart Chunking ↓ RAG-Ready Dataset


🎯 Typical Use Cases πŸ€– AI Knowledge Bases Build datasets for: β€’ RAG chatbots β€’ enterprise knowledge assistants β€’ documentation search


🧠 LLM Training Pipelines Create structured datasets for: β€’ domain-specific AI β€’ internal knowledge ingestion β€’ AI copilots


πŸ” Market Intelligence Collect structured knowledge for: β€’ competitor monitoring β€’ research automation β€’ industry knowledge graphs


🏒 Enterprise Data Pipelines Use for: β€’ internal documentation ingestion β€’ compliance monitoring β€’ research knowledge systems β€’ strategic intelligence


πŸ‘¨β€πŸ’» Who Is This For? Developers Perfect for: β€’ LangChain pipelines β€’ LlamaIndex ingestion β€’ Pinecone / Weaviate / Qdrant β€’ AI agents β€’ RAG chatbots


AI Startups Use this Actor for: β€’ product knowledge ingestion β€’ market intelligence pipelines β€’ domain-specific AI systems


Enterprise Teams Production-grade ingestion for: β€’ internal knowledge bases β€’ compliance monitoring β€’ research automation


πŸ“¦ Example Input { "startUrlsText": "https://docs.python.org/3/library/asyncio-task.html\nhttps://docs.python.org/3/library/urllib.parse.html\nhttps://www.iana.org/help/example-domains\nhttps://www.python-httpx.org/quickstart/\nhttps://docs.pydantic.dev/latest/concepts/models/\nhttps://en.wikipedia.org/wiki/Retrieval-augmented_generation\n", "maxPages": 20, "maxConcurrency": 4, "topicKeywordsText": "retrieval\nrag\nvector\nembedding\nchunk\nchunking\nsplit\nsimilarity\ndeduplicate\ndedup\nsimhash\nnoise\nboilerplate\nhttp\ncrawl\nasyncio\npydantic\nurllib\nmodels\nvalidation\nquickstart\nrfc\nurl\nuri\n", "noiseControlJson": "{"enabled":true,"minChars":250,"maxChars":150000,"dropIfMostlyCode":false,"dropIfMostlyLinks":true,"dropIfNavigationLike":false,"requiredKeywordsAny":[],"blockedKeywordsAny":[],"dedupe":{"enabled":true,"simhashHammingThreshold":3},"quality":{"enabled":true,"minScore":0.30}}", "chunkingJson": "{"maxChars":1100,"minChars":300,"overlapChars":120}", "outputJson": "{"writeRunReport":true}", "outputCsv": true, "outputXlsx": true, "outputBaseName": "rag_demo_rag_pack_v3", "debug": true }


πŸ“€ Output Structure Clean Documents Dataset Fields include: doc_id source_url domain title clean_text language noise_score filtering_reasons collected_at


RAG Chunks Dataset Fields include: chunk_id doc_id chunk_index chunk_text source_url source_title token_estimate


Run Audit Report Optional run report containing: URL decision (kept / dropped) rejection reason quality score characters before/after cleaning


πŸ”Œ Integration Example (LangChain) from langchain.document_loaders import JSONLoader from langchain.vectorstores import Pinecone from langchain.embeddings import OpenAIEmbeddings

loader = JSONLoader( file_path="chunks_rag.json", jq_schema=".chunk_text" )

docs = loader.load()

vectorstore = Pinecone.from_documents( docs, OpenAIEmbeddings() ) Works with: β€’ LangChain β€’ LlamaIndex β€’ Pinecone β€’ Weaviate β€’ Qdrant β€’ OpenAI embeddings


πŸ“Š Comparison vs Generic Web Scrapers Feature Generic Scraper This Actor Raw HTML dump βœ… ❌ Boilerplate removal ❌ βœ… Duplicate detection ❌ βœ… Topic filtering ❌ βœ… Quality scoring ❌ βœ… RAG-ready chunks ❌ βœ… Audit report ❌ βœ…


🧠 Why Noise Control Matters for RAG Garbage input increases: β€’ embedding costs β€’ retrieval latency β€’ irrelevant context β€’ hallucination risk This Actor protects your AI pipeline before embeddings are generated. Better ingestion β†’ better retrieval β†’ better AI answers.


πŸ›‘ Enterprise-Ready Design Designed for production AI systems. Features: β€’ deterministic filtering β€’ configurable thresholds β€’ structured JSON output β€’ reproducible chunk IDs β€’ scalable architecture β€’ dataset-based processing Works standalone or as a post-processing layer after another scraper.


❓ FAQ Does it crawl websites? Yes. The Actor can crawl websites directly or process existing scraped datasets. Does it remove duplicate articles? Yes. Near-duplicate detection uses SimHash fingerprinting. Is it RAG compatible? Yes. Output chunks are optimized for embedding pipelines. Can I control chunk size? Yes. Chunk size and overlap are configurable. Can it work after another scraper? Yes. It can act as a post-processing ingestion layer. Is it suitable for enterprise AI systems? Yes. It was designed for production-grade RAG pipelines.

πŸ”— Related Actors by Adinfosys Labs You may also find these useful: β€’ Website Contact Extractor β€’ Google Maps Lead Generator β€’ Salesforce AppExchange Intelligence Engine β€’ Business Directory Intelligence Engine Together these tools form a complete data-collection and AI intelligence ecosystem.

πŸš€ Stop Feeding Garbage Into Your LLM Better data β†’ Better embeddings β†’ Better answers. Build cleaner AI knowledge pipelines today.

You might also like

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summaryβ€”ready for embeddings or vector DBs without extra glue code.

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

RAG Web Browser

parseforge/rag-web-browser

Give your AI agents real-time web access! Search the web on any topic and get full page content as clean Markdown, ready for LLMs, RAG pipelines, or OpenAI Assistants. Includes titles, descriptions, links, authors, images, and metadata. Start grounding your AI with fresh data in minutes!