RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases
Pricing
from $0.01 / 1,000 results
RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases
RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
7
Total users
1
Monthly active users
3 months ago
Last modified
Categories
Share
π RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases Collect clean, deduplicated, relevance-filtered web content and export it as RAG-ready chunks optimized for LLM pipelines and vector databases. The RAG-Ready Web Scraper is an AI-focused crawler that collects, cleans, filters, deduplicates, scores, and chunks web content into knowledge-base datasets for Retrieval-Augmented Generation (RAG) systems. Stop feeding garbage into your vector database.
π§ Why This Actor Exists Most web scrapers simply dump raw HTML. That leads to: β Boilerplate (menus, navigation, cookie banners) β Duplicate articles across domains β Thin pages and link-heavy content β Irrelevant pages β Higher embedding costs β Poor retrieval quality β LLM hallucinations This Actor fixes the ingestion layer of your AI pipeline. It doesnβt just scrape. It performs a full AI ingestion workflow: Fetch β Clean β Filter β Deduplicate β Score β Chunk β Export The result is a clean dataset ready for vector databases and LLM retrieval.
π₯ What Makes This Actor Different Unlike generic scrapers, this Actor is RAG-first. β Boilerplate Removal Automatically removes: β’ navigation menus β’ footers β’ cookie banners β’ UI elements β’ scripts and styles Result: Clean semantic text ready for embeddings.
β Noise Filtering Rejects low-signal pages such as: β’ thin pages β’ index pages β’ link directories β’ code dumps β’ navigation pages Your vector database receives only high-signal content.
β Near-Duplicate Suppression Prevents mirrored or syndicated content from polluting your embeddings. Uses: β’ SimHash-64 fingerprinting β’ Hamming distance comparison Default threshold: Hamming distance β€ 3 This removes: β’ mirrored articles β’ syndicated content β’ minor text variations
β Topic Relevance Filtering Keep only content aligned with your target keywords. Example keywords: RAG vector database embeddings machine learning AI agents Irrelevant pages are automatically rejected.
β Quality Scoring Engine Each document receives a normalized quality score (0β1). Factors include: β’ topic keyword density β’ text density β’ link ratio penalty β’ length normalization β’ duplicate penalty Default acceptance threshold: score β₯ 0.55
β Smart Chunking Documents are converted into retrieval-optimized chunks. Chunking strategy: β’ paragraph-first segmentation β’ merge micro-paragraphs β’ configurable chunk size β’ optional overlap β’ drop tiny fragments Stable SHA-based chunk IDs ensure deterministic embeddings.
β Audit Report Every processed page is recorded with a decision: β’ kept β’ rejected β’ duplicate β’ filtered You can see exactly why pages were accepted or rejected.
βοΈ How the Pipeline Works Websites / URLs β HTML Extraction β Content Cleaning β Noise Filtering β Duplicate Detection β Quality Scoring β Smart Chunking β RAG-Ready Dataset
π― Typical Use Cases π€ AI Knowledge Bases Build datasets for: β’ RAG chatbots β’ enterprise knowledge assistants β’ documentation search
π§ LLM Training Pipelines Create structured datasets for: β’ domain-specific AI β’ internal knowledge ingestion β’ AI copilots
π Market Intelligence Collect structured knowledge for: β’ competitor monitoring β’ research automation β’ industry knowledge graphs
π’ Enterprise Data Pipelines Use for: β’ internal documentation ingestion β’ compliance monitoring β’ research knowledge systems β’ strategic intelligence
π¨βπ» Who Is This For? Developers Perfect for: β’ LangChain pipelines β’ LlamaIndex ingestion β’ Pinecone / Weaviate / Qdrant β’ AI agents β’ RAG chatbots
AI Startups Use this Actor for: β’ product knowledge ingestion β’ market intelligence pipelines β’ domain-specific AI systems
Enterprise Teams Production-grade ingestion for: β’ internal knowledge bases β’ compliance monitoring β’ research automation
π¦ Example Input { "startUrlsText": "https://docs.python.org/3/library/asyncio-task.html\nhttps://docs.python.org/3/library/urllib.parse.html\nhttps://www.iana.org/help/example-domains\nhttps://www.python-httpx.org/quickstart/\nhttps://docs.pydantic.dev/latest/concepts/models/\nhttps://en.wikipedia.org/wiki/Retrieval-augmented_generation\n", "maxPages": 20, "maxConcurrency": 4, "topicKeywordsText": "retrieval\nrag\nvector\nembedding\nchunk\nchunking\nsplit\nsimilarity\ndeduplicate\ndedup\nsimhash\nnoise\nboilerplate\nhttp\ncrawl\nasyncio\npydantic\nurllib\nmodels\nvalidation\nquickstart\nrfc\nurl\nuri\n", "noiseControlJson": "{"enabled":true,"minChars":250,"maxChars":150000,"dropIfMostlyCode":false,"dropIfMostlyLinks":true,"dropIfNavigationLike":false,"requiredKeywordsAny":[],"blockedKeywordsAny":[],"dedupe":{"enabled":true,"simhashHammingThreshold":3},"quality":{"enabled":true,"minScore":0.30}}", "chunkingJson": "{"maxChars":1100,"minChars":300,"overlapChars":120}", "outputJson": "{"writeRunReport":true}", "outputCsv": true, "outputXlsx": true, "outputBaseName": "rag_demo_rag_pack_v3", "debug": true }
π€ Output Structure Clean Documents Dataset Fields include: doc_id source_url domain title clean_text language noise_score filtering_reasons collected_at
RAG Chunks Dataset Fields include: chunk_id doc_id chunk_index chunk_text source_url source_title token_estimate
Run Audit Report Optional run report containing: URL decision (kept / dropped) rejection reason quality score characters before/after cleaning
π Integration Example (LangChain) from langchain.document_loaders import JSONLoader from langchain.vectorstores import Pinecone from langchain.embeddings import OpenAIEmbeddings
loader = JSONLoader( file_path="chunks_rag.json", jq_schema=".chunk_text" )
docs = loader.load()
vectorstore = Pinecone.from_documents( docs, OpenAIEmbeddings() ) Works with: β’ LangChain β’ LlamaIndex β’ Pinecone β’ Weaviate β’ Qdrant β’ OpenAI embeddings
π Comparison vs Generic Web Scrapers Feature Generic Scraper This Actor Raw HTML dump β β Boilerplate removal β β Duplicate detection β β Topic filtering β β Quality scoring β β RAG-ready chunks β β Audit report β β
π§ Why Noise Control Matters for RAG Garbage input increases: β’ embedding costs β’ retrieval latency β’ irrelevant context β’ hallucination risk This Actor protects your AI pipeline before embeddings are generated. Better ingestion β better retrieval β better AI answers.
π‘ Enterprise-Ready Design Designed for production AI systems. Features: β’ deterministic filtering β’ configurable thresholds β’ structured JSON output β’ reproducible chunk IDs β’ scalable architecture β’ dataset-based processing Works standalone or as a post-processing layer after another scraper.
β FAQ Does it crawl websites? Yes. The Actor can crawl websites directly or process existing scraped datasets. Does it remove duplicate articles? Yes. Near-duplicate detection uses SimHash fingerprinting. Is it RAG compatible? Yes. Output chunks are optimized for embedding pipelines. Can I control chunk size? Yes. Chunk size and overlap are configurable. Can it work after another scraper? Yes. It can act as a post-processing ingestion layer. Is it suitable for enterprise AI systems? Yes. It was designed for production-grade RAG pipelines.
π Related Actors by Adinfosys Labs You may also find these useful: β’ Website Contact Extractor β’ Google Maps Lead Generator β’ Salesforce AppExchange Intelligence Engine β’ Business Directory Intelligence Engine Together these tools form a complete data-collection and AI intelligence ecosystem.
π Stop Feeding Garbage Into Your LLM Better data β Better embeddings β Better answers. Build cleaner AI knowledge pipelines today.
