👁 RAG Doctor: Audit & Repair Your AI Knowledge Base avatar

RAG Doctor: Audit & Repair Your AI Knowledge Base

Pricing

Pay per usage

👁 RAG Doctor: Audit & Repair Your AI Knowledge Base

RAG Doctor: Audit & Repair Your AI Knowledge Base

Audit and repair the content you feed your AI. Finds contradictions, stale facts, duplicates, dead links, and broken chunks that quietly poison RAG, agents, and custom GPTs. Returns a scored report, a prioritized fix list, and a cleaned, ready-to-index knowledge base.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

👁 Sanya Kumari

Sanya Kumari

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

12 days ago

Last modified

RAG Doctor — Knowledge Base Health Check & Repair for AI

Your AI is only as good as the content you feed it. RAG Doctor audits that content the way a linter audits code: it finds the contradictions, stale facts, duplicates, and broken chunks that quietly poison RAG pipelines, agents, and custom GPTs, then hands you a prioritized fix list and an optional cleaned-up version.

Most tools build a knowledge base for you. RAG Doctor fixes the one you already have.

Why this exists

Garbage in, confident garbage out. When two pages disagree, a RAG system retrieves one at random and the model states it as fact. When a chunk reads "as shown above," retrieval pulls it alone and the model fills the gap by guessing. These defects are invisible until a user gets a wrong answer. RAG Doctor surfaces them before your users do.

What it checks

Check	What it catches	Needs API key
Contradictions	Two pages stating facts that can't both be true (the #1 silent RAG killer)	Yes
Stale facts	Pages whose newest referenced date is past your freshness threshold	No
Duplicates	Near-identical pages that crowd out distinct facts at retrieval time	No
Chunk health	Chunks that lose meaning when retrieved alone (dangling references, orphan pronouns, too short)	No
Dead links	Cited URLs that 404 or time out	No
AI extractability	robots.txt blocking AI crawlers, missing sitemap, JavaScript-only content	No
Coverage gaps	Real user questions the knowledge base cannot answer	Yes

Input

Point it at a site to crawl, or hand it a dataset you already extracted.

{
"startUrls":[{"url":"https://docs.example.com"}],
"maxPages":100,
"maxCrawlDepth":2,
"mode":"audit",
"checks":["staleness","duplicates","chunkHealth","deadLinks","extractability","contradictions"],
"stalenessThresholdDays":540,
"similarityThreshold":0.85,
"userQuestions":["How do I rotate my API token?"],
"anthropicApiKey":"sk-ant-...",
"llmModel":"claude-haiku-4-5-20251001"
}

Crawling and link checks run over the Apify datacenter proxy automatically; there is no proxy option to configure.

Audit content you already crawled (composes with apify/website-content-crawler):

{"datasetId":"YOUR_DATASET_ID","mode":"both"}

The LLM-backed checks (contradictions, coverage gaps) need an Anthropic API key. Without it, those two checks are skipped and every other check still runs.

Output

Dataset — one row per finding (severity, check, issue, detail, suggested fix, URL). Sorted most-severe first.
Key-value store
- REPORT — a shareable HTML report with the AI-readiness score and full fix list.
- SUMMARY / OUTPUT — the score, grade, and severity counts as JSON.
repaired-knowledge-base dataset (repair / both modes) — duplicates collapsed, thin pages dropped, stale pages flagged, content pre-chunked and ready for a vector DB or llms.txt.

The AI-readiness score (0-100) is defect density, not raw count, so a large knowledge base isn't penalized just for having more pages.

Modes

audit — report and fix list only.
repair — also emit the cleaned corpus.
both — everything.

Local development

npminstall
npm run build
apify run # or: npm start

Roadmap

Expose as an MCP server tool (audit_knowledge_base) so an agent can call it mid-workflow before answering.
Embedding-based duplicate and contradiction candidate selection for higher recall.
Incremental re-audits that only re-check what changed.

👁 RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases avatar

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

👁 User avatar

Artashes Arakelyan

👁 GEO Auditor — AI Search Readiness & Citability Audit avatar

GEO Auditor — AI Search Readiness & Citability Audit

foxlabs/geo-auditor

Audit how ready your site is to be found, read & cited by AI search (ChatGPT, Perplexity, Gemini, Claude). Checks AI-crawler access, structured data, content extractability, speed & trust — scored 0-100 with a prioritized fix list. GEO / AEO technical audit.

👁 User avatar

Berkan Kaplan

👁 Site to LLM Knowledge Base avatar

Site to LLM Knowledge Base

adambounhar/site-to-knowledge-base

Turn any website or docs into clean, LLM-ready Markdown for RAG and AI agents — one record per page, each with a token count. Sitemap- and robots.txt-aware, with predictable per-page pricing (no token credits). Simple knowledge-base ingestion.

👁 User avatar

Mohamed Adam BOUNHAR

👁 Universal Knowledge Base Scraper (RAG Ready) avatar

Universal Knowledge Base Scraper (RAG Ready)

actums/universal-rag-scraper

Turn any Help Center into LLM-ready Markdown. Supports Zendesk, Intercom, Docusaurus, and generic sites. Perfect for RAG and AI Agents.

👁 User avatar

Actums

👁 Knowledge Intelligence Engine — Website to Markdown for RAG avatar

Knowledge Intelligence Engine — Website to Markdown for RAG

ryanclinton/website-content-to-markdown

Turn any website, documentation site or help centre into a retrieval-ready knowledge corpus for RAG and AI search. Clean Markdown plus chunks, change detection, deduplication, retrieval scoring, version awareness and a full corpus audit, in one run.

👁 User avatar

Ryan Clinton

👁 Front Knowledge Base avatar

Front Knowledge Base

canadesk/front-knowledge-base

Get Categories and Articles from any public Front Knowledge Base. It's fast and costs little.

👁 User avatar

Canadesk Support

👁 Website Content Crawler avatar

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

👁 User avatar

Crawler Bros

👁 AI / RAG Web Crawler avatar

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

👁 User avatar

Group Oject

👁 Auto Repair Shop Email Scraper avatar

Auto Repair Shop Email Scraper

contacts-api/auto-repair-shop-email-scraper

Auto repair shop email scraper to extract verified mechanic and repair shop emails from automotive directories and websites 📧🔧 Ideal for automotive marketing, partnerships, and local business lead generation.

👁 User avatar

Lead Heaven

👁 Broken Link Audit avatar

Broken Link Audit

zerobreak/broken-link-audit

Broken Link Audit is an Apify actor that crawls websites to find broken links, dead URLs, and failed HTTP requests. It scans internal pages, extracts all links, and performs live HTTP checks to detect 404 errors, timeouts, and server issues, helping you fix problems before they harm SEO.

👁 User avatar

ZeroBreak

👁 Blog article image

How to train an AI chatbot using automated scraping

👁 Blog article image

11 AI agent use cases (on Apify)

URL: https://apify.com/sanya_kumari/rag-doctor