VOOZH about

URL: https://apify.com/actums/universal-rag-scraper

โ‡ฑ Universal Knowledge Base Scraper (RAG Ready) [DEPRECATED] ยท Apify


๐Ÿ‘ Universal Knowledge Base Scraper (RAG Ready) avatar

Universal Knowledge Base Scraper (RAG Ready)

Deprecated

Pricing

$49.00/month + usage

Go to Apify Store

Universal Knowledge Base Scraper (RAG Ready)

Deprecated

Turn any Help Center into LLM-ready Markdown. Supports Zendesk, Intercom, Docusaurus, and generic sites. Perfect for RAG and AI Agents.

Pricing

$49.00/month + usage

Rating

0.0

(0)

Developer

๐Ÿ‘ Actums

Actums

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

5 months ago

Last modified

Share

๐Ÿง  Universal Knowledge Base Scraper (RAG Ready)

Feed your AI Agents with clean, structured Markdown. Stop feeding them HTML garbage.


๐Ÿš€ What is Universal RAG Scraper?

Universal RAG Scraper is an "ETL-in-a-Box" for AI Developers. It turns messy Help Centers (Zendesk, Intercom, Docusaurus, Notion) into pure, train-ready Markdown (.md) files.

If you are building RAG Pipelines (Retrieval-Augmented Generation) or AI Agents, you know that HTML noise (navbars, footers, cookie banners) ruins your vector embeddings. This Actor solves that problem instantly.

Why not just use a generic scraper?

Generic scrapers give you the page. We give you the content.

  • Auto-Detect: We identify the platform (e.g., Zendesk) and apply surgical clean-up rules.
  • Markdown Native: We don't just "strip tags"; we convert tables, lists, and code blocks into perfect Markdown.
  • Metadata Rich: We extract the Title, URL, and Last Updated Date for your Vector DB.

โšก Enterprise-Grade Features

Built for scale and reliability:

  1. ๐Ÿ›ก๏ธ Zero-Config Proxies: Scrape protected Help Centers without getting 403 Blocked. Request rotation is built-in.
  2. โฐ Auto-Sync Scheduling: Set it to run every Friday night. Keep your RAG Knowledge Base in sync with your product docs automatically.
  3. ๐Ÿ’พ Infinite Storage: Scrape 10,000 pages or 10 million. All data is stored, indexed, and ready for export (JSON, CSV, Excel).
  4. ๐Ÿ”Œ Native Integrations: Pipe the Markdown directly to Pinecone, LangChain, or Zapier. No glue code needed.

๐ŸŽฏ Supported Platforms (Auto-Detected)

PlatformCapability
ZendeskFull support. Strips "Related Articles" & sidebars.
IntercomFull support. Handles dynamic loading.
DocusaurusPerfect for V2/V3 docs. Preserves code block languages.
NotionScrapes public Notion Knowledge Bases.
GenericSmart Fallback: If we don't recognize the platform, we use advanced readability algorithms to extract the main content.

๐Ÿ“š How to scrape a Knowledge Base in 3 steps

  1. Paste the URL: Go to the input tab and enter the URL of the Help Center home page (e.g., https://support.zoom.us/hc/en-us).
  2. Set Depth: Choose how many links to follow (default: 2 levels deep).
  3. Run: Click "Start". In minutes, you can download a JSON file containing all articles in Markdown.

๐Ÿ’ฐ Pricing & Usage

This is a Rental Actor.

  • Free Trial: You can test the scraper for a limited time to verify the Markdown quality.
  • Rental Plan: Access unlimited scale, high-frequency scheduling, and priority support.

Cost Estimation:

  • Scraping a typical Help Center (500 pages) takes ~5-10 minutes.
  • The output is "Vector Ready" - no post-processing costs.

๐Ÿ“ค Input & Output

Input Configuration

Simple, developer-friendly input:

{
"startUrls":[{"url":"https://docs.apify.com"}],
"maxDepth":10,
"outputFormat":"markdown"
}

Output (JSON/Dataset)

Each item in the dataset is one article:

{
"url":"https://docs.apify.com/academy/web-scraping",
"title":"Web Scraping Academy",
"platform":"Docusaurus",
"scrapedAt":"2023-10-27T10:00:00Z",
"markdown":"# Web Scraping Academy\n\nLearn how to scrape..."
}

โ“ FAQ

Can I scrape a custom-built Help Center?

Yes. The Actor uses a "Smart Fallback" (Readability algorithm). If it doesn't detect Zendesk/Intercom, it will still scan the page, identify the visual "main content" area, and extract it.

Does this handle dynamic Javascript sites?

Yes. We use Playwright (headless browser) under the hood. We render the full page, execute JavaScript, and then scrape. This works even on React/Vue/Angular apps.

How do I feed this into my LLM?

  1. Run the Actor.
  2. Download the JSON output.
  3. Use the markdown field as the content in your LLM Prompt or Embedding request.

๐Ÿ“ž Support & Feedback

Found a site we can't scrape? Missing a platform?

  • Report a Bug: Use the "Issues" tab.
  • Request a Feature: We add new Platforms (e.g., Gitbook, ReadTheDocs) based on user votes!

You might also like

Universal News Article Intelligence Agent

workhard3000/news-intelligence-rag-extractor

High-fidelity news normalization for AI & Agentic RAG. Extract clean Markdown, full-text, and metadata from premium domains (Bloomberg, Wall Street Journal, Financial Times, New York Times, Washington Post, etc.). Success-only billing, only pay when full-text is verified.

47

5.0

(11)

RAG Data Ingestion: Website to AI Knowledge Base

0xysn/rag-data-ingestion-website-to-ai-knowledge-base

Master complex documentation with a premium scraper that flattens Shadow DOM and handles modern web components. Delivers clean, token-accurate Markdown pre-chunked for immediate RAG ingestion into Pinecone, Weaviate, or LangChain. Optimized for high-fidelity LLM training data.

Rag Architect

ai_solutionist/rag-architect

Transform any website into vector-store-ready knowledge chunks for Pinecone, Weaviate, LangChain, LlamaIndex, Supabase, n8n & more. AI-generated Q&A pairs, smart chunking, PII scrubbing. Build hallucination-free RAG chatbots in minutes.

๐Ÿ‘ User avatar

Jason Pellerin

2

Quick Website Content Scraper ( Extract Text for RAG & LLMs )

automateitplease/ai-web-content-scraper-extract-text-for-rag-llms

Extract clean text from any website for AI/LLM applications. Supports both static and JavaScript-rendered sites (React, Vue, Angular). Perfect for RAG systems, chatbot training, and content analysis.

๐Ÿ‘ User avatar

AutomateItPlease Workflow And Automaton Ops

49

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.

546

2.1

(3)

Universal Markdown Scraper for LLMs

botflowtech/universal-markdown-scraper-for-llms

Universal Markdown Scraper for LLMs

Meta Ad Library Scraper โ€” Facebook & Instagram Ads

scrapeify/meta-ad-library-scraper

Extract Facebook and Meta Ad Library transparency data: search by keyword, Page ID, or full URL. Sort by total_impressions or most_recent. Returns structured creatives, spend/impression estimates, timing, distribution. Supports 100+ languages. No cookies. Built for competitive intel.

52

5.0

(1)