👁 Universal Knowledge Base Scraper (RAG Ready) avatar

Universal Knowledge Base Scraper (RAG Ready)

Deprecated

Pricing

$49.00/month + usage

See alternative Actors

Go to Apify Store

👁 Universal Knowledge Base Scraper (RAG Ready)

Universal Knowledge Base Scraper (RAG Ready)

Deprecated

See alternative Actors

Turn any Help Center into LLM-ready Markdown. Supports Zendesk, Intercom, Docusaurus, and generic sites. Perfect for RAG and AI Agents.

Pricing

$49.00/month + usage

Rating

0.0

(0)

Developer

👁 Actums

Actums

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

5 months ago

Last modified

🧠 Universal Knowledge Base Scraper (RAG Ready)

Feed your AI Agents with clean, structured Markdown. Stop feeding them HTML garbage.

🚀 What is Universal RAG Scraper?

Universal RAG Scraper is an "ETL-in-a-Box" for AI Developers. It turns messy Help Centers (Zendesk, Intercom, Docusaurus, Notion) into pure, train-ready Markdown (.md) files.

If you are building RAG Pipelines (Retrieval-Augmented Generation) or AI Agents, you know that HTML noise (navbars, footers, cookie banners) ruins your vector embeddings. This Actor solves that problem instantly.

Why not just use a generic scraper?

Generic scrapers give you the page. We give you the content.

Auto-Detect: We identify the platform (e.g., Zendesk) and apply surgical clean-up rules.
Markdown Native: We don't just "strip tags"; we convert tables, lists, and code blocks into perfect Markdown.
Metadata Rich: We extract the Title, URL, and Last Updated Date for your Vector DB.

⚡ Enterprise-Grade Features

Built for scale and reliability:

🛡️ Zero-Config Proxies: Scrape protected Help Centers without getting 403 Blocked. Request rotation is built-in.
⏰ Auto-Sync Scheduling: Set it to run every Friday night. Keep your RAG Knowledge Base in sync with your product docs automatically.
💾 Infinite Storage: Scrape 10,000 pages or 10 million. All data is stored, indexed, and ready for export (JSON, CSV, Excel).
🔌 Native Integrations: Pipe the Markdown directly to Pinecone, LangChain, or Zapier. No glue code needed.

🎯 Supported Platforms (Auto-Detected)

Platform	Capability
Zendesk	Full support. Strips "Related Articles" & sidebars.
Intercom	Full support. Handles dynamic loading.
Docusaurus	Perfect for V2/V3 docs. Preserves code block languages.
Notion	Scrapes public Notion Knowledge Bases.
Generic	Smart Fallback: If we don't recognize the platform, we use advanced readability algorithms to extract the main content.

📚 How to scrape a Knowledge Base in 3 steps

Paste the URL: Go to the input tab and enter the URL of the Help Center home page (e.g., https://support.zoom.us/hc/en-us).
Set Depth: Choose how many links to follow (default: 2 levels deep).
Run: Click "Start". In minutes, you can download a JSON file containing all articles in Markdown.

💰 Pricing & Usage

This is a Rental Actor.

Free Trial: You can test the scraper for a limited time to verify the Markdown quality.
Rental Plan: Access unlimited scale, high-frequency scheduling, and priority support.

Cost Estimation:

Scraping a typical Help Center (500 pages) takes ~5-10 minutes.
The output is "Vector Ready" - no post-processing costs.

📤 Input & Output

Input Configuration

Simple, developer-friendly input:

{
"startUrls":[{"url":"https://docs.apify.com"}],
"maxDepth":10,
"outputFormat":"markdown"
}

Output (JSON/Dataset)

Each item in the dataset is one article:

{
"url":"https://docs.apify.com/academy/web-scraping",
"title":"Web Scraping Academy",
"platform":"Docusaurus",
"scrapedAt":"2023-10-27T10:00:00Z",
"markdown":"# Web Scraping Academy\n\nLearn how to scrape..."
}

❓ FAQ

Can I scrape a custom-built Help Center?

Yes. The Actor uses a "Smart Fallback" (Readability algorithm). If it doesn't detect Zendesk/Intercom, it will still scan the page, identify the visual "main content" area, and extract it.

Does this handle dynamic Javascript sites?

Yes. We use Playwright (headless browser) under the hood. We render the full page, execute JavaScript, and then scrape. This works even on React/Vue/Angular apps.

How do I feed this into my LLM?

Run the Actor.
Download the JSON output.
Use the markdown field as the content in your LLM Prompt or Embedding request.

📞 Support & Feedback

Found a site we can't scrape? Missing a platform?

Report a Bug: Use the "Issues" tab.
Request a Feature: We add new Platforms (e.g., Gitbook, ReadTheDocs) based on user votes!

👁 Universal News Article Intelligence Agent avatar

Universal News Article Intelligence Agent

workhard3000/news-intelligence-rag-extractor

High-fidelity news normalization for AI & Agentic RAG. Extract clean Markdown, full-text, and metadata from premium domains (Bloomberg, Wall Street Journal, Financial Times, New York Times, Washington Post, etc.). Success-only billing, only pay when full-text is verified.

👁 User avatar

WorkHard3000

5.0

(11)

👁 RAG Data Ingestion: Website to AI Knowledge Base avatar

RAG Data Ingestion: Website to AI Knowledge Base

0xysn/rag-data-ingestion-website-to-ai-knowledge-base

Master complex documentation with a premium scraper that flattens Shadow DOM and handles modern web components. Delivers clean, token-accurate Markdown pre-chunked for immediate RAG ingestion into Pinecone, Weaviate, or LangChain. Optimized for high-fidelity LLM training data.

👁 User avatar

tekk

👁 Rag Architect avatar

Rag Architect

ai_solutionist/rag-architect

Transform any website into vector-store-ready knowledge chunks for Pinecone, Weaviate, LangChain, LlamaIndex, Supabase, n8n & more. AI-generated Q&A pairs, smart chunking, PII scrubbing. Build hallucination-free RAG chatbots in minutes.

👁 User avatar

Jason Pellerin

👁 Quick Website Content Scraper ( Extract Text for RAG & LLMs ) avatar

Quick Website Content Scraper ( Extract Text for RAG & LLMs )

automateitplease/ai-web-content-scraper-extract-text-for-rag-llms

Extract clean text from any website for AI/LLM applications. Supports both static and JavaScript-rendered sites (React, Vue, Angular). Perfect for RAG systems, chatbot training, and content analysis.

👁 User avatar

AutomateItPlease Workflow And Automaton Ops

👁 Website Content Crawler Pro avatar

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.

👁 User avatar

halam

546

2.1

(3)

Schema.Org Json Ld Extractor

sync-network/schema-org-json-ld-extractor

Extract Schema.org JSON-LD structured data from any website. Fast, lightweight HTTP-based scraper that pulls all JSON-LD scripts - perfect for SEO analysis, product data extraction, and AI/RAG pipelines. No browser overhead.

👁 User avatar

Alam

👁 Universal Markdown Scraper for LLMs avatar

Universal Markdown Scraper for LLMs

botflowtech/universal-markdown-scraper-for-llms

Universal Markdown Scraper for LLMs

👁 User avatar

BotFlowTech

🕷️ Web Scraping MCP — AI Content Extraction

nexgendata/web-scraping-mcp-server

MCP server letting AI agents (Claude Desktop, Cursor, n8n, OpenAI Agents SDK) scrape any website, run Google searches, query Wikipedia, crawl pages, and parse HTML at LLM tool-call time. Universal pay-per-result web extraction — drop-in for RAG pipelines and research agents.

👁 User avatar

NexGenData

Instagram Scraper Pro

red.cars/instagram-scraper-pro

Extract Instagram profiles, posts, stories, hashtags, and engagement metrics. Influencer research and social media analytics without an API key.

👁 User avatar

AutomateLab

125

1.0

(1)

👁 Meta Ad Library Scraper — Facebook & Instagram Ads avatar

Meta Ad Library Scraper — Facebook & Instagram Ads

scrapeify/meta-ad-library-scraper

Extract Facebook and Meta Ad Library transparency data: search by keyword, Page ID, or full URL. Sort by total_impressions or most_recent. Returns structured creatives, spend/impression estimates, timing, distribution. Supports 100+ languages. No cookies. Built for competitive intel.

👁 User avatar

Scrapeify

5.0

(1)

URL: https://apify.com/actums/universal-rag-scraper