RAG Knowledge Loader

Under maintenance

Pricing

$1.00 / 1,000 results

Try for free

Go to Apify Store

👁 RAG Knowledge Loader

RAG Knowledge Loader

Under maintenance

Try for free

Scrapes documentation sites (GitBook, ReadTheDocs, Notion public pages) and converts them into vector-ready JSON format for RAG applications.

Pricing

$1.00 / 1,000 results

Rating

0.0

(0)

Developer

👁 BotFlowTech

BotFlowTech

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Features

Crawls entire documentation sites recursively
Extracts clean, structured content
Removes navigation, headers, footers automatically
Outputs vector-ready JSON format
Supports GitBook, ReadTheDocs, Notion, and custom doc sites

Use Cases

Build "Chat with Docs" chatbots
Feed LLMs with up-to-date documentation
Create knowledge bases for RAG pipelines
Automated documentation updates for vector databases

Input Parameters

Required

Start URLs (required): Array of documentation site URLs to scrape
- Example: https://docs.apify.com/, https://your-gitbook-site.com

Optional Configuration

Max pages to crawl (default: 1000): Maximum number of pages to scrape
- Minimum: 1
Include URL patterns (globs) (default: []): Only crawl URLs matching these patterns
- Example: ["**/api/**", "**/guides/**"]
Exclude URL patterns (globs) (default: ["**/*.pdf", "**/*.zip", "**/login**", "**/signup**"]): Skip URLs matching these patterns
Content CSS Selectors (default: "article, main, .content, .markdown-body, #content, [role='main']"): Comma-separated CSS selectors for main content area
Remove CSS Selectors (default: "nav, header, footer, .sidebar, #sidebar, .navigation, .cookie-banner, script, style, iframe"): Selectors for elements to remove like navigation and headers
Output Format (default: "vector-ready"):
- "vector-ready": Flat structure optimized for embeddings
- "hierarchical": Nested structure with full metadata
Crawler Type (default: "cheerio"):
- "cheerio": Fast HTTP crawler for static sites
- "playwright": Browser-based crawler for JavaScript-heavy sites

Example Input JSON

{ "startUrls": [ { "url": "https://docs.example.com/" }, { "url": "https://your-gitbook.com/docs" } ], "maxPages": 500, "excludeUrlGlobs": ["/*.pdf", "/login**", "/signup"], "includeUrlGlobs": ["/docs/"], "contentSelectors": "article, main, .markdown-body", "removeSelectors": "nav, footer, .sidebar", "outputFormat": "vector-ready", "crawlerType": "cheerio" }

Minimal Input Example

{ "startUrls": [ { "url": "https://docs.example.com/" } ] }

Output Format

Vector-Ready Format (Default)

Optimized for direct ingestion into vector databases:

{ "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"], "readyForEmbedding": true }, "documents": [ { "id": "unique-doc-id-123", "text": "Full page content with all text extracted and cleaned...", "metadata": { "source": "https://docs.example.com/page", "title": "Page Title", "url": "https://docs.example.com/page", "scrapedAt": "2025-12-06T08:11:00.000Z", "wordCount": 1234 } } ] }

Hierarchical Format

Includes full document structure with headings and metadata:

{ "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"] }, "documents": [ { "id": "unique-doc-id-123", "url": "https://docs.example.com/page", "title": "Page Title", "content": "Full page content...", "metadata": { "description": "Page meta description", "keywords": "api, documentation", "scrapedAt": "2025-12-06T08:11:00.000Z", "headings": [ { "level": 1, "text": "Introduction" }, { "level": 2, "text": "Getting Started" } ], "wordCount": 1234, "characterCount": 5678 } } ] }

Integration with Vector Databases

The output is ready to use with popular RAG frameworks:

LangChain: Use JSONLoader to load documents
LlamaIndex: Import as Document objects
Pinecone/Weaviate: Batch upsert with metadata
Chroma: Add to collection with embeddings

👁 Tech Docs to LLM-Ready Markdown avatar

Tech Docs to LLM-Ready Markdown

hedelka/tech-docs-scraper

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

👁 User avatar

Dmitry Goncharov

👁 Docs-to-RAG Crawler avatar

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

👁 User avatar

Stas Persiianenko

👁 Documentation Crawler for RAG avatar

Documentation Crawler for RAG

liquid_bark/docs-crawler-for-rag

Specialized crawler for developer documentation sites. Detects frameworks (Docusaurus, GitBook, ReadTheDocs, MkDocs, Sphinx), extracts clean content, and outputs semantically chunked Markdown optimized for RAG pipelines.

👁 User avatar

Izz

👁 Docs To Rag avatar

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

👁 User avatar

Gabriel Antony Xaviour

👁 RAG-Ready Documentation Scraper avatar

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

👁 User avatar

Alaricus

👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

👁 User avatar

Dev with Bobby

👁 RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases avatar

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

👁 User avatar

Artashes Arakelyan

👁 Universal Knowledge Base Scraper (RAG Ready) avatar

Universal Knowledge Base Scraper (RAG Ready)

actums/universal-rag-scraper

Turn any Help Center into LLM-ready Markdown. Supports Zendesk, Intercom, Docusaurus, and generic sites. Perfect for RAG and AI Agents.

👁 User avatar

Actums

👁 Rag Embedding Generator avatar

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

👁 User avatar

mick_

👁 Web Scraper RAG Ready avatar

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

👁 User avatar

Alexy Traore

URL: https://apify.com/botflowtech/rag-knowledge-loader