VOOZH about

URL: https://apify.com/botflowtech/rag-knowledge-loader

⇱ RAG Knowledge Loader Β· Apify


πŸ‘ RAG Knowledge Loader avatar

RAG Knowledge Loader

Under maintenance

Pricing

$1.00 / 1,000 results

Go to Apify Store

RAG Knowledge Loader

Under maintenance

Scrapes documentation sites (GitBook, ReadTheDocs, Notion public pages) and converts them into vector-ready JSON format for RAG applications.

Pricing

$1.00 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ BotFlowTech

BotFlowTech

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

0

Monthly active users

2 months ago

Last modified

Share

Scrapes documentation sites (GitBook, ReadTheDocs, Notion public pages) and converts them into vector-ready JSON format for RAG applications.

Features

  • Crawls entire documentation sites recursively
  • Extracts clean, structured content
  • Removes navigation, headers, footers automatically
  • Outputs vector-ready JSON format
  • Supports GitBook, ReadTheDocs, Notion, and custom doc sites

Use Cases

  • Build "Chat with Docs" chatbots
  • Feed LLMs with up-to-date documentation
  • Create knowledge bases for RAG pipelines
  • Automated documentation updates for vector databases

Input Parameters

Required

  • Start URLs (required): Array of documentation site URLs to scrape
    • Example: https://docs.apify.com/, https://your-gitbook-site.com

Optional Configuration

  • Max pages to crawl (default: 1000): Maximum number of pages to scrape

    • Minimum: 1
  • Include URL patterns (globs) (default: []): Only crawl URLs matching these patterns

    • Example: ["**/api/**", "**/guides/**"]
  • Exclude URL patterns (globs) (default: ["**/*.pdf", "**/*.zip", "**/login**", "**/signup**"]): Skip URLs matching these patterns

  • Content CSS Selectors (default: "article, main, .content, .markdown-body, #content, [role='main']"): Comma-separated CSS selectors for main content area

  • Remove CSS Selectors (default: "nav, header, footer, .sidebar, #sidebar, .navigation, .cookie-banner, script, style, iframe"): Selectors for elements to remove like navigation and headers

  • Output Format (default: "vector-ready"):

    • "vector-ready": Flat structure optimized for embeddings
    • "hierarchical": Nested structure with full metadata
  • Crawler Type (default: "cheerio"):

    • "cheerio": Fast HTTP crawler for static sites
    • "playwright": Browser-based crawler for JavaScript-heavy sites

Example Input JSON

{ "startUrls": [ { "url": "https://docs.example.com/" }, { "url": "https://your-gitbook.com/docs" } ], "maxPages": 500, "excludeUrlGlobs": ["/*.pdf", "/login**", "/signup"], "includeUrlGlobs": ["/docs/"], "contentSelectors": "article, main, .markdown-body", "removeSelectors": "nav, footer, .sidebar", "outputFormat": "vector-ready", "crawlerType": "cheerio" }

Minimal Input Example

{ "startUrls": [ { "url": "https://docs.example.com/" } ] }

Output Format

Vector-Ready Format (Default)

Optimized for direct ingestion into vector databases:

{ "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"], "readyForEmbedding": true }, "documents": [ { "id": "unique-doc-id-123", "text": "Full page content with all text extracted and cleaned...", "metadata": { "source": "https://docs.example.com/page", "title": "Page Title", "url": "https://docs.example.com/page", "scrapedAt": "2025-12-06T08:11:00.000Z", "wordCount": 1234 } } ] }

Hierarchical Format

Includes full document structure with headings and metadata:

{ "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"] }, "documents": [ { "id": "unique-doc-id-123", "url": "https://docs.example.com/page", "title": "Page Title", "content": "Full page content...", "metadata": { "description": "Page meta description", "keywords": "api, documentation", "scrapedAt": "2025-12-06T08:11:00.000Z", "headings": [ { "level": 1, "text": "Introduction" }, { "level": 2, "text": "Getting Started" } ], "wordCount": 1234, "characterCount": 5678 } } ] }

Integration with Vector Databases

The output is ready to use with popular RAG frameworks:

  • LangChain: Use JSONLoader to load documents
  • LlamaIndex: Import as Document objects
  • Pinecone/Weaviate: Batch upsert with metadata
  • Chroma: Add to collection with embeddings

You might also like

Tech Docs to LLM-Ready Markdown

hedelka/tech-docs-scraper

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

πŸ‘ User avatar

Dmitry Goncharov

25

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

πŸ‘ User avatar

Stas Persiianenko

7

Documentation Crawler for RAG

liquid_bark/docs-crawler-for-rag

Specialized crawler for developer documentation sites. Detects frameworks (Docusaurus, GitBook, ReadTheDocs, MkDocs, Sphinx), extracts clean content, and outputs semantically chunked Markdown optimized for RAG pipelines.

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

πŸ‘ User avatar

Gabriel Antony Xaviour

9

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

πŸ‘ User avatar

Artashes Arakelyan

7

Universal Knowledge Base Scraper (RAG Ready)

actums/universal-rag-scraper

Turn any Help Center into LLM-ready Markdown. Supports Zendesk, Intercom, Docusaurus, and generic sites. Perfect for RAG and AI Agents.

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.