VOOZH about

URL: https://apify.com/cspnair/rag-knowledge-graph-builder

⇱ RAG Knowledge Graph Builder | Website to AI-Ready Dataset Β· Apify


Pricing

from $0.01 / 1,000 results

Go to Apify Store

Rag Knowledge Graph Builder

Transform websites into RAG-ready datasets. Crawls pages, chunks content into semantic segments (500-1000 tokens), and generates hypothetical questions for each chunk. No API key needed with native mode. Output: pre-indexed JSON optimized for AI retrieval with 3x better accuracy than raw text.

Pricing

from $0.01 / 1,000 results

Rating

5.0

(7)

Developer

πŸ‘ csp

csp

Maintained by Community

Actor stats

28

Bookmarked

129

Total users

0

Monthly active users

85 days

Issues response

6 months ago

Last modified

Share

RAG-Ready Knowledge Graph Builder

🧠 Transform any website into a semantic dataset optimized for Retrieval-Augmented Generation (RAG)

The Problem

Traditional web scrapers produce giant walls of text (like llms-full.txt). For large sites, this approach has critical limitations:

  • Exceeds context windows of most LLMs
  • Models get "lost in the middle" of long documents
  • Raw text provides no semantic structure for retrieval
  • Poor retrieval accuracy in RAG pipelines

The Solution

This Actor creates a pre-indexed semantic dataset that AI agents can ingest instantly with high accuracy:

  1. Intelligent Crawling - Crawls websites following same-domain links
  2. Semantic Chunking - Uses recursive character splitting to create logical segments (500-1000 tokens)
  3. Hypothetical Question Generation - For every chunk, generates potential user questions using LLM
  4. RAG-Ready Output - Structured JSON where each object contains chunk text, source URL, and hypothetical questions

Why It's Better

Instead of raw data, you get pre-indexed data that skyrockets retrieval accuracy:

{
"chunkId":"abc123_0",
"chunkText":"Apify is a platform for web scraping and automation...",
"sourceUrl":"https://docs.apify.com/platform",
"hypotheticalQuestions":[
"What is Apify used for?",
"How does Apify help with web scraping?",
"What automation capabilities does Apify provide?"
],
"tokenCount":487,
"metadata":{
"pageTitle":"Apify Platform Overview",
"crawledAt":"2024-01-15T10:30:00Z"
}
}

Input Configuration

ParameterTypeDefaultDescription
startUrlsarrayrequiredURLs to start crawling from
maxCrawlPagesinteger50Maximum pages to crawl (0 = unlimited)
maxCrawlDepthinteger3Maximum link depth from start URLs
chunkSizeinteger750Target chunk size in tokens
chunkOverlapinteger100Overlapping tokens between chunks
questionsPerChunkinteger3Hypothetical questions per chunk
llmProviderstring"openai"LLM provider (openai/anthropic)
llmModelstring"gpt-4o-mini"Model for question generation
openaiApiKeystring-OpenAI API key (required for OpenAI)
anthropicApiKeystring-Anthropic API key (required for Anthropic)
excludeSelectorsarray[...]CSS selectors to exclude
urlPatternsarray[]URL patterns to include
excludeUrlPatternsarray[...]URL patterns to exclude

Output

Dataset (per chunk)

{
"chunkId":"unique_chunk_identifier",
"chunkIndex":0,
"chunkText":"The actual text content...",
"tokenCount":523,
"sourceUrl":"https://example.com/page",
"pageTitle":"Page Title",
"pageDescription":"Meta description",
"hypotheticalQuestions":[
"Question 1?",
"Question 2?",
"Question 3?"
],
"questionsCount":3,
"metadata":{
"crawledAt":"2024-01-15T10:30:00Z",
"chunkStart":0,
"chunkEnd":2100,
"totalChunksInPage":5
}
}

Key-Value Store

  • OUTPUT - Processing summary with statistics
  • rag-dataset.json - Complete dataset as single JSON file

Use Cases

1. Build a Documentation Chatbot

Crawl your docs site and create a knowledge base for a customer support bot.

2. Create a Research Assistant

Index academic papers or research sites for semantic search.

3. Power a Content Discovery Engine

Build a recommendation system based on semantic similarity.

4. Train Custom Embeddings

Use the chunks and questions to fine-tune embedding models.

LLM Cost Estimation

Using GPT-4o-mini (~$0.15/1M input tokens, ~$0.60/1M output tokens):

  • 100 pages Γ— 5 chunks/page Γ— 3 questions = ~$0.10-0.20

Using Claude 3 Haiku (~$0.25/1M input tokens, ~$1.25/1M output tokens):

  • 100 pages Γ— 5 chunks/page Γ— 3 questions = ~$0.15-0.30

Integration Examples

With LangChain

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
# Load the RAG dataset
chunks = load_apify_dataset("your-run-id")
# Create documents with questions as metadata
documents =[]
for chunk in chunks:
doc = Document(
page_content=chunk["chunkText"],
metadata={
"source": chunk["sourceUrl"],
"questions": chunk["hypotheticalQuestions"]
}
)
documents.append(doc)
# Create vector store
vectorstore = Chroma.from_documents(documents, OpenAIEmbeddings())

With LlamaIndex

from llama_index import Document, VectorStoreIndex
# Create documents from chunks
documents =[
Document(
text=chunk["chunkText"],
metadata={
"url": chunk["sourceUrl"],
"questions": chunk["hypotheticalQuestions"]
}
)
for chunk in chunks
]
# Build index
index = VectorStoreIndex.from_documents(documents)

Technical Details

Chunking Strategy

  • Recursive Character Splitter - Splits on semantic boundaries (paragraphs β†’ sentences β†’ words)
  • Token-based sizing - Uses tiktoken for accurate GPT-4 token counting
  • Overlap handling - Maintains context between chunks

Question Generation

  • Uses system prompts optimized for retrieval-focused questions
  • Generates diverse question types (what, how, why, when, etc.)
  • Questions are self-contained and specific to chunk content

License

ISC

Support

For issues or feature requests, please open an issue on the repository.

You might also like

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

πŸ‘ User avatar

Gabriel Antony Xaviour

9

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

πŸ‘ User avatar

Artashes Arakelyan

7

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

AI Sitemap Content Extractor

enosgb/ai-sitemap-content-extractor

Transform website sitemaps into clean, AI-ready content with Markdown, semantic chunks, and optional AI summaries.

Knowledge Intelligence Engine β€” Website to Markdown for RAG

ryanclinton/website-content-to-markdown

Turn any website, documentation site or help centre into a retrieval-ready knowledge corpus for RAG and AI search. Clean Markdown plus chunks, change detection, deduplication, retrieval scoring, version awareness and a full corpus audit, in one run.

16

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summaryβ€”ready for embeddings or vector DBs without extra glue code.

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

πŸ‘ User avatar

Nguyα»…n Anh Duy

3

4.7