Rag Knowledge Graph Builder

Pricing

from $0.01 / 1,000 results

Rag Knowledge Graph Builder

Transform websites into RAG-ready datasets. Crawls pages, chunks content into semantic segments (500-1000 tokens), and generates hypothetical questions for each chunk. No API key needed with native mode. Output: pre-indexed JSON optimized for AI retrieval with 3x better accuracy than raw text.

Pricing

from $0.01 / 1,000 results

Rating

5.0

(7)

Developer

👁 csp

csp

Maintained by Community

Actor stats

Bookmarked

129

Total users

Monthly active users

85 days

Issues response

6 months ago

Last modified

RAG-Ready Knowledge Graph Builder

🧠 Transform any website into a semantic dataset optimized for Retrieval-Augmented Generation (RAG)

The Problem

Traditional web scrapers produce giant walls of text (like llms-full.txt). For large sites, this approach has critical limitations:

Exceeds context windows of most LLMs
Models get "lost in the middle" of long documents
Raw text provides no semantic structure for retrieval
Poor retrieval accuracy in RAG pipelines

The Solution

This Actor creates a pre-indexed semantic dataset that AI agents can ingest instantly with high accuracy:

Intelligent Crawling - Crawls websites following same-domain links
Semantic Chunking - Uses recursive character splitting to create logical segments (500-1000 tokens)
Hypothetical Question Generation - For every chunk, generates potential user questions using LLM
RAG-Ready Output - Structured JSON where each object contains chunk text, source URL, and hypothetical questions

Why It's Better

Instead of raw data, you get pre-indexed data that skyrockets retrieval accuracy:

{
"chunkId":"abc123_0",
"chunkText":"Apify is a platform for web scraping and automation...",
"sourceUrl":"https://docs.apify.com/platform",
"hypotheticalQuestions":[
"What is Apify used for?",
"How does Apify help with web scraping?",
"What automation capabilities does Apify provide?"
],
"tokenCount":487,
"metadata":{
"pageTitle":"Apify Platform Overview",
"crawledAt":"2024-01-15T10:30:00Z"
}
}

Input Configuration

Parameter	Type	Default	Description
`startUrls`	array	required	URLs to start crawling from
`maxCrawlPages`	integer	50	Maximum pages to crawl (0 = unlimited)
`maxCrawlDepth`	integer	3	Maximum link depth from start URLs
`chunkSize`	integer	750	Target chunk size in tokens
`chunkOverlap`	integer	100	Overlapping tokens between chunks
`questionsPerChunk`	integer	3	Hypothetical questions per chunk
`llmProvider`	string	"openai"	LLM provider (openai/anthropic)
`llmModel`	string	"gpt-4o-mini"	Model for question generation
`openaiApiKey`	string	-	OpenAI API key (required for OpenAI)
`anthropicApiKey`	string	-	Anthropic API key (required for Anthropic)
`excludeSelectors`	array	[...]	CSS selectors to exclude
`urlPatterns`	array	[]	URL patterns to include
`excludeUrlPatterns`	array	[...]	URL patterns to exclude

Output

Dataset (per chunk)

{
"chunkId":"unique_chunk_identifier",
"chunkIndex":0,
"chunkText":"The actual text content...",
"tokenCount":523,
"sourceUrl":"https://example.com/page",
"pageTitle":"Page Title",
"pageDescription":"Meta description",
"hypotheticalQuestions":[
"Question 1?",
"Question 2?",
"Question 3?"
],
"questionsCount":3,
"metadata":{
"crawledAt":"2024-01-15T10:30:00Z",
"chunkStart":0,
"chunkEnd":2100,
"totalChunksInPage":5
}
}

Key-Value Store

OUTPUT - Processing summary with statistics
rag-dataset.json - Complete dataset as single JSON file

Use Cases

1. Build a Documentation Chatbot

Crawl your docs site and create a knowledge base for a customer support bot.

2. Create a Research Assistant

Index academic papers or research sites for semantic search.

3. Power a Content Discovery Engine

Build a recommendation system based on semantic similarity.

4. Train Custom Embeddings

Use the chunks and questions to fine-tune embedding models.

LLM Cost Estimation

Using GPT-4o-mini (~$0.15/1M input tokens, ~$0.60/1M output tokens):

100 pages × 5 chunks/page × 3 questions = ~$0.10-0.20

Using Claude 3 Haiku (~$0.25/1M input tokens, ~$1.25/1M output tokens):

100 pages × 5 chunks/page × 3 questions = ~$0.15-0.30

Integration Examples

With LangChain

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
# Load the RAG dataset
chunks = load_apify_dataset("your-run-id")
# Create documents with questions as metadata
documents =[]
for chunk in chunks:
 doc = Document(
 page_content=chunk["chunkText"],
 metadata={
"source": chunk["sourceUrl"],
"questions": chunk["hypotheticalQuestions"]
}
)
 documents.append(doc)
# Create vector store
vectorstore = Chroma.from_documents(documents, OpenAIEmbeddings())

With LlamaIndex

from llama_index import Document, VectorStoreIndex
# Create documents from chunks
documents =[
 Document(
 text=chunk["chunkText"],
 metadata={
"url": chunk["sourceUrl"],
"questions": chunk["hypotheticalQuestions"]
}
)
for chunk in chunks
]
# Build index
index = VectorStoreIndex.from_documents(documents)

Technical Details

Chunking Strategy

Recursive Character Splitter - Splits on semantic boundaries (paragraphs → sentences → words)
Token-based sizing - Uses tiktoken for accurate GPT-4 token counting
Overlap handling - Maintains context between chunks

Question Generation

Uses system prompts optimized for retrieval-focused questions
Generates diverse question types (what, how, why, when, etc.)
Questions are self-contained and specific to chunk content

License

ISC

Support

For issues or feature requests, please open an issue on the repository.

👁 Docs To Rag avatar

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

👁 User avatar

Gabriel Antony Xaviour

👁 RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases avatar

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

👁 User avatar

Artashes Arakelyan

👁 Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks avatar

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

👁 User avatar

Ken M

PDF → RAG Chunks (Token-Aware, Vector-Ready)

gochujang/pdf-rag-chunker

Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. Returns one row per chunk with page, char count, token estimate. Feed directly into OpenAI text-embedding-3 / Voyage / Cohere. $0.005 per PDF + $0.0002 per chunk.

👁 User avatar

Hojun Lee

👁 AI Sitemap Content Extractor avatar

AI Sitemap Content Extractor

enosgb/ai-sitemap-content-extractor

Transform website sitemaps into clean, AI-ready content with Markdown, semantic chunks, and optional AI summaries.

👁 User avatar

Enos Melo

👁 Knowledge Intelligence Engine — Website to Markdown for RAG avatar

Knowledge Intelligence Engine — Website to Markdown for RAG

ryanclinton/website-content-to-markdown

Turn any website, documentation site or help centre into a retrieval-ready knowledge corpus for RAG and AI search. Clean Markdown plus chunks, change detection, deduplication, retrieval scoring, version awareness and a full corpus audit, in one run.

👁 User avatar

Ryan Clinton

👁 AI / RAG Web Crawler avatar

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

👁 User avatar

Group Oject

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

👁 User avatar

wallnut.ai

179

👁 Rag Content Chunker avatar

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

👁 User avatar

mick_

👁 RAG-Ready Markdown Converter & Chunker avatar

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

👁 User avatar

Nguyễn Anh Duy

4.7

URL: https://apify.com/cspnair/rag-knowledge-graph-builder