VOOZH about

URL: https://apify.com/gabrielaxy/docs-to-rag

⇱ DocsToRAG - Documentation to Vector Embeddings fr AI Β· Apify


Pricing

Pay per usage

Go to Apify Store

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

πŸ‘ Gabriel Antony Xaviour

Gabriel Antony Xaviour

Maintained by Community

Actor stats

0

Bookmarked

9

Total users

1

Monthly active users

6 months ago

Last modified

Share

πŸ“š DocsToRAG

Transform any documentation site into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Features β€’ Quick Start β€’ Configuration β€’ Output β€’ Vector DBs β€’ Use Cases


What is DocsToRAG?

DocsToRAG crawls documentation websites and converts them into high-quality chunks optimized for Retrieval-Augmented Generation (RAG) systems. Unlike simple text splitters, it uses semantic chunking that preserves code blocks, lists, and document structure.

Key Benefits

FeatureDescription
🧠 Semantic ChunkingPreserves code blocks with explanations, keeps lists intact, respects document hierarchy
⭐ Quality ScoringAutomatically filters boilerplate, navigation, and low-value content
πŸ”— Vector DB IntegrationPush directly to Pinecone, Supabase, or Qdrant
πŸ“Š Rich MetadataContent classification, complexity levels, topic extraction, chunk relationships
⚑ Embeddings ReadyGenerate OpenAI embeddings in the same run

Features

Semantic Chunking

Traditional chunking splits text by character count, breaking code blocks and sentences mid-way. DocsToRAG understands document structure:

  • Preserves code blocks with their surrounding explanations
  • Keeps lists and paragraphs intact as logical units
  • Respects document hierarchy with H1/H2/H3 sections
  • Smart overlap using complete semantic blocks, not raw tokens

Quality Scoring

Every chunk receives a quality score (0-100) based on:

DimensionWhat It Measures
Information DensityRatio of unique meaningful terms
CompletenessProper sentence structure, punctuation
Code QualityLanguage specified, meaningful length
ReadabilitySentence length, clarity

Quality Flags identify issues like boilerplate, low_content, navigation_text.

Enhanced Metadata

Each chunk includes rich metadata for better retrieval:

{
"contentType":"tutorial",
"complexity":"beginner",
"topics":["CheerioCrawler","RequestQueue"],
"headingPath":"Quick Start > Installation",
"prevChunkId":"chunk_abc123",
"nextChunkId":"chunk_def456"
}

Quick Start

Basic Usage

Crawl a documentation site and output semantic chunks:

{
"startUrls":[{"url":"https://docs.example.com"}],
"maxPages":50,
"chunkingStrategy":"semantic",
"outputFormat":"jsonl"
}

With Quality Filter

Only output chunks scoring above 50:

{
"startUrls":[{"url":"https://docs.example.com"}],
"maxPages":100,
"chunkingStrategy":"semantic",
"enableQualityScoring":true,
"minQualityScore":50
}

With Embeddings

Generate OpenAI embeddings for each chunk:

{
"startUrls":[{"url":"https://docs.example.com"}],
"chunkingStrategy":"semantic",
"generateEmbeddings":true,
"openaiApiKey":"sk-...",
"embeddingModel":"text-embedding-3-small"
}

Full Pipeline (Crawl β†’ Chunk β†’ Embed β†’ Store)

{
"startUrls":[{"url":"https://docs.example.com"}],
"maxPages":100,
"chunkingStrategy":"semantic",
"enableQualityScoring":true,
"minQualityScore":40,
"generateEmbeddings":true,
"openaiApiKey":"sk-...",
"vectorDbProvider":"pinecone",
"vectorDbConfig":{
"apiKey":"your-pinecone-api-key",
"indexName":"your-index-name"
},
"vectorDbNamespace":"docs-v1"
}

Input Configuration

Crawling Options

ParameterTypeDefaultDescription
startUrlsarrayrequiredDocumentation URLs to crawl
maxDepthinteger10How many levels deep to crawl
maxPagesinteger1000Maximum pages to process
includeGlobsarrayβ€”Only crawl URLs matching these patterns
excludeGlobsarrayβ€”Skip URLs matching these patterns

Chunking Options

ParameterTypeDefaultDescription
chunkingStrategystringsemanticsemantic or simple
chunkSizeinteger500Target chunk size in tokens
chunkOverlapinteger50Overlap between chunks
splitByHeadersbooleantrueCreate new chunks at H1/H2 (simple mode)
includeCodeBlocksbooleantrueInclude code snippets

Quality Options

ParameterTypeDefaultDescription
enableQualityScoringbooleantrueEnable quality scoring
minQualityScoreinteger40Minimum quality score (0-100)
includeQualityInMetadatabooleantrueInclude scores in output
enrichMetadatabooleantrueAdd content classification

Embedding Options

ParameterTypeDefaultDescription
generateEmbeddingsbooleanfalseGenerate OpenAI embeddings
openaiApiKeystringβ€”Your OpenAI API key
embeddingModelstringtext-embedding-3-smallModel to use

Vector Database Options

ParameterTypeDefaultDescription
vectorDbProviderstringnonenone, pinecone, supabase, or qdrant
vectorDbConfigobjectβ€”Provider-specific configuration
vectorDbNamespacestringβ€”Namespace/collection for vectors
upsertBatchSizeinteger100Batch size for upserts

Output Options

ParameterTypeDefaultDescription
outputFormatstringjsonljson, jsonl, or csv

Output Schema

Chunk Structure

{
"id":"chunk_a1b2c3d4",
"text":"## Installation\n\nTo install the package, run:\n\n```bash\nnpm install crawlee\n```",
"tokenCount":24,
"metadata":{
"sourceUrl":"https://crawlee.dev/docs/quick-start",
"title":"Quick Start",
"section":"Installation",
"breadcrumbs":["Docs","Quick Start"],
"chunkIndex":2,
"totalChunks":8,
"hierarchy":["Quick Start","Installation"],
"headingLevel":2,
"hasCode":true,
"codeLanguages":["bash"],
"contentType":"mixed",
"prevChunkId":"chunk_x1y2z3w4",
"nextChunkId":"chunk_e5f6g7h8",
"quality":{
"overall":78,
"dimensions":{
"informationDensity":72,
"completeness":85,
"codeQuality":80,
"readability":75
},
"flags":[]
}
},
"embedding":[0.123,-0.456, ...]
}

Run Summary (OUTPUT)

{
"summary":{
"totalPages":45,
"totalChunks":312,
"uniqueChunks":287,
"avgChunkSize":423,
"embeddingsGenerated":true,
"crawlDurationSec":67,
"vectorDb":{
"provider":"pinecone",
"namespace":"docs-v1",
"upsertedCount":287
},
"qualityStats":{
"avgScore":71,
"scoreDistribution":{
"excellent":89,
"good":142,
"fair":56,
"filtered":25
}
}
}
}

Vector Databases

Pinecone

{
"vectorDbProvider":"pinecone",
"vectorDbConfig":{
"apiKey":"your-pinecone-api-key",
"indexName":"your-index-name"
},
"vectorDbNamespace":"docs-v1"
}

Supabase

{
"vectorDbProvider":"supabase",
"vectorDbConfig":{
"url":"https://your-project.supabase.co",
"anonKey":"your-anon-key",
"tableName":"documents"
}
}

Required Supabase table schema:

CREATETABLE documents (
id TEXTPRIMARYKEY,
content TEXT,
metadata JSONB,
embedding VECTOR(1536)
);

Qdrant

{
"vectorDbProvider":"qdrant",
"vectorDbConfig":{
"url":"https://your-cluster.qdrant.io",
"apiKey":"your-qdrant-api-key",
"collectionName":"docs"
}
}

Environment Variables

Store API keys securely in Actor environment variables instead of input:

VariableDescription
OPENAI_API_KEYOpenAI API key for embeddings
PINECONE_API_KEYPinecone API key
PINECONE_INDEX_NAMEDefault Pinecone index
SUPABASE_URLSupabase project URL
SUPABASE_ANON_KEYSupabase anonymous key
QDRANT_URLQdrant cluster URL
QDRANT_API_KEYQdrant API key

With environment variables set, input simplifies to:

{
"startUrls":[{"url":"https://docs.example.com"}],
"generateEmbeddings":true,
"vectorDbProvider":"pinecone"
}

Use Cases

Use CaseDescription
RAG ApplicationsBuild knowledge bases for AI assistants and chatbots
Documentation SearchCreate semantic search indexes for docs sites
Training DataPrepare high-quality documentation for fine-tuning
Content AnalysisAnalyze documentation quality across projects
Knowledge GraphsExtract structured information from docs

Cost Estimation

ComponentCost
Crawling~$0.001 per page (Apify compute)
Embeddings~$0.02 per 1M tokens (text-embedding-3-small)
Vector DBVaries by provider

Example: 100 pages β†’ ~500 chunks β†’ ~50K tokens β†’ ~$0.10 total


FAQ

Q: What's the difference between semantic and simple chunking?

Simple chunking splits by character count with optional header breaks. Semantic chunking understands document structureβ€”it keeps code blocks intact, preserves list items, and maintains paragraph coherence.

Q: How does quality scoring work?

Each chunk is scored 0-100 based on information density, completeness, code quality, and readability. Low-scoring content (boilerplate, navigation, cookie notices) is automatically filtered.

Q: Can I use my own embedding model?

Currently supports OpenAI embedding models. The embeddingModel parameter accepts any OpenAI embedding model ID.

Q: How do I handle large documentation sites?

Use includeGlobs and excludeGlobs to target specific sections. Set appropriate maxPages limits. Consider running multiple times with different namespaces for different doc sections.


Support

  • Issues: Report bugs or request features on GitHub
  • Documentation: See the full README in the Actor source
  • API: Use the Apify API to run this Actor programmatically

License

ISC

You might also like

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

Docs-to-RAG Optimizer

vamsi-krishna/docs-to-rag-optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.

2

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

πŸ‘ User avatar

Stas Persiianenko

7

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

πŸ‘ User avatar

Nguyα»…n Anh Duy

3

4.7

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

RAG Knowledge Loader

botflowtech/rag-knowledge-loader

Scrapes documentation sites (GitBook, ReadTheDocs, Notion public pages) and converts them into vector-ready JSON format for RAG applications.

Rag Knowledge Graph Builder

cspnair/rag-knowledge-graph-builder

Transform websites into RAG-ready datasets. Crawls pages, chunks content into semantic segments (500-1000 tokens), and generates hypothetical questions for each chunk. No API key needed with native mode. Output: pre-indexed JSON optimized for AI retrieval with 3x better accuracy than raw text.