RAG Pipeline

Pricing

from $5.00 / 1,000 results

RAG Pipeline

One-click RAG pipeline: chunks text, generates embeddings, and stores vectors in Pinecone or Qdrant. Provide your content and API keys -- the orchestrator handles the rest.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

👁 mick_

mick_

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

What It Does

This actor orchestrates three sub-actors in sequence to build a complete RAG (Retrieval-Augmented Generation) pipeline. Feed it your content and it handles chunking, embedding, and vector storage automatically. Returns a pipeline summary -- ready for orchestration or consumption by AI agents via MCP.

Your content
->RAG Content Chunker(chunk by paragraphs, sentences, or Markdown headers)
->RAG Embedding Generator(OpenAI or Cohere embeddings)
->RAG Vector Store Writer(upsert to Pinecone or Qdrant)

You provide your content, API keys, and vector DB config. The pipeline handles dataset handoff between steps automatically.

🆕 New Feature: Bulk File Upload

Already have a document? Upload it directly to Apify Storage and run the full pipeline against it — no crawling, no copy-pasting.

How to upload a file:

Go to Apify Console → Storage → Key-Value Stores
Click + Create new store → give it a name
Click + Add record → upload your .txt, .md, or .pdf file
Find your file → click the 🔗 icon to copy the direct URL

Make sure the URL starts with api.apify.com — not console.apify.com. The console URL is the web page, not the file.
Paste the URL into the file_url field along with your API keys and run

Example — full pipeline from a file:

{
"file_url":"https://api.apify.com/v2/key-value-stores/YOUR_STORE_ID/records/document.md",
"chunking_strategy":"markdown",
"chunk_size":512,
"chunk_overlap":64,
"embedding_api_key":"sk-...",
"embedding_provider":"openai",
"embedding_model":"text-embedding-3-small",
"vector_db_api_key":"your-qdrant-key",
"vector_db_provider":"qdrant",
"index_name":"my-rag-collection",
"qdrant_url":"https://your-cluster.us-west-1.aws.cloud.qdrant.io:6333"
}

Or skip storage entirely — paste text or Markdown directly into the text field:

{
"text":"# My Document\n\nThis is my content...",
"chunking_strategy":"markdown",
"chunk_size":512,
"embedding_api_key":"sk-...",
 ...
}

Supported file formats: .txt, .md, .markdown, .html, .pdf Max file size: 5MB URL requirements: Must be a public HTTPS URL. Apify Storage, S3, Dropbox (shared public link), or GitHub raw URLs all work.

PDF note: Text-based PDFs are supported. Scanned/image-only PDFs have no text layer and will fail — convert them with OCR first. Office documents (.docx, .xlsx) are not yet supported.

Input

Content Source (choose one)

Option A — Direct text:

{"text":"# My Document\n\nContent..."}

Option B — Single file URL (Apify Storage):

{
"file_url":"https://api.apify.com/v2/key-value-stores/STORE_ID/records/doc.txt",
"chunking_strategy":"markdown",
"chunk_size":512
}

Option C — Multiple file URLs, bulk (Apify Storage):

{
"file_urls":[
"https://api.apify.com/v2/key-value-stores/STORE_ID/records/doc1.txt",
"https://api.apify.com/v2/key-value-stores/STORE_ID/records/doc2.md"
]
}

Option D — Dataset from crawler:

{"source_dataset_id":"your-crawler-dataset-id","source_dataset_field":"markdown"}

Priority order when multiple sources are provided: source_dataset_id > file_urls > file_url > text

Parameter	Required	Default	Description
`text`	One of `text`, `file_url`, `file_urls`, or `source_dataset_id`	-	Plain text or Markdown to process
`file_url`	One of `text`, `file_url`, `file_urls`, or `source_dataset_id`	-	HTTPS URL to a single file in Apify Storage
`file_urls`	One of `text`, `file_url`, `file_urls`, or `source_dataset_id`	-	List of HTTPS URLs to files in Apify Storage (max 20, 10 MB per file). Contents are fetched and concatenated before chunking.
`source_dataset_id`	One of `text`, `file_url`, `file_urls`, or `source_dataset_id`	-	Apify dataset ID from a crawler
`source_dataset_field`	No	`text`	Field to read from source dataset items
`chunking_strategy`	No	`recursive`	`recursive`, `markdown`, or `sentence`
`chunk_size`	No	`512`	Target chunk size in tokens (64-8192)
`chunk_overlap`	No	`64`	Overlap between chunks in tokens (0-2048)
`embedding_api_key`	Yes	-	OpenAI or Cohere API key
`embedding_provider`	No	`openai`	`openai` or `cohere`
`embedding_model`	No	`text-embedding-3-small`	Embedding model name
`embedding_batch_size`	No	`128`	Texts per API request
`vector_db_api_key`	Yes	-	Pinecone or Qdrant API key
`vector_db_provider`	No	`pinecone`	`pinecone` or `qdrant`
`index_name`	Yes	-	Index (Pinecone) or collection (Qdrant) name
`qdrant_url`	If Qdrant	-	Qdrant Cloud cluster URL
`pinecone_namespace`	No	`""`	Pinecone namespace
`qdrant_distance_metric`	No	`Cosine`	`Cosine`, `Dot`, or `Euclid`

Output

A single summary item in the default dataset:

{
"_summary":true,
"pipeline":{
"total_duration_seconds":12.345,
"steps":{
"chunker":{"actor":"labrat011/rag-content-chunker","status":"SUCCEEDED","duration_seconds":3.2},
"embedder":{"actor":"labrat011/rag-embedding-generator","status":"SUCCEEDED","duration_seconds":5.1},
"writer":{"actor":"labrat011/rag-vector-store-writer","status":"SUCCEEDED","duration_seconds":4.0}
}
},
"result":{
"total_upserted":42,
"vector_db_provider":"qdrant",
"index_name":"my-collection"
}
}

Pricing

The orchestrator charges $0.005 per pipeline run ($5.00 per 1,000 runs). Sub-actors charge separately:

Actor	Rate
RAG Content Chunker	$0.0005/chunk
RAG Embedding Generator	$0.0003/embedding
RAG Vector Store Writer	$0.0004/vector

You also pay the embedding provider (OpenAI/Cohere) and vector DB provider (Pinecone/Qdrant) at their standard rates.

Example: Quick Start with Qdrant

Option A — direct text:

{
"text":"Your document content goes here...",
"chunking_strategy":"recursive",
"chunk_size":512,
"embedding_api_key":"sk-...",
"embedding_provider":"openai",
"embedding_model":"text-embedding-3-small",
"vector_db_api_key":"your-qdrant-key",
"vector_db_provider":"qdrant",
"index_name":"my-rag-collection",
"qdrant_url":"https://your-cluster.us-west-1.aws.cloud.qdrant.io:6333"
}

Option B — file upload (Apify Storage, S3, or any public HTTPS URL):

{
"file_url":"https://api.apify.com/v2/key-value-stores/YOUR_STORE_ID/records/document.md",
"chunking_strategy":"markdown",
"chunk_size":512,
"embedding_api_key":"sk-...",
"embedding_provider":"openai",
"embedding_model":"text-embedding-3-small",
"vector_db_api_key":"your-qdrant-key",
"vector_db_provider":"qdrant",
"index_name":"my-rag-collection",
"qdrant_url":"https://your-cluster.us-west-1.aws.cloud.qdrant.io:6333"
}

Sub-Actors

Security

API keys are validated for presence only and never logged
Qdrant URLs are validated against cloud.qdrant.io pattern (SSRF prevention)
All string inputs are sanitized against control characters
Dataset IDs and field names are validated with strict regex patterns

License

MIT

MCP Integration

This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.

Endpoint: https://mcp.apify.com?tools=labrat011/rag-pipeline
Auth: Authorization: Bearer <APIFY_TOKEN>
Transport: Streamable HTTP
Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI

Example MCP config (Claude Desktop / Cursor):

{
"mcpServers":{
"rag-pipeline":{
"url":"https://mcp.apify.com?tools=labrat011/rag-pipeline",
"headers":{
"Authorization":"Bearer <APIFY_TOKEN>"
}
}
}
}

AI agents can use this actor to ingest text into a vector database, build RAG knowledge bases, and set up retrieval-augmented generation pipelines -- all as a single callable MCP tool.

👁 Website to Text & Markdown — AI / RAG Content Crawler avatar

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

👁 User avatar

Hitman studio

Rag Pipeline Manager Mcp

bronze_quarterback/rag-pipeline-manager-mcp

👁 User avatar

Segun Zubair

AI Data Pipeline — Crawl, Chunk & Export to Vector DB

ozapp/ai-data-pipeline

Crawl any website, extract clean text, split into chunks with quality scoring, and export to JSON, Pinecone, or Qdrant. Built for RAG pipelines and AI training data. Includes language detection, content type classification, and token counting.

👁 User avatar

Ozapp

👁 Rag Vector Store Writer avatar

Rag Vector Store Writer

labrat011/rag-vector-store-writer

Apify Actor that writes embedding vectors to Pinecone or Qdrant vector databases. Chains directly with RAG Embedding Generator output or accepts raw vectors with metadata. Handles batching, retries, collection creation, metadata mapping, and ID generation. Bring your own vector DB API key.

👁 User avatar

mick_

RAG Ingestor: Multi-Source Chunks for Vector DBs

aitoolbreakdown/atb-rag-ingestor

Ingest URLs, sitemaps, and GitHub READMEs into uniform chunks with titles, source URLs, and stable IDs. Ready to push straight into Pinecone, Weaviate, or any RAG pipeline.

👁 User avatar

AI Tool Breakdown

👁 Rag Embedding Generator avatar

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

👁 User avatar

mick_

👁 RAG Pipeline Data Collector avatar

RAG Pipeline Data Collector

scraper_guru/rag-pipeline-data-collector

AI-ready web content extraction for RAG systems, LLMs, and AI agents. Single-page or multi-page scraping with parallel processing.

👁 User avatar

LIAICHI MUSTAPHA

👁 Docs to Markdown + AI Embeddings → Vector DB Crawler avatar

Docs to Markdown + AI Embeddings → Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.

👁 User avatar

Badruddeen Naseem

5.0

Rag

zenisjan/rag

Interviews RAG — An Actor that answers questions about customer meeting notes using RAG. It searches a Pinecone vector store for relevant transcript chunks, ranks results by semantic similarity and recency, then generates answers. Runs in Standby mode as an HTTP service, exposing a /query endpoint.

👁 User avatar

Jan Ženíšek

👁 Rag Content Chunker avatar

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

👁 User avatar

mick_

URL: https://apify.com/labrat011/rag-pipeline

⇱ RAG Pipeline · Apify

RAG Pipeline

What It Does

🆕 New Feature: Bulk File Upload

How to upload a file:

Input

Content Source (choose one)

Output

Pricing

Example: Quick Start with Qdrant

Sub-Actors

Security

License

MCP Integration

You might also like

Website to Text & Markdown — AI / RAG Content Crawler

Rag Pipeline Manager Mcp

AI Data Pipeline — Crawl, Chunk & Export to Vector DB

Rag Vector Store Writer

RAG Ingestor: Multi-Source Chunks for Vector DBs

Rag Embedding Generator

RAG Pipeline Data Collector

Docs to Markdown + AI Embeddings → Vector DB Crawler

Rag

Rag Content Chunker