👁 AI Data Pipeline — Crawl, Chunk & Export to Vector DB avatar

AI Data Pipeline — Crawl, Chunk & Export to Vector DB

Pricing

from $4.99 / 1,000 results

👁 AI Data Pipeline — Crawl, Chunk & Export to Vector DB

AI Data Pipeline — Crawl, Chunk & Export to Vector DB

Crawl any website, extract clean text, split into chunks with quality scoring, and export to JSON, Pinecone, or Qdrant. Built for RAG pipelines and AI training data. Includes language detection, content type classification, and token counting.

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

👁 Ozapp

Ozapp

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

AI Data Pipeline — Website to Vector DB (No-Code)

Crawl any website, clean and chunk content for RAG/LLM applications, score quality, detect language, classify content type, and optionally export directly to Pinecone or Qdrant. Zero coding required.

Pipeline Stages

URL--> Crawl --> Clean HTML--> Chunk Text --> Score Quality --> Export

Crawl — Spider pages within the same domain using Playwright (JS-rendered content supported)
Clean — Strip boilerplate (nav, footer, scripts, ads), convert code blocks to markdown fences, extract page title
Chunk — Split into semantic chunks by headings/paragraphs with configurable overlap. Code blocks are protected (never split mid-block)
Score — Rate each chunk 0-1 based on word count, structure, lexical diversity, code ratio, boilerplate detection
Classify — Detect language (7 languages) and content type (documentation, article, blog, product, FAQ)
Export — Push to Apify dataset (JSON) and/or Pinecone/Qdrant vector databases

Input

Field	Type	Description	Default
`startUrls`	Array	URLs to start crawling (use the URL editor)	Required
`maxPages`	Number	Maximum pages to crawl (1-10000)	`50`
`chunkSize`	Number	Target tokens per chunk (100-4000)	`500`
`chunkOverlap`	Number	Overlap tokens between chunks (0-500)	`50`
`minQualityScore`	Number	Minimum quality score to include a chunk (0-1)	`0.3`
`exportTo`	String	Export target: `json`, `pinecone`, `qdrant`	`"json"`
`pineconeApiKey`	String	Pinecone API key (secret)	None
`pineconeIndexName`	String	Pinecone index name	None
`pineconeHost`	String	Pinecone index host URL	None
`qdrantUrl`	String	Qdrant instance URL	None
`qdrantApiKey`	String	Qdrant API key (secret)	None
`qdrantCollectionName`	String	Qdrant collection name	None

Example Input — JSON Dataset

{
"startUrls":[{"url":"https://docs.example.com"}],
"maxPages":100,
"chunkSize":500,
"chunkOverlap":50,
"minQualityScore":0.3
}

Example Input — Export to Pinecone

{
"startUrls":[{"url":"https://docs.example.com"}],
"maxPages":100,
"chunkSize":500,
"exportTo":"pinecone",
"pineconeApiKey":"your-api-key",
"pineconeHost":"https://your-index.svc.pinecone.io",
"pineconeIndexName":"docs"
}

Output Per Chunk

{
"url":"https://docs.apify.com/academy/getting-started",
"title":"Getting started | Academy | Apify Documentation",
"sourceTitle":"Getting started",
"chunk":"Getting started | Academy | Apify Documentation...",
"chunkIndex":0,
"totalChunks":1,
"qualityScore":0.95,
"tokenCount":258,
"language":"en",
"contentType":"documentation",
"summary":"Getting started | Academy | Apify Documentation...",
"metadata":{
"headings":["Getting started","Getting to know the platform","Next up"],
"linkCount":86,
"imageCount":3,
"wordCount":185
},
"lastProcessed":"2026-03-06T14:40:18.420Z"
}

Pipeline Summary

At the end of each run, the actor logs a summary:

=== Pipeline Summary ===
Total pages attempted:2
Pages withcontent:2
Total chunks created:4
Chunks filtered(minScore):0
Avg quality score:0.87(min:0.80,max:0.95)
Avg token count:357
Languages detected: en
Content types: documentation
Export format: json

Use Cases

RAG Chatbots — Prepare knowledge bases for retrieval-augmented generation
Documentation Search — Index docs sites for semantic search
Knowledge Management — Convert websites into structured, searchable chunks
Content Analysis — Score and filter web content by quality
AI Fine-tuning — Prepare clean training data from websites

Quality Scoring

Each chunk is scored 0-1 based on:

Word count — Penalizes very short or very long chunks
Sentence structure — Proper sentences score higher
Heading presence — Structured content scores higher
Lexical diversity — Varied vocabulary scores higher
Code ratio — Moderate code is rewarded, excessive code is penalized
Boilerplate detection — Cookie notices, "all rights reserved", raw HTML tags lower the score

Language Detection

Supports 7 languages: English, French, German, Spanish, Dutch, Portuguese, Italian. Detection uses keyword matching with 20 marker words per language.

Content Type Classification

Automatically classifies each chunk as: documentation, article, blog, product, faq, or other.

Notes

Uses same-domain crawling strategy (won't follow external links)
Playwright-based for JavaScript-rendered content support
Code blocks are protected during chunking — never split mid-block
Vector DB export sends metadata alongside text (no embeddings — use your own model)
Chunks respect heading boundaries to maintain semantic coherence

API Integration

JavaScript

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token:'YOUR_API_TOKEN'});
const run =await client.actor('ozapp/ai-data-pipeline').call({
startUrls:[{url:'https://docs.example.com'}],
maxPages:100,
chunkSize:500,
minQualityScore:0.3,
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
console.log(`${items.length} chunks ready for your vector DB`);

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("ozapp/ai-data-pipeline").call(run_input={
"startUrls":[{"url":"https://docs.example.com"}],
"maxPages":100,
"chunkSize":500,
"minQualityScore":0.3,
})
items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"{len(items)} chunks ready for your vector DB")

cURL

curl"https://api.apify.com/v2/acts/ozapp~ai-data-pipeline/runs"\
-X POST \
-H"Content-Type: application/json"\
-H"Authorization: Bearer YOUR_API_TOKEN"\
-d'{"startUrls":[{"url":"https://docs.example.com"}],"maxPages":100,"chunkSize":500}'

Pricing

$4.99 per 1,000 chunks — includes crawling, cleaning, scoring, and classification.

👁 LLM Data Pipeline Pro avatar

LLM Data Pipeline Pro

sanztheo/llm-data-pipeline-pro

Transform websites into LLM training data. Scrape, validate, deduplicate, chunk for RAG, and export to OpenAI/Anthropic/Mistral formats. Built-in PII detection and GDPR compliance. Vector DB export to Pinecone & Qdrant.

👁 User avatar

Theo Sanz

👁 Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks avatar

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

👁 User avatar

Ken M

👁 Website to Text & Markdown — AI / RAG Content Crawler avatar

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

👁 User avatar

Hitman studio

👁 AI / RAG Web Crawler avatar

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

👁 User avatar

Group Oject

👁 AI Training Data Curator avatar

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

👁 User avatar

Ryan Clinton

👁 AI Dataset Converter - Website to Training Data avatar

AI Dataset Converter - Website to Training Data

boztek-ltd/ai-dataset-converter

Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.

👁 User avatar

Boztek LTD

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.

👁 User avatar

Kai Agent

👁 RAG Pipeline avatar

RAG Pipeline

labrat011/rag-pipeline

One-click RAG pipeline: chunks text, generates embeddings, and stores vectors in Pinecone or Qdrant. Provide your content and API keys -- the orchestrator handles the rest.

👁 User avatar

mick_

👁 AI Training Data Scraper avatar

AI Training Data Scraper

blukaze/AI-Training-Data-Scraper

AI Training Data Scraper converts websites into clean, semantically-chunked, vector-ready data for LLMs, RAG pipelines, and AI search. Built for documentation, tutorials, and code-heavy content, with smart chunking and rich metadata.

👁 User avatar

Blukaze Automations

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

URL: https://apify.com/ozapp/ai-data-pipeline

⇱ AI Data Pipeline — Crawl, Chunk & Export to Pinecone/Qdrant · Apify

AI Data Pipeline — Crawl, Chunk & Export to Vector DB

AI Data Pipeline — Website to Vector DB (No-Code)

Pipeline Stages

Input

Example Input — JSON Dataset

Example Input — Export to Pinecone

Output Per Chunk

Pipeline Summary

Use Cases

Quality Scoring

Language Detection

Content Type Classification

Notes

API Integration

JavaScript

Python

cURL

Pricing

You might also like

LLM Data Pipeline Pro

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

Website to Text & Markdown — AI / RAG Content Crawler

AI / RAG Web Crawler

AI Training Data Curator

AI Dataset Converter - Website to Training Data

AI Content Crawler

RAG Pipeline

AI Training Data Scraper

Website to Markdown Crawler for LLM & RAG