VOOZH about

URL: https://apify.com/ozapp/ai-data-pipeline

โ‡ฑ AI Data Pipeline โ€” Crawl, Chunk & Export to Pinecone/Qdrant ยท Apify


๐Ÿ‘ AI Data Pipeline โ€” Crawl, Chunk & Export to Vector DB avatar

AI Data Pipeline โ€” Crawl, Chunk & Export to Vector DB

Pricing

from $4.99 / 1,000 results

Go to Apify Store

AI Data Pipeline โ€” Crawl, Chunk & Export to Vector DB

Crawl any website, extract clean text, split into chunks with quality scoring, and export to JSON, Pinecone, or Qdrant. Built for RAG pipelines and AI training data. Includes language detection, content type classification, and token counting.

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

๐Ÿ‘ Ozapp

Ozapp

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 months ago

Last modified

Share

AI Data Pipeline โ€” Website to Vector DB (No-Code)

Crawl any website, clean and chunk content for RAG/LLM applications, score quality, detect language, classify content type, and optionally export directly to Pinecone or Qdrant. Zero coding required.

Pipeline Stages

URL--> Crawl --> Clean HTML--> Chunk Text --> Score Quality --> Export
  1. Crawl โ€” Spider pages within the same domain using Playwright (JS-rendered content supported)
  2. Clean โ€” Strip boilerplate (nav, footer, scripts, ads), convert code blocks to markdown fences, extract page title
  3. Chunk โ€” Split into semantic chunks by headings/paragraphs with configurable overlap. Code blocks are protected (never split mid-block)
  4. Score โ€” Rate each chunk 0-1 based on word count, structure, lexical diversity, code ratio, boilerplate detection
  5. Classify โ€” Detect language (7 languages) and content type (documentation, article, blog, product, FAQ)
  6. Export โ€” Push to Apify dataset (JSON) and/or Pinecone/Qdrant vector databases

Input

FieldTypeDescriptionDefault
startUrlsArrayURLs to start crawling (use the URL editor)Required
maxPagesNumberMaximum pages to crawl (1-10000)50
chunkSizeNumberTarget tokens per chunk (100-4000)500
chunkOverlapNumberOverlap tokens between chunks (0-500)50
minQualityScoreNumberMinimum quality score to include a chunk (0-1)0.3
exportToStringExport target: json, pinecone, qdrant"json"
pineconeApiKeyStringPinecone API key (secret)None
pineconeIndexNameStringPinecone index nameNone
pineconeHostStringPinecone index host URLNone
qdrantUrlStringQdrant instance URLNone
qdrantApiKeyStringQdrant API key (secret)None
qdrantCollectionNameStringQdrant collection nameNone

Example Input โ€” JSON Dataset

{
"startUrls":[{"url":"https://docs.example.com"}],
"maxPages":100,
"chunkSize":500,
"chunkOverlap":50,
"minQualityScore":0.3
}

Example Input โ€” Export to Pinecone

{
"startUrls":[{"url":"https://docs.example.com"}],
"maxPages":100,
"chunkSize":500,
"exportTo":"pinecone",
"pineconeApiKey":"your-api-key",
"pineconeHost":"https://your-index.svc.pinecone.io",
"pineconeIndexName":"docs"
}

Output Per Chunk

{
"url":"https://docs.apify.com/academy/getting-started",
"title":"Getting started | Academy | Apify Documentation",
"sourceTitle":"Getting started",
"chunk":"Getting started | Academy | Apify Documentation...",
"chunkIndex":0,
"totalChunks":1,
"qualityScore":0.95,
"tokenCount":258,
"language":"en",
"contentType":"documentation",
"summary":"Getting started | Academy | Apify Documentation...",
"metadata":{
"headings":["Getting started","Getting to know the platform","Next up"],
"linkCount":86,
"imageCount":3,
"wordCount":185
},
"lastProcessed":"2026-03-06T14:40:18.420Z"
}

Pipeline Summary

At the end of each run, the actor logs a summary:

=== Pipeline Summary ===
Total pages attempted:2
Pages withcontent:2
Total chunks created:4
Chunks filtered(minScore):0
Avg quality score:0.87(min:0.80,max:0.95)
Avg token count:357
Languages detected: en
Content types: documentation
Export format: json

Use Cases

  • RAG Chatbots โ€” Prepare knowledge bases for retrieval-augmented generation
  • Documentation Search โ€” Index docs sites for semantic search
  • Knowledge Management โ€” Convert websites into structured, searchable chunks
  • Content Analysis โ€” Score and filter web content by quality
  • AI Fine-tuning โ€” Prepare clean training data from websites

Quality Scoring

Each chunk is scored 0-1 based on:

  • Word count โ€” Penalizes very short or very long chunks
  • Sentence structure โ€” Proper sentences score higher
  • Heading presence โ€” Structured content scores higher
  • Lexical diversity โ€” Varied vocabulary scores higher
  • Code ratio โ€” Moderate code is rewarded, excessive code is penalized
  • Boilerplate detection โ€” Cookie notices, "all rights reserved", raw HTML tags lower the score

Language Detection

Supports 7 languages: English, French, German, Spanish, Dutch, Portuguese, Italian. Detection uses keyword matching with 20 marker words per language.

Content Type Classification

Automatically classifies each chunk as: documentation, article, blog, product, faq, or other.

Notes

  • Uses same-domain crawling strategy (won't follow external links)
  • Playwright-based for JavaScript-rendered content support
  • Code blocks are protected during chunking โ€” never split mid-block
  • Vector DB export sends metadata alongside text (no embeddings โ€” use your own model)
  • Chunks respect heading boundaries to maintain semantic coherence

API Integration

JavaScript

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token:'YOUR_API_TOKEN'});
const run =await client.actor('ozapp/ai-data-pipeline').call({
startUrls:[{url:'https://docs.example.com'}],
maxPages:100,
chunkSize:500,
minQualityScore:0.3,
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
console.log(`${items.length} chunks ready for your vector DB`);

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("ozapp/ai-data-pipeline").call(run_input={
"startUrls":[{"url":"https://docs.example.com"}],
"maxPages":100,
"chunkSize":500,
"minQualityScore":0.3,
})
items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"{len(items)} chunks ready for your vector DB")

cURL

curl"https://api.apify.com/v2/acts/ozapp~ai-data-pipeline/runs"\
-X POST \
-H"Content-Type: application/json"\
-H"Authorization: Bearer YOUR_API_TOKEN"\
-d'{"startUrls":[{"url":"https://docs.example.com"}],"maxPages":100,"chunkSize":500}'

Pricing

$4.99 per 1,000 chunks โ€” includes crawling, cleaning, scoring, and classification.

You might also like

LLM Data Pipeline Pro

sanztheo/llm-data-pipeline-pro

Transform websites into LLM training data. Scrape, validate, deduplicate, chunk for RAG, and export to OpenAI/Anthropic/Mistral formats. Built-in PII detection and GDPR compliance. Vector DB export to Pinecone & Qdrant.

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

Website to Text & Markdown โ€” AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

3

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

AI Dataset Converter - Website to Training Data

boztek-ltd/ai-dataset-converter

Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.

RAG Pipeline

labrat011/rag-pipeline

One-click RAG pipeline: chunks text, generates embeddings, and stores vectors in Pinecone or Qdrant. Provide your content and API keys -- the orchestrator handles the rest.

AI Training Data Scraper

blukaze/AI-Training-Data-Scraper

AI Training Data Scraper converts websites into clean, semantically-chunked, vector-ready data for LLMs, RAG pipelines, and AI search. Built for documentation, tutorials, and code-heavy content, with smart chunking and rich metadata.

๐Ÿ‘ User avatar

Blukaze Automations

7

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.