VOOZH about

URL: https://apify.com/omarchydev/ai-training-data-curator

⇱ LLM Training Data Crawler & Curator Β· Apify


Pricing

from $0.01 / 1,000 results

Go to Apify Store

Ai Training Data Curator

Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to JSONL, Parquet, or CSV formats ready for OpenAI, Claude, or Llama training.

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ Omarchy Dev

Omarchy Dev

Maintained by Community

Actor stats

0

Bookmarked

7

Total users

0

Monthly active users

4 months ago

Last modified

Categories

Share

Curate high-quality, deduplicated training data for LLM fine-tuning. Extract clean text from any website OR process your own documents with automatic quality scoring, deduplication, and format conversion.

Features

  • Smart Content Extraction: Automatically detects and extracts main content, filtering out navigation, ads, and boilerplate
  • Bring Your Own Data (BYOD): Process your own text documents without crawling - perfect for existing datasets
  • Quality Scoring: Scores each document based on vocabulary diversity, sentence structure, and content density
  • Deduplication: Uses MinHash/Jaccard similarity to remove near-duplicate content
  • Flexible Crawling: Single page, same domain, same subdomain, or follow all links
  • Document Chunking: Split long documents into training-ready chunks with configurable overlap
  • Multiple Output Formats: JSONL (OpenAI compatible), JSON, Parquet, CSV, or HuggingFace Datasets format
  • Language Filtering: Filter content by language (ISO 639-1 codes)
  • Privacy Features: Optionally remove emails and URLs from extracted text

Use Cases

  • LLM Fine-tuning: Collect domain-specific training data for fine-tuning language models
  • RAG Systems: Build high-quality document collections for retrieval-augmented generation
  • Knowledge Bases: Create clean text corpora from documentation sites
  • Research: Gather datasets from academic or technical resources
  • Data Cleaning: Clean and deduplicate existing text datasets for ML training

Input Configuration

Mode Selection

The actor supports two modes - provide either start_urls (for crawling) or documents (for BYOD):

FieldTypeDefaultDescription
start_urlsarray-URLs to start crawling from (Crawl mode)
documentsarray-Your own documents to process (BYOD mode)

BYOD (Bring Your Own Data) Settings

FieldTypeDefaultDescription
documentsarray-Array of text strings or objects with text field
byod_text_fieldstringtextField name containing text in document objects
max_byod_documentsinteger500Maximum documents to process (hard limit)

Crawl Settings

FieldTypeDefaultDescription
start_urlsarray-URLs to start crawling from
crawl_modestringsame_domainsingle_page, same_domain, same_subdomain, or all_links
max_pagesinteger100Maximum pages to crawl
max_depthinteger3Maximum link depth from start URLs

Content Extraction

FieldTypeDefaultDescription
content_selectorsarray["article", "main", ".content"]CSS selectors for main content
exclude_selectorsarray["nav", "header", "footer", ".sidebar"]CSS selectors to exclude
min_word_countinteger100Minimum words per document
max_word_countinteger50000Maximum words per document

Quality & Deduplication

FieldTypeDefaultDescription
deduplicatebooleantrueRemove duplicate/near-duplicate content
dedup_thresholdnumber0.85Similarity threshold (0.5-1.0)
quality_filterbooleantrueFilter low-quality content
min_quality_scorenumber0.5Minimum quality score (0.0-1.0)
language_filterarray["en"]Languages to include (ISO codes)

Output Settings

FieldTypeDefaultDescription
output_formatstringjsonljsonl, json, parquet, csv, or huggingface
text_field_namestringtextName of the text field in output
include_metadatabooleantrueInclude URL, title, date metadata
include_raw_htmlbooleanfalseAlso save original HTML

Chunking

FieldTypeDefaultDescription
chunk_documentsbooleanfalseSplit documents into chunks
chunk_sizeinteger512Target chunk size in tokens
chunk_overlapinteger64Overlap between chunks

Text Cleaning

FieldTypeDefaultDescription
clean_htmlbooleantrueRemove HTML tags
normalize_whitespacebooleantrueCollapse multiple spaces/newlines
remove_urlsbooleanfalseStrip embedded URLs
remove_emailsbooleantrueStrip email addresses

Performance

FieldTypeDefaultDescription
use_proxiesbooleanfalseUse residential proxies
max_concurrencyinteger10Parallel requests
request_delay_msinteger500Delay between requests
respect_robots_txtbooleantrueFollow robots.txt rules

Output Format

Each document in the output contains:

{
"text":"The cleaned document text content...",
"doc_id":"abc123def456",
"source_url":"https://example.com/page",
"word_count":1523,
"quality_score":0.847,
"language":"en",
"title":"Page Title",
"description":"Meta description",
"content_type":"documentation",
"scraped_at":"2024-01-15T10:30:00Z"
}

If chunking is enabled, additional fields are included:

{
"chunk_index":0,
"total_chunks":5,
"parent_doc_id":"abc123def456"
}

Quality Metrics

The quality scorer evaluates documents based on:

  • Word count: Penalizes very short documents
  • Sentence length: Flags very short (fragments) or very long sentences
  • Vocabulary diversity: Ratio of unique words to total words
  • Boilerplate ratio: Detection of common web boilerplate patterns
  • Character composition: Penalizes excessive uppercase, digits, or special characters

Documents with scores below min_quality_score are automatically filtered out.

Example Input

Crawl Python Documentation

{
"start_urls":[
{"url":"https://docs.python.org/3/tutorial/"}
],
"crawl_mode":"same_subdomain",
"max_pages":500,
"content_selectors":[".document",".body"],
"exclude_selectors":[".sphinxsidebar",".related","footer"],
"output_format":"jsonl",
"chunk_documents":true,
"chunk_size":1024
}

Build Knowledge Base from Blog

{
"start_urls":[
{"url":"https://example.com/blog/"}
],
"crawl_mode":"same_domain",
"max_pages":100,
"content_selectors":["article",".post-content"],
"quality_filter":true,
"min_quality_score":0.6,
"deduplicate":true,
"output_format":"parquet"
}

BYOD: Process Your Own Documents

{
"documents":[
"This is a plain text document that will be processed...",
{
"text":"This document has metadata attached to it...",
"source_id":"doc_001",
"metadata":{
"title":"My Document",
"author":"John Doe",
"language":"en"
}
}
],
"deduplicate":true,
"quality_filter":true,
"min_quality_score":0.5,
"output_format":"jsonl"
}

BYOD: Clean Existing Dataset

{
"documents":[
{"text":"First document from your dataset..."},
{"text":"Second document from your dataset..."},
{"text":"Third document from your dataset..."}
],
"byod_text_field":"text",
"deduplicate":true,
"dedup_threshold":0.85,
"chunk_documents":true,
"chunk_size":512,
"output_format":"jsonl"
}

Tips for Best Results

  1. Use specific content selectors: Better extraction with precise CSS selectors for your target site
  2. Set appropriate word counts: Filter out navigation pages and indexes with min_word_count
  3. Enable deduplication: Prevents training on repetitive content (common on content farms)
  4. Adjust quality threshold: Lower for technical content, higher for prose
  5. Use chunking for long documents: Better for training context windows
  6. Start small: Test with max_pages: 20 before large crawls

Pricing

  • $0.01 per document - charged for each cleaned document (both crawled and BYOD)

Additional costs:

  • Proxy: ~$0.001-0.005 per request (if enabled)
  • Storage: ~$0.0001 per document

Support

You might also like

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

AI Dataset Converter - Website to Training Data

boztek-ltd/ai-dataset-converter

Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

Ai Training Data Enricher

fiery_dream/ai-training-data-enricher

Production-grade data enrichment and validation for LLM training datasets. Automatically clean, enrich, deduplicate, and validate your AI training data before fine-tuning.

πŸ‘ User avatar

Cody Churchwell

2

AI Training Data Quality MCP Server

ryanclinton/ai-training-data-quality-mcp

AI training data quality assessment, bias detection, and governance scoring for AI agents via the Model Context Protocol.

AI Training Dataset Builder: Articles, Blogs & Web Pages

turboextract/ai-training-dataset-builder

Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.

πŸ‘ User avatar

Moses Ndambuki

3

Blog Post Scraper for LLM

extremescrapes/blog-post-scraper-for-llm

Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.

πŸ‘ User avatar

Extreme Scrapes

2