VOOZH about

URL: https://apify.com/fiery_dream/ai-training-data-enricher

โ‡ฑ Ai Training Data Enricher ยท Apify


Pricing

from $0.01 / 1,000 results

Go to Apify Store

Ai Training Data Enricher

Production-grade data enrichment and validation for LLM training datasets. Automatically clean, enrich, deduplicate, and validate your AI training data before fine-tuning.

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

๐Ÿ‘ Cody Churchwell

Cody Churchwell

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

0

Monthly active users

7 months ago

Last modified

Share

๐Ÿค– AI Training Data Enricher & Validator

๐Ÿ‘ Apify Actor
๐Ÿ‘ License: MIT

Production-grade data enrichment and validation for LLM training datasets. Automatically clean, enrich, deduplicate, and validate your AI training data before fine-tuning.

๐ŸŽฏ Why This Actor?

Training high-quality LLMs requires clean, diverse, and well-structured data. Poor data quality leads to:

  • Overfitting from duplicates
  • Privacy violations from undetected PII
  • Biased models from unbalanced sentiment
  • Poor performance from low-quality text
  • GDPR non-compliance from personal data

This Actor solves all these problems in one automated pipeline.

โœจ Key Features

๐Ÿ” Enrichment

  • Sentiment Analysis - AFINN lexicon-based scoring with positive/negative word extraction
  • Named Entity Recognition - Extract people, places, organizations, dates, and values
  • Keyword Extraction - TF-IDF weighted keyword extraction for topic modeling
  • Language Detection - Multi-language support with confidence scoring
  • Readability Metrics - Word count, sentence analysis, complexity scoring

โœ… Validation

  • Duplicate Detection - Fuzzy string matching with configurable similarity thresholds (0.5-1.0)
  • PII Detection - GDPR-compliant detection of emails, phones, SSNs, credit cards
  • Schema Validation - JSON Schema validation with detailed error reporting
  • Length Filtering - Min/max character limits with configurable thresholds
  • Quality Flags - Flag-only mode to preserve all data with validation metadata

๐Ÿ”’ Privacy & Compliance

  • PII Redaction - Automatic [REDACTED] replacement for detected sensitive data
  • GDPR Ready - Identifies all personal data for compliance workflows
  • Audit Trail - Complete validation history for regulatory reporting

๐Ÿ“Š Use Cases

Use CaseConfiguration
LLM Fine-TuningEnable all enrichment, strict duplicate detection (0.95), remove PII
Sentiment DatasetSentiment analysis, keyword extraction, balanced sampling
GDPR CompliancePII detection, flag-only mode, audit logging
Quality FilteringMin length 50 chars, readability metrics, schema validation
DeduplicationDuplicate detection at 0.85 threshold, remove invalid items

๐Ÿš€ Quick Start

1. Prepare Your Dataset

Your input dataset should contain items with at least a text field:

{
"text":"This is my training sample",
"label":"positive"
}

2. Configure the Actor

{
"datasetId":"your-dataset-id",
"textField":"text",
"enrichmentOptions":{
"sentiment":true,
"entities":true,
"keywords":true,
"language":true,
"readability":true
},
"validationOptions":{
"detectDuplicates":true,
"duplicateSimilarityThreshold":0.85,
"detectPII":true,
"minTextLength":10,
"maxTextLength":0
},
"outputOptions":{
"includeOriginal":true,
"flagOnly":false,
"removePII":false
}
}

3. Run and Export

The Actor outputs an enriched dataset with this structure:

{
"id":0,
"originalText":"Apple Inc. released iPhone in 2007. Great product!",
"enrichment":{
"sentiment":{
"score":3,
"comparative":0.375,
"positive":["great"],
"negative":[]
},
"entities":{
"people":[],
"places":[],
"organizations":["Apple Inc."],
"dates":["2007"],
"values":[]
},
"keywords":["apple","iphone","released","product"],
"language":"english",
"readability":{
"wordCount":8,
"sentenceCount":2,
"avgWordsPerSentence":4.0,
"avgWordLength":5.1
}
},
"validation":{
"isValid":true,
"isDuplicate":false,
"hasPII":false,
"lengthValid":true,
"schemaValid":true
}
}

๐Ÿ”ง Configuration Reference

Enrichment Options

sentiment (boolean, default: true)

Adds sentiment analysis using the AFINN-111 lexicon. Produces scores from -5 (very negative) to +5 (very positive).

Technical Details:

  • Uses Porter Stemmer for word normalization
  • Comparative score normalizes by text length
  • Extracts individual positive and negative words for interpretability

entities (boolean, default: true)

Named Entity Recognition using Compromise.js natural language processing.

Extracted Entity Types:

  • People - Person names (e.g., "Steve Jobs")
  • Places - Locations, cities, countries (e.g., "California")
  • Organizations - Companies, institutions (e.g., "Apple Inc.")
  • Dates - Temporal expressions (e.g., "January 2024", "next week")
  • Values - Numbers, measurements (e.g., "$100", "5 kilometers")

keywords (boolean, default: true)

TF-IDF (Term Frequency-Inverse Document Frequency) weighted keyword extraction.

Algorithm:

  1. Tokenizes text into words
  2. Calculates term frequency within document
  3. Calculates inverse document frequency across corpus
  4. Returns top 10 highest-scoring terms

Best For: Topic modeling, search indexing, feature engineering

language (boolean, default: true)

Simple language detection using stopword analysis.

Supported Languages: English, Spanish, French, German, Portuguese

Note: For production multilingual detection, consider integrating with franc or fastText language identification models.

readability (boolean, default: true)

Text complexity metrics for quality assessment.

Metrics:

  • Word Count - Total words (tokenized)
  • Sentence Count - Sentences split by .!?
  • Avg Words/Sentence - Indicates complexity (15-20 is ideal for general content)
  • Avg Word Length - Character count per word (3-5 typical for English)

Validation Options

detectDuplicates (boolean, default: true)

Uses FuzzySet.js for approximate string matching to catch near-duplicates.

How It Works:

  1. Builds n-gram index of all texts
  2. For each text, finds closest matches
  3. Compares similarity scores against threshold
  4. Flags items above threshold as duplicates

Performance: O(n) per item after O(n) index build

Threshold Guidance:

  • 0.95-1.0 - Very strict, catches only near-exact duplicates
  • 0.85-0.94 - Balanced (recommended), catches paraphrases
  • 0.70-0.84 - Loose, may flag similar but distinct content
  • 0.50-0.69 - Very loose, not recommended

duplicateSimilarityThreshold (number, 0.5-1.0, default: 0.85)

Controls duplicate detection strictness. See above for guidance.

detectPII (boolean, default: true)

GDPR-compliant detection of Personal Identifiable Information.

Detected PII Types:

  • Email - Regex: [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}
  • Phone - Regex: (\+?\d{1,3}[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4} (US/International)
  • SSN - Regex: \d{3}-\d{2}-\d{4} (US Social Security Numbers)
  • Credit Card - Regex: \d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4} (15-16 digit cards)

Privacy Note: Regex patterns provide high recall but may have false positives. For production GDPR compliance, consider integrating with Microsoft Presidio or AWS Comprehend PII detection.

minTextLength / maxTextLength (integer, default: 10 / 0)

Filters texts by character count. Set maxTextLength to 0 to disable max length check.

Recommended Values:

  • Tweets/Short Form: min=10, max=280
  • General Training: min=50, max=5000
  • Long Form: min=500, max=50000

Schema Validation

Provide a JSON Schema object to validate the structure of your data:

{
"schemaValidation":{
"type":"object",
"required":["text","label"],
"properties":{
"text":{"type":"string","minLength":10},
"label":{"type":"string","enum":["positive","negative","neutral"]}
}
}
}

Uses Zod for runtime validation with detailed error messages.

Output Options

includeOriginal (boolean, default: true)

Preserves all original fields from input items in output. Disable to reduce output size.

flagOnly (boolean, default: false)

When enabled, invalid items are included in output but marked with validation flags. Use for audit workflows where you need to review rejected data.

removePII (boolean, default: false)

Automatically redacts detected PII with placeholder text:

  • [EMAIL_REDACTED]
  • [PHONE_REDACTED]
  • [SSN_REDACTED]
  • [CC_REDACTED]

Important: Redaction is applied to processedText field; originalText is always preserved for audit.

๐Ÿ“ˆ Performance & Scalability

  • Throughput: ~100-200 items/second on default Apify infrastructure
  • Memory: O(n) for duplicate detection fuzzy index
  • Concurrency: Single-threaded processing (natural language processing is CPU-bound)
  • Dataset Size: Tested up to 1M items, recommend batching for 10M+ datasets

๐Ÿ”ฌ Technical Architecture

NLP Pipeline

Input Dataset
โ†“
Text Extraction(configurable field)
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ENRICHMENTPHASE โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1. Sentiment Analysis(AFINN) โ”‚
โ”‚ 2.NER(Compromise.js) โ”‚
โ”‚ 3.TF-IDF Keyword Extraction โ”‚
โ”‚ 4. Language Detection โ”‚
โ”‚ 5. Readability Metrics โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ VALIDATIONPHASE โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1. Length Validation โ”‚
โ”‚ 2. Duplicate Detection(FuzzySet)โ”‚
โ”‚ 3.PIIDetection(Regex +ML) โ”‚
โ”‚ 4. Schema Validation(Zod) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“
Filtering / Flagging Logic
โ†“
Output Dataset

Dependencies

  • natural - NLP toolkit for sentiment, tokenization, stemming, TF-IDF
  • compromise - Fast, client-side NER without external models
  • fuzzyset - Probabilistic fuzzy string matching using n-grams
  • zod - TypeScript-first schema validation
  • email-validator - RFC-compliant email validation
  • phone - International phone number parsing

๐ŸŽ“ Best Practices

1. Start with Quality Filtering

Before enrichment, remove obviously bad data:

{
"validationOptions":{
"minTextLength":50,
"maxTextLength":5000
}
}

2. Tune Duplicate Threshold Iteratively

Start at 0.95, lower if you see duplicates, raise if too many false positives.

3. Always Check for PII

GDPR fines for data breaches can be 4% of global revenue. Always run PII detection.

4. Use Schema Validation

Enforce structure early to catch bugs in scraping pipelines:

{
"schemaValidation":{
"required":["text","source_url"]
}
}

5. Monitor Sentiment Distribution

Use sentiment enrichment to check for dataset bias. Balanced datasets should have near-zero average sentiment.

6. Batch Large Datasets

For datasets >1M items, split into smaller batches and run in parallel.

๐Ÿ› Troubleshooting

"Input dataset is empty"

  • Verify datasetId is correct
  • Check that dataset has items
  • Try using dataset ID from a previous Actor run

"Item missing text field 'xyz'"

  • Verify textField parameter matches your data structure
  • Check for null/undefined values in your dataset
  • Ensure text field contains strings, not objects

"Out of memory"

  • Reduce dataset size with maxItems parameter
  • Disable duplicate detection for very large datasets (1M+ items)
  • Use flag-only mode to avoid filtering large numbers of items

Slow Performance

  • Disable unused enrichment features
  • Reduce maxItems for testing
  • Consider upgrading Apify Actor memory allocation

๐Ÿ“š Related Resources

๐Ÿค Contributing

Found a bug? Have a feature request?

Please report issues or suggest improvements via GitHub Issues.

๐Ÿ“„ License

MIT License - feel free to use in commercial projects.

๐ŸŽ–๏ธ Credits

Built for the Apify $1M Challenge by a team passionate about data quality and AI safety.


Ready to clean your training data? Get started now โ†’

You might also like

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

AI Training Dataset Builder: Articles, Blogs & Web Pages

turboextract/ai-training-dataset-builder

Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.

๐Ÿ‘ User avatar

Moses Ndambuki

3

Ai Training Data Curator

omarchydev/ai-training-data-curator

Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to JSONL, Parquet, or CSV formats ready for OpenAI, Claude, or Llama training.

Blog Post Scraper for LLM

extremescrapes/blog-post-scraper-for-llm

Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.

๐Ÿ‘ User avatar

Extreme Scrapes

2

AI Training Data Quality MCP Server

ryanclinton/ai-training-data-quality-mcp

AI training data quality assessment, bias detection, and governance scoring for AI agents via the Model Context Protocol.

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

๐Ÿš€ Transform web content into clean, LLM-ready Markdown! ๐Ÿ“˜ Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! ๐ŸŒ๐Ÿ“๐Ÿง