VOOZH about

URL: https://apify.com/boztek-ltd/ai-dataset-converter

⇱ AI Dataset Converter - Website to Training Data Β· Apify


πŸ‘ AI Dataset Converter - Website to Training Data avatar

AI Dataset Converter - Website to Training Data

Pricing

from $0.008 / actor start

Go to Apify Store

AI Dataset Converter - Website to Training Data

Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.

Pricing

from $0.008 / actor start

Rating

0.0

(0)

Developer

πŸ‘ Boztek LTD

Boztek LTD

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a month ago

Last modified

Share

AI Dataset Converter β€” Website to AI Training Data

Convert any website into AI-ready datasets for RAG pipelines, LLM fine-tuning, and Q&A training. Token-aware chunking, quality scoring, content deduplication β€” all without external API calls.

What does AI Dataset Converter do?

AI Dataset Converter crawls websites and transforms their content into structured, token-aware datasets optimized for AI/ML workflows:

  • RAG Chunks β€” Embedding-ready JSON with configurable chunk size and overlap
  • Fine-tuning JSONL β€” OpenAI-compatible messages[] format
  • Q&A Pairs β€” Automatically extracted from FAQ pages and heading structures
  • Clean Markdown β€” Boilerplate-free content with full page metadata

Every chunk includes the cl100k_base (GPT-4 compatible) token count, a 0.0–1.0 quality score, source URL, language, and canonical URL β€” ready to ingest into Pinecone, Qdrant, Weaviate, LangChain, LlamaIndex, or any vector store.

Why AI Dataset Converter?

FeatureWebsite Content CrawlerAI Dataset Converter
OutputRaw Markdown / textStructured AI-ready formats
ChunkingManualToken-aware, configurable
Token countingβ€”cl100k_base (GPT-4)
Q&A extractionβ€”5 rule-based strategies
Quality scoringβ€”0.0–1.0 per page
DeduplicationURL-basedContent fingerprinting
Fine-tuning formatβ€”OpenAI JSONL
External LLM costNoneNone

How much does it cost?

AI Dataset Converter uses pay-per-event pricing at approximately $0.002 per output item (chunk, Q&A pair, or page). Platform compute units are included.

Use casePagesOutput itemsEstimated cost
Small docs site50~250 chunks~$0.50
Medium blog500~2,500 chunks~$5.00
Large docs + FAQ2,000~12,000 items~$24.00

Apify's free plan provides $5 of platform credit per month β€” enough to test on small sites.

Output formats

1. RAG Chunks (rag-chunks)

One JSON item per chunk with embedding-ready text plus rich metadata:

{
"chunk_id":"550e8400-e29b-41d4-a716-446655440000",
"source_url":"https://docs.example.com/getting-started",
"canonical_url":"https://docs.example.com/getting-started",
"text":"Getting started with Example SDK...",
"markdown":"# Getting Started\n\nWelcome to...",
"chunk_index":0,
"total_chunks":3,
"token_count":487,
"char_count":1843,
"page_title":"Getting Started",
"page_description":"Quick start guide",
"page_language":"en",
"page_author":"Docs Team",
"page_date":"2026-04-12T00:00:00.000Z",
"quality_score":0.85,
"content_type":"documentation",
"crawled_at":"2026-05-12T08:30:00.000Z",
"actor_version":"1.0.0"
}

2. Fine-tuning JSONL (fine-tuning-jsonl)

OpenAI-compatible messages[] format. Prompts are synthesized rule-based (no LLM):

{
"messages":[
{"role":"system","content":"You are a helpful assistant that provides information about Example Documentation."},
{"role":"user","content":"What is the chunk size?"},
{"role":"assistant","content":"The chunk size is the target number of tokens per output chunk..."}
],
"_metadata":{
"source_url":"https://docs.example.com/chunking",
"chunk_id":"...",
"token_count":412,
"quality_score":0.81
}
}

3. Q&A Pairs (qa-pairs)

Extracted from FAQ pages using five rule-based strategies:

{
"question":"Can I cancel my subscription?",
"answer":"Yes, you can cancel anytime from the billing settings page in your account.",
"source_url":"https://example.com/help/faq",
"extraction_method":"faq_html",
"confidence":0.95,
"token_count":28,
"page_title":"FAQ"
}

Extraction strategies (in confidence order):

  1. faq_schema β€” JSON-LD FAQPage schema (confidence 1.0)
  2. faq_html β€” <details><summary> elements (0.95)
  3. dt_dd β€” Definition lists <dl>/<dt>/<dd> (0.90)
  4. accordion β€” aria-controls / data-toggle patterns (0.85)
  5. heading_paragraph β€” <h2>/<h3> + following content (0.5–0.9)

4. Clean Markdown (markdown)

Full-page Markdown with boilerplate removed and complete metadata.

Input options

OptionTypeDefaultDescription
startUrlsarrayrequiredInitial URLs to crawl
maxPagesinteger100Maximum number of pages (0 = unlimited)
maxDepthinteger5Link-follow depth from start URLs
crawlerTypestringadaptiveadaptive / cheerio / playwright
includeGlobs / excludeGlobsarray[]URL pattern filters
outputFormatstringrag-chunksrag-chunks / fine-tuning-jsonl / qa-pairs / markdown / all
chunkSizeinteger512Target tokens per chunk
chunkOverlapinteger50Token overlap between chunks
extractQAPairsbooleantrueRun Q&A extraction strategies
languagestring""ISO 639-1 code language filter
minContentLengthinteger100Skip pages shorter than this (chars)
minQualityScorenumber0.3Skip pages below this score (0.0–1.0)
removeDuplicatesbooleantrueContent-fingerprint deduplication
removeBoilerplatebooleantrueStrip nav/footer/cookie banners
proxyConfigurationobjectApify ProxyProxy settings
maxConcurrencyinteger10Parallel page processing

Use cases

  1. Build RAG chatbots β€” Crawl documentation β†’ chunk β†’ embed in Pinecone/Qdrant/Weaviate
  2. Fine-tune LLMs β€” Convert knowledge bases to OpenAI training format
  3. Create Q&A datasets β€” Extract FAQ data for customer-support AI
  4. Feed AI agents β€” Provide structured web knowledge to autonomous agents

Integrations

Output is plain JSON / JSONL and works with LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, Milvus, MongoDB Atlas, OpenAI fine-tuning, and any tool that accepts JSON.

Quality scoring (heuristic, no LLM)

Each page receives a 0.0–1.0 score computed from:

  • Content length (25%) β€” Pages between 500 and 10000 chars score highest
  • Text density (25%) β€” Ratio of extracted text to original HTML
  • Paragraph count (15%) β€” β‰₯3 paragraphs preferred
  • Heading presence (10%) β€” At least one <h1>–<h6>
  • Link density (10%) β€” Low anchor-text ratio preferred
  • Repetition (15%) β€” Unique-sentence ratio

Pages scoring below minQualityScore are filtered out before token usage.

Token-aware chunking

Chunks are produced with a recursive splitter that respects natural boundaries:

  1. Split by paragraph (\n\n)
  2. If a paragraph exceeds chunkSize, split by sentence
  3. If a sentence exceeds chunkSize, split by token
  4. Apply chunkOverlap by prepending the last N tokens of the previous chunk

Token counts are computed with js-tiktoken using the cl100k_base encoding β€” identical to GPT-4 / text-embedding-3-*.

Limitations

  • No LLM-based extraction (by design β€” keeps cost predictable)
  • Q&A extraction works best on structured pages (FAQ, docs with headings)
  • Login-protected content not supported without cookie injection
  • JavaScript-heavy SPAs may need crawlerType: "playwright" for full rendering

You might also like

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

Ai Training Data Curator

omarchydev/ai-training-data-curator

Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to JSONL, Parquet, or CSV formats ready for OpenAI, Claude, or Llama training.

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

πŸ‘ User avatar

Nguyα»…n Anh Duy

3

4.7

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds β€” perfect for AI training data, RAG pipelines, and content archiving.

AI Training Dataset Builder: Articles, Blogs & Web Pages

turboextract/ai-training-dataset-builder

Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.

πŸ‘ User avatar

Moses Ndambuki

3