AI Dataset Converter - Website to Training Data
Pricing
from $0.008 / actor start
AI Dataset Converter - Website to Training Data
Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.
Pricing
from $0.008 / actor start
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a month ago
Last modified
Categories
Share
AI Dataset Converter β Website to AI Training Data
Convert any website into AI-ready datasets for RAG pipelines, LLM fine-tuning, and Q&A training. Token-aware chunking, quality scoring, content deduplication β all without external API calls.
What does AI Dataset Converter do?
AI Dataset Converter crawls websites and transforms their content into structured, token-aware datasets optimized for AI/ML workflows:
- RAG Chunks β Embedding-ready JSON with configurable chunk size and overlap
- Fine-tuning JSONL β OpenAI-compatible
messages[]format - Q&A Pairs β Automatically extracted from FAQ pages and heading structures
- Clean Markdown β Boilerplate-free content with full page metadata
Every chunk includes the cl100k_base (GPT-4 compatible) token count, a 0.0β1.0 quality score, source URL, language, and canonical URL β ready to ingest into Pinecone, Qdrant, Weaviate, LangChain, LlamaIndex, or any vector store.
Why AI Dataset Converter?
| Feature | Website Content Crawler | AI Dataset Converter |
|---|---|---|
| Output | Raw Markdown / text | Structured AI-ready formats |
| Chunking | Manual | Token-aware, configurable |
| Token counting | β | cl100k_base (GPT-4) |
| Q&A extraction | β | 5 rule-based strategies |
| Quality scoring | β | 0.0β1.0 per page |
| Deduplication | URL-based | Content fingerprinting |
| Fine-tuning format | β | OpenAI JSONL |
| External LLM cost | None | None |
How much does it cost?
AI Dataset Converter uses pay-per-event pricing at approximately $0.002 per output item (chunk, Q&A pair, or page). Platform compute units are included.
| Use case | Pages | Output items | Estimated cost |
|---|---|---|---|
| Small docs site | 50 | ~250 chunks | ~$0.50 |
| Medium blog | 500 | ~2,500 chunks | ~$5.00 |
| Large docs + FAQ | 2,000 | ~12,000 items | ~$24.00 |
Apify's free plan provides $5 of platform credit per month β enough to test on small sites.
Output formats
1. RAG Chunks (rag-chunks)
One JSON item per chunk with embedding-ready text plus rich metadata:
{"chunk_id":"550e8400-e29b-41d4-a716-446655440000","source_url":"https://docs.example.com/getting-started","canonical_url":"https://docs.example.com/getting-started","text":"Getting started with Example SDK...","markdown":"# Getting Started\n\nWelcome to...","chunk_index":0,"total_chunks":3,"token_count":487,"char_count":1843,"page_title":"Getting Started","page_description":"Quick start guide","page_language":"en","page_author":"Docs Team","page_date":"2026-04-12T00:00:00.000Z","quality_score":0.85,"content_type":"documentation","crawled_at":"2026-05-12T08:30:00.000Z","actor_version":"1.0.0"}
2. Fine-tuning JSONL (fine-tuning-jsonl)
OpenAI-compatible messages[] format. Prompts are synthesized rule-based (no LLM):
{"messages":[{"role":"system","content":"You are a helpful assistant that provides information about Example Documentation."},{"role":"user","content":"What is the chunk size?"},{"role":"assistant","content":"The chunk size is the target number of tokens per output chunk..."}],"_metadata":{"source_url":"https://docs.example.com/chunking","chunk_id":"...","token_count":412,"quality_score":0.81}}
3. Q&A Pairs (qa-pairs)
Extracted from FAQ pages using five rule-based strategies:
{"question":"Can I cancel my subscription?","answer":"Yes, you can cancel anytime from the billing settings page in your account.","source_url":"https://example.com/help/faq","extraction_method":"faq_html","confidence":0.95,"token_count":28,"page_title":"FAQ"}
Extraction strategies (in confidence order):
faq_schemaβ JSON-LDFAQPageschema (confidence 1.0)faq_htmlβ<details><summary>elements (0.95)dt_ddβ Definition lists<dl>/<dt>/<dd>(0.90)accordionβaria-controls/data-togglepatterns (0.85)heading_paragraphβ<h2>/<h3>+ following content (0.5β0.9)
4. Clean Markdown (markdown)
Full-page Markdown with boilerplate removed and complete metadata.
Input options
| Option | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | Initial URLs to crawl |
maxPages | integer | 100 | Maximum number of pages (0 = unlimited) |
maxDepth | integer | 5 | Link-follow depth from start URLs |
crawlerType | string | adaptive | adaptive / cheerio / playwright |
includeGlobs / excludeGlobs | array | [] | URL pattern filters |
outputFormat | string | rag-chunks | rag-chunks / fine-tuning-jsonl / qa-pairs / markdown / all |
chunkSize | integer | 512 | Target tokens per chunk |
chunkOverlap | integer | 50 | Token overlap between chunks |
extractQAPairs | boolean | true | Run Q&A extraction strategies |
language | string | "" | ISO 639-1 code language filter |
minContentLength | integer | 100 | Skip pages shorter than this (chars) |
minQualityScore | number | 0.3 | Skip pages below this score (0.0β1.0) |
removeDuplicates | boolean | true | Content-fingerprint deduplication |
removeBoilerplate | boolean | true | Strip nav/footer/cookie banners |
proxyConfiguration | object | Apify Proxy | Proxy settings |
maxConcurrency | integer | 10 | Parallel page processing |
Use cases
- Build RAG chatbots β Crawl documentation β chunk β embed in Pinecone/Qdrant/Weaviate
- Fine-tune LLMs β Convert knowledge bases to OpenAI training format
- Create Q&A datasets β Extract FAQ data for customer-support AI
- Feed AI agents β Provide structured web knowledge to autonomous agents
Integrations
Output is plain JSON / JSONL and works with LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, Milvus, MongoDB Atlas, OpenAI fine-tuning, and any tool that accepts JSON.
Quality scoring (heuristic, no LLM)
Each page receives a 0.0β1.0 score computed from:
- Content length (25%) β Pages between 500 and 10000 chars score highest
- Text density (25%) β Ratio of extracted text to original HTML
- Paragraph count (15%) β β₯3 paragraphs preferred
- Heading presence (10%) β At least one
<h1>β<h6> - Link density (10%) β Low anchor-text ratio preferred
- Repetition (15%) β Unique-sentence ratio
Pages scoring below minQualityScore are filtered out before token usage.
Token-aware chunking
Chunks are produced with a recursive splitter that respects natural boundaries:
- Split by paragraph (
\n\n) - If a paragraph exceeds
chunkSize, split by sentence - If a sentence exceeds
chunkSize, split by token - Apply
chunkOverlapby prepending the last N tokens of the previous chunk
Token counts are computed with js-tiktoken using the cl100k_base encoding β identical to GPT-4 / text-embedding-3-*.
Limitations
- No LLM-based extraction (by design β keeps cost predictable)
- Q&A extraction works best on structured pages (FAQ, docs with headings)
- Login-protected content not supported without cookie injection
- JavaScript-heavy SPAs may need
crawlerType: "playwright"for full rendering
