👁 Q&A Dataset Extractor for LLM Fine-Tuning avatar

Q&A Dataset Extractor for LLM Fine-Tuning

Under maintenance

Pricing

from $3.00 / 1,000 q&a pairs

Try for free

Go to Apify Store

👁 Q&A Dataset Extractor for LLM Fine-Tuning

Q&A Dataset Extractor for LLM Fine-Tuning

Under maintenance

Try for free

Crawl any website, documentation or FAQ and turn it into clean, deduplicated question-answer pairs in OpenAI / Alpaca / plain JSONL format - ready for fine-tuning and RAG.

Pricing

from $3.00 / 1,000 q&a pairs

Rating

0.0

(0)

Developer

👁 Deniz Schlösser

Deniz Schlösser

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

13 days ago

Last modified

Why this Actor

🎯 Training-ready output, not raw text. Get {question, answer} pairs in OpenAI, Alpaca, or plain JSONL — drop them straight into a fine-tuning job.
🧹 Clean extraction. Mozilla Readability strips navigation, sidebars, cookie banners, and ads. The model never sees junk, so your dataset doesn't either.
🔒 Grounded answers, no hallucination. Every answer is constrained to the crawled content — the prompt forbids inventing facts.
♻️ Automatic deduplication. Near-identical questions across pages are collapsed, so you don't pay to train on the same thing twice.
💸 Bring your own Claude key (BYOK). You control model choice and token spend. Default claude-haiku-4-5 keeps costs at roughly $1 per 1,000 pairs in tokens.
🧪 Free dry run. Preview crawling and chunking with zero LLM cost before you spend anything.

What it does

Start URL(s)
 │ 1. Crawl → follows same-domain links up to your page limit
 ▼
Clean Markdown → Mozilla Readability +Turndown(main content only)
 │ 2. Chunk → paragraph-aware,with overlap to preserve context
 ▼
Content chunks
 │ 3. Generate → Claude produces up to N grounded Q&A pairs per chunk
 ▼
 │ 4. Deduplicate → collapses repeated questions
 ▼
JSONL dataset → OpenAI / Alpaca / plain,with source_url + source_title

Example

Input:

{
"startUrls":[{"url":"https://docs.apify.com/academy/web-scraping-for-beginners"}],
"maxPagesToCrawl":5,
"maxQuestionsPerChunk":3,
"outputFormat":"openai",
"model":"claude-haiku-4-5",
"anthropicApiKey":"sk-ant-..."
}

Output (one dataset item, openai format):

{
"messages":[
{
"role":"user",
"content":"What is the main project you'll build in this JavaScript web scraping course?"
},
{
"role":"assistant",
"content":"In this course, you'll create an application for watching prices. It will be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such a program would be useful for seeing trends in price changes, detecting discounts, and more."
}
],
"source_url":"https://docs.apify.com/academy/web-scraping-for-beginners",
"source_title":"Web scraping basics for JavaScript devs | Academy | Apify Documentation"
}

Every item carries source_url and source_title so you can trace, filter, or cite each example.

Output formats

Pick the shape your training pipeline expects with the outputFormat setting:

Format	Shape	Use it for
`openai`	`{ "messages": [{ "role": "user", ... }, { "role": "assistant", ... }] }`	OpenAI fine-tuning, chat-format SFT
`alpaca`	`{ "instruction": "...", "input": "", "output": "..." }`	Llama / Mistral / open-model instruction tuning
`plain`	`{ "question": "...", "answer": "..." }`	RAG eval sets, custom pipelines, embeddings

The output format is independent of the model that generates the pairs — Claude produces pairs in whichever training shape you choose.

Use cases

🤖 Build a support chatbot from your docs. Point it at your help center or product docs and generate a Q&A set to fine-tune or seed a RAG index — so the bot answers in your product's own words.

🎓 Fine-tune a domain expert model. Turn a knowledge base, wiki, or set of guides into thousands of instruction examples for a specialized model, without hand-writing them.

📚 Create RAG evaluation sets. Generate grounded question-answer pairs to benchmark retrieval quality — does your RAG system actually find the right answer?

🌍 Localize training data. Use customInstructions (e.g. "Write all questions and answers in German") to produce datasets in any language your source covers.

Input reference

Field	Description	Default
`startUrls`	Pages to crawl (follows same-domain links)	— (required)
`maxPagesToCrawl`	Hard limit on pages crawled	`10`
`maxQuestionsPerChunk`	Q&A pairs generated per content chunk	`3`
`chunkSize` / `chunkOverlap`	Chunking in characters (overlap auto-capped at half)	`4000` / `200`
`outputFormat`	`openai` \| `alpaca` \| `plain`	`openai`
`model`	`claude-haiku-4-5` \| `claude-sonnet-4-6` \| `claude-opus-4-8`	`claude-haiku-4-5`
`anthropicApiKey`	Your Anthropic (Claude) API key — `sk-ant-...` (secret)	—
`customInstructions`	Extra guidance, e.g. "answer in German" or "focus on pricing"	—
`dryRun`	Skip the LLM and just output chunks — free preview	`false`

Choosing a model

Model	Token cost (your key)	Best for
Claude Haiku 4.5 (default)	~$1 / 1,000 pairs	High-volume datasets, the cost-conscious default
Claude Sonnet 4.6	~$3 / 1,000 pairs	Nuanced answers, technical or complex sources
Claude Opus 4.8	~$5 / 1,000 pairs	Highest quality where it matters most

Quick start

Add one or more Start URLs (e.g. your docs site).
Paste your Anthropic (Claude) API key (sk-ant-...). Get one at console.anthropic.com.
(Optional) Set dryRun: true first to preview crawling and chunking for free.
Run, then download the dataset as JSONL and feed it to your fine-tuning job.

Notes & responsible use

Bring your own key. This Actor calls the Claude API with the key you provide; you are billed by Anthropic for token usage directly.
Answers are grounded. The prompt forbids inventing facts — answers are constrained to the crawled content.
Respect each site's rules. Only crawl content you are permitted to use. Honor the target site's Terms of Service and robots directives, and respect copyright and data-protection law (e.g. GDPR) for any personal data you process.

Questions or a source that doesn't extract cleanly? Open an issue on the Actor page — feedback shapes the roadmap.

👁 AI Training Data Scraper - LLM and RAG-Ready avatar

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

👁 User avatar

George Kioko

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

👁 User avatar

Fulcria Labs

👁 AI Dataset Converter - Website to Training Data avatar

AI Dataset Converter - Website to Training Data

boztek-ltd/ai-dataset-converter

Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.

👁 User avatar

Boztek LTD

👁 AI Training Data Curator avatar

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

👁 User avatar

Ryan Clinton

👁 AI-Ready Documentation Crawler avatar

AI-Ready Documentation Crawler

funny_electrician/Korak1901

Scrapes developer docs and outputs perfectly formatted Markdown for LLM fine-tuning.

👁 User avatar

Milton Gardener

Biomedical & Legal Q&A Dataset Generator

resilient_meteor/biomedical-legal-qa-generator

Generates verified Q&A training pairs from PubMed Central (biomedical) or CourtListener (legal). Every answer has a verbatim evidence anchor from the source document. Output: JSONL ready for LLM fine-tuning.

👁 User avatar

tegar dave

👁 Blog Post Scraper for LLM avatar

Blog Post Scraper for LLM

extremescrapes/blog-post-scraper-for-llm

Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.

👁 User avatar

Extreme Scrapes

👁 AI Training Dataset Builder: Articles, Blogs & Web Pages avatar

AI Training Dataset Builder: Articles, Blogs & Web Pages

turboextract/ai-training-dataset-builder

Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.

👁 User avatar

Moses Ndambuki

👁 AI Training Data Scraper (Substack / Medium) avatar

AI Training Data Scraper (Substack / Medium)

juryless_lens/ai-training-data-scraper

Extract clean, structured text data from Substack and Medium publications — formatted as Markdown or Plain Text — ready for LLM fine-tuning, RAG pipelines, and content analysis.

👁 User avatar

Brian

LLM-Ready Web Scraper

devoted_helix/llm-web-scraper

Convert web pages to clean, LLM-friendly text. Perfect for RAG pipelines, AI chatbot training, and fine-tuning datasets. Removes ads,menus, and clutter automatically.

👁 User avatar

batuhan senavci

URL: https://apify.com/deniz_schloesser/qa-dataset-extractor