VOOZH about

URL: https://apify.com/deniz_schloesser/qa-dataset-extractor

⇱ Q&A Dataset Extractor for LLM Fine-Tuning Β· Apify


πŸ‘ Q&A Dataset Extractor for LLM Fine-Tuning avatar

Q&A Dataset Extractor for LLM Fine-Tuning

Under maintenance

Pricing

from $3.00 / 1,000 q&a pairs

Go to Apify Store

Q&A Dataset Extractor for LLM Fine-Tuning

Under maintenance

Crawl any website, documentation or FAQ and turn it into clean, deduplicated question-answer pairs in OpenAI / Alpaca / plain JSONL format - ready for fine-tuning and RAG.

Pricing

from $3.00 / 1,000 q&a pairs

Rating

0.0

(0)

Developer

πŸ‘ Deniz SchlΓΆsser

Deniz SchlΓΆsser

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

13 days ago

Last modified

Share

Turn any website, documentation portal, or FAQ into clean, deduplicated question-answer pairs β€” ready to fine-tune a model or power a RAG / support chatbot.

Most scrapers stop at raw HTML or Markdown and leave the hard part to you: turning a pile of text into actual training examples. This Actor goes the whole way. It crawls your source, extracts the real content (no nav, ads, or boilerplate), splits it into context-preserving chunks, and uses Claude to generate grounded, self-contained Q&A pairs in the exact JSONL format your training pipeline expects.

You bring a source URL. You get back a fine-tuning-ready dataset.


Why this Actor

  • 🎯 Training-ready output, not raw text. Get {question, answer} pairs in OpenAI, Alpaca, or plain JSONL β€” drop them straight into a fine-tuning job.
  • 🧹 Clean extraction. Mozilla Readability strips navigation, sidebars, cookie banners, and ads. The model never sees junk, so your dataset doesn't either.
  • πŸ”’ Grounded answers, no hallucination. Every answer is constrained to the crawled content β€” the prompt forbids inventing facts.
  • ♻️ Automatic deduplication. Near-identical questions across pages are collapsed, so you don't pay to train on the same thing twice.
  • πŸ’Έ Bring your own Claude key (BYOK). You control model choice and token spend. Default claude-haiku-4-5 keeps costs at roughly $1 per 1,000 pairs in tokens.
  • πŸ§ͺ Free dry run. Preview crawling and chunking with zero LLM cost before you spend anything.

What it does

Start URL(s)
β”‚ 1. Crawl β†’ follows same-domain links up to your page limit
β–Ό
Clean Markdown β†’ Mozilla Readability +Turndown(main content only)
β”‚ 2. Chunk β†’ paragraph-aware,with overlap to preserve context
β–Ό
Content chunks
β”‚ 3. Generate β†’ Claude produces up to N grounded Q&A pairs per chunk
β–Ό
β”‚ 4. Deduplicate β†’ collapses repeated questions
β–Ό
JSONL dataset β†’ OpenAI / Alpaca / plain,with source_url + source_title

Example

Input:

{
"startUrls":[{"url":"https://docs.apify.com/academy/web-scraping-for-beginners"}],
"maxPagesToCrawl":5,
"maxQuestionsPerChunk":3,
"outputFormat":"openai",
"model":"claude-haiku-4-5",
"anthropicApiKey":"sk-ant-..."
}

Output (one dataset item, openai format):

{
"messages":[
{
"role":"user",
"content":"What is the main project you'll build in this JavaScript web scraping course?"
},
{
"role":"assistant",
"content":"In this course, you'll create an application for watching prices. It will be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such a program would be useful for seeing trends in price changes, detecting discounts, and more."
}
],
"source_url":"https://docs.apify.com/academy/web-scraping-for-beginners",
"source_title":"Web scraping basics for JavaScript devs | Academy | Apify Documentation"
}

Every item carries source_url and source_title so you can trace, filter, or cite each example.


Output formats

Pick the shape your training pipeline expects with the outputFormat setting:

FormatShapeUse it for
openai{ "messages": [{ "role": "user", ... }, { "role": "assistant", ... }] }OpenAI fine-tuning, chat-format SFT
alpaca{ "instruction": "...", "input": "", "output": "..." }Llama / Mistral / open-model instruction tuning
plain{ "question": "...", "answer": "..." }RAG eval sets, custom pipelines, embeddings

The output format is independent of the model that generates the pairs β€” Claude produces pairs in whichever training shape you choose.


Use cases

πŸ€– Build a support chatbot from your docs. Point it at your help center or product docs and generate a Q&A set to fine-tune or seed a RAG index β€” so the bot answers in your product's own words.

πŸŽ“ Fine-tune a domain expert model. Turn a knowledge base, wiki, or set of guides into thousands of instruction examples for a specialized model, without hand-writing them.

πŸ“š Create RAG evaluation sets. Generate grounded question-answer pairs to benchmark retrieval quality β€” does your RAG system actually find the right answer?

🌍 Localize training data. Use customInstructions (e.g. "Write all questions and answers in German") to produce datasets in any language your source covers.


Input reference

FieldDescriptionDefault
startUrlsPages to crawl (follows same-domain links)β€” (required)
maxPagesToCrawlHard limit on pages crawled10
maxQuestionsPerChunkQ&A pairs generated per content chunk3
chunkSize / chunkOverlapChunking in characters (overlap auto-capped at half)4000 / 200
outputFormatopenai | alpaca | plainopenai
modelclaude-haiku-4-5 | claude-sonnet-4-6 | claude-opus-4-8claude-haiku-4-5
anthropicApiKeyYour Anthropic (Claude) API key β€” sk-ant-... (secret)β€”
customInstructionsExtra guidance, e.g. "answer in German" or "focus on pricing"β€”
dryRunSkip the LLM and just output chunks β€” free previewfalse

Choosing a model

ModelToken cost (your key)Best for
Claude Haiku 4.5 (default)~$1 / 1,000 pairsHigh-volume datasets, the cost-conscious default
Claude Sonnet 4.6~$3 / 1,000 pairsNuanced answers, technical or complex sources
Claude Opus 4.8~$5 / 1,000 pairsHighest quality where it matters most

Quick start

  1. Add one or more Start URLs (e.g. your docs site).
  2. Paste your Anthropic (Claude) API key (sk-ant-...). Get one at console.anthropic.com.
  3. (Optional) Set dryRun: true first to preview crawling and chunking for free.
  4. Run, then download the dataset as JSONL and feed it to your fine-tuning job.

Notes & responsible use

  • Bring your own key. This Actor calls the Claude API with the key you provide; you are billed by Anthropic for token usage directly.
  • Answers are grounded. The prompt forbids inventing facts β€” answers are constrained to the crawled content.
  • Respect each site's rules. Only crawl content you are permitted to use. Honor the target site's Terms of Service and robots directives, and respect copyright and data-protection law (e.g. GDPR) for any personal data you process.

Questions or a source that doesn't extract cleanly? Open an issue on the Actor page β€” feedback shapes the roadmap.

You might also like

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

AI Dataset Converter - Website to Training Data

boztek-ltd/ai-dataset-converter

Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

​AI-Ready Documentation Crawler

funny_electrician/Korak1901

Scrapes developer docs and outputs perfectly formatted Markdown for LLM fine-tuning.

πŸ‘ User avatar

Milton Gardener

2

Blog Post Scraper for LLM

extremescrapes/blog-post-scraper-for-llm

Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.

πŸ‘ User avatar

Extreme Scrapes

2

AI Training Dataset Builder: Articles, Blogs & Web Pages

turboextract/ai-training-dataset-builder

Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.

πŸ‘ User avatar

Moses Ndambuki

3

AI Training Data Scraper (Substack / Medium)

juryless_lens/ai-training-data-scraper

Extract clean, structured text data from Substack and Medium publications β€” formatted as Markdown or Plain Text β€” ready for LLM fine-tuning, RAG pipelines, and content analysis.