Blog Post Scraper for LLM

Pricing

from $50.00 / 1,000 article scrapeds

Blog Post Scraper for LLM

Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.

Pricing

from $50.00 / 1,000 article scrapeds

Rating

0.0

(0)

Developer

👁 Extreme Scrapes

Extreme Scrapes

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

7 days ago

Last modified

Blog Post for AI Training Data

Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.

Features

Image-free output — strips all images for pure text training data
Word count filtering — skip short posts that don't meet your quality threshold
JSONL output — combined output file ready for fine-tuning pipelines
Batch processing — extract hundreds of blog posts in a single run
Dataset + KV store — results in both Apify dataset and as a single JSONL file

How It Works

Provide blog post URLs and set a minimum word count threshold.
The Actor fetches each post and strips all images.
Posts below the word count threshold are skipped.
Valid posts are stored in the dataset AND as a combined JSONL file in the Key-Value store.

Input

{
"startUrls":[
{"url":"https://lilianweng.github.io/posts/2023-06-23-agent/"},
{"url":"https://blog.google/technology/ai/google-gemini-ai/"}
],
"minWordCount":200
}

Output

Dataset record:

{
"url":"https://lilianweng.github.io/posts/2023-06-23-agent/",
"wordCount":8542,
"markdown":"# LLM Powered Autonomous Agents\n\nContent..."
}

Key-Value store: A single OUTPUT file in JSONL format containing all records.

Use Cases

Build fine-tuning datasets for LLMs
Create training corpora from technical blogs
Collect AI/ML research blog content
Generate evaluation datasets

Keywords

blog scraper, AI training data, LLM dataset, fine-tuning data, blog to jsonl, training corpus

Pricing

$50 per 1,000 blog post extractions.

👁 AI Training Data Scraper - LLM and RAG-Ready avatar

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

👁 User avatar

George Kioko

👁 AI Training Data Curator avatar

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

👁 User avatar

Ryan Clinton

LLM-Ready Web Scraper

devoted_helix/llm-web-scraper

Convert web pages to clean, LLM-friendly text. Perfect for RAG pipelines, AI chatbot training, and fine-tuning datasets. Removes ads,menus, and clutter automatically.

👁 User avatar

batuhan senavci

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

👁 User avatar

Fulcria Labs

Blog Scraper

assured_crown/blog-scraper

👁 User avatar

Ben

👁 AI Training Dataset Builder: Articles, Blogs & Web Pages avatar

AI Training Dataset Builder: Articles, Blogs & Web Pages

turboextract/ai-training-dataset-builder

Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.

👁 User avatar

Moses Ndambuki

Q&A Dataset Extractor for LLM Fine-Tuning

deniz_schloesser/qa-dataset-extractor

Crawl any website, documentation or FAQ and turn it into clean, deduplicated question-answer pairs in OpenAI / Alpaca / plain JSONL format - ready for fine-tuning and RAG.

👁 User avatar

Deniz Schlösser

👁 Blog Scraper avatar

Blog Scraper

naive_zing/blog-scraper

Company Blog Scraper, Blog Post Scraper, Corporate Blog Crawler, Automatic Blog Discovery, Blog Content Extractor, Article Metadata Scraper, Multi-Domain Blog Scraper, Competitor Blog Analysis, Content Marketing Scraper, Blog Post Metadata Extraction, Company Announcements Scraper.

👁 User avatar

Wyald

👁 AI Training Data Scraper (Substack / Medium) avatar

AI Training Data Scraper (Substack / Medium)

juryless_lens/ai-training-data-scraper

Extract clean, structured text data from Substack and Medium publications — formatted as Markdown or Plain Text — ready for LLM fine-tuning, RAG pipelines, and content analysis.

👁 User avatar

Brian

👁 Smart Article & Blog Extractor avatar

Smart Article & Blog Extractor

lightkong/universal-blog-scraper

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.

👁 User avatar

Lightkong

URL: https://apify.com/extremescrapes/blog-post-scraper-for-llm