VOOZH about

URL: https://apify.com/extremescrapes/blog-post-scraper-for-llm

⇱ Blog Post Scraper for LLM Β· Apify


Pricing

from $50.00 / 1,000 article scrapeds

Go to Apify Store

Blog Post Scraper for LLM

Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.

Pricing

from $50.00 / 1,000 article scrapeds

Rating

0.0

(0)

Developer

πŸ‘ Extreme Scrapes

Extreme Scrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

7 days ago

Last modified

Categories

Share

Blog Post for AI Training Data

Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.

Features

  • Image-free output β€” strips all images for pure text training data
  • Word count filtering β€” skip short posts that don't meet your quality threshold
  • JSONL output β€” combined output file ready for fine-tuning pipelines
  • Batch processing β€” extract hundreds of blog posts in a single run
  • Dataset + KV store β€” results in both Apify dataset and as a single JSONL file

How It Works

  1. Provide blog post URLs and set a minimum word count threshold.
  2. The Actor fetches each post and strips all images.
  3. Posts below the word count threshold are skipped.
  4. Valid posts are stored in the dataset AND as a combined JSONL file in the Key-Value store.

Input

{
"startUrls":[
{"url":"https://lilianweng.github.io/posts/2023-06-23-agent/"},
{"url":"https://blog.google/technology/ai/google-gemini-ai/"}
],
"minWordCount":200
}

Output

Dataset record:

{
"url":"https://lilianweng.github.io/posts/2023-06-23-agent/",
"wordCount":8542,
"markdown":"# LLM Powered Autonomous Agents\n\nContent..."
}

Key-Value store: A single OUTPUT file in JSONL format containing all records.

Use Cases

  • Build fine-tuning datasets for LLMs
  • Create training corpora from technical blogs
  • Collect AI/ML research blog content
  • Generate evaluation datasets

Keywords

blog scraper, AI training data, LLM dataset, fine-tuning data, blog to jsonl, training corpus

Pricing

$50 per 1,000 blog post extractions.

You might also like

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

AI Training Dataset Builder: Articles, Blogs & Web Pages

turboextract/ai-training-dataset-builder

Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.

πŸ‘ User avatar

Moses Ndambuki

3

Blog Scraper

naive_zing/blog-scraper

Company Blog Scraper, Blog Post Scraper, Corporate Blog Crawler, Automatic Blog Discovery, Blog Content Extractor, Article Metadata Scraper, Multi-Domain Blog Scraper, Competitor Blog Analysis, Content Marketing Scraper, Blog Post Metadata Extraction, Company Announcements Scraper.

AI Training Data Scraper (Substack / Medium)

juryless_lens/ai-training-data-scraper

Extract clean, structured text data from Substack and Medium publications β€” formatted as Markdown or Plain Text β€” ready for LLM fine-tuning, RAG pipelines, and content analysis.

Smart Article & Blog Extractor

lightkong/universal-blog-scraper

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.