VOOZH about

URL: https://apify.com/juryless_lens/ai-training-data-scraper

โ‡ฑ AI Training Data Scraper (Substack / Medium) ยท Apify


๐Ÿ‘ AI Training Data Scraper (Substack / Medium) avatar

AI Training Data Scraper (Substack / Medium)

Pricing

from $30.00 / 1,000 results

Go to Apify Store

AI Training Data Scraper (Substack / Medium)

Extract clean, structured text data from Substack and Medium publications โ€” formatted as Markdown or Plain Text โ€” ready for LLM fine-tuning, RAG pipelines, and content analysis.

Pricing

from $30.00 / 1,000 results

Rating

0.0

(0)

Developer

๐Ÿ‘ Brian

Brian

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 months ago

Last modified

Share

Extract clean, structured text data from Substack and Medium publications โ€” formatted as Markdown or Plain Text โ€” ready for LLM fine-tuning, RAG pipelines, and content analysis.

Why this scraper?

LLM training requires massive volumes of high-quality, long-form text. Substack and Medium are the internet's richest sources of expert-written articles, but scraping them manually is tedious and the raw HTML is full of noise (ads, popups, subscribe widgets).

This Actor thoughtfully cleans the content, stripping out:

  • Subscribe popups & paywall gates
  • Navigation headers & footers
  • Author bio cards & social share buttons
  • Script/style tags and embedded widgets

What you get is pure, clean content โ€” exactly what AI training pipelines need.

What does it extract?

FieldDescription
titleArticle headline
subtitleArticle subtitle (if present)
authorAuthor name
datePublication date
contentFull article body in your chosen format
urlOriginal article URL

Input Parameters

ParameterTypeDescription
publicationUrlsArraySubstack or Medium publication URLs
maxArticlesPerPublicationIntegerMax articles to scrape per publication (default: 10)
outputFormatStringmarkdown or text (default: markdown)

Sample Input

{
"publicationUrls":[
"https://newsletter.banklesshq.com",
"https://medium.com/@exampleauthor"
],
"maxArticlesPerPublication":5,
"outputFormat":"markdown"
}

Sample Output

{
"url":"https://newsletter.banklesshq.com/p/the-future-of-defi",
"title":"The Future of DeFi",
"subtitle":"Where decentralized finance is headed next",
"author":"Bankless",
"date":"2024-01-15",
"content":"# The Future of DeFi\n\nDecentralized finance has come a long way..."
}

Use Cases

  • LLM Fine-Tuning: Build domain-specific training datasets from expert writers
  • RAG Pipelines: Populate vector databases with high-quality knowledge bases
  • Content Analysis: Analyze publication trends, writing styles, and topic coverage
  • Research: Systematically collect articles on specific subjects

You might also like

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

Medium Publications Search Scraper

easyapi/medium-publications-search-scraper

Scrape Medium publications by keywords - Extract publication details including name, description, URL and avatar from Medium's search results efficiently and reliably.

Substack Articles Extractor

extremescrapes/substack-articles-extractor

Extract Substack newsletter posts as clean Markdown for LLM consumption

๐Ÿ‘ User avatar

Extreme Scrapes

2

Substack Discovery Scraper

getdataforme/substack-discovery-scraper

The Substack Discovery Scraper efficiently extracts and analyzes data from Substack publications, supporting market research, competitive intelligence, and content aggregation....

Substack Scraper

uncleken/substack-scraper

Substack Scraper is a tool designed to extract and archive public content from Substack publications without requiring authentication or API keys.

Substack Profile Scraper

getdataforme/substack-profile-scraper

The Substack Profile Scraper efficiently extracts detailed data from Substack profiles and posts for analysis, research, and content aggregation....

Substack Email Scraper

scrapapi/substack-email-scraper