VOOZH about

URL: https://apify.com/naive_zing/blog-scraper

⇱ Blog Scraper Β· Apify


Pricing

from $33.00 / 1,000 standard-fetches

Go to Apify Store

Company Blog Scraper, Blog Post Scraper, Corporate Blog Crawler, Automatic Blog Discovery, Blog Content Extractor, Article Metadata Scraper, Multi-Domain Blog Scraper, Competitor Blog Analysis, Content Marketing Scraper, Blog Post Metadata Extraction, Company Announcements Scraper.

Pricing

from $33.00 / 1,000 standard-fetches

Rating

0.0

(0)

Developer

πŸ‘ Wyald

Wyald

Maintained by Community

Actor stats

0

Bookmarked

29

Total users

2

Monthly active users

56 days

Issues response

6 months ago

Last modified

Share

A robust Apify Actor designed to scrape blog posts from company websites. Given a list of company domains and a maximum number of posts to fetch, this scraper automatically discovers blog sections, extracts blog posts, and collects comprehensive content and metadata.

Targeted Keywords

  • Primary: Blog Scraper, Content Extraction, Company Blog Crawler, Article Scraper
  • Secondary: Blog Post Metadata, Content Marketing Analysis, Blog Content Aggregation, Corporate Blog Mining

Features

βœ… Automatic Blog Discovery: Intelligently finds blog sections on company websites βœ… Smart Content Extraction: Extracts comprehensive blog post data including: * Title * Author * Publication date * Full article content * Excerpt/summary * Tags * Category * URL βœ… Configurable Limits: Set maximum number of posts per domain (up to 50) βœ… Multiple Domain Support: Scrape from multiple company websites in a single run βœ… Structured Output: Returns clean JSON data with all metadata βœ… Fast & Lightweight: Uses crawlee with BeautifulSoup for efficient HTTP-based scraping (no headless browser overhead)

Input

FieldTypeDescriptionRequiredDefault
company_urlsArrayList of company domain URLs or homepage URLs to scrape (e.g., ["https://stripe.com", "shopify.com"]).Yes-
max_blogposts_to_fetchNumberMaximum number of blog posts to fetch per domain (1-50)No10
max_concurrencyNumberNumber of concurrent requestsNo2

Input Example

{
"company_urls":[
"https://www.stripe.com",
"https://shopify.com",
"https://ai-bees.io"
],
"max_blogposts_to_fetch":10,
"max_concurrency":2
}

Output Example

{
"url":"https://www.stripe.com/blog/example-post",
"domain":"www.stripe.com",
"post_title":"How we scaled our payment infrastructure",
"author":"Jane Doe",
"published_date":"2024-01-15",
"content":"Full article content here...",
"excerpt":"Learn how we scaled our payment infrastructure to handle millions of transactions...",
"tags":["engineering","infrastructure","scaling"],
"category":"Engineering",
"scraped_at":"2024-01-20T10:30:00.000Z"
}

How It Works

  1. Domain Analysis: The scraper starts by visiting each provided company domain
  2. Blog Detection: It automatically searches for blog sections using common patterns (/blog, /news, /articles, etc.)
  3. Post Discovery: Once in the blog section, it identifies individual blog post URLs
  4. Content Extraction: For each post, it extracts:
    • Structured metadata (title, author, date)
    • Full article content
    • Additional metadata (tags, categories)
  5. Limit Enforcement: Respects the number_of_blog_posts_to_fetch limit per domain

Usage Tips

  • URL Format: You can provide URLs with or without https:// - the scraper will normalize them
  • Rate Limiting: The scraper includes automatic delays to be respectful to target websites
  • Post Limits: Maximum 50 posts per domain to prevent excessive scraping
  • Concurrency: Adjust max_concurrency based on target website capacity (default: 2)

Use Cases

  • Content Marketing Analysis: Analyze competitor blog strategies
  • Content Aggregation: Collect blog content for research or analysis
  • Market Intelligence: Monitor company announcements and thought leadership
  • SEO Research: Study content patterns and topics from successful blogs
  • Training Data: Collect blog content for ML/AI model training

Notes

  • The scraper respects robots.txt and includes reasonable delays between requests
  • Blog structure varies by website - extraction quality depends on site structure
  • Some blogs may require authentication or have anti-scraping measures
  • Always ensure you have permission to scrape the target websites

You might also like

Replicate Blog Scraper

yourapiservice/replicate-blog-scraper

The Replicate Blog Scraper lets you easily extract blog content in HTML or plaintext formats. It also captures key metadata like author and publication date, making it a great tool for content analysis and research.

πŸ‘ User avatar

Your API Service

58

Sort Your Photos Blog Scraper

yourapiservice/sortyourphotos-blog-scraper

Sort Your Photos Blog Scraper (sortyourphotos.com) lets you extract blog content in HTML, JSON, and plaintext. Get authors, create/update date, images, read time, RSS, titles, SEO titles, featured images & videos, and keywords easily for content analysis and aggregation.

πŸ‘ User avatar

Your API Service

2

youtube-transcript-scraper

cjsolt13/youtube-transcript-scraper

for blog and product development

πŸ‘ User avatar

Claudia Solt-Ames

10

Be The One Best Blog Scraper

yourapiservice/betheonebest-blog-scraper

Be The One Best Blog Scraper (betheonebest.com) lets you extract blog content in HTML, JSON, and plaintext. Get authors, create/update date, images, read time, RSS, titles, SEO titles, featured images & videos, and keywords easily for content analysis and aggregation.

πŸ‘ User avatar

Your API Service

2

Media Partnership Blog Scraper

yourapiservice/mediapartnership-blog-scraper

Media Partnership Blog Scraper (mediapartnership.co.uk) lets you extract blog content in HTML, JSON, and plaintext. Get authors, create/update date, images, read time, RSS, titles, SEO titles, featured images & videos, and keywords easily for content analysis and aggregation.

πŸ‘ User avatar

Your API Service

3