VOOZH about

URL: https://apify.com/primeparse/esg-content-scraper

⇱ πŸ’ŽESG Scraper: Sustainability Reports & PDF Disclosures Β· Apify


πŸ‘ πŸ’ŽESG Scraper: Sustainability Reports & PDF Disclosures avatar

πŸ’ŽESG Scraper: Sustainability Reports & PDF Disclosures

Pricing

from $2.00 / 1,000 results

Go to Apify Store

πŸ’ŽESG Scraper: Sustainability Reports & PDF Disclosures

Powerful ESG scraper (Environmental, Social, and Governance) to automatically extract sustainability reports, PDF disclosures, articles, and content from any website. Get clean, AI-ready datasets with keyword filtering, metadata extraction, images, links, and full PDF support.

Pricing

from $2.00 / 1,000 results

Rating

5.0

(1)

Developer

πŸ‘ PrimeParse

PrimeParse

Maintained by Community

Actor stats

1

Bookmarked

18

Total users

0

Monthly active users

4 months ago

Last modified

Share

🌱 ESG Scraper: Sustainability Reports, Articles & PDF Disclosures Extractor

Enterprise-grade ESG web scraper that automatically extracts sustainability articles, corporate reports, climate news, and PDF disclosures β€” clean, structured, and ready for investors, compliance teams, or AI training.

High-quality ESG & Sustainability Web Scraper for Investors, Analysts, and AI Teams

Automatically collects ESG articles, sustainability reports, corporate disclosures, climate news, and PDF reports from any website β€” clean, structured, ready for analysis or AI.

Built for:

  • Sustainable investors & analysts
  • Compliance and risk teams
  • AI/ML engineers building ESG models
  • Researchers and NGOs tracking climate & governance trends

βœ… Smart ESG keyword filtering βœ… Full clean article text extraction βœ… PDF sustainability reports parsing βœ… Rich metadata (date, author, description) βœ… ESG-relevant images and related links βœ… AI-ready dataset splitting (overview / full-text / images)

πŸ‘‰ Runs on Apify β€’ No code required β€’ Pay only for compute used


πŸš€ Why This Scraper

βœ” Purpose-Built for ESG Data Intelligently filters pages using custom ESG keywords (climate, emissions, governance, CSR, net zero, etc.).

βœ” Excellent PDF Handling Full text extraction from sustainability and ESG reports (PDF) with metadata where available.

βœ” Clean & Noise-Free Output Removes ads, navigation, scripts β€” only meaningful content remains.

βœ” Rich Structured Data Title, publication date, author, description, ESG keywords, internal links, relevant images.

βœ” AI & ML Ready Optional splitting into specialized datasets for RAG, LLM fine-tuning, or training.

βœ” Fast & Efficient
Powered by Crawlee + Cheerio β€” excellent for static and content-heavy sites (news, corporate pages, PDFs). For heavily JavaScript-rendered sites, results may vary.

βœ” Safe & Controlled Crawling Automatic domain restriction, depth limit (max 3 levels), request limits.


πŸ’Ό Use Cases

  • ESG portfolio screening and risk monitoring
  • Training ESG-focused LLMs or RAG systems
  • Regulatory compliance and disclosure tracking
  • Competitive intelligence on corporate sustainability
  • Academic research on climate and governance trends

πŸ“Š Supported Sources

  • ESG news sections (Reuters, Bloomberg, FT, Guardian, etc.)
  • Corporate sustainability / ESG pages
  • Annual sustainability reports (PDF)
  • Climate, emissions, governance disclosures

βš™οΈ How It Works

  1. Provide start URLs (news sections, corporate pages, PDF links)
  2. Set custom ESG keywords and limits
  3. Run the Actor
  4. Download clean, structured ESG datasets

🧩 Input Configuration

Example JSON Input

{
"startUrls":[
{"url":"https://www.reuters.com/sustainability/"},
{"url":"https://www.weforum.org/stories/technological-innovation/"}
],
"allowedDomains":["reuters.com"],
"useApifyProxy":false,
"maxRequestsPerCrawl":500,
"esgKeywords":[
"ESG",
"sustainability",
"climate",
"emissions",
"net zero",
"governance"
],
"extractContent":true,
"extractMetadata":true,
"followLinks":true,
"useSeparateDatasets":true,
"cleanDefaultDataset":true,
"proxyUrls":[
{
"url":"http://user:pass@host:port"
}
]
}

Key Options

  • startUrls β€” one or more starting pages or direct PDF links (required)
  • allowedDomains β€” restrict crawling to specific domains. If empty, automatically limited to domains from startUrls
  • maxRequestsPerCrawl β€” control cost and crawl size
  • esgKeywords β€” custom list for relevance filtering (default includes common ESG terms)
  • extractContent / extractMetadata β€” toggle full text or metadata extraction
  • followLinks β€” enable internal crawling (limited to depth 3 for safety)
  • useSeparateDatasets β€” recommended for large runs and AI workflows
  • cleanDefaultDataset β€” clear previous run data

πŸ“‚ Output Datasets

When useSeparateDatasets: true (recommended):

  • esg-overview (primary) β€” lightweight metadata for fast analysis
  • esg-full-content β€” long articles (>5000 characters)
  • esg-images β€” ESG-relevant images with context
  • Default dataset β€” minimal preview records (for Apify UI visibility)

When useSeparateDatasets: false

  • Single dataset with full detailed records

Example Output Record (Full Mode)

{
"url":"https://www.reuters.com/sustainability/example",
"title":"Companies strengthen climate commitments",
"scrapedAt":"2025-12-15T10:30:45Z",
"publishedDate":"2025-12-10",
"author":"Jane Doe",
"description":"Major firms enhance ESG targets...",
"content":"Full clean article text...\n\nParagraphs preserved...",
"esgKeywords":["climate","emissions","sustainability"],
"relatedLinks":[
{
"url":"https://www.reuters.com/sustainability/esg-guide",
"text":"ESG Explained"
}
],
"images":[
{
"url":"https://reuters.com/chart-netzero.jpg",
"alt":"Net zero emissions progress"
}
]
}

PDF Example

{
"url":"https://company.com/sustainability-2024.pdf",
"title":"Annual Sustainability Report 2024",
"content":"Full extracted report text...",
"esgKeywords":["sustainability","carbon","governance"],
"type":"PDF",
"author":"Corporate Sustainability Team",
"publishedDate":"2024-03-15"
}

🏁 Getting Started

  1. Click β€œTry for free” on Apify
  2. Paste ESG/sustainability URLs or direct PDF links
  3. Customize keywords and limits
  4. Run and download your dataset

πŸ“§ Support

Tags: ESG, sustainability, web scraping, PDF extraction, climate data, corporate governance, RAG, LLM training, sustainable investing, compliance monitoring

Built with ❀️ on Apify

You might also like

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

πŸ‘ User avatar

Onidivo Technologies

512

SEC & ESG Report Scraper

taroyamada/esg-disclosure-tracker

Extract climate disclosures and sustainability reports directly from SEC EDGAR filings and corporate investor relations web pages.

Global Climate Sustainability B2B Leads

blukaze/global-climate-sustainability-b2b-leads-Apify-Actor

Global Climate & Sustainability B2B Leads Finder crawls company websites to detect ESG and sustainability activity, then converts it into structured leads with key pages, contacts, and a sustainabilityIntentScore (0–100) to quickly identify high-intent organizations.

πŸ‘ User avatar

Blukaze Automations

4

CSRHub.com ESG Data Scraper

njoylab/csrhub-com-esg-data-scraper

Extract comprehensive ESG metrics and company profiles from CSRHub.com with this efficient Apify scraper. Get structured sustainability ratings, corporate information, and industry benchmarks for investment analysis and research

PDF Parser API

george.the.developer/pdf-parser-api

Instant API that parses any PDF from a URL β€” extracts full text, page count, metadata (title, author, dates), and PDF version. Returns structured JSON. Perfect for document processing pipelines and AI agents.

HTML To PDF API

igview-owner/html-to-pdf-api

Convert HTML content and webpage URLs to high-quality PDF documents instantly. HTML to PDF converter with customizable page formats (A4, Letter), scale control, background graphics, and smart waiting for dynamic content. Perfect for reports, documentation, and automated PDF generation workflows.

πŸ‘ User avatar

Sachin Kumar Yadav

46

Extract text from PDF

akash9078/pdf-text-extractor

Efficiently extract text content from PDF files, ideal for data processing, content analysis, and automation workflows. Supports various PDF structures and outputs clean, readable text.

πŸ‘ User avatar

Akash Kumar Naik

109

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.