Pricing
from $10.00 / 1,000 esg extractions
Company ESG & Sustainability Data Extractor
Extract ESG and sustainability metrics, carbon commitments, and net-zero targets from public company sustainability pages. Structured JSON output for finance, research, and procurement teams.
Pricing
from $10.00 / 1,000 esg extractions
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
What this Actor does
Extract ESG and sustainability metrics, carbon commitments, and net-zero targets from public company sustainability and ESG report web pages that you supply.
It processes user-provided public URLs, reads schema.org Organization JSON-LD for the company name, scans visible page text for ESG keywords grouped by metric category (carbon, energy, water, waste, diversity, governance), pairs those keywords with nearby numeric values and units, and optionally captures net-zero and reduction-target commitment sentences. It normalizes useful fields, deduplicates rows, and saves structured records to the Apify dataset.
Why this Actor is useful
Sustainability analysts, investors, and procurement teams pay for this kind of extraction because it converts unstructured ESG narrative reports into clean, comparable datasets. It saves manual reading, creates repeatable monitoring, feeds spreadsheets, dashboards, or scoring models, and turns public ESG pages into API-ready data.
Who this is for
- ESG and sustainability analysts
- Investment and ESG research teams
- Corporate sustainability and procurement teams
- Data providers and ESG rating builders
- Journalists and NGOs tracking corporate climate claims
- B2B teams enriching company sustainability profiles
Common use cases
- Build comparable ESG metric datasets across many companies
- Track net-zero and carbon-neutral commitments and target years
- Monitor reported Scope 1/2/3 emissions over time
- Enrich company profiles with sustainability data points
- Feed ESG scoring or screening models
Input
startUrls: Public URLs to extract from. Use only pages you are allowed to access without login or bypassing access controls.keywords: Optional additional ESG or sustainability terms to match on top of the built-in keyword library.includeCommitments: Capture net-zero, carbon-neutral, and reduction-target sentences as commitment rows with an extracted target year.maxItems: Maximum number of rows to save.maxConcurrency: Number of pages processed in parallel. The default is intentionally conservative.requestTimeoutSecs: Maximum time to spend on a single page.proxyConfiguration: Optional Apify proxy configuration where permitted by your source review.
Output
companyName: Company name when exposed inOrganizationstructured data.sourceUrl: URL where the data was extracted.metricCategory: Category such as carbon, energy, water, waste, diversity, governance, commitment, or other.metricName: The matched metric label (for example, Scope 1 emissions).metricValue: The numeric value found near the metric keyword.unit: Detected unit such as%,tCO2e,MWh, or similar.reportingYear: Reporting year detected in the same sentence when available.targetYear: Target year detected for commitment rows.commitmentText: The captured net-zero or reduction-target sentence.framework: Reporting frameworks referenced on the page (GRI, SASB, TCFD, CDP, SDG).extractedAt: Timestamp when this Actor extracted the row.extractionMethod:structured_data,text_extraction, orcommitment_text.confidenceScore: Heuristic confidence score (structured 0.9, text-derived 0.6-0.8).missingFields: Required fields (companyName,metricName,metricValue,reportingYear) not available from the source page.
Sample input
{"startUrls":[{"url":"https://example.com/"}],"keywords":[],"includeCommitments":true,"maxItems":50,"maxConcurrency":3,"requestTimeoutSecs":30}
Sample output
{"companyName":"Example Manufacturing Group","sourceUrl":"https://example.com/","metricCategory":"carbon","metricName":"Scope 1 emissions","metricValue":125000,"unit":"tCO2e","reportingYear":2024,"targetYear":null,"commitmentText":null,"framework":"GRI","extractedAt":"2026-06-12T00:00:00.000Z","extractionMethod":"structured_data","confidenceScore":0.9,"missingFields":[]}
How to use
Run this Actor on Apify with public URLs, export the dataset as JSON, CSV, Excel, or through the Apify API, then connect the output to Google Sheets, Make, Zapier, a webhook, your CRM, or an internal dashboard. For monitoring, save the input as an Apify task and schedule recurring runs.
Pricing
This Actor uses a pay-per-event model: $0.01 per extraction. You pay only for the structured rows the Actor produces, which keeps costs predictable and tied directly to delivered data.
Best practices
- Start with a small set of reviewed public ESG and sustainability report URLs.
- Prefer the main sustainability or ESG data pages rather than PDF download links.
- Add domain-specific terms via
keywordswhen a company uses non-standard metric names. - Keep
includeCommitmentsenabled to capture net-zero and target language. - Keep
maxConcurrencylow for smaller websites. - Review source website rules before scheduling recurring runs.
- Treat text-derived values as candidates for human review before downstream scoring.
Compliance and responsible use
This Actor is for public data only. It must not be used to bypass logins, paywalls, CAPTCHAs, or security systems, collect private data, gather sensitive personal data, or support spam or abuse. You are responsible for following applicable laws and source website rules.
Limitations
- Output quality depends on the public ESG content available on the source pages.
- Text-derived extraction is heuristic. Numeric values and units are matched near keywords and may need human verification before use in scoring.
- The Actor reads HTML pages and does not parse PDF reports.
- Some fields may be empty when the source does not publish them, and they are reported in
missingFieldsrather than inferred. - The Actor does not claim support for any specific third-party ESG platform.
- Website markup and access policies can change.
Troubleshooting
- Empty output usually means the page has no recognizable ESG keywords paired with numeric values.
- Invalid URL errors mean one or more input URLs are malformed.
- Slow runs can usually be improved by lowering
maxConcurrency. - Missing fields are source-data limitations, not inferred values.
Changelog
- v0.2.0: Production-readiness pass with improved positioning, samples, schema descriptions, and responsible-use notes.
- v0.1.0: Initial dry-run factory generated MVP.
