AI Web Scraper β Structured Data From Any URL
Pricing
from $20.00 / 1,000 page processeds
AI Web Scraper β Structured Data From Any URL
Extract structured data from any website using an LLM and your own field schema β no CSS selectors. Give it URLs and the fields you want; get clean JSON rows back. Works on blogs, job boards, product pages, listings, and more.
Pricing
from $20.00 / 1,000 page processeds
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
11 days ago
Last modified
Categories
Share
Extract structured data from any website using an LLM and your own field schema β no CSS selectors, no per-site code. Give it URLs and the fields you want; get back clean JSON rows. Built for the messy long tail of sites that off-the-shelf scrapers don't cover: blogs, job boards, product pages, directories, listings, and more.
Export results, run via API, schedule and monitor runs, or integrate with other tools and AI agents.
How it works
- You provide one or more URLs and a list of fields (name + short description).
- The actor fetches each page, converts it to clean text, and asks an LLM to return JSON matching your fields.
- You get one row per record (or one row per repeating item in list mode).
No selectors to maintain. When a site changes its HTML, the LLM still finds your fields.
Input
| Field | Type | Description |
|---|---|---|
startUrls | array | The page URLs to extract from. |
fields | array | What to extract β [{ "name": "title", "description": "the product title", "type": "string" }]. |
listMode | boolean | ON = one row per repeating item on the page (grids, listings). OFF = one row per page. |
model | string | OpenRouter model slug (default openai/gpt-4o-mini). |
maxItems | integer | Cap on total output rows. |
maxCrawlPages | integer | Cap on pages fetched. |
maxContentChars | integer | How much page text to send to the model (cost control). |
proxyConfiguration | object | Apify proxy settings (datacenter by default). |
Example input
{"startUrls":[{"url":"https://quotes.toscrape.com"}],"fields":[{"name":"text","description":"the full quote text"},{"name":"author","description":"who said it"},{"name":"tags","description":"list of tag labels","type":"array"}],"listMode":true,"model":"openai/gpt-4o-mini"}
API key (required)
Extraction runs through OpenRouter β set a single environment variable on the actor (Console β Settings β Environment variables):
OPENROUTER_API_KEY= sk-or-...
Pick any model via the model input β cheap models like openai/gpt-4o-mini or google/gemini-2.5-flash handle most structured extraction well. You pay OpenRouter directly for model usage; the actor's PPE events cover the extraction layer.
Output
Every row contains source_url, scraped_at, error, plus your fields:
{"text":"The world as we have created it is a process of our thinking.","author":"Albert Einstein","tags":["change","deep-thoughts","thinking","world"],"source_url":"https://quotes.toscrape.com","scraped_at":"2026-06-07T12:00:00.000Z","error":null}
Pricing (Pay Per Event)
| Event | When |
|---|---|
actor-start | Once per run. |
page-processed | Each page successfully fetched and extracted (one LLM call). |
Failed pages (fetch error, model error, missing key) are not charged.
Use cases
- RAG / AI pipelines β turn arbitrary pages into clean structured records.
- Long-tail sites β scrape sites with no dedicated actor.
- Listings & directories β pull every item from a results page with
listMode. - Monitoring β schedule extraction of the same fields over time.
Tips
- Write clear field descriptions β they're the instructions the model follows.
- Use
listModefor pages with many repeating records; turn it off for single detail pages. - For JS-heavy sites where text is missing, increase
maxContentCharsor use a richer model.
