Pricing
from $0.00005 / actor start
AI Web Crawler
DeprecatedExtract structured data from any website using AI. No custom selectors needed.
Pricing
from $0.00005 / actor start
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a month ago
Last modified
Categories
Share
๐ค AI Web Scraper โ GPT-Powered Data Extraction
Extract structured data from any website using AI. No custom selectors needed โ just a URL and natural language instructions. Supports OpenAI, OpenRouter, LM Studio, Ollama, Groq, and any OpenAI-compatible API.
๐ Apify
๐ Python
๐ GPT
๐ License
๐ฏ What It Does
AI Web Scraper uses GPT-4o-mini (or GPT-4o/GPT-4.1) to intelligently extract structured data from any webpage. Unlike traditional scrapers that require specific CSS selectors or XPath expressions, this Actor understands natural language instructions and adapts to any website structure.
โจ Key Features
- ๐ง Natural Language Extraction โ Describe what you want in plain English, GPT does the rest
- ๐ Universal Compatibility โ Works on any website without custom coding per site
- ๐ Structured JSON Output โ Returns clean, parseable data pushed to Apify Dataset
- ๐ Multi-Page Support โ Automatic pagination handling (up to 50 pages)
- ๐ Fast Processing โ Pages processed in seconds with headless Playwright
- ๐ Anti-Detection โ Blocks images/ads, uses realistic user-agent
- โก Multiple AI Models โ gpt-4o-mini, gpt-4o, gpt-4.1 (or any OpenAI-compatible API)
๐ก Use Cases
| Industry | What to Extract |
|---|---|
| ๐ E-commerce | Product names, prices, ratings, descriptions, reviews count |
| ๐ Real Estate | Listings, prices, locations, agent info, property details |
| ๐ง Lead Generation | Company names, emails, phone numbers, social profiles |
| ๐ผ Job Boards | Job titles, salaries, companies, locations, requirements |
| ๐ฐ Research | Articles, papers, reviews, social media content |
| ๐ SEO | Meta tags, headings, content structure, internal links |
๐ฅ Input Schema
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | โ | โ | Target URL to scrape |
prompt | string | โ | โ | What data to extract (natural language) |
apiKey | string | โ | env OPENAI_API_KEY | OpenAI API key (sk-...) |
model | string | โ | gpt-4o-mini | AI model: gpt-4o-mini, gpt-4o, gpt-4.1 |
maxPages | integer | โ | 1 | Max pages to process (1โ50) |
waitForSelector | string | โ | โ | CSS selector to wait for before extracting |
Example Input
{"url":"https://www.example.com/products","prompt":"Extract all product names, prices, ratings, and review counts","model":"gpt-4o-mini","maxPages":3}
๐ค Output
Each extracted item is pushed to the Apify Dataset as a separate record with these standard fields:
| Field | Type | Description |
|---|---|---|
title | string | Title or name of the extracted item |
description | string | Description or summary |
price | string | Price value if available |
url | string | Source URL of the item |
image_url | string | Image URL if available |
rating | number | Rating score (0โ5 scale) |
reviews_count | integer | Number of reviews |
availability | string | Availability status |
category | string | Category or type |
source_page | string | Page where item was found |
extracted_at | datetime | ISO timestamp of extraction |
โ ๏ธ Note: Field names are dynamic โ GPT determines them based on your prompt. The schema above covers common extraction patterns for products/listings.
Example Output
[{"title":"Wireless Headphones Pro","price":"$79.99","rating":4.5,"reviews_count":1234,"url":"https://example.com/products/wireless-headphones-pro"},{"title":"Bluetooth Speaker","price":"$49.99","rating":4.2,"reviews_count":856,"url":"https://example.com/products/bluetooth-speaker"}]
๐งช How to Use
Option 1: Run via Apify Console
- Go to Apify Console
- Find "AI Web Scraper" in the Store
- Click "Try for free" or "Run Actor"
- Enter your URL and extraction prompt
- Click "Run" โ results appear in the Dataset
Option 2: Run via API
curl-X POST "https://api.apify.com/v2/acts/gek0v~ai-web-scraper/runs"\-H"Authorization: Bearer YOUR_APIFY_TOKEN"\-H"Content-Type: application/json"\-d'{"url": "https://example.com/products","prompt": "Extract product names and prices","model": "gpt-4o-mini"}'
Option 3: Python SDK
from apify_client import ApifyClientclient = ApifyClient("your-apify-token")run = client.actor("gek0v/ai-web-scraper").call(run_input={"url":"https://example.com","prompt":"Extract all article titles and authors","model":"gpt-4o-mini"})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
๐ฐ Pricing
| Component | Cost |
|---|---|
| Actor Compute (Actor Start) | ~$0.000002/run (based on memory allocation) |
| Dataset Storage | ~$0.002 per stored item |
| Platform Fee | 20% of compute + storage costs |
| OpenAI GPT API | Passed directly to user at model pricing |
๐ก Typical cost per run: Most extractions cost < $0.01 (with gpt-4o-mini) plus ~$0.002 per extracted item stored.
๐ง Local Development
# Clonegit clone https://github.com/gek0v/ai-web-scraper.gitcd ai-web-scraper# Install dependenciespip install-r requirements.txt# Run locallypython src/main.py --input'{"url": "https://example.com", "prompt": "Extract all headings"}'
๐ Tips for Best Results
- Be specific in your prompt โ "Extract product name, price in USD, and star rating" works better than "extract product info"
- Test with gpt-4o-mini first โ It's 10x cheaper and often good enough. Upgrade to gpt-4o for complex pages
- Use
waitForSelectorโ For dynamic SPAs (React, Vue, Angular), wait for the content container - Limit
maxPagesโ Start with 1 page to test, then scale up - Provide your API key โ Set
OPENAI_API_KEYenv var or pass via input
โ ๏ธ Limitations
- Very large pages (>100K chars) are truncated to fit GPT's context window
- JavaScript-heavy SPAs may need
waitForSelectorfor rendering - Some anti-bot protections (Cloudflare, etc.) may block access
- GPT costs are passed through to the user (OpenAI/compatible API pricing applies)
- Requires an OpenAI-compatible API key (not included)
๐ License
MIT License โ free to use and modify.
๐ท๏ธ Tags
web-scraping artificial-intelligence data-extraction playwright gpt automation developer-tools
