VOOZH about

URL: https://apify.com/incredible_moment/llm-scraper

โ‡ฑ LLM Web Scraper (AI Extraction Tool) ยท Apify


Pricing

from $2.00 / 1,000 scraped pages

Go to Apify Store

Turn any website into structured JSON using AI. Supports OpenAI GPT-4 and Anthropic Claude. Built in Rust to minimize compute costs while waiting for LLM responses. Extract data without selectors.

Pricing

from $2.00 / 1,000 scraped pages

Rating

0.0

(0)

Developer

๐Ÿ‘ Daniel Rosen

Daniel Rosen

Maintained by Community

Actor stats

0

Bookmarked

8

Total users

1

Monthly active users

4 months ago

Last modified

Share

Turn any website into structured JSON data using OpenAI (GPT) or Anthropic (Claude) models. This Actor fetches HTML, cleans it, and uses a Large Language Model (LLM) to extract specific data fields based on a schema you provide.

What This Does

Traditional scraping requires writing brittle CSS selectors or Regex for every field. This Actor uses semantic understanding to locate and extract data, making it resilient to layout changes.

  1. Fetches: Downloads the webpage (supports User-Agent rotation).
  2. Cleans: Prunes scripts, styles, and ads to minimize token usage.
  3. Extracts: Sends the content to an LLM alongside your JSON schema.
  4. Validates: Returns structured JSON matching your definition.

Use Cases

  • E-commerce: Extract product details (price, specs, availability) from diverse layouts.
  • News Aggregation: Normalize article content, authors, and dates into a standard format.
  • Lead Generation: Extract contact info and company details from "About Us" pages.
  • Financial Data: Parse unstructured tables and reports into usable JSON.

Input

You must provide a target URL and a JSON Schema defining the data you want. You must also provide either an OpenAI or Anthropic API key.

{
"url":"https://news.ycombinator.com",
"schema":{
"stories":[
{
"title":"string",
"points":"number",
"author":"string"
}
]
},
"openaiApiKey":"sk-...",
"model":"gpt-4o",
"maxTokens":2000,
"selector":"table.itemlist"
}

Configuration Parameters

FieldTypeRequiredDescription
urlStringYesThe target URL to scrape.
schemaJSONYesThe structure you want the AI to extract.
openaiApiKeyStringNo*OpenAI API Key (Required if using GPT models).
anthropicApiKeyStringNo*Anthropic API Key (Required if using Claude models).
modelStringNoModel selection (e.g., gpt-4o, claude-3-5-sonnet).
selectorStringNoCSS selector to limit the scope (e.g., main#content).
instructionsStringNoSpecific guidance for the AI (e.g., "Exclude ads").
maxTokensIntegerNoLimit response size (Default: 4096).

*One of the two API keys is required.

Output

The Actor outputs a JSON object containing the extracted data and metadata about the run.

{
"url":"https://news.ycombinator.com",
"success":true,
"tokensUsed":1450,
"model":"gpt-4o",
"data":{
"stories":[
{"title":"Rust vs C++","points":156,"author":"dev_user"},
{"title":"New AI Model","points":42,"author":"ai_researcher"}
]
}
}

Optimization & Costs

LLM scraping involves two costs: the Apify run cost and your external LLM API usage. To minimize both:

  1. Use Selectors: Always provide a CSS selector (e.g., div.product-details) if possible. This discards headers, footers, and sidebars before sending text to the AI, significantly reducing your token bill.
  2. Choose the Right Model: gpt-4o and claude-3-5-sonnet offer the best balance of speed and intelligence. Smaller models are cheaper but may struggle with complex schemas.
  3. Clean Input: The Actor automatically removes <script>, <style>, and <nav> tags to ensure high-quality context for the AI.

You might also like

Claude AI Web Automation

dtrungtin/claude-ai-web-automation

A real browser with Anthropic's Claude models to navigate any website and extract structured data โ€” no CSS selectors or page-specific scraping code required.

Review Response Generator

alizarin_refrigerator-owner/review-response-generator

Generate professional review responses using AI (GPT-4o, Claude). Multi-language, tone customization. BYOK: OpenAI, Anthropic.

AI Outreach Personalizer โ€” Emails with Your LLM Key

ryanclinton/ai-outreach-personalizer

Generate personalized cold emails using your own OpenAI or Anthropic API key. Subject lines, opening lines, full bodies โ€” tailored to each lead's role, company, and signals. $0.01/lead compute + your LLM costs. Zero AI markup.

Website Content Crawler for LLM's

salesblaster-ai/website-content-crawler

Extract contact information + turn any website into clean, structured content ready for LLM's (e.g. AI lead magnets, RAG pipelines, and outbound personalization). Most web scrapers dump raw HTML or unstructured text. This crawler is purpose-built for LLM's, and optimized for lead generation.

๐Ÿ‘ User avatar

SalesBlaster AI

7

Prompt Engineering Helper

scraper_guru/prompt-engineering-helper

Transform basic prompts into optimized LLM prompts using 12 research-proven templates. Works with ChatGPT, Claude, GPT-4, and any LLM.

๐Ÿ‘ User avatar

LIAICHI MUSTAPHA

6

AI Web Scraper โ€” Structured Data From Any URL

muhammadafzal/ai-web-extractor

Extract structured data from any website using an LLM and your own field schema โ€” no CSS selectors. Give it URLs and the fields you want; get clean JSON rows back. Works on blogs, job boards, product pages, listings, and more.

๐Ÿ‘ User avatar

Muhammad Afzal

1

LLM Data Pipeline Pro

sanztheo/llm-data-pipeline-pro

Transform websites into LLM training data. Scrape, validate, deduplicate, chunk for RAG, and export to OpenAI/Anthropic/Mistral formats. Built-in PII detection and GDPR compliance. Vector DB export to Pinecone & Qdrant.