VOOZH about

URL: https://apify.com/gastronomic_desk/structured-extract

โ‡ฑ Structured Extract ยท Apify


Pricing

$50.00 / 1,000 structured extractions

Go to Apify Store

Only pay when it works. $0.05 per verified extraction โ€” nothing charged on failure or retries. Extract structured JSON from any webpage using your own schema. AJV-validated output guaranteed. Compatible with Groq, OpenAI, Together AI, and Ollama.

Pricing

$50.00 / 1,000 structured extractions

Rating

0.0

(0)

Developer

๐Ÿ‘ Herbert Yeboah

Herbert Yeboah

Maintained by Community

Actor stats

1

Bookmarked

3

Total users

1

Monthly active users

4 months ago

Last modified

Share

Structured Data Extractor

Extract structured JSON from any webpage using a Groq-compatible LLM.

Provide a URL + a JSON Schema โ†’ get back validated, structured data. Works with Groq (free), OpenAI, Together AI, Fireworks AI, and Ollama.

๐Ÿ‘ Apify Actor
๐Ÿ‘ PPE Pricing


What It Does

  1. Scrapes the page at your URL using a real browser-grade crawler (CheerioCrawler)
  2. Strips all HTML, navigation, scripts, and boilerplate โ†’ clean plain text
  3. Prompts a Groq-compatible LLM to extract data matching your schema
  4. Validates the response with AJV (JSON Schema validator)
  5. Retries up to 3 times if the LLM returns invalid JSON, injecting the error back into the prompt
  6. Returns validated structured data in the Apify dataset

Charge: $0.05 per successful extraction. Nothing charged on failure.


Input Schema

FieldTypeRequiredDefaultDescription
urlstringโœ…โ€”Page to scrape
output_schemaobjectโœ…โ€”JSON Schema defining the data to extract
groq_api_keystringโœ…โ€”API key (Groq, OpenAI, Together AI, etc.)
modelstringโŒllama-3.3-70b-versatileModel name
base_urlstringโŒGroq endpointFor OpenAI-compatible providers

Usage Examples

Example 1: Groq (default, free tier)

Get a free API key at console.groq.com.

{
"url":"https://example.com/product/widget-pro",
"groq_api_key":"gsk_YOUR_GROQ_KEY_HERE",
"output_schema":{
"type":"object",
"required":["name","price"],
"properties":{
"name":{"type":"string"},
"price":{"type":"number"},
"description":{"type":"string"},
"in_stock":{"type":"boolean"}
}
}
}

Output:

{
"url":"https://example.com/product/widget-pro",
"extracted":{
"name":"Widget Pro",
"price":29.99,
"description":"The best widget on the market.",
"in_stock":true
},
"model":"llama-3.3-70b-versatile",
"attempts":1
}

Example 2: OpenAI-compatible endpoint (Together AI, Fireworks AI)

Use any OpenAI-compatible provider by setting base_url:

{
"url":"https://jobs.lever.co/anthropic/engineer",
"groq_api_key":"YOUR_TOGETHER_AI_KEY",
"base_url":"https://api.together.xyz/v1",
"model":"meta-llama/Llama-3.3-70B-Instruct-Turbo",
"output_schema":{
"type":"object",
"required":["title","company","location","salary_range"],
"properties":{
"title":{"type":"string"},
"company":{"type":"string"},
"location":{"type":"string"},
"salary_range":{"type":"string"},
"remote":{"type":"boolean"},
"requirements":{
"type":"array",
"items":{"type":"string"}
}
}
}
}

Other compatible endpoints:

  • Fireworks AI: https://api.fireworks.ai/inference/v1
  • OpenAI: https://api.openai.com/v1

Example 3: Ollama (local, completely free)

Run models locally at zero cost with Ollama:

# Start Ollama with a model
ollama serve
ollama pull llama3.3
{
"url":"https://news.ycombinator.com/item?id=12345",
"groq_api_key":"ollama",
"base_url":"http://localhost:11434/v1",
"model":"llama3.3",
"output_schema":{
"type":"object",
"required":["title","score","comments_count"],
"properties":{
"title":{"type":"string"},
"score":{"type":"integer"},
"comments_count":{"type":"integer"},
"author":{"type":"string"},
"url":{"type":"string"}
}
}
}

Note: When running the Actor on Apify cloud, Ollama requires a remote endpoint. For local testing, use apify run with localhost.


Common Use Cases

Use CaseSchema Fields
Product extractionname, price, description, in_stock, SKU
Job postingstitle, company, location, salary, requirements
News articlesheadline, author, published_date, summary, tags
Real estate listingsaddress, price, bedrooms, bathrooms, sqft
Restaurant menusrestaurant_name, items (name, price, description)
Resume parsingname, email, skills, experience, education
Event listingsname, date, venue, ticket_price, organizer

How Retry Logic Works

The actor uses the same retry-with-feedback pattern as constrained.py from the DagPipe core library:

  1. Attempt 1: Send text + schema โ†’ LLM responds โ†’ AJV validates
  2. On failure: Inject the exact AJV error message into the next prompt โ†’ retry
  3. Attempt 2: LLM receives error and corrects โ†’ validate again
  4. After 3 failures: Throw with a descriptive error message

This approach reliably extracts valid structured data even from smaller/cheaper models.


Pricing

  • $0.05 per successful extraction (Pay-Per-Event)
  • Free if extraction fails โ€” you're never charged for failed attempts
  • Groq's free tier provides 30 requests/minute at zero cost to you

Technical Details

  • Scraper: CheerioCrawler (zero-JS, fast, reliable)
  • Validation: AJV v8 + ajv-formats (JSON Schema Draft-07/2019/2020 compatible)
  • LLM client: OpenAI SDK (works with any OpenAI-compatible endpoint)
  • Retry strategy: Error-feedback prompting (same pattern as DagPipe constrained.py)
  • Language: TypeScript, Node.js 20+
  • Tests: 9 vitest tests (100% passing)

Built With

DagPipe โ€” Zero-cost, crash-proof LLM pipeline orchestrator.

$pip install dagpipe-core

You might also like

AI Web Scraper โ€” Structured Data From Any URL

muhammadafzal/ai-web-extractor

Extract structured data from any website using an LLM and your own field schema โ€” no CSS selectors. Give it URLs and the fields you want; get clean JSON rows back. Works on blogs, job boards, product pages, listings, and more.

๐Ÿ‘ User avatar

Muhammad Afzal

-

Structured Data Extractor โ€” URL to JSON

shelvick/structured-extractor

Extract structured data from a batch of URLs as schema-validated JSON. Send web pages and a JSON Schema; it scrapes each (stealth + residential proxy as needed), runs an LLM to convert the page to JSON matching your schema, and validates per URL. Omit schema for best-effort. Public pages only.

2

Structured Extract

romanrostar/structured-extract

Ai Api Status

david_flagg/ai-api-status

Monitor health, response times, and availability of 9 major AI APIs โ€” OpenAI, Anthropic, Gemini, OpenRouter, Venice, Groq, Together, Fireworks, and Mistral. Real incident data from status pages. Works without API keys.

SmartSchema Extract โ€” Text to JSON with AI

olican/smartschema-extract

Convert any unstructured text into validated JSON using Google Gemini. Define your JSON Schema per request. Perfect for invoice parsing, web scraping, email extraction, and ETL pipelines.

1

5.0

AI Extraction Agent - Smart Scraper

alizarin_refrigerator-owner/ai-extraction-agent

AI-powered data extraction using natural language prompts. Describe what you need & let AI extract structured data from any webpage automatically.