Web Content Extractor API β URL to JSON
Under maintenancePricing
from $3.00 / 1,000 content extractions
Web Content Extractor API β URL to JSON
Under maintenanceExtract structured JSON from any webpage. Articles, products, recipes, jobs. Auto-detects content type. Returns metadata, headings, images, links. For AI agents and RAG.
Pricing
from $3.00 / 1,000 content extractions
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
11
Total users
2
Monthly active users
a month ago
Last modified
Categories
Share
π Web Content Extractor API β URL to Structured JSON
One API call. Any URL. Clean structured JSON. Extract articles, products, recipes, job postings, and more β automatically detected and organized. Built for AI agents, RAG pipelines, and data workflows.
Architecture Overview
flowchart TBsubgraph InputURL[/"URL: any webpage"/]endsubgraph Processing["Extraction Pipeline"]FETCH["1. Fetch & Parse HTML"]DETECT["2. Auto-Detect Content Type"]SCORE["3. Score Content Blocks"]EXTRACT["4. Extract Structured Data"]ENRICH["5. Enrich with Metadata"]endsubgraph Detection["Content Type Detection"]ART["Article"]PROD["Product"]REC["Recipe"]JOB["Job Posting"]EVT["Event"]WEB["Generic Webpage"]endsubgraph Output["Structured JSON"]META["Metadata: title, author, date, image"]CONTENT["Content: text, headings, word count"]MEDIA["Media: images, links"]SCHEMA["JSON-LD Structured Data"]TYPED["Type-Specific: price, ingredients, salary..."]endURL --> FETCH --> DETECT --> SCORE --> EXTRACT --> ENRICHDETECT --> ART & PROD & REC & JOB & EVT & WEBENRICH --> META & CONTENT & MEDIA & SCHEMA & TYPEDstyle Input fill:#1a1a2e,color:#fffstyle Processing fill:#16213e,color:#fffstyle Detection fill:#0f3460,color:#fffstyle Output fill:#533483,color:#fff
What Makes This Different?
| Feature | This Actor | Typical Scrapers |
|---|---|---|
| Output format | Structured JSON | Raw HTML |
| Content detection | Auto-detects 6 types | Manual configuration |
| Setup time | Zero β just pass URL | Hours of selector writing |
| AI-ready | Yes β clean text for LLMs | Needs post-processing |
| Batch support | Up to 25 URLs per call | One at a time |
| Response time | 1-3 seconds | 5-30 seconds |
Request Flow
sequenceDiagramparticipant Client as Your Appparticipant API as Content Extractorparticipant Web as Target Websiteparticipant Cache as 30-min CacheClient->>API: GET /extract?url=example.comAPI->>Cache: Check cachealt Cache HitCache-->>API: Return cached resultAPI-->>Client: JSON response (instant)else Cache MissAPI->>Web: Fetch HTMLWeb-->>API: HTML contentAPI->>API: Detect type + Extract + ScoreAPI->>Cache: Store resultAPI-->>Client: Structured JSON (1-3s)endNote over Client,API: PPE charge: $0.003 per extraction
API Endpoints
GET /extract β Extract from URL
GET /extract?url=https://techcrunch.com/2026/03/24/ai-news&format=full
| Parameter | Type | Required | Default | Options |
|---|---|---|---|---|
url | string | Yes | β | Any valid URL |
format | string | No | full | full, article, metadata |
POST /extract β Extract with JSON body
POST /extract{"url":"https://techcrunch.com/2026/03/24/ai-news","format":"article"}
POST /batch β Extract multiple URLs
POST /batch{"urls":["https://news.ycombinator.com","https://techcrunch.com","https://bbc.com/news"],"format":"full"}
GET / β Health check
Returns API status, version, and endpoint documentation.
Content Type Detection
flowchart LRHTML["HTML Page"] --> CHECK{"Detect Signals"}CHECK -->|"og:type=article<br/>or article tag"| ART["**article**<br/>title, author, date,<br/>full text, headings"]CHECK -->|"Schema: Product<br/>or .product-price"| PROD["**product**<br/>name, price, rating,<br/>images, SKU, brand"]CHECK -->|"Schema: Recipe<br/>or .recipe"| REC["**recipe**<br/>ingredients, instructions,<br/>prep time, servings"]CHECK -->|"Schema: JobPosting<br/>or .job-title"| JOB["**job_posting**<br/>title, company, salary,<br/>location, type"]CHECK -->|"Schema: Event<br/>or .event-date"| EVT["**event**<br/>name, date, location,<br/>description"]CHECK -->|"No specific<br/>signals found"| WEB["**webpage**<br/>metadata, content,<br/>links, images"]style ART fill:#10b981,color:#fffstyle PROD fill:#f59e0b,color:#fffstyle REC fill:#ef4444,color:#fffstyle JOB fill:#3b82f6,color:#fffstyle EVT fill:#8b5cf6,color:#fffstyle WEB fill:#6b7280,color:#fff
Output Examples
Article Extraction
{"url":"https://techcrunch.com/2026/03/24/ai-agents","type":"article","metadata":{"title":"AI Agents Are Reshaping Enterprise Software","description":"How autonomous AI agents are changing B2B SaaS","author":"Sarah Perez","date":"2026-03-24T10:00:00Z","image":"https://techcrunch.com/hero.jpg","siteName":"TechCrunch","locale":"en-US","canonical":"https://techcrunch.com/2026/03/24/ai-agents","keywords":["AI","agents","enterprise","SaaS"]},"content":{"text":"The rise of AI agents represents a fundamental shift in how enterprise software operates. Unlike traditional chatbots...","headings":[{"level":2,"text":"What Are AI Agents?"},{"level":2,"text":"The Enterprise Impact"},{"level":3,"text":"Case Study: Salesforce"}],"wordCount":2847},"media":{"images":[{"src":"https://techcrunch.com/diagram.png","alt":"AI agent architecture"}],"links":[{"href":"https://openai.com/agents","text":"OpenAI's agent framework"}]},"structuredData":[{"@type":"NewsArticle","headline":"..."}],"extractedAt":"2026-03-24T12:34:56.789Z"}
Product Extraction
{"url":"https://store.example.com/product/widget-pro","type":"product","metadata":{"title":"Widget Pro - Best Seller","siteName":"Example Store"},"content":{"text":"The Widget Pro is our most popular...","wordCount":342},"product":{"name":"Widget Pro","price":"$49.99","currency":"USD","availability":"InStock","rating":"4.8","reviewCount":"1,247","brand":"WidgetCo","sku":"WP-2026","images":["https://store.example.com/widget-pro-1.jpg"]}}
Use Case Workflows
RAG Pipeline Integration
flowchart LRURLs["URL List<br/>100+ sources"] --> EXTRACT["Web Content<br/>Extractor API"]EXTRACT --> TEXT["Clean Text<br/>+ Metadata"]TEXT --> CHUNK["Text Chunking<br/>(LangChain)"]CHUNK --> EMBED["Embeddings<br/>(OpenAI)"]EMBED --> VECTOR["Vector DB<br/>(Pinecone)"]VECTOR --> RAG["RAG Query<br/>Engine"]RAG --> ANSWER["AI-Powered<br/>Answers"]style EXTRACT fill:#10b981,color:#fffstyle RAG fill:#3b82f6,color:#fff
Competitive Intelligence Pipeline
flowchart LRCOMP["Competitor<br/>URLs"] --> EXTRACT["Web Content<br/>Extractor API"]EXTRACT --> PROD["Product Data:<br/>prices, features"]EXTRACT --> NEWS["News & Blog:<br/>announcements"]PROD --> DASH["Analytics<br/>Dashboard"]NEWS --> ALERT["Email<br/>Alerts"]style EXTRACT fill:#10b981,color:#fff
Pricing
| Event | Price per call | Cost per 1,000 |
|---|---|---|
| Content extraction | $0.003 | $3.00 |
Cost Comparison
| Solution | Cost per 1,000 URLs | Setup Time |
|---|---|---|
| This Actor | $3.00 | 0 minutes |
| Diffbot | $299/month flat | Hours |
| Custom scraper | $50+ developer hours | Days |
| Manual copy-paste | 40+ hours labor | Forever |
Integrations
| Platform | How to Connect |
|---|---|
| LangChain | Use as Document Loader via HTTP |
| LlamaIndex | Custom reader pointing to /extract |
| Zapier | Webhook trigger -> GET /extract |
| Make (Integromat) | HTTP module -> POST /extract |
| n8n | HTTP Request node |
| Apify Orchestrator | Direct actor call or Standby URL |
FAQ
Q: How fast is extraction? A: 1-3 seconds for a single URL. Batch processes 25 URLs in parallel.
Q: Does it handle paywalled content? A: It extracts whatever is publicly visible in the HTML. Paywalled content behind JavaScript auth won't be extracted.
Q: What about JavaScript-rendered pages (SPAs)? A: Current version uses server-side HTML. For JS-heavy pages, pair with our Screenshot & PDF API.
Q: Is there a rate limit? A: No hard rate limit. Apify Standby handles concurrent requests automatically.
Q: What languages are supported? A: Any language. The extractor works with HTML structure, not language-specific parsing.
Related Actors
- WebSight API β Technical website analysis (SEO, tech stack, AI score)
- Screenshot & PDF API β Pixel-perfect webpage captures
- Website Contact Scraper β Extract emails, phones, social links
Built by George Kioko | 6,196+ data extraction jobs completed | 35+ production APIs
