VOOZH about

URL: https://apify.com/legible_ship/web2json-agent

โ‡ฑ Web2json Agent ยท Apify


Pricing

Pay per usage

Go to Apify Store

Pricing

Pay per usage

Rating

0.0

(0)

Developer

๐Ÿ‘ ๅ›ฝๅผบ ๆจ

ๅ›ฝๅผบ ๆจ

Maintained by Community

Actor stats

1

Bookmarked

3

Total users

0

Monthly active users

3 months ago

Last modified

Share


๐Ÿ“– What is web2json-agent?

An AI-powered web scraping agent that automatically generates production-ready parser code from HTML samples โ€” no manual XPath/CSS selector writing required.


๐Ÿ“‹ Demo

https://github.com/user-attachments/assets/c82e8e13-fc42-4d1f-a81a-4cec6e3f434b


๐Ÿ“Š SWDE Benchmark Results

The SWDE dataset covers 8 vertical fields, 80 websites, and 124,291 pages


๐Ÿš€ Quick Start

Install via pip

# 1. Install package
pip install web2json-agent
# 2. Initialize configuration
web2json setup

Install for Developers

# 1. Clone the repository
git clone https://github.com/ccprocessor/web2json-agent
cd web2json-agent
# 2. Install in editable mode
pip install-e.
# 3. Initialize configuration
web2json setup

๐Ÿ“š Complete User Guide

For a comprehensive tutorial covering installation, configuration, and all usage scenarios, see:

docs/Web2JsonAgent%E4%BD%BF%E7%94%A8%E6%8C%87%E5%8D%97.md

This guide includes:

  • Detailed installation steps
  • Configuration methods (interactive wizard, config file, environment variables)
  • Layout clustering for mixed HTML types
  • Complete API examples and use cases
  • FAQ and troubleshooting

๐Ÿ API Usage

Web2JSON provides five simple APIs. Perfect for databases, APIs, and real-time processing!

API 1: extract_data - Complete Workflow

Extract structured data from HTML in one step (schema + parser + data).

โš ๏ธ Important: The extract_data API assumes all HTML files in the input directory have the same layout type. If your HTML files have different layouts (e.g., list pages vs detail pages), use classify_html_dir first to group them by layout similarity. See ./demo.py for a complete example.

Auto Mode - Let AI automatically discover and extract fields:

from web2json import Web2JsonConfig, extract_data
config = Web2JsonConfig(
name="my_project",
html_path="html_samples/",
# save=['schema', 'code', 'data'], # Save to local disk
# output_path="./results", # Custom output directory (default: "output")
)
result = extract_data(config)
# Results are always returned in memory
print(result.final_schema)# Dict: extracted schema
print(result.parser_code)# str: generated parser code
print(result.parsed_data[0])# List[Dict]: parsed JSON data

Predefined Mode - Extract only specific fields:

from web2json import Web2JsonConfig, extract_data
config = Web2JsonConfig(
name="articles",
html_path="html_samples/",
schema={
"title":"string",
"author":"string",
"date":"string",
"content":"string"
},
# save=['schema', 'code', 'data'], # Save to local disk
# output_path="./results", # Custom output directory
)
result = extract_data(config)
# Returns: ExtractDataResult with schema, code, and data in memory

API 2: extract_schema - Extract Schema Only

Generate a JSON schema describing the data structure in HTML.

from web2json import Web2JsonConfig, extract_schema
config = Web2JsonConfig(
name="schema_only",
html_path="html_samples/",
# save=['schema'], # Save schema to disk
# output_path="./schemas", # Custom output directory
)
result = extract_schema(config)
print(result.final_schema)# Dict: final schema
print(result.intermediate_schemas)# List[Dict]: iteration history

API 3: infer_code - Generate Parser Code

Generate parser code from a schema (Dict or from previous step).

from web2json import Web2JsonConfig, infer_code
# Use schema from previous step or define manually
my_schema ={
"title":"string",
"author":"string",
"content":"string"
}
config = Web2JsonConfig(
name="my_parser",
html_path="html_samples/",
schema=my_schema,
# save=['code'], # Save parser code and schema to disk
# output_path="./parsers", # Custom output directory
)
result = infer_code(config)
print(result.parser_code)# str: BeautifulSoup parser code
print(result.schema)# Dict: schema used

API 4: extract_data_with_code - Parse with Code

Use parser code to extract data from HTML files.

from web2json import Web2JsonConfig, extract_data_with_code
config = Web2JsonConfig(
name="parse_demo",
html_path="new_html_files/",
parser_code="output/blog/parsers/final_parser.py",# Path to parser .py file
save=['data'],# Save parsed data to disk
output_path="./parse_results",# Custom output directory
)
result = extract_data_with_code(config)
print(f"Success: {result.success_count}, Failed: {result.failed_count}")
for item in result.parsed_data:
print(f"File: {item['filename']}")
print(f"Data: {item['data']}")

API 5: classify_html_dir - Classify HTML by Layout

Group HTML files by layout similarity (for mixed-layout datasets).

from web2json import Web2JsonConfig, classify_html_dir
config = Web2JsonConfig(
name="classify_demo",
html_path="mixed_html/",
# save=['report', 'files'], # Save cluster report and copy files to subdirectories
# output_path="./cluster_analysis", # Custom output directory
)
result = classify_html_dir(config)
print(f"Found {result.cluster_count} layout types")
print(f"Noise files: {len(result.noise_files)}")
for cluster_name, files in result.clusters.items():
print(f"{cluster_name}: {len(files)} files")
forfilein files[:3]:
print(f" - {file}")

Configuration Reference

Web2JsonConfig Parameters:

ParameterTypeDefaultDescription
namestrRequiredProject name (for identification)
html_pathstrRequiredHTML directory or file path
output_pathstr"output"Output directory (used when save is specified)
iteration_roundsint3Number of samples for learning
schemaDictNonePredefined schema (None = auto mode)
enable_schema_editboolFalseEnable manual schema editing
parser_codestrNoneParser code (for extract_data_with_code)
saveList[str]NoneItems to save locally (e.g., ['schema', 'code', 'data']). None = memory only

Standalone API Parameters:

APIParametersReturns
extract_dataconfig: Web2JsonConfigExtractDataResult
extract_schemaconfig: Web2JsonConfigExtractSchemaResult
infer_codeconfig: Web2JsonConfigInferCodeResult
extract_data_with_codeconfig: Web2JsonConfigParseResult
classify_html_dirconfig: Web2JsonConfigClusterResult

All result objects provide:

  • Direct access to data via object attributes
  • .to_dict() method for serialization
  • .get_summary() method for quick stats

Which API Should I Use?

# Need data immediately? โ†’ extract_data
config = Web2JsonConfig(name="my_run", html_path="html_samples/")
result = extract_data(config)
print(result.parsed_data)
# Want to review/edit schema first? โ†’ extract_schema + infer_code
config = Web2JsonConfig(name="schema_run", html_path="html_samples/")
schema_result = extract_schema(config)
# Edit schema if needed, then generate code
config = Web2JsonConfig(
name="code_run",
html_path="html_samples/",
schema=schema_result.final_schema
)
code_result = infer_code(config)
# Parse with the generated code
config = Web2JsonConfig(
name="parse_run",
html_path="new_html_files/",
parser_code=code_result.parser_code
)
data_result = extract_data_with_code(config)
# Have parser code, need to parse more files? โ†’ extract_data_with_code
config = Web2JsonConfig(
name="parse_more",
html_path="more_files/",
parser_code=my_parser_code
)
result = extract_data_with_code(config)
# Mixed layouts (list + detail pages)? โ†’ classify_html_dir
config = Web2JsonConfig(name="classify", html_path="mixed_html/")
result = classify_html_dir(config)

๐Ÿ“„ License

Apache-2.0 License


You might also like

link2web

dammitdc/link2web

link2web: Turn YouTube/IG feeds into a pro landing page in seconds! ๐ŸŽฅ This AI Agent uses Gemini 2.5 Flash Lite to sync live content, build high-trust brand portals, and capture leads via forms. Ideal for creators wanting to maximize reach & engagement automatically.

๐Ÿ‘ User avatar

Drishti Choudhary

4

Ai Web Research Agent

devwithbobby/ai-web-research-agent

An autonomous agent that researches topics across the web, synthesizes information from multiple sources, and produces comprehensive reports. Perfect for researchers, students, content creators, and analysts who need fast, reliable web research.

๐Ÿ‘ User avatar

Dev with Bobby

19

Zillow Agent Data Scraper (Agent Listings, Reviews & Details)

coder_zoro/zillow-agent-data-scraper-agent-listings-reviews-details

Scrape complete Zillow agent data effortlessly. Get agent details, active/rental/sold listings, reviews, and search results with one API call. Ideal for real estate analytics, lead generation, and agent performance tracking.

Related articles

Web crawling vs. web scraping
Read more