The webclaw server lets you extract, crawl, and analyze web content for AI agents, RAG pipelines, and developer workflows. Here's what you can do:
Scrape – Extract content from a single URL as markdown, LLM-optimized text, plain text, HTML, or JSON; supports CSS selector filtering, main-content extraction, and auto-fallback to cloud API when bot protection or JS rendering is detected.
Crawl – Breadth-first crawl of a website from a seed URL, following links up to a configurable depth and page limit, with optional sitemap seeding and concurrent requests.
Batch – Scrape multiple URLs concurrently and return extracted content for all of them at once.
Map – Discover all URLs from a website's sitemaps (via
robots.txt+sitemap.xml) without fully extracting every page.Extract – Extract structured data from a web page using an LLM, guided by a JSON schema or a natural language prompt.
Summarize – Generate a concise LLM-powered summary of a web page, with configurable sentence count.
Diff – Compare a URL's current content against a previously saved extraction snapshot to highlight what has changed.
Brand – Extract brand identity assets (colors, fonts, logo, favicon) from a website's HTML and CSS.
Search – Search the web for a query and return structured results (requires API key).
Research – Run a deep, multi-source research investigation on a topic or question, with an optional deep mode for more thorough results (requires API key).
Integrates with local Ollama instances for private, AI-powered structured data extraction and content summarization.
Utilizes OpenAI's API to perform schema-enforced extraction and content summarization on scraped data.
Provides specialized extraction of structured metadata and content details from YouTube video pages.
Most web scraping tools give your agent one of two bad outputs:
a blocked page, login wall, or empty app shell
raw HTML full of nav, scripts, styling, ads, and duplicated boilerplate
webclaw.io is the hosted web extraction API for webclaw. This repo contains the open-source CLI, MCP server, extraction engine, and self-hostable server.
webclaw turns a URL into clean content your tools can actually use.
webclaw https://example.com --format markdown# Example Domain
This domain is for use in illustrative examples in documents.
You may use this domain in literature without prior coordination or asking for permission.Use it from the terminal, wire it into Claude/Cursor through MCP, call the hosted API from your app, or self-host the OSS server.
Install
Agent setup
The fastest way to connect webclaw to Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, and other MCP-compatible tools:
npx create-webclawThe installer detects supported clients and configures the MCP server for you.
Homebrew
brew tap 0xMassi/webclaw
brew install webclawPrebuilt binaries
Download macOS and Linux binaries from GitHub Releases.
Docker
docker run --rm ghcr.io/0xmassi/webclaw https://example.comCargo
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-cli
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-mcpIf building from source fails because native build tools are missing, install the platform prerequisites:
OS | Command |
Debian / Ubuntu |
|
Fedora / RHEL |
|
Arch |
|
macOS |
|
Related MCP server: mcp-playwright
Quick Start
Scrape one page
webclaw https://stripe.com --format markdownReturn LLM-optimized text
webclaw https://docs.anthropic.com --format llmKeep only the main content
webclaw https://example.com/blog/post --only-main-contentInclude or exclude selectors
webclaw https://example.com \
--include "article, main, .content" \
--exclude "nav, footer, .sidebar, .ad"Crawl a documentation site
webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50Workflow examples
Extract brand assets
webclaw https://github.com --brandCompare a page over time
webclaw https://example.com/pricing --format json > pricing-old.json
webclaw https://example.com/pricing --diff-with pricing-old.jsonMCP Server
webclaw ships with an MCP server for AI agents.
npx create-webclawManual config:
{
"mcpServers": {
"webclaw": {
"command": "~/.webclaw/webclaw-mcp"
}
}
}Then ask your agent things like:
Scrape these competitor pricing pages and summarize the differences.Crawl this documentation site and prepare clean context for a RAG index.Extract the brand colors, fonts, and logos from this company website.Tools
Tool | What it does | Local |
| Extract one URL as markdown, text, JSON, LLM format, or HTML | Yes |
| Follow same-origin links and extract discovered pages | Yes |
| Discover URLs without extracting every page | Yes |
| Scrape multiple URLs in parallel | Yes |
| Convert page content into structured data | Yes, with local or configured LLM |
| Summarize a page | Yes, with local or configured LLM |
| Compare page content snapshots | Yes |
| Extract colors, fonts, logos, and metadata | Yes |
| Search the web and scrape results | Hosted API |
| Multi-source research workflow | Hosted API |
SDKs
npm install @webclaw/sdk
pip install webclaw
go get github.com/0xMassi/webclaw-goimport { Webclaw } from "@webclaw/sdk";
const client = new Webclaw({ apiKey: process.env.WEBCLAW_API_KEY! });
const page = await client.scrape({
url: "https://example.com",
formats: ["markdown"],
only_main_content: true,
});
console.log(page.markdown);from webclaw import Webclaw
client = Webclaw(api_key="wc_your_key")
page = client.scrape(
"https://example.com",
formats=["markdown"],
only_main_content=True,
)
print(page.markdown)curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"only_main_content": true
}'Output Formats
Format | Use it when you need |
| Clean page content with structure preserved |
| Compact context for agents and RAG pipelines |
| Plain text with minimal formatting |
| Structured metadata, links, images, and extracted fields |
| Cleaned HTML for custom processing |
Local First, Hosted When Needed
The CLI and MCP server work locally without an account for the core extraction path.
Use the hosted API at webclaw.io when you need:
protected-site access without managing infrastructure
JavaScript rendering
async crawl and research jobs
web search
watches and production usage tracking
SDKs for application code
export WEBCLAW_API_KEY=wc_your_key
webclaw https://example.com --cloudWhat You Can Build
Use case | Example |
AI agent web access | Give Claude, Cursor, or another MCP client clean page context |
RAG ingestion | Crawl docs, help centers, blogs, and knowledge bases |
Competitor monitoring | Track pricing pages, changelogs, docs, and product pages |
Structured extraction | Turn messy pages into typed JSON for automations |
Research workflows | Search, scrape, summarize, and cite multiple sources |
Brand intelligence | Extract logos, colors, fonts, and social metadata |
Architecture
webclaw/
crates/
webclaw-core HTML to markdown, text, JSON, and LLM-ready output
webclaw-fetch Fetching, crawling, batching, and mapping
webclaw-llm Local and hosted LLM provider support
webclaw-pdf PDF text extraction
webclaw-mcp MCP server for AI agents
webclaw-cli Command-line interfacewebclaw-core is pure extraction logic: no network I/O, small surface area, and usable independently from the fetching layer.
Configuration
Variable | Description |
| Hosted API key |
| Ollama URL for local LLM features |
| OpenAI-compatible LLM provider key |
| OpenAI-compatible base URL |
| Anthropic-compatible LLM provider key |
| Anthropic-compatible base URL |
| Single proxy URL |
| Proxy pool file |
Contributing
The most useful contributions right now are practical and small:
add examples for real agent and RAG workflows
improve SDK snippets
report pages that extract poorly
add failing fixtures for messy HTML
improve docs for MCP clients and local setup
test the CLI on more Linux/macOS environments
Good first places to start:
If a page extracts badly, include:
URL:
Command or API request:
Expected output:
Actual output:
Format used: markdown / llm / text / json / html
CLI, MCP, SDK, or API:Please remove secrets, cookies, private tokens, and customer data from logs before posting.
Infrastructure Partner
Studio Partners
Community Plugins
Third-party plugins that integrate webclaw with AI agent platforms:
Plugin | Platform | What it does |
Native webclaw v1 API plugin with 9 tools: scrape, search, crawl, extract, summarize, diff, map, batch, brand | ||
Web search provider and 9 dedicated tools for the full v1 API surface. Install with |
Built a webclaw integration? Open a PR to add it here.
Contributors
Thanks to everyone improving webclaw through issues, examples, docs, bug reports, and pull requests.
Star History
License
Maintenance
Appeared in Searches
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/0xMassi/webclaw'
If you have feedback or need assistance with the MCP directory API, please join our Discord server
