Pricing
Pay per usage
Crawl4ai
Extract page content (markdown/HTML/text), metadata, and link stats. Uses crawl4ai.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Actor stats
1
Bookmarked
2
Total users
0
Monthly active users
3 months ago
Last modified
Share
Website Content Extractor
Apify Actor: extract page content (markdown/HTML/text), metadata, and link stats. Uses crawl4ai.
Quick start
pip install-e".[dev]"crawl4ai-setuppython -m crawl4ai_actor.main
Input: startUrls (required), maxPages, maxDepth, waitUntil, waitForSelector, cssSelector, etc. Full schema: .actor/input_schema.json.
Output: dataset with url, success, content, title, content_length, links_internal_count, etc. Run summary in Storage β Key-value store (runSummary), including failedUrls for retries.
Options (high level)
| Option | Purpose |
|---|---|
crawlMode | full (default) | discover_only β discover_only = URLs + links only, no content |
includeLinkUrls | Include links_internal / links_external arrays in each item |
waitUntil | domcontentloaded | load | networkidle (SPA/slow sites) |
pageLoadWaitSecs | Extra delay before capture |
waitForSelector | Wait for CSS selector (or css:/js: prefix) |
cssSelector | Extract only this region (e.g. main, .article) |
virtualScrollSelector | Infinite-scroll container to expand |
Example β SPA / slow site: { "startUrls": ["https://..."], "waitUntil": "networkidle", "pageLoadWaitSecs": 2 }
Example β discover links only: { "startUrls": ["https://..."], "crawlMode": "discover_only", "maxPages": 100 }
Run locally / Docker
$docker build -t website-content-extractor .
Regression
$UX_MATRIX_GROUP=core python scripts/ux_matrix.py
Reports: scripts/ux_matrix_output.json, scripts/ux_matrix_report.txt (gitignored).
