VOOZH about

URL: https://apify.com/kael_odin/crawl4ai

⇱ Crawl4ai Β· Apify


Pricing

Pay per usage

Go to Apify Store

Extract page content (markdown/HTML/text), metadata, and link stats. Uses crawl4ai.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

πŸ‘ Kael Odin

Kael Odin

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

0

Monthly active users

3 months ago

Last modified

Categories

Share

Website Content Extractor

Apify Actor: extract page content (markdown/HTML/text), metadata, and link stats. Uses crawl4ai.

Quick start

pip install-e".[dev]"
crawl4ai-setup
python -m crawl4ai_actor.main

Input: startUrls (required), maxPages, maxDepth, waitUntil, waitForSelector, cssSelector, etc. Full schema: .actor/input_schema.json.

Output: dataset with url, success, content, title, content_length, links_internal_count, etc. Run summary in Storage β†’ Key-value store (runSummary), including failedUrls for retries.

Options (high level)

OptionPurpose
crawlModefull (default) | discover_only β€” discover_only = URLs + links only, no content
includeLinkUrlsInclude links_internal / links_external arrays in each item
waitUntildomcontentloaded | load | networkidle (SPA/slow sites)
pageLoadWaitSecsExtra delay before capture
waitForSelectorWait for CSS selector (or css:/js: prefix)
cssSelectorExtract only this region (e.g. main, .article)
virtualScrollSelectorInfinite-scroll container to expand

Example β€” SPA / slow site: { "startUrls": ["https://..."], "waitUntil": "networkidle", "pageLoadWaitSecs": 2 }
Example β€” discover links only: { "startUrls": ["https://..."], "crawlMode": "discover_only", "maxPages": 100 }

Run locally / Docker

$docker build -t website-content-extractor .

Regression

$UX_MATRIX_GROUP=core python scripts/ux_matrix.py

Reports: scripts/ux_matrix_output.json, scripts/ux_matrix_report.txt (gitignored).

You might also like

Website Content Extractor for RAG: Markdown, HTML, Text

nezha/website-content-crawler

Turn docs sites, help centers, blogs, and websites into clean markdown, text, or HTML for RAG, AI knowledge bases, and internal search. Crawl from start URLs or sitemaps and keep the crawl in scope.

RAG Web Browser Scraper

datapilot/rag-web-browser-scraper

RAG Web Browser Search & Crawl Actor uses to search Bing or crawl URLs, then extracts page content as clean markdown. It captures title, description, language, HTTP status, and structured metadata. Supports multiple queries, proxies, and outputs organized crawl + search results.

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

Website URL Crawler & Link Extractor

maximedupre/website-url-crawler

Crawl JavaScript-rendered websites and export a URL link map. Get source pages, depth, anchor text, link type, HTTP metadata, and crawl status.

πŸ‘ User avatar

Maxime DuprΓ©

3

Ai Ready Web Page To Markdown Converter

mustafa.irshaid.113/ai-ready-web-page-to-markdown-converter

Convert any webpage into structured Markdown and HTML using just a URL. Get the page title, link, and contentβ€”perfect for SEO, devs, and AI crawlers. Fast, clean, and ideal for repurposing or analysis. Start turning websites into Markdown instantly.

πŸ‘ User avatar

Mustafa Irshaid

16

πŸ•·οΈ Website Crawler β€” Full-Site Scraping for AI

nexgendata/website-content-crawler

Crawl entire websites for clean text, markdown or HTML. Perfect for RAG pipelines, AI training & content analysis. Handles JS-rendered pages. Alternative to Firecrawl & Jina. Pay per page.

Html to Markdown Converter

antonio_espresso/html-to-markdown-converter

Crawl a target URL and convert its HTML content into clean, structured Markdown with optional heading-based chunking.

39