Website Content Crawler

Pricing

$10.00 / 1,000 results

Website Content Crawler

A powerful web crawler that extracts text content from websites, optimized for AI models, Large Language Models (LLMs), vector databases, and Retrieval-Augmented Generation (RAG) pipelines.

Pricing

$10.00 / 1,000 results

Rating

0.0

(0)

Developer

👁 Jason Giang

Jason Giang

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

Web Content Crawler

A powerful web crawler that extracts text content from websites, optimized for AI models, Large Language Models (LLMs), vector databases, and Retrieval-Augmented Generation (RAG) pipelines.

Features

Multiple Crawling Engines: Choose between Playwright (Chrome/Firefox), Cheerio (fast HTTP client), or JSDOM based on your needs
Markdown Output: Automatically converts HTML content to clean Markdown format
Smart Content Extraction: Removes unwanted elements like cookie banners, navigation, ads, and more
Customizable Selectors: Keep or remove specific elements using CSS selectors
Deep Crawling: Recursively crawl websites with configurable depth limits
AI-Ready Output: Structured data perfect for feeding into AI models and vector databases
Proxy Support: Built-in proxy configuration for reliable crawling
Screenshot Capture: Optional screenshot capture for visual documentation (Playwright only)
File Downloads: Download and save linked files like PDFs and documents

Use Cases

Knowledge Base Extraction: Crawl documentation sites and help centers
Content Aggregation: Collect articles, blog posts, and web content at scale
AI Training Data: Extract clean text for training or fine-tuning language models
RAG Pipelines: Feed content into retrieval-augmented generation systems
Vector Database Population: Prepare text content for embedding and semantic search
Website Migration: Extract content from existing websites for migration
Competitive Analysis: Monitor and analyze competitor content

Input Parameters

Required

Start URLs (startUrls): Array of URLs where the crawler will begin. The crawler will only process pages under these URLs.

Crawler Configuration

Crawler Type (crawlerType): Select the crawling engine
- cheerio (default): Fast HTTP client, best for static websites
- playwright:chrome: Chrome browser with full JavaScript support
- playwright:firefox: Firefox browser, useful for sites with anti-bot measures
- jsdom: Experimental JavaScript-capable crawler
Max Crawling Depth (maxCrawlDepth): Maximum link depth from start URLs (default: 1)
- 0 = Only crawl start URLs
- 1 = Crawl start URLs and pages directly linked from them
- 2+ = Continue crawling to specified depth
Max Pages (maxCrawlPages): Maximum number of pages to crawl (default: 100)
Max Requests Per Minute (maxRequestsPerMinute): Rate limiting (default: 0 = unlimited)

Content Extraction

Readable Text Char Threshold (readableTextCharThreshold): Minimum characters required to save a page (default: 100)
Remove Cookie Warnings (removeCookieWarnings): Automatically remove cookie consent dialogs (default: true)
Click Elements CSS Selector (clickElementsCssSelector): CSS selector for elements to click before extraction (e.g., "Show more" buttons)
HTML Transformer (htmlTransformer): How to process HTML
- readableText (default): Remove scripts, styles, navigation
- none: Keep original HTML
Remove Elements CSS Selector (removeElementsCssSelector): CSS selector for elements to remove (e.g., nav, footer, .ads)
Keep Elements CSS Selector (keepElementsCssSelector): CSS selector for elements to keep (removes everything else)

Output Options

Save Markdown (saveMarkdown): Convert content to Markdown format (default: true)
Save HTML (saveHtml): Save raw HTML to key-value store (default: false)
Save Screenshots (saveScreenshots): Capture page screenshots (Playwright only, default: false)
Save Files (saveFiles): Download linked files like PDFs (default: false)

Advanced Options

Max Scroll Height (maxScrollHeightPixels): Scroll down pages with infinite scroll (default: 0 = disabled)
Proxy Configuration (proxyConfiguration): Proxy settings for the crawler
Max Request Retries (maxRequestRetries): Number of retry attempts for failed requests (default: 3)
Debug Mode (debugMode): Enable detailed logging (default: false)

Output Format

Each crawled page produces a dataset item with the following structure:

{
"url":"https://example.com/page",
"title":"Page Title",
"description":"Page meta description",
"canonicalUrl":"https://example.com/page",
"text":"Extracted plain text content...",
"markdown":"# Page Title\n\nExtracted content in Markdown...",
"crawl":{
"loadedUrl":"https://example.com/page",
"depth":1,
"httpStatusCode":200,
"loadedAt":"2024-01-01T12:00:00.000Z"
}
}

Example Usage

Basic Crawl

{
"startUrls":[
{"url":"https://example.com/docs"}
],
"crawlerType":"cheerio",
"maxCrawlDepth":2,
"maxCrawlPages":50
}

Advanced Configuration

{
"startUrls":[
{"url":"https://example.com/blog"}
],
"crawlerType":"playwright:chrome",
"maxCrawlDepth":3,
"maxCrawlPages":200,
"removeElementsCssSelector":"nav, footer, .sidebar, .comments",
"removeCookieWarnings":true,
"saveMarkdown":true,
"saveScreenshots":false,
"maxRequestRetries":5,
"proxyConfiguration":{
"useApifyProxy":true
}
}

Extract Specific Content

{
"startUrls":[
{"url":"https://example.com"}
],
"keepElementsCssSelector":"article, .content, main",
"htmlTransformer":"readableText",
"readableTextCharThreshold":500,
"saveMarkdown":true
}

How It Works

The crawler starts from your specified URLs and:

Fetches and processes each page using your selected crawling engine
Extracts and cleans the content by removing unwanted elements
Converts the content to your preferred format (Markdown, plain text, or HTML)
Follows links to discover and crawl additional pages (up to your depth limit)
Saves all extracted data to the dataset for easy access

👁 Website Content Crawler avatar

Website Content Crawler

rupom888/website-content-crawler

👁 User avatar

Syed Rupom

👁 Website Content Crawler Pro avatar

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.

👁 User avatar

halam

546

2.1

👁 Website Content Crawler API - Markdown for RAG avatar

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

👁 User avatar

Tugelbay Konabayev

Website Content Crawler for AI — Clean Markdown, 4x Cheaper

joyouscam35875/website-content-crawler

Crawl any website and extract clean text/markdown for LLMs, RAG pipelines, vector databases. BFS crawl with depth control, robots.txt support, boilerplate removal. Perfect for feeding AI models. $0.001/page — 4x cheaper than the official Apify crawler.

👁 User avatar

Ken Digital

👁 No-BS Content Crawler 🖕 avatar

No-BS Content Crawler 🖕

successful_nonagon/no-bs-content-crawler

Fast web crawler that extracts clean text from websites. Returns readable content, headings, and links. Perfect for content aggregation, SEO research, and data collection.

👁 User avatar

hafsah nuzhat

5.0

👁 Website Content Crawler avatar

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

👁 User avatar

Apify

133K

4.6

👁 Website Content Crawler avatar

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

👁 User avatar

ParseForge

Website Content Crawler Scraper

oneary/website-content-crawler

🌐 Full website crawler that extracts structured content (text, headings, metadata, links, images) from any domain. Free platform compute pricing.

👁 User avatar

Luan M.

👁 AI Website Content Crawler avatar

AI Website Content Crawler

ilborso/ai-website-content-crawler

A super fast website crawler for Agentic AI integration

👁 User avatar

Fabio Borsotti

5.0

👁 Website Content Crawler avatar

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

👁 User avatar

Crawler Bros

👁 Blog article image

What is a vector database?

URL: https://apify.com/jasondev/website-content-crawler

⇱ Website Content Crawler · Apify

Website Content Crawler

Web Content Crawler

Features

Use Cases

Input Parameters

Required

Crawler Configuration

Content Extraction

Output Options

Advanced Options

Output Format

Example Usage

Basic Crawl

Advanced Configuration

Extract Specific Content

How It Works

You might also like

Website Content Crawler

Website Content Crawler Pro

Website Content Crawler API - Markdown for RAG

Website Content Crawler for AI — Clean Markdown, 4x Cheaper

No-BS Content Crawler 🖕

Website Content Crawler

Website Content Crawler

Website Content Crawler Scraper

AI Website Content Crawler

Website Content Crawler

Related articles