VOOZH about

URL: https://apify.com/jasondev/website-content-crawler

โ‡ฑ Website Content Crawler ยท Apify


Pricing

$10.00 / 1,000 results

Go to Apify Store

Website Content Crawler

A powerful web crawler that extracts text content from websites, optimized for AI models, Large Language Models (LLMs), vector databases, and Retrieval-Augmented Generation (RAG) pipelines.

Pricing

$10.00 / 1,000 results

Rating

0.0

(0)

Developer

๐Ÿ‘ Jason Giang

Jason Giang

Maintained by Community

Actor stats

0

Bookmarked

43

Total users

6

Monthly active users

4 months ago

Last modified

Share

Web Content Crawler

A powerful web crawler that extracts text content from websites, optimized for AI models, Large Language Models (LLMs), vector databases, and Retrieval-Augmented Generation (RAG) pipelines.

Features

  • Multiple Crawling Engines: Choose between Playwright (Chrome/Firefox), Cheerio (fast HTTP client), or JSDOM based on your needs
  • Markdown Output: Automatically converts HTML content to clean Markdown format
  • Smart Content Extraction: Removes unwanted elements like cookie banners, navigation, ads, and more
  • Customizable Selectors: Keep or remove specific elements using CSS selectors
  • Deep Crawling: Recursively crawl websites with configurable depth limits
  • AI-Ready Output: Structured data perfect for feeding into AI models and vector databases
  • Proxy Support: Built-in proxy configuration for reliable crawling
  • Screenshot Capture: Optional screenshot capture for visual documentation (Playwright only)
  • File Downloads: Download and save linked files like PDFs and documents

Use Cases

  • Knowledge Base Extraction: Crawl documentation sites and help centers
  • Content Aggregation: Collect articles, blog posts, and web content at scale
  • AI Training Data: Extract clean text for training or fine-tuning language models
  • RAG Pipelines: Feed content into retrieval-augmented generation systems
  • Vector Database Population: Prepare text content for embedding and semantic search
  • Website Migration: Extract content from existing websites for migration
  • Competitive Analysis: Monitor and analyze competitor content

Input Parameters

Required

  • Start URLs (startUrls): Array of URLs where the crawler will begin. The crawler will only process pages under these URLs.

Crawler Configuration

  • Crawler Type (crawlerType): Select the crawling engine

    • cheerio (default): Fast HTTP client, best for static websites
    • playwright:chrome: Chrome browser with full JavaScript support
    • playwright:firefox: Firefox browser, useful for sites with anti-bot measures
    • jsdom: Experimental JavaScript-capable crawler
  • Max Crawling Depth (maxCrawlDepth): Maximum link depth from start URLs (default: 1)

    • 0 = Only crawl start URLs
    • 1 = Crawl start URLs and pages directly linked from them
    • 2+ = Continue crawling to specified depth
  • Max Pages (maxCrawlPages): Maximum number of pages to crawl (default: 100)

  • Max Requests Per Minute (maxRequestsPerMinute): Rate limiting (default: 0 = unlimited)

Content Extraction

  • Readable Text Char Threshold (readableTextCharThreshold): Minimum characters required to save a page (default: 100)

  • Remove Cookie Warnings (removeCookieWarnings): Automatically remove cookie consent dialogs (default: true)

  • Click Elements CSS Selector (clickElementsCssSelector): CSS selector for elements to click before extraction (e.g., "Show more" buttons)

  • HTML Transformer (htmlTransformer): How to process HTML

    • readableText (default): Remove scripts, styles, navigation
    • none: Keep original HTML
  • Remove Elements CSS Selector (removeElementsCssSelector): CSS selector for elements to remove (e.g., nav, footer, .ads)

  • Keep Elements CSS Selector (keepElementsCssSelector): CSS selector for elements to keep (removes everything else)

Output Options

  • Save Markdown (saveMarkdown): Convert content to Markdown format (default: true)

  • Save HTML (saveHtml): Save raw HTML to key-value store (default: false)

  • Save Screenshots (saveScreenshots): Capture page screenshots (Playwright only, default: false)

  • Save Files (saveFiles): Download linked files like PDFs (default: false)

Advanced Options

  • Max Scroll Height (maxScrollHeightPixels): Scroll down pages with infinite scroll (default: 0 = disabled)

  • Proxy Configuration (proxyConfiguration): Proxy settings for the crawler

  • Max Request Retries (maxRequestRetries): Number of retry attempts for failed requests (default: 3)

  • Debug Mode (debugMode): Enable detailed logging (default: false)

Output Format

Each crawled page produces a dataset item with the following structure:

{
"url":"https://example.com/page",
"title":"Page Title",
"description":"Page meta description",
"canonicalUrl":"https://example.com/page",
"text":"Extracted plain text content...",
"markdown":"# Page Title\n\nExtracted content in Markdown...",
"crawl":{
"loadedUrl":"https://example.com/page",
"depth":1,
"httpStatusCode":200,
"loadedAt":"2024-01-01T12:00:00.000Z"
}
}

Example Usage

Basic Crawl

{
"startUrls":[
{"url":"https://example.com/docs"}
],
"crawlerType":"cheerio",
"maxCrawlDepth":2,
"maxCrawlPages":50
}

Advanced Configuration

{
"startUrls":[
{"url":"https://example.com/blog"}
],
"crawlerType":"playwright:chrome",
"maxCrawlDepth":3,
"maxCrawlPages":200,
"removeElementsCssSelector":"nav, footer, .sidebar, .comments",
"removeCookieWarnings":true,
"saveMarkdown":true,
"saveScreenshots":false,
"maxRequestRetries":5,
"proxyConfiguration":{
"useApifyProxy":true
}
}

Extract Specific Content

{
"startUrls":[
{"url":"https://example.com"}
],
"keepElementsCssSelector":"article, .content, main",
"htmlTransformer":"readableText",
"readableTextCharThreshold":500,
"saveMarkdown":true
}

How It Works

The crawler starts from your specified URLs and:

  1. Fetches and processes each page using your selected crawling engine
  2. Extracts and cleans the content by removing unwanted elements
  3. Converts the content to your preferred format (Markdown, plain text, or HTML)
  4. Follows links to discover and crawl additional pages (up to your depth limit)
  5. Saves all extracted data to the dataset for easy access

You might also like

Website Content Crawler

rupom888/website-content-crawler

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

๐Ÿ‘ User avatar

Tugelbay Konabayev

26

No-BS Content Crawler ๐Ÿ–•

successful_nonagon/no-bs-content-crawler

Fast web crawler that extracts clean text from websites. Returns readable content, headings, and links. Perfect for content aggregation, SEO research, and data collection.

13

5.0

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with ๐Ÿฆœ๐Ÿ”— LangChain, LlamaIndex, and the wider LLM ecosystem.

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

AI Website Content Crawler

ilborso/ai-website-content-crawler

A super fast website crawler for Agentic AI integration

๐Ÿ‘ User avatar

Fabio Borsotti

6

5.0

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

45

Related articles

What is a vector database?
Read more