VOOZH about

URL: https://apify.com/optimus-fulcria/ai-ready-website-crawler

⇱ Website to Markdown Crawler for AI & RAG Β· Apify


Pricing

Pay per usage

Go to Apify Store

AI-Ready Website Crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

πŸ‘ Fulcria Labs

Fulcria Labs

Maintained by Community

Actor stats

0

Bookmarked

7

Total users

2

Monthly active users

3 months ago

Last modified

Categories

Share

Crawls websites and converts pages to clean markdown suitable for AI/RAG knowledge bases, LLM fine-tuning, and document pipelines.

What it does

This actor takes a starting URL, crawls the website following same-domain links, and outputs each page as clean markdown with metadata. It strips out navigation, ads, scripts, and other non-content elements to produce AI-ready text.

Input

FieldTypeDefaultDescription
startUrlstringrequiredPrimary URL to start crawling
additionalUrlsstring[][]Extra URLs to include in the crawl
maxPagesinteger50Maximum pages to crawl (1-10000)
maxDepthinteger3Maximum link depth from start URL
requestsPerSecondnumber2Rate limit for politeness
respectRobotsTxtbooleantrueHonor robots.txt rules
includeUrlPatternsstring[][]Regex patterns - only crawl matching URLs
excludeUrlPatternsstring[]see belowRegex patterns - skip matching URLs
removeSelectorsstring[]see belowCSS selectors for elements to remove
contentSelectorsstring[][]CSS selectors to isolate main content
requestTimeoutSecsinteger30Per-request timeout
userAgentstringAIReadyWebsiteCrawler/1.0User-Agent header

Default exclude patterns

\.(pdf|zip|tar|gz|mp4|mp3|...)$
/api/
/login,/logout,/signin,/signup,/auth/

Default remove selectors

nav, footer, header, aside, .sidebar, .advertisement, .cookie-banner, script, style, noscript, iframe, svg, and more.

Output

Each crawled page produces a dataset item with:

{
"url":"https://docs.example.com/getting-started",
"title":"Getting Started - Example Docs",
"markdown":"---\ntitle: \"Getting Started\"\nurl: https://...\ncrawl_date: 2026-02-23T12:00:00Z\n---\n\n# Getting Started\n\nWelcome to...",
"crawl_date":"2026-02-23T12:00:00+00:00",
"depth":1,
"word_count":342
}

The markdown field includes YAML frontmatter with title, URL, and crawl date, followed by the cleaned content.

Example input

Crawl documentation site

{
"startUrl":"https://docs.example.com",
"maxPages":100,
"maxDepth":5,
"requestsPerSecond":2
}

Crawl specific section only

{
"startUrl":"https://example.com/docs/api",
"maxPages":50,
"maxDepth":3,
"includeUrlPatterns":["/docs/api/"],
"contentSelectors":[".docs-content","article"]
}

Crawl multiple sites

{
"startUrl":"https://docs.example.com",
"additionalUrls":[
"https://blog.example.com",
"https://wiki.example.com"
],
"maxPages":200
}

How the content cleaning works

  1. HTML fetching - Uses httpx with HTTP/2 support and configurable timeouts
  2. Element removal - Strips nav, footer, ads, scripts, styles via CSS selectors
  3. Content isolation - Auto-detects <main>, <article>, or content divs (or uses your custom selectors)
  4. Markdown conversion - Converts to markdown preserving headings, lists, tables, code blocks, and links
  5. Whitespace cleanup - Removes excessive blank lines and trailing whitespace
  6. Quality filter - Skips pages with fewer than 10 words of content

Use cases

  • Build RAG knowledge bases from documentation sites
  • Create training datasets for LLM fine-tuning
  • Index product documentation for AI assistants
  • Archive website content in a portable format
  • Feed content into vector databases (Pinecone, Weaviate, etc.)

Technical details

  • Async crawling with httpx for fast performance
  • BFS traversal with configurable depth limits
  • URL deduplication with fragment removal and normalization
  • robots.txt compliance with per-domain caching
  • Token bucket rate limiting for polite crawling
  • Same-domain restriction prevents crawling external sites
  • lxml parser for fast, robust HTML parsing

You might also like

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

Fast Google Search Results Scraper

6sigmag/fast-google-search-results-scraper

Paste keywords in bulk β†’ get clean, clickable URLs. This ultra-lightweight Google SERP scraper is built for non-technical teams who need links fast for lead prospecting and market research. No giant payloads, no complex setup

Google Search Results Scraper (Pay Per Result)

vtrdev/google-search-results-serp-scraper

Google SERP scraper with dual parsing, smart title recovery, and proxy support. Scrape multiple pages with localized results. Ideal for SEO tracking, content research, and brand monitoring β€” billed only per result.

Website Company Enricher

great_pistachio/website-company-enricher

Enrich company data from any website domain. Extracts company name, emails, phones, social links, tech stack, addresses, and more. A free alternative to Clearbit and Clay for lead enrichment and sales prospecting.

πŸ‘ User avatar

Saturnin Pugnet

54

Google Search Results Scraper (SERP)

apidojo/google-search-scraper

SERP - Google Search Scraper with unbeatable pricing! $0.002/query gets you 10 results FREE + $0.0002/extra item. Event-based billing = pay only for what you need. Ideal for SEO monitoring, keyword research & market analysis. No proxy required!

564

3.2

Expired Domains Scraper

martin1080p/expired-domains-scraper

The Expired Domains Scraper automates finding valuable expired domains from expireddomains.com, offering filters and sorting by SEO metrics and auction details for efficient domain acquisition.

267

1.0

Company Website Research

mstech/company-website-research

Extracting comprehensive data from the corporate website

Camoufox Scraper

apify/camoufox-scraper

Crawls websites with stealthy Camoufox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

Google Search Scraper

mina_safwat/Google-Search-Scraper

The fastest Google Search scraper availableβ€”perfect if you need to scrape Google Search results quickly and efficiently.

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.