Website Content Crawler

Pricing

Pay per usage

Website Content Crawler

Universal website crawler that extracts clean text/markdown content, metadata, links, and images from any URL. Features sitemap parsing, robots.txt respect, and multi-page BFS crawling with depth control.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

👁 Ali haydar Karadaş

Ali haydar Karadaş

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

What does Website Content Crawler do?

This actor provides four endpoints that cover different content extraction needs. Crawl Page scrapes a single URL and returns the page content, metadata, links, and images. Crawl Site follows links from a starting URL and crawls multiple pages up to a configurable depth and page limit. Get Sitemap parses a site's sitemap.xml and returns all listed URLs with their last modified dates, change frequencies, and priorities. Extract Content pulls just the main content from a page in either plain text or markdown format.

The crawler respects robots.txt by default (configurable), extracts Open Graph and meta tags, identifies internal vs. external links, and captures image alt text and dimensions. Output is clean and structured -- ready for AI training data, content analysis, SEO audits, or database storage.

What data do you get?

Page content:

url, title, description
text_content -- extracted plain text
markdown_content -- content converted to markdown
author, published_date, language
word_count, char_count

Page metadata:

status_code, content_type, response_time_ms
canonical_url, og_tags, meta_tags

Links found on page:

url, text, is_internal, is_nofollow

Images found on page:

url, alt_text, width, height

Sitemap data:

url, lastmod, changefreq, priority

Crawl summary:

start_url, pages_crawled, total_links

Who is this for?

AI and ML engineers -- collect training data from websites in clean text or markdown format
SEO professionals -- audit site structure, meta tags, internal linking, and content quality
Content analysts -- extract and compare content across competitor websites
Researchers -- build text corpora from web sources for academic or commercial analysis
Developers -- integrate website content extraction into pipelines, chatbots, or knowledge bases

How to use it

Open the actor in Apify Console and select an endpoint (crawl_page, crawl_site, get_sitemap, or extract_content).
Enter the URL you want to crawl or extract content from.
For crawl_site, set the crawl depth and page limit.
Click "Start" to run the crawler.
Export results as JSON from the Dataset tab or use the Apify API.

Input parameters

Parameter	Type	Default	Description
endpoint	string	crawl_page	API endpoint: crawl_page, crawl_site, get_sitemap, or extract_content
url	string	--	The URL to crawl or extract content from (required)
depth	integer	1	Maximum crawl depth, 1-5 (crawl_site only)
limit	integer	10	Maximum number of pages to crawl, 1-100 (crawl_site only)
output_format	string	text	Output format for extract_content: text or markdown
respect_robots	boolean	true	Whether to respect robots.txt rules

Sample output

{
"url":"https://example.com/blog/intro-to-web-scraping",
"content":{
"url":"https://example.com/blog/intro-to-web-scraping",
"title":"Introduction to Web Scraping",
"description":"A beginner's guide to web scraping with Python",
"text_content":"Web scraping is the process of extracting data from websites...",
"markdown_content":"# Introduction to Web Scraping\n\nWeb scraping is the process...",
"author":"Jane Smith",
"published_date":"2026-05-10",
"language":"en",
"word_count":1245,
"char_count":7830
},
"metadata":{
"url":"https://example.com/blog/intro-to-web-scraping",
"status_code":200,
"content_type":"text/html",
"response_time_ms":234.5,
"canonical_url":"https://example.com/blog/intro-to-web-scraping",
"og_tags":{
"og:title":"Introduction to Web Scraping",
"og:type":"article"
},
"meta_tags":{
"description":"A beginner's guide to web scraping with Python"
}
},
"links":[
{
"url":"https://example.com/blog/advanced-scraping",
"text":"Advanced Scraping Techniques",
"is_internal":true,
"is_nofollow":false
}
],
"images":[
{
"url":"https://example.com/images/scraping-diagram.png",
"alt_text":"Web scraping workflow diagram",
"width":800,
"height":450
}
]
}

How much does it cost?

Each result costs $0.002. Crawling 1,000 pages costs just $2, and 10,000 pages costs $20.

Apify gives every new user $5 in free monthly credits, so you can crawl about 2,500 pages for free.

Common questions

Can I get the content in markdown format? Yes. Use the extract_content endpoint and set output_format to "markdown." The crawl_page endpoint also returns markdown_content alongside plain text by default.

Does it follow links across different domains? The crawl_site endpoint only follows internal links (same domain). External links are captured in the output but not followed. This prevents the crawl from spiraling across the entire web.

Does it handle JavaScript-rendered pages? The crawler works with server-rendered HTML. Pages that require JavaScript execution to load content may return incomplete results. For heavy SPA sites, consider using a browser-based crawler instead.

Contact & Custom Solutions

Need a custom scraper, higher volume, or a specific integration? We're here to help.

If anything isn't working right or you need support, don't hesitate to reach out.

Telegram: t.me/novashield_dev
Email: novashield.dev@gmail.com

👁 Website Content Crawler avatar

Website Content Crawler

rupom888/website-content-crawler

👁 User avatar

Syed Rupom

👁 Website Content Crawler avatar

Website Content Crawler

ayeeyee/website-content-crawler

Full website crawling

👁 User avatar

Virtual Footprint LLC

Website Content Crawler for AI — Clean Markdown, 4x Cheaper

joyouscam35875/website-content-crawler

Crawl any website and extract clean text/markdown for LLMs, RAG pipelines, vector databases. BFS crawl with depth control, robots.txt support, boilerplate removal. Perfect for feeding AI models. $0.001/page — 4x cheaper than the official Apify crawler.

👁 User avatar

Ken Digital

Website Content Crawler Scraper

oneary/website-content-crawler

🌐 Full website crawler that extracts structured content (text, headings, metadata, links, images) from any domain. Free platform compute pricing.

👁 User avatar

Luan M.

👁 Website Content Crawler API - Markdown for RAG avatar

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

👁 User avatar

Tugelbay Konabayev

👁 Website Content Crawler avatar

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

👁 User avatar

ParseForge

👁 No-BS Content Crawler 🖕 avatar

No-BS Content Crawler 🖕

successful_nonagon/no-bs-content-crawler

Fast web crawler that extracts clean text from websites. Returns readable content, headings, and links. Perfect for content aggregation, SEO research, and data collection.

👁 User avatar

hafsah nuzhat

5.0

👁 AI Website Content Crawler avatar

AI Website Content Crawler

ilborso/ai-website-content-crawler

A super fast website crawler for Agentic AI integration

👁 User avatar

Fabio Borsotti

5.0

Robots.txt Auditor & Sitemap Finder

andok/robotstxt-auditor

Scan robots.txt files in bulk to extract sitemap URLs and verify crawler directives for technical SEO compliance.

👁 User avatar

Andok

👁 Website Metadata Extractor (meta tags, sitemap, robots) 🔎 avatar

Website Metadata Extractor (meta tags, sitemap, robots) 🔎

powerful_bachelor/website-metadata-extractor

🔍 Website Metadata Extractor 🌐 Extract essential website data: meta tags, robots.txt, and sitemap.xml in one scan. 📊 Analyze SEO elements, crawler directives, and site structure. ✅ Perfect for SEO audits, 🔎 competitor research, and 🚀 understanding how search engines view your website.

👁 User avatar

Powerful Bachelor

URL: https://apify.com/novashieldai/website-content-crawler

⇱ Website Content Crawler · Apify

Website Content Crawler

What does Website Content Crawler do?

What data do you get?

Who is this for?

How to use it

Input parameters

Sample output

How much does it cost?

Common questions

Contact & Custom Solutions

You might also like

Website Content Crawler

Website Content Crawler

Website Content Crawler for AI — Clean Markdown, 4x Cheaper

Website Content Crawler Scraper

Website Content Crawler API - Markdown for RAG

Website Content Crawler

No-BS Content Crawler 🖕

AI Website Content Crawler

Robots.txt Auditor & Sitemap Finder

Website Metadata Extractor (meta tags, sitemap, robots) 🔎