VOOZH about

URL: https://apify.com/novashieldai/website-content-crawler

⇱ Website Content Crawler Β· Apify


Pricing

Pay per usage

Go to Apify Store

Website Content Crawler

Universal website crawler that extracts clean text/markdown content, metadata, links, and images from any URL. Features sitemap parsing, robots.txt respect, and multi-page BFS crawling with depth control.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

πŸ‘ Ali haydar Karadaş

Ali haydar Karadaş

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Website Content Crawler extracts clean text and markdown content from any website, along with metadata, links, and images. Whether you need to scrape a single page, crawl an entire site, or parse a sitemap, this actor handles it with minimal setup.

What does Website Content Crawler do?

This actor provides four endpoints that cover different content extraction needs. Crawl Page scrapes a single URL and returns the page content, metadata, links, and images. Crawl Site follows links from a starting URL and crawls multiple pages up to a configurable depth and page limit. Get Sitemap parses a site's sitemap.xml and returns all listed URLs with their last modified dates, change frequencies, and priorities. Extract Content pulls just the main content from a page in either plain text or markdown format.

The crawler respects robots.txt by default (configurable), extracts Open Graph and meta tags, identifies internal vs. external links, and captures image alt text and dimensions. Output is clean and structured -- ready for AI training data, content analysis, SEO audits, or database storage.

What data do you get?

Page content:

  • url, title, description
  • text_content -- extracted plain text
  • markdown_content -- content converted to markdown
  • author, published_date, language
  • word_count, char_count

Page metadata:

  • status_code, content_type, response_time_ms
  • canonical_url, og_tags, meta_tags

Links found on page:

  • url, text, is_internal, is_nofollow

Images found on page:

  • url, alt_text, width, height

Sitemap data:

  • url, lastmod, changefreq, priority

Crawl summary:

  • start_url, pages_crawled, total_links

Who is this for?

  • AI and ML engineers -- collect training data from websites in clean text or markdown format
  • SEO professionals -- audit site structure, meta tags, internal linking, and content quality
  • Content analysts -- extract and compare content across competitor websites
  • Researchers -- build text corpora from web sources for academic or commercial analysis
  • Developers -- integrate website content extraction into pipelines, chatbots, or knowledge bases

How to use it

  1. Open the actor in Apify Console and select an endpoint (crawl_page, crawl_site, get_sitemap, or extract_content).
  2. Enter the URL you want to crawl or extract content from.
  3. For crawl_site, set the crawl depth and page limit.
  4. Click "Start" to run the crawler.
  5. Export results as JSON from the Dataset tab or use the Apify API.

Input parameters

ParameterTypeDefaultDescription
endpointstringcrawl_pageAPI endpoint: crawl_page, crawl_site, get_sitemap, or extract_content
urlstring--The URL to crawl or extract content from (required)
depthinteger1Maximum crawl depth, 1-5 (crawl_site only)
limitinteger10Maximum number of pages to crawl, 1-100 (crawl_site only)
output_formatstringtextOutput format for extract_content: text or markdown
respect_robotsbooleantrueWhether to respect robots.txt rules

Sample output

{
"url":"https://example.com/blog/intro-to-web-scraping",
"content":{
"url":"https://example.com/blog/intro-to-web-scraping",
"title":"Introduction to Web Scraping",
"description":"A beginner's guide to web scraping with Python",
"text_content":"Web scraping is the process of extracting data from websites...",
"markdown_content":"# Introduction to Web Scraping\n\nWeb scraping is the process...",
"author":"Jane Smith",
"published_date":"2026-05-10",
"language":"en",
"word_count":1245,
"char_count":7830
},
"metadata":{
"url":"https://example.com/blog/intro-to-web-scraping",
"status_code":200,
"content_type":"text/html",
"response_time_ms":234.5,
"canonical_url":"https://example.com/blog/intro-to-web-scraping",
"og_tags":{
"og:title":"Introduction to Web Scraping",
"og:type":"article"
},
"meta_tags":{
"description":"A beginner's guide to web scraping with Python"
}
},
"links":[
{
"url":"https://example.com/blog/advanced-scraping",
"text":"Advanced Scraping Techniques",
"is_internal":true,
"is_nofollow":false
}
],
"images":[
{
"url":"https://example.com/images/scraping-diagram.png",
"alt_text":"Web scraping workflow diagram",
"width":800,
"height":450
}
]
}

How much does it cost?

Each result costs $0.002. Crawling 1,000 pages costs just $2, and 10,000 pages costs $20.

Apify gives every new user $5 in free monthly credits, so you can crawl about 2,500 pages for free.

Common questions

Can I get the content in markdown format? Yes. Use the extract_content endpoint and set output_format to "markdown." The crawl_page endpoint also returns markdown_content alongside plain text by default.

Does it follow links across different domains? The crawl_site endpoint only follows internal links (same domain). External links are captured in the output but not followed. This prevents the crawl from spiraling across the entire web.

Does it handle JavaScript-rendered pages? The crawler works with server-rendered HTML. Pages that require JavaScript execution to load content may return incomplete results. For heavy SPA sites, consider using a browser-based crawler instead.

Contact & Custom Solutions

Need a custom scraper, higher volume, or a specific integration? We're here to help.

If anything isn't working right or you need support, don't hesitate to reach out.

You might also like

Website Content Crawler

rupom888/website-content-crawler

Website Content Crawler

ayeeyee/website-content-crawler

Full website crawling

πŸ‘ User avatar

Virtual Footprint LLC

1

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

πŸ‘ User avatar

Tugelbay Konabayev

26

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

No-BS Content Crawler πŸ–•

successful_nonagon/no-bs-content-crawler

Fast web crawler that extracts clean text from websites. Returns readable content, headings, and links. Perfect for content aggregation, SEO research, and data collection.

13

5.0

AI Website Content Crawler

ilborso/ai-website-content-crawler

A super fast website crawler for Agentic AI integration

πŸ‘ User avatar

Fabio Borsotti

6

5.0

Website Metadata Extractor (meta tags, sitemap, robots) πŸ”Ž

powerful_bachelor/website-metadata-extractor

πŸ” Website Metadata Extractor 🌐 Extract essential website data: meta tags, robots.txt, and sitemap.xml in one scan. πŸ“Š Analyze SEO elements, crawler directives, and site structure. βœ… Perfect for SEO audits, πŸ”Ž competitor research, and πŸš€ understanding how search engines view your website.

πŸ‘ User avatar

Powerful Bachelor

7