VOOZH about

URL: https://apify.com/mikolabs/website-extractor

โ‡ฑ Website Extractor ยท Apify


Pricing

$25.00/month + usage

Go to Apify Store

Download complete websites and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with all assets. Includes source code. For Research purposes

Pricing

$25.00/month + usage

Rating

0.0

(0)

Developer

๐Ÿ‘ mikolabs

mikolabs

Maintained by Community

Actor stats

1

Bookmarked

38

Total users

3

Monthly active users

7 months ago

Last modified

Share

Scrap Any Website with Source Code

Download complete websites and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with all assets. Includes source code.

Features

โœ… Complete Website Downloads - Downloads entire websites with all assets and source code
โœ… ZIP Archive Output - Automatically creates compressed ZIP files with full source code
โœ… Configurable Depth - Control how deep to follow links (1-10 levels)
โœ… Rate Limiting - Respect servers with configurable download rates
โœ… Domain Filtering - Stay on same domain or follow external links
โœ… Content Selection - Choose to download images, videos, or just HTML/CSS/JS
โœ… Robots.txt Support - Optionally respect website's robots.txt
โœ… Progress Tracking - Real-time logging of scraping progress
โœ… Statistics - File counts, sizes, and compression ratios

Input Configuration

Required

  • Website URL - The URL to scrape (must include http:// or https://)

Optional

ParameterTypeDefaultDescription
depthInteger2How many links deep to follow (1-10)
stayOnDomainBooleantrueOnly download from the same domain
externalDepthInteger0How deep to follow external links
connectionsInteger4Number of simultaneous downloads
maxRateInteger0Max download rate in KB/s (0 = unlimited)
maxSizeInteger0Max total size in MB (0 = unlimited)
maxTimeInteger0Max scraping time in seconds (0 = unlimited)
retriesInteger2Number of retry attempts on error
timeoutInteger30Connection timeout in seconds
getImagesBooleantrueDownload image files
getVideosBooleantrueDownload video files
followRobotsBooleantrueRespect robots.txt
outputNameStringnullCustom output name (auto-generated if empty)
cleanupBooleantrueRemove source files after creating ZIP

Output

The Actor provides two types of output:

1. Dataset

Statistics and metadata for each scrape:

{
"url":"https://example.com",
"outputName":"example.com_20241205_130000",
"zipFile":"example.com_20241205_130000.zip",
"fileCount":156,
"totalSize":5242880,
"zipSize":2621440,
"compressionRatio":50.0,
"timestamp":"2024-12-05T13:00:00.000Z",
"config":{ ... },
"status":"success"
}

2. Key-Value Store

The complete website as a ZIP archive. Access it via:

  • Apify Console: Storage โ†’ Key-Value Store โ†’ [filename].zip
  • API: https://api.apify.com/v2/key-value-stores/{storeId}/keys/{filename}.zip

Usage Examples

Example 1: Basic Website Backup

{
"url":"https://example.com",
"depth":2,
"stayOnDomain":true
}

Downloads the website up to 2 levels deep, staying on the same domain.

Example 2: Deep Archive with External Links

{
"url":"https://example.com",
"depth":5,
"externalDepth":1,
"stayOnDomain":false
}

Downloads 5 levels deep and follows external links 1 level.

Example 3: Fast Scrape (HTML/CSS/JS Only)

{
"url":"https://example.com",
"depth":3,
"getImages":false,
"getVideos":false,
"connections":8
}

Fast scraping without images or videos, using 8 parallel connections.

Example 4: Rate-Limited Polite Scrape

{
"url":"https://example.com",
"depth":2,
"maxRate":500,
"connections":2,
"followRobots":true
}

Polite scraping with rate limiting and respecting robots.txt.

Example 5: Time-Limited Scrape

{
"url":"https://example.com",
"depth":10,
"maxTime":300,
"maxSize":100
}

Stops after 5 minutes or 100 MB, whichever comes first.

How It Works

  1. Input Validation - Validates the URL and configuration
  2. HTTrack Execution - Runs HTTrack with configured parameters to download website source code
  3. Progress Monitoring - Logs progress in real-time
  4. Pre-ZIP Cleanup - Removes HTTrack cache files and index files before archiving
  5. ZIP Creation - Creates a compressed archive of all website files and source code
  6. Storage - Saves ZIP to Key-Value Store and stats to Dataset
  7. Post-ZIP Cleanup - Optionally removes temporary files after ZIP creation

Technical Details

Based On

  • HTTrack 3.49+ - Industry-standard website copier
  • Python 3.11 - Modern async Python runtime
  • Apify SDK 2.7+ - For Actor integration and storage

Limitations

  • Some JavaScript-heavy SPAs may not download completely
  • Websites with aggressive bot protection may block scraping
  • Dynamic content loaded after page load may be missed
  • Maximum recommended depth is 5-6 for most websites

Performance

  • Small websites (< 100 pages): 1-5 minutes
  • Medium websites (100-1000 pages): 5-30 minutes
  • Large websites (1000+ pages): 30+ minutes

Performance depends on:

  • Website size and structure
  • Number of connections
  • Network speed
  • Rate limiting settings

Legal and Ethical Considerations

โš ๏ธ Important: Always ensure you have permission to scrape websites.

  • โœ… Respect robots.txt files (enabled by default)
  • โœ… Don't overload servers (use rate limiting)
  • โœ… Check website Terms of Service
  • โœ… Don't scrape copyrighted content without permission
  • โœ… Use reasonable connection limits (2-8)

Troubleshooting

Scraping Takes Too Long

  • Reduce depth to 1 or 2
  • Disable getVideos and getImages
  • Increase connections (but be respectful)
  • Set maxTime or maxSize limits

ZIP File Too Large

  • Reduce depth
  • Disable getVideos
  • Set maxSize limit
  • Use maxTime to stop early

Website Blocks Scraping

  • Enable followRobots
  • Reduce connections to 2-4
  • Add rate limiting with maxRate
  • Increase timeout if connections are slow

Missing Content

  • Increase depth
  • Enable externalDepth if content is on other domains
  • Check if website uses heavy JavaScript (may not work)
  • Enable getImages and getVideos if needed

Development

Local Testing

# Install dependencies
pip install-r requirements.txt
# Run locally
apify run

Building

# Build Docker image
docker build -t httrack-scraper .
# Run container
docker run httrack-scraper

Support

For issues or questions:

  • Check Actor logs for detailed error messages
  • Review HTTrack documentation: https://www.httrack.com/
  • Contact Apify support through the platform

License

This Actor uses HTTrack, which is licensed under GPL v3.

Version History

  • 1.0 - Initial release with full HTTrack integration, source code download, and ZIP archive output

You might also like

Extract Any Website with Source Code

mikolabs/extract-any-website-with-source-code

Download complete websites and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with all assets. Includes source code. For Research purposes

Scrap Any Website with Source Code

mikolabs/web-extractor

Download complete websites and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with all assets. Includes source code. For Research purposes

Website Image Downloader Pro

powerful_bachelor/website-image-downloader-pro

๐Ÿ“ธ Website Image Downloader Pro: Extract and download images from any URL! ๐Ÿš€ Features include image URL extraction, SVG to PNG conversion, downloading, and zipping images. Perfect for market research, AI training, and creating visual archives. ๐ŸŒโœจ Try it now on Apify! ๐Ÿ’พ

๐Ÿ‘ User avatar

Powerful Bachelor

515

2.5

Website Scraper

snipercoder/website-scraper

Scrape websites effortlessly

326

YouTube Transcript Extractor

knowbaseai/youtube-transcript-extractor

Professional YouTube transcript extraction tool. Features multi-language support, timestamped chunks in JSON format, and complete transcript text. Ideal for content analysis, research, and creating searchable video archives.

183

1.0

Yupoo Image & Album Downloader - Download Photos as ZIP

bytebeast/yupoo-album-downloader

Download all photos from any Yupoo album in full resolution and get them as a ZIP file. Fast, proxy-enabled, and perfect for product sourcing, catalog scraping, and backup.

Email Extractor

gordian/email-extractor

Find and extract email addresses from any website in seconds. This actor will crawl entire websites and return all emails after validation. Easy to use and extremely fast.