Website Extractor

Pricing

$25.00/month + usage

Website Extractor

Download complete websites and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with all assets. Includes source code. For Research purposes

Pricing

$25.00/month + usage

Rating

0.0

(0)

Developer

👁 mikolabs

mikolabs

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

7 months ago

Last modified

Scrap Any Website with Source Code

Download complete websites and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with all assets. Includes source code.

Features

✅ Complete Website Downloads - Downloads entire websites with all assets and source code
✅ ZIP Archive Output - Automatically creates compressed ZIP files with full source code
✅ Configurable Depth - Control how deep to follow links (1-10 levels)
✅ Rate Limiting - Respect servers with configurable download rates
✅ Domain Filtering - Stay on same domain or follow external links
✅ Content Selection - Choose to download images, videos, or just HTML/CSS/JS
✅ Robots.txt Support - Optionally respect website's robots.txt
✅ Progress Tracking - Real-time logging of scraping progress
✅ Statistics - File counts, sizes, and compression ratios

Input Configuration

Required

Website URL - The URL to scrape (must include http:// or https://)

Optional

Parameter	Type	Default	Description
`depth`	Integer	2	How many links deep to follow (1-10)
`stayOnDomain`	Boolean	true	Only download from the same domain
`externalDepth`	Integer	0	How deep to follow external links
`connections`	Integer	4	Number of simultaneous downloads
`maxRate`	Integer	0	Max download rate in KB/s (0 = unlimited)
`maxSize`	Integer	0	Max total size in MB (0 = unlimited)
`maxTime`	Integer	0	Max scraping time in seconds (0 = unlimited)
`retries`	Integer	2	Number of retry attempts on error
`timeout`	Integer	30	Connection timeout in seconds
`getImages`	Boolean	true	Download image files
`getVideos`	Boolean	true	Download video files
`followRobots`	Boolean	true	Respect robots.txt
`outputName`	String	null	Custom output name (auto-generated if empty)
`cleanup`	Boolean	true	Remove source files after creating ZIP

Output

The Actor provides two types of output:

1. Dataset

Statistics and metadata for each scrape:

{
"url":"https://example.com",
"outputName":"example.com_20241205_130000",
"zipFile":"example.com_20241205_130000.zip",
"fileCount":156,
"totalSize":5242880,
"zipSize":2621440,
"compressionRatio":50.0,
"timestamp":"2024-12-05T13:00:00.000Z",
"config":{ ... },
"status":"success"
}

2. Key-Value Store

The complete website as a ZIP archive. Access it via:

Apify Console: Storage → Key-Value Store → [filename].zip
API: https://api.apify.com/v2/key-value-stores/{storeId}/keys/{filename}.zip

Usage Examples

Example 1: Basic Website Backup

{
"url":"https://example.com",
"depth":2,
"stayOnDomain":true
}

Downloads the website up to 2 levels deep, staying on the same domain.

Example 2: Deep Archive with External Links

{
"url":"https://example.com",
"depth":5,
"externalDepth":1,
"stayOnDomain":false
}

Downloads 5 levels deep and follows external links 1 level.

Example 3: Fast Scrape (HTML/CSS/JS Only)

{
"url":"https://example.com",
"depth":3,
"getImages":false,
"getVideos":false,
"connections":8
}

Fast scraping without images or videos, using 8 parallel connections.

Example 4: Rate-Limited Polite Scrape

{
"url":"https://example.com",
"depth":2,
"maxRate":500,
"connections":2,
"followRobots":true
}

Polite scraping with rate limiting and respecting robots.txt.

Example 5: Time-Limited Scrape

{
"url":"https://example.com",
"depth":10,
"maxTime":300,
"maxSize":100
}

Stops after 5 minutes or 100 MB, whichever comes first.

How It Works

Input Validation - Validates the URL and configuration
HTTrack Execution - Runs HTTrack with configured parameters to download website source code
Progress Monitoring - Logs progress in real-time
Pre-ZIP Cleanup - Removes HTTrack cache files and index files before archiving
ZIP Creation - Creates a compressed archive of all website files and source code
Storage - Saves ZIP to Key-Value Store and stats to Dataset
Post-ZIP Cleanup - Optionally removes temporary files after ZIP creation

Technical Details

Based On

HTTrack 3.49+ - Industry-standard website copier
Python 3.11 - Modern async Python runtime
Apify SDK 2.7+ - For Actor integration and storage

Limitations

Some JavaScript-heavy SPAs may not download completely
Websites with aggressive bot protection may block scraping
Dynamic content loaded after page load may be missed
Maximum recommended depth is 5-6 for most websites

Performance

Small websites (< 100 pages): 1-5 minutes
Medium websites (100-1000 pages): 5-30 minutes
Large websites (1000+ pages): 30+ minutes

Performance depends on:

Website size and structure
Number of connections
Network speed
Rate limiting settings

Legal and Ethical Considerations

⚠️ Important: Always ensure you have permission to scrape websites.

✅ Respect robots.txt files (enabled by default)
✅ Don't overload servers (use rate limiting)
✅ Check website Terms of Service
✅ Don't scrape copyrighted content without permission
✅ Use reasonable connection limits (2-8)

Troubleshooting

Scraping Takes Too Long

Reduce depth to 1 or 2
Disable getVideos and getImages
Increase connections (but be respectful)
Set maxTime or maxSize limits

ZIP File Too Large

Reduce depth
Disable getVideos
Set maxSize limit
Use maxTime to stop early

Website Blocks Scraping

Enable followRobots
Reduce connections to 2-4
Add rate limiting with maxRate
Increase timeout if connections are slow

Missing Content

Increase depth
Enable externalDepth if content is on other domains
Check if website uses heavy JavaScript (may not work)
Enable getImages and getVideos if needed

Development

Local Testing

# Install dependencies
pip install-r requirements.txt
# Run locally
apify run

Building

# Build Docker image
docker build -t httrack-scraper .
# Run container
docker run httrack-scraper

Support

For issues or questions:

Check Actor logs for detailed error messages
Review HTTrack documentation: https://www.httrack.com/
Contact Apify support through the platform

License

This Actor uses HTTrack, which is licensed under GPL v3.

Version History

1.0 - Initial release with full HTTrack integration, source code download, and ZIP archive output

👁 Extract Any Website with Source Code avatar

Extract Any Website with Source Code

mikolabs/extract-any-website-with-source-code

👁 User avatar

mikolabs

👁 Scrap Any Website with Source Code avatar

Scrap Any Website with Source Code

mikolabs/web-extractor

👁 User avatar

mikolabs

Full Site Downloader | $4.99/Site | 1-Time Crawl | All Assets

hailey_apify/Full-Website-Downloader

Full-Website-Downloader - Automatically crawls entire websites including HTML and all static assets (CSS, JS, images, etc.), preserves complete structure and exports as ZIP package. Supports depth control and same-domain resource filtering.

👁 User avatar

Hailey

Zip Code API

vivid_astronaut/zip-code

👁 User avatar

Fabio Suizu

👁 Website Image Downloader Pro avatar

Website Image Downloader Pro

powerful_bachelor/website-image-downloader-pro

📸 Website Image Downloader Pro: Extract and download images from any URL! 🚀 Features include image URL extraction, SVG to PNG conversion, downloading, and zipping images. Perfect for market research, AI training, and creating visual archives. 🌐✨ Try it now on Apify! 💾

👁 User avatar

Powerful Bachelor

515

2.5

👁 Website Scraper avatar

Website Scraper

snipercoder/website-scraper

Scrape websites effortlessly

👁 User avatar

Sniper Coder

326

👁 YouTube Transcript Extractor avatar

YouTube Transcript Extractor

knowbaseai/youtube-transcript-extractor

Professional YouTube transcript extraction tool. Features multi-language support, timestamped chunks in JSON format, and complete transcript text. Ideal for content analysis, research, and creating searchable video archives.

👁 User avatar

knowbase

183

1.0

Zip Extractor

ukonhattu/zip-extractor

Extracts files from ZIP archives. Input can be a URL or uploaded ZIP. Extracts contents and saves each file as a record in the Apify Key-Value Store, with sanitized filenames as keys. Ideal for automating data retrieval from compressed sources.

👁 User avatar

Daniel

👁 Yupoo Image & Album Downloader - Download Photos as ZIP avatar

Yupoo Image & Album Downloader - Download Photos as ZIP

bytebeast/yupoo-album-downloader

Download all photos from any Yupoo album in full resolution and get them as a ZIP file. Fast, proxy-enabled, and perfect for product sourcing, catalog scraping, and backup.

👁 User avatar

Aatish

130

👁 Email Extractor avatar

Email Extractor

gordian/email-extractor

Find and extract email addresses from any website in seconds. This actor will crawl entire websites and return all emails after validation. Easy to use and extremely fast.

👁 User avatar

Gordian

632

2.0

URL: https://apify.com/mikolabs/website-extractor