π Quick Website Content Scraper ( Extract Text for RAG & LLMs ) avatar
Quick Website Content Scraper ( Extract Text for RAG & LLMs )
Pricing
Pay per usage
Go to Apify Store
Quick Website Content Scraper ( Extract Text for RAG & LLMs )
Extract clean text from any website for AI/LLM applications. Supports both static and JavaScript-rendered sites (React, Vue, Angular). Perfect for RAG systems, chatbot training, and content analysis.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
π AutomateItPlease Workflow And Automaton Ops
AutomateItPlease Workflow And Automaton Ops
Maintained by CommunityActor stats
1
Bookmarked
49
Total users
4
Monthly active users
5 months ago
Last modified
Categories
Share
AI Web Content Scraper
Extract clean, structured text from any website - perfect for feeding into AI models, LLMs, and RAG systems.
π Features
- Universal Compatibility: Works with both static HTML and JavaScript-rendered websites (React, Vue, Angular, Next.js)
- AI-Optimized Output: Clean text with line breaks, ready for LLM consumption
- Smart Detection: Automatically detects and switches to browser mode for JS-heavy sites
- Blazing Fast: Uses HTTP for static sites, only uses browser when needed
- Batch Processing: Scrape multiple URLs in one run
- Zero Configuration: Just provide URLs and go
π‘ Use Cases
- RAG Systems: Feed website content into vector databases for AI retrieval
- LLM Training: Collect clean text data for fine-tuning language models
- Content Analysis: Extract text for sentiment analysis, summarization, or classification
- Knowledge Bases: Build AI-powered chatbots with website content
- Research: Gather structured data from multiple sources
π Input
{"startUrls":[{"url":"https://example.com"},{"url":"https://another-site.com"}],"maxPages":100}
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
startUrls | array | Yes | - | List of URLs to scrape |
maxPages | integer | No | 100 | Maximum number of pages to process |
π€ Output
Each scraped page produces:
{"url":"https://example.com","title":"Page Title","text":"All extracted text content...","wordCount":1250,"scrapedAt":"2026-01-19T21:18:43Z"}
Output Fields
- url: Original URL scraped
- title: Page title from
<title>tag - text: Complete text content with line breaks preserved
- wordCount: Total number of words extracted
- scrapedAt: ISO timestamp of when the page was scraped
π― How It Works
- Fetch: Makes HTTP request to each URL
- Detect: Analyzes if the page is JavaScript-rendered
- Extract: Uses fast HTTP mode for static sites, or switches to Playwright browser for JS-rendered sites
- Clean: Removes scripts, styles, navigation, and returns only the main content
- Store: Saves structured data to dataset
π§ Performance
- Static Sites: ~0.5-2 seconds per page
- JS-Rendered Sites: ~3-5 seconds per page (includes browser rendering)
- Throughput: Up to 100+ pages per run (configurable)
π» Technology
- Python 3.14
- Apify SDK: Actor framework and storage
- Playwright: Browser automation for JS-rendered sites
- Beautiful Soup: HTML parsing and text extraction
- HTTPX: Fast async HTTP client
π Examples
Example 1: RAG System Data Collection
{"startUrls":[{"url":"https://docs.python.org/3/"},{"url":"https://docs.apify.com/"},{"url":"https://playwright.dev/"}],"maxPages":50}
Example 2: Single Page Extraction
{"startUrls":[{"url":"https://blog.example.com/article"}],"maxPages":1}
π Privacy & Compliance
- Respects standard web scraping practices
- No personal data collection
- Works only with publicly accessible content
- Users responsible for compliance with site ToS
π Support
For issues or questions:
- Check the Apify documentation
- Open an issue in the Actor's GitHub repository
- Contact support through Apify Console
π License
This Actor is available for use on the Apify platform.
Made with β€οΈ for the AI community
