Pricing
from $5.00 / 1,000 results
Go to Apify Store
AI-Powered Smart Web Scraper
Intelligent content extraction from any website using Crawlee + AI. Auto-detects structure, adapts to layout changes, handles JavaScript rendering. No custom code needed. Extract articles, products, listings from 1000s of pages.
Pricing
from $5.00 / 1,000 results
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
31
Total users
3
Monthly active users
2 months ago
Last modified
Categories
Share
AI Web Scraper
Extract AI-ready content from any website. Clean Markdown output, smart chunking for RAG/embeddings, and structured metadata β optimized for LLM data pipelines.
Features
- Clean Markdown Output β Automatically removes navigation, ads, footers, sidebars, and cookie banners. Extracts only the main content.
- Smart Chunking β Paragraph-aware text splitting with configurable chunk size and overlap. Perfect for vector databases and embedding models.
- Token Estimation β Each chunk includes an estimated token count, compatible with OpenAI, Cohere, and other tokenizers.
- Structured Metadata β Extracts title, description, language, author, publish date, OG images, headings, links, and images.
- Multi-page Crawling β Follow links within the same domain with configurable depth. Process entire documentation sites or blogs.
- Multiple Output Formats β Markdown (default), plain text, or raw HTML.
Use Cases
- RAG Pipelines β Feed clean, chunked content into retrieval-augmented generation systems
- Vector Database Ingestion β Ready-to-embed chunks for Pinecone, Weaviate, Qdrant, ChromaDB, Milvus
- LLM Fine-tuning Data β Extract structured training data from web sources
- Knowledge Base Building β Crawl documentation sites and create searchable knowledge bases
- Content Analysis β Extract and analyze web content at scale
Input
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | string[] | (required) | URLs to scrape |
maxPages | integer | 10 | Maximum pages to crawl |
outputFormat | string | "markdown" | Output format: "markdown", "text", or "html" |
chunkSize | integer | 1000 | Target chunk size in tokens |
chunkOverlap | integer | 100 | Overlap between chunks in tokens |
excludeSelectors | string[] | [] | Additional CSS selectors to exclude |
includeLinks | boolean | true | Include extracted links in metadata |
includeImages | boolean | true | Include extracted images in metadata |
maxDepth | integer | 0 | Crawl depth (0 = provided URLs only) |
respectRobotsTxt | boolean | true | Respect robots.txt rules |
Output
Each page produces a dataset item with:
{"url":"https://example.com/page","metadata":{"title":"Page Title","description":"Meta description","language":"en","author":"Author Name","publishedDate":"2025-01-15","ogImage":"https://example.com/image.jpg","headings":[{"level":1,"text":"Main Heading"}],"links":[{"text":"Link Text","href":"https://..."}],"images":[{"alt":"Image description","src":"https://..."}]},"content":"# Main Heading\n\nClean markdown content...","chunks":[{"index":0,"text":"First chunk of content...","tokenEstimate":245,"charCount":980}],"totalTokenEstimate":1520,"scrapedAt":"2025-01-15T10:30:00.000Z"}
Integration Examples
Pinecone / Vector DB
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("your-username/ai-web-scraper").call(run_input={"urls":["https://docs.example.com"],"maxDepth":2,"chunkSize":512})for item in client.dataset(run["defaultDatasetId"]).iterate_items():for chunk in item["chunks"]:# Embed and upsert to your vector databaseembedding = embed(chunk["text"])index.upsert([(f"{item['url']}_{chunk['index']}", embedding,{"text": chunk["text"],"url": item["url"],"title": item["metadata"]["title"],})])
LangChain
from langchain.document_loaders import ApifyDatasetLoaderfrom langchain.schema import Documentloader = ApifyDatasetLoader(dataset_id=run["defaultDatasetId"],dataset_mapping_function=lambda item:[Document(page_content=chunk["text"],metadata={"source": item["url"],"chunk_index": chunk["index"]},)for chunk in item["chunks"]],)docs = loader.load()
Chunk Size Recommendations
| Embedding Model | Recommended Chunk Size |
|---|---|
| OpenAI text-embedding-3-small | 500β1000 |
| OpenAI text-embedding-3-large | 1000β2000 |
| Cohere embed-v3 | 256β512 |
| Sentence Transformers | 256β512 |
| Google Gecko | 500β1000 |
Pricing
This actor uses pay-per-event pricing at approximately $0.005 per page processed.
License
MIT
