VOOZH about

URL: https://apify.com/cloud9_ai/ai-web-scraper

⇱ AI Web Scraper for LLM, RAG and Vector DBs Β· Apify


Pricing

from $5.00 / 1,000 results

Go to Apify Store

AI-Powered Smart Web Scraper

Intelligent content extraction from any website using Crawlee + AI. Auto-detects structure, adapts to layout changes, handles JavaScript rendering. No custom code needed. Extract articles, products, listings from 1000s of pages.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ cloud9

cloud9

Maintained by Community

Actor stats

0

Bookmarked

31

Total users

3

Monthly active users

2 months ago

Last modified

Categories

Share

AI Web Scraper

Extract AI-ready content from any website. Clean Markdown output, smart chunking for RAG/embeddings, and structured metadata β€” optimized for LLM data pipelines.

Features

  • Clean Markdown Output β€” Automatically removes navigation, ads, footers, sidebars, and cookie banners. Extracts only the main content.
  • Smart Chunking β€” Paragraph-aware text splitting with configurable chunk size and overlap. Perfect for vector databases and embedding models.
  • Token Estimation β€” Each chunk includes an estimated token count, compatible with OpenAI, Cohere, and other tokenizers.
  • Structured Metadata β€” Extracts title, description, language, author, publish date, OG images, headings, links, and images.
  • Multi-page Crawling β€” Follow links within the same domain with configurable depth. Process entire documentation sites or blogs.
  • Multiple Output Formats β€” Markdown (default), plain text, or raw HTML.

Use Cases

  • RAG Pipelines β€” Feed clean, chunked content into retrieval-augmented generation systems
  • Vector Database Ingestion β€” Ready-to-embed chunks for Pinecone, Weaviate, Qdrant, ChromaDB, Milvus
  • LLM Fine-tuning Data β€” Extract structured training data from web sources
  • Knowledge Base Building β€” Crawl documentation sites and create searchable knowledge bases
  • Content Analysis β€” Extract and analyze web content at scale

Input

ParameterTypeDefaultDescription
urlsstring[](required)URLs to scrape
maxPagesinteger10Maximum pages to crawl
outputFormatstring"markdown"Output format: "markdown", "text", or "html"
chunkSizeinteger1000Target chunk size in tokens
chunkOverlapinteger100Overlap between chunks in tokens
excludeSelectorsstring[][]Additional CSS selectors to exclude
includeLinksbooleantrueInclude extracted links in metadata
includeImagesbooleantrueInclude extracted images in metadata
maxDepthinteger0Crawl depth (0 = provided URLs only)
respectRobotsTxtbooleantrueRespect robots.txt rules

Output

Each page produces a dataset item with:

{
"url":"https://example.com/page",
"metadata":{
"title":"Page Title",
"description":"Meta description",
"language":"en",
"author":"Author Name",
"publishedDate":"2025-01-15",
"ogImage":"https://example.com/image.jpg",
"headings":[{"level":1,"text":"Main Heading"}],
"links":[{"text":"Link Text","href":"https://..."}],
"images":[{"alt":"Image description","src":"https://..."}]
},
"content":"# Main Heading\n\nClean markdown content...",
"chunks":[
{
"index":0,
"text":"First chunk of content...",
"tokenEstimate":245,
"charCount":980
}
],
"totalTokenEstimate":1520,
"scrapedAt":"2025-01-15T10:30:00.000Z"
}

Integration Examples

Pinecone / Vector DB

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("your-username/ai-web-scraper").call(
run_input={"urls":["https://docs.example.com"],"maxDepth":2,"chunkSize":512}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
for chunk in item["chunks"]:
# Embed and upsert to your vector database
embedding = embed(chunk["text"])
index.upsert([(f"{item['url']}_{chunk['index']}", embedding,{
"text": chunk["text"],
"url": item["url"],
"title": item["metadata"]["title"],
})])

LangChain

from langchain.document_loaders import ApifyDatasetLoader
from langchain.schema import Document
loader = ApifyDatasetLoader(
dataset_id=run["defaultDatasetId"],
dataset_mapping_function=lambda item:[
Document(
page_content=chunk["text"],
metadata={"source": item["url"],"chunk_index": chunk["index"]},
)
for chunk in item["chunks"]
],
)
docs = loader.load()

Chunk Size Recommendations

Embedding ModelRecommended Chunk Size
OpenAI text-embedding-3-small500–1000
OpenAI text-embedding-3-large1000–2000
Cohere embed-v3256–512
Sentence Transformers256–512
Google Gecko500–1000

Pricing

This actor uses pay-per-event pricing at approximately $0.005 per page processed.

License

MIT

You might also like

AI Web Extractor

uxinfra/uxinfra-web-extractor

Intelligent web content extraction with AI-powered structuring. Extracts articles, products, reviews, and structured data from any website.

Agentic Crawler

hpix/agentic-crawler

An intelligent AI web scraper that navigates websites like a human. Just describe the data you need in plain English. Adapts to layout changes, handles dynamic JavaScript sites, and gets smarter with every run.

Web Content Extractor API β€” URL to JSON

george.the.developer/web-content-extractor-api

Extract structured JSON from any webpage. Articles, products, recipes, jobs. Auto-detects content type. Returns metadata, headings, images, links. For AI agents and RAG.

11

Crawlee Scraper

ellustar/my-actor-62

Crawlee Scraper** is a lightweight JavaScript actor for fast and reliable web scraping using Crawlee and Cheerio. It efficiently crawls pages, extracts structured data, and supports scalable, customizable scraping workflows.

Crawlee HTML Scraper

ellustar/my-actor-28

Crawlee HTML Scraper is a fast, lightweight web scraping actor built with JavaScript, Crawlee, and Cheerio. It efficiently extracts structured data from static HTML pages, supports custom selectors, pagination, and scalable crawling for reliable web data collection.

Smart Url Extractor

diao-bah-timbi/smart-url-extractor

Intelligent web scraping Actor that automatically detects content types (products, jobs, articles, profiles) and extracts structured data with 15+ fields. Perfect for e-commerce monitoring, job aggregation, and content curation.

πŸ‘ User avatar

Mamadou Diao Bah

13

Universal AI Web Scraper

stanvanrooy6/universal-ai-web-scraper

Turn any website into an API. Extract structured data using plain English. Features anti-bot bypass, dynamic rendering, and web search. No coding needed.

97

1.5

Universal AI GPT Scraper

louisdeconinck/ai-gpt-scraper

Transform any website into structured data with AI-powered extraction. This versatile tool combines advanced web scraping with intelligent content analysis to deliver clean, customized JSON output - perfect for automating data collection from any web source.

πŸ‘ User avatar

Louis Deconinck

177

5.0

Related articles

What is AI web scraping? And do you really need it?
Read more