VOOZH about

URL: https://apify.com/wheat_tourist/ai-context-scraper

⇱ AI Context Scraper Β· Apify


Pricing

from $70.00 / 1,000 results

Go to Apify Store

AI Context Scraper

AI Context Scraper is a production-grade Apify Actor that gathers high-quality coding context from the Web, GitHub, and StackOverflow for AI agents and RAG systems. It uses NVIDIA Nemotron 3 Super to synthesize documents, code snippets, and patterns into actionable implementation guidance.

Pricing

from $70.00 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ Varun Chopra

Varun Chopra

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

0

Monthly active users

12 days ago

Last modified

Share

AI Context Scraper β€” Production-Grade Developer Knowledge Engine

πŸ‘ Apify Actor
πŸ‘ Build
πŸ‘ Coverage
πŸ‘ Tests
πŸ‘ Python
πŸ‘ License: MIT

Overview

A production-grade Apify Actor that intelligently compiles high-quality coding context for AI agents, developer copilots, and engineering RAG systems. Transforms any coding task into structured, LLM-optimized context with documentation, code examples, implementation patterns, and best practices.

Powered by NVIDIA Nemotron 3 Super (120B) via OpenRouter for LLM synthesis.

πŸš€ Quick Start

# Run via Apify CLI
apify actors call wheat_tourist/ai-context-scraper \
-f input.json -t300-o
# input.json
{
"task":"Build a Python FastAPI endpoint for file upload to S3",
"max_sources":10,
"enable_llm_synthesis":true
}

Or call via the Apify Console.

βœ… Production Verified

The actor has been battle-tested with real-world queries. Here are actual metrics from a live run:

MetricValue
Search queries executed12
Sources discovered5
Pages crawled4
Code snippets extracted92
Critical context chunks11
StackOverflow answers4
LLM tokens used8,471
Total execution time112s
Errors0

πŸ”‘ Key Features

Multi-Source Knowledge Mining

  • Web Search: DuckDuckGo (via ddgs) with documentation prioritization and rate-limited timeouts
  • GitHub Intelligence: Repository and code search with star-based ranking
  • StackOverflow Q&A: High-score accepted answers from the developer community
  • Documentation Priority: Boosted ranking for official docs (Python, AWS, FastAPI, etc.)

LLM RAG Synthesis

  • Model: nvidia/nemotron-3-super-120b-a12b:free (configurable)
  • Actionable Guidance: Synthesizes gathered context into implementation-ready code with open questions
  • Automatic Prompting: Builds token-optimized context prompts with code snippets, patterns, and SO answers
  • Graceful Degradation: If the LLM call fails, the pipeline safely returns raw structured context

Advanced Intelligence

  • Semantic Relevance Filtering: sentence-transformers/all-MiniLM-L6-v2 embeddings for precision ranking
  • Relevance Bucketization: Context classified as Critical / Helpful / Noise (noise dropped)
  • Implementation Pattern Detection: Automatically identifies auth, caching, async, database patterns
  • Content Deduplication: MinHash/shingling-based near-duplicate removal
  • Code Quality Scoring: Ranks snippets by completeness, relevance, and documentation

Enterprise & Security

  • SSRF Protection: CIDR-based private IP blocking (RFC 1918 + IPv6 link-local)
  • Input Sanitization: Pattern-based injection detection (script tags, eval, data URIs)
  • Secret Redaction: Automatic redaction of tokens/keys in logs
  • Caching Layer: Apify KV store with monotonic-clock TTL (immune to container clock skew)
  • Observability: Per-phase timing, relevance buckets, cache hit rates, error tracking

πŸ“Š Output Structure

{
"task":"Build a FastAPI endpoint for S3 uploads",
"relevant_context":[
{
"source":"https://fastapi.tiangolo.com/tutorial/request-files/",
"bucket":"critical",
"relevance_score":0.66,
"why_it_matters":"High-relevance documentation chunk",
"key_detail":"# Request Files\nYou can define files to be uploaded..."
}
],
"context":{
"concepts":[...],
"code_snippets":[...],
"api_references":[...],
"best_practices":[...],
"implementation_patterns":[...],
"stackoverflow_answers":[...]
},
"llm_guidance":{
"content":"## Task\n- Build a FastAPI endpoint...\n\n## Implementation\n```python\n...\n```",
"model":"nvidia/nemotron-3-super-120b-a12b:free",
"tokens_used":8471,
"finish_reason":"stop"
},
"metrics":{
"timing":{"total_seconds":112.24,"search_seconds":8.51, ... },
"counts":{"queries":12,"sources_found":5,"pages_scraped":4, ... },
"relevance_buckets":{"critical_chunks":11,"helpful_chunks":5, ... },
"quality":{"avg_chunk_relevance":0.462,"avg_snippet_relevance":0.723},
"errors":[]
}
}

βš™οΈ Configuration

Input Parameters

ParameterTypeDefaultDescription
taskstring(required)Coding task description
max_sourcesinteger10Maximum sources to scrape (3–50)
allowed_domainsarray[]Domain whitelist (empty = all)
include_githubbooleantrueEnable GitHub repository mining
include_github_code_searchbooleantrueEnable authenticated GitHub code search
github_tokenstringnullGitHub token (or GITHUB_TOKEN env var)
github_code_languagesarray[]Target languages for code search
include_stackoverflowbooleantrueEnable StackOverflow Q&A mining
max_code_snippetsinteger20Maximum code snippets to return (1–100)
enable_cachebooleantrueEnable caching for faster repeated runs
chunk_sizeinteger500Token limit per LLM chunk (100–2000)
enable_llm_synthesisbooleantrueEnable LLM-powered context synthesis
openrouter_api_keystringnullOpenRouter API key (or OPENROUTER_API_KEY env var)
openrouter_modelstringnvidia/nemotron-3-super-120b-a12b:freeModel ID for LLM synthesis

Environment Variables

VariableRequiredDescription
OPENROUTER_API_KEYYes (for LLM)OpenRouter API key. Can also be passed as input.
GITHUB_TOKENNoGitHub personal access token for code search.

Set via Apify secrets:

apify secrets add openrouter_api_key "sk-or-v1-..."
apify secrets add github_token "ghp_..."

πŸ—οΈ Architecture

Module Structure

src/
β”œβ”€β”€ __main__.py # Entry point with Pydantic input validation
β”œβ”€β”€ orchestrator.py # Pipeline coordinator with metrics & error recovery
β”œβ”€β”€ search.py # DDGS search with query expansion & timeouts
β”œβ”€β”€ github_miner.py # GitHub repo + code search
β”œβ”€β”€ stackoverflow_miner.py # StackOverflow Q&A mining
β”œβ”€β”€ crawler.py # Async HTTP crawler with robots.txt, retry, rate-limit
β”œβ”€β”€ extractor.py # Content + code extraction(readability + BeautifulSoup)
β”œβ”€β”€ pattern_detector.py # Implementation pattern detection
β”œβ”€β”€ relevance.py # Semantic ranking with embeddings + bucketization
β”œβ”€β”€ chunker.py # LLM-optimized text chunking(tiktoken)
β”œβ”€β”€ deduplicator.py # Near-duplicate content removal(MinHash)
β”œβ”€β”€ llm_synthesizer.py # OpenRouter LLMRAG synthesis
β”œβ”€β”€ cache_manager.py # Apify KV store caching(monotonic TTL)
β”œβ”€β”€ security.py # Input validation,SSRF protection, secret redaction
β”œβ”€β”€ metrics.py # Observability and telemetry(dict-based phase tracking)
β”œβ”€β”€ exceptions.py # Custom exception hierarchy
└── formatter.py # Final output formatting

Pipeline Flow

Task Input
↓
Task Understanding & Query Expansion(12+ search queries)
↓
Multi-Source Discovery(Web + GitHub + StackOverflow)
↓
Async Crawling(semaphore-limited, robots.txt-aware)
↓
Content Extraction(readability + BeautifulSoup)
↓
Code & Pattern Extraction
↓
Content Deduplication(MinHash shingling)
↓
Semantic Relevance Ranking(sentence-transformers)
↓
Relevance Bucketization(Critical / Helpful / Noise)
↓
LLM Context Synthesis(NVIDIA Nemotron 3 Super via OpenRouter)
↓
Structured Context Output +LLM Guidance + Metrics
↓
Caching for Future Runs

πŸ§ͺ Testing

# Unit tests (534 tests across 10 modules, ~90s)
pytest -v
# Unit tests only β€” skip live network tests
pytest -m"not live"-v
# Live regression tests (6 tests, hits real network)
pytest -m live --no-cov -v
# With coverage report
pytest --cov=src --cov-report=html

Test Coverage

ModuleCoverage
deduplicator.py100%
metrics.py100%
exceptions.py100%
pattern_detector.py100%
llm_synthesizer.py99%
chunker.py99%
security.py99%
cache_manager.py99%
extractor.py97%
formatter.py96%
orchestrator.py83%
Total77%

Test Modules

FileTestsCovers
test_security.py79SSRF protection, injection detection, secret redaction
test_metrics.py55Phase timing, counters, finalize, reset
test_cache_manager.py54TTL expiry, disabled mode, KVS error handling
test_orchestrator.py36+Pipeline run, error fallbacks, static helpers
test_actor_input.py27Pydantic schema, all boundary values
test_llm_synthesizer.py34Retry logic, prompt building, structured output
test_extractor.py65HTML extraction, snippet detection, edge cases
test_formatter.py66Output structure, dedup, relevance buckets
test_deduplicator.py45Shingling, Jaccard similarity, determinism
test_chunker.py27Token-aware chunking, code block handling
test_pattern_detector.py25Auth, async, caching, DB pattern detection
test_search.py35Query expansion, rate limiting, mocked DDGS

πŸ› οΈ Deployment

Build & Push

# Initialize (first time only)
git init &&gitadd-A&&git commit -m"Initial commit"
# Deploy to Apify (builds Docker image remotely)
apify push

Docker Build (local)

docker build -t ai-context-scraper .
docker run -eOPENROUTER_API_KEY=sk-or-v1-... ai-context-scraper

The Dockerfile pre-downloads the sentence-transformers model during build so cold starts are fast.

πŸ“ˆ Performance

  • Async Architecture: Concurrent crawling with semaphore limits
  • Smart Caching: Task-level caching with monotonic-clock TTL (immune to container clock drift)
  • Batch Processing: Embeddings computed in batches for efficiency
  • Rate Limiting: Configurable requests/second with sleep-outside-semaphore optimization
  • Timeout Protection: asyncio.wait_for() on all external calls
  • Broad Error Recovery: Catches 8 exception types without crashing the pipeline

πŸ”’ Security

  • SSRF Protection: ipaddress module CIDR checks against all RFC 1918, loopback, link-local, and IPv6 private ranges
  • Input Validation: Pydantic models with strict typing + regex-based injection detection
  • Secret Redaction: Automatic redaction of ghp_*, sk-or-v1-*, Bearer * tokens in logs
  • Content Sanitization: readability-lxml for safe HTML parsing
  • SEO Spam Filtering: Multi-keyword detection (sponsored, affiliate, promo, etc.)
  • Domain Whitelisting: Optional domain restrictions
  • API Key Validation: Format checks with suspicious pattern detection

πŸ“¦ Dependencies

PackagePurpose
apifyActor runtime
httpxAsync HTTP client
beautifulsoup4HTML parsing
readability-lxmlContent extraction
ddgsDuckDuckGo search
markdownifyHTML β†’ Markdown
sentence-transformersSemantic embeddings
tiktokenToken counting
pydanticInput validation
rapidfuzzLexical similarity fallback

🎯 Use Cases

  • AI Coding Agents: Power your coding agent with real-time context about libraries, patterns, and best practices
  • Developer Copilots: Provide your IDE extension with rich, structured coding context
  • RAG Systems: Build retrieval-augmented generation pipelines with curated developer knowledge
  • Engineering Onboarding: Generate comprehensive learning materials for new team members
  • Code Review Assistance: Fetch implementation patterns and best practices to guide reviews

πŸ“ License

MIT License β€” see LICENSE file for details.

🀝 Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.


Built for production use by AI infrastructure teams. Actor ID: 2OBJzyOtx1FyGGt2f | Latest Build: 0.1.24 | Model: NVIDIA Nemotron 3 Super | Tests: 534 passing | Coverage: 77%

You might also like

Gitingest: GitHub to LLM Context

gauzy_synthesizer/gitingest-repo-to-llm

Turn any GitHub repository into a single text file optimized for LLMs (ChatGPT, Claude, DeepSeek). Perfect for RAG pipelines, code debugging, and AI context extraction.

πŸ‘ User avatar

DAANISH MANSURI

6

Context Layer

evertools/context-layer

Transforms documentation sites into a clean, structured context layer for AI systemsβ€”handling crawling, extraction, intelligent chunking, and optional enrichment for RAG, fine-tuning, and semantic search.

AI Training Data Quality MCP Server

ryanclinton/ai-training-data-quality-mcp

AI training data quality assessment, bias detection, and governance scoring for AI agents via the Model Context Protocol.

RAG Pipeline Data Collector

scraper_guru/rag-pipeline-data-collector

AI-ready web content extraction for RAG systems, LLMs, and AI agents. Single-page or multi-page scraping with parallel processing.

πŸ‘ User avatar

LIAICHI MUSTAPHA

5

AI Model Governance MCP Server

ryanclinton/ai-model-governance-mcp

Enterprise AI compliance and governance intelligence for AI agents via the Model Context Protocol.

Web Search MCP Server

abotapi/ai-search-mcp-server

An Apify MCP Server that provides real-time web search capabilities for AI agents via the Model Context Protocol (MCP).

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

14

5.0

Related articles

What is MCP? Insights from the Developers Summit
Read more
What are AI agents?
Read more
6 AI agent tools that keep your agents grounded in current data
Read more