VOOZH about

URL: https://apify.com/benthepythondev/github-repository-intelligence

⇱ GitHub Repository Scraper - API Data Extraction for AI/LLM Β· Apify


πŸ‘ GitHub Repository Intelligence - API-Based Data Scraper avatar

GitHub Repository Intelligence - API-Based Data Scraper

Pricing

from $20.00 / 1,000 results

Go to Apify Store

GitHub Repository Intelligence - API-Based Data Scraper

Extract repository metadata, README content, and documentation from GitHub using the official REST API. Perfect for LLM training data, developer research, and competitive analysis. Search by keywords or fetch specific repositories.

Pricing

from $20.00 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ ben

ben

Maintained by Community

Actor stats

1

Bookmarked

51

Total users

11

Monthly active users

5 hours ago

Last modified

Share

GitHub Repository Intelligence - API-Based Data & Documentation Scraper

Extract comprehensive repository data from GitHub using the official REST API.

Fetch repository metadata, README content, documentation, topics, language statistics, and more. Perfect for AI/LLM training data, developer research, competitive analysis, and tech stack discovery. Legal, stable, and fast API-based extraction.

Features

βœ… Dual Scraping Modes

  • Search Mode: Find repositories by keywords, language, stars
  • Direct Mode: Fetch specific repositories by URL

βœ… Comprehensive Data Extraction

  • Repository metadata (stars, forks, watchers, issues)
  • README content (perfect for LLM training)
  • Programming language statistics
  • Repository topics/tags
  • License information
  • Creation/update timestamps
  • Owner information

βœ… Official GitHub API

  • Uses GitHub REST API v3 (100% legal)
  • No browser automation required
  • Stable and reliable
  • Optional authentication for higher rate limits

βœ… Built for AI & Research

  • README extraction for LLM training
  • Structured JSON output
  • Rich metadata for analysis
  • Topic and language classification
  • Dataset export (CSV, JSON, Excel)

Use Cases

πŸ€– AI/LLM Training Data

  • Extract README files for AI model training
  • Gather documentation for vector databases
  • Build RAG (Retrieval-Augmented Generation) pipelines
  • Create code-to-text datasets

πŸ” Developer Research

  • Discover trending repositories
  • Analyze tech stacks and tools
  • Monitor open-source ecosystem
  • Track language adoption trends

πŸ’Ό Business Intelligence

  • Competitive analysis
  • Technology trend spotting
  • Developer tool discovery
  • Market research for dev tools

πŸ“Š Academic Research

  • Software engineering studies
  • Open-source collaboration analysis
  • Programming language evolution
  • Developer ecosystem research

Input

{
"mode":"search",
"searchQuery":"language:python stars:>1000",
"sortBy":"stars",
"maxResults":50,
"includeReadme":true,
"includeTopics":true,
"includeLanguages":true,
"githubToken":"ghp_xxxxxxxxxxxx"
}

Input Parameters

ParameterTypeDefaultDescription
modestringsearchScraping mode: search (find repos by query) or direct (specific URLs)
searchQuerystring""Search query (e.g., language:python stars:>1000). Uses GitHub search syntax.
repositoryUrlsstring""Repository URLs (one per line). Format: https://github.com/owner/repo
sortBystringstarsSort search results by: stars, forks, updated, help-wanted-issues
maxResultsinteger30Maximum repositories to fetch (search mode, 1-1000)
includeReadmebooleantrueExtract README content (recommended for AI/LLM)
includeTopicsbooleantrueFetch repository topics/tags
includeLanguagesbooleantrueFetch programming language statistics
githubTokenstring""Optional GitHub Personal Access Token (5,000 vs 60 requests/hour)
debugModebooleanfalseEnable verbose logging

Output

Each repository returns comprehensive data:

{
"name":"tensorflow",
"full_name":"tensorflow/tensorflow",
"owner":{
"login":"tensorflow",
"type":"Organization",
"url":"https://github.com/tensorflow"
},
"description":"An Open Source Machine Learning Framework for Everyone",
"url":"https://github.com/tensorflow/tensorflow",
"homepage":"https://www.tensorflow.org",
"language":"C++",
"stars":185000,
"forks":74000,
"watchers":185000,
"open_issues":1850,
"size":285000,
"topics":["machine-learning","deep-learning","tensorflow","python"],
"license":"Apache License 2.0",
"created_at":"2015-11-07T01:19:20Z",
"updated_at":"2025-01-12T10:30:45Z",
"pushed_at":"2025-01-12T09:15:22Z",
"is_fork":false,
"is_archived":false,
"is_private":false,
"default_branch":"master",
"readme":{
"name":"README.md",
"path":"README.md",
"content":"# TensorFlow...",
"size":12584,
"html_url":"https://github.com/tensorflow/tensorflow/blob/master/README.md",
"download_url":"https://raw.githubusercontent.com/tensorflow/tensorflow/master/README.md"
},
"languages":{
"C++":125847623,
"Python":45123456,
"Java":12345678
},
"scraped_at":"2025-01-12T15:30:00.000Z",
"index":1
}

Example Usage

Search for Python Repositories

{
"mode":"search",
"searchQuery":"language:python stars:>1000",
"sortBy":"stars",
"maxResults":100
}

Search for AI/ML Projects

{
"mode":"search",
"searchQuery":"machine learning stars:>5000",
"sortBy":"updated",
"maxResults":50,
"includeReadme":true
}

Fetch Specific Repositories

{
"mode":"direct",
"repositoryUrls":"https://github.com/facebook/react\nhttps://github.com/tensorflow/tensorflow\nhttps://github.com/microsoft/vscode",
"includeReadme":true,
"includeTopics":true
}

With GitHub Token (Higher Rate Limits)

{
"mode":"search",
"searchQuery":"language:rust stars:>500",
"maxResults":200,
"githubToken":"ghp_yourtoken",
"includeReadme":true
}

GitHub Search Query Syntax

Search by Language

language:python
language:javascript
language:rust

Search by Stars/Forks

stars:>1000
stars:1000..5000
forks:>500

Search by Topics

topic:machine-learning
topic:web-development

Search by Organization

org:google
org:microsoft
user:torvalds

Combine Multiple Criteria

language:python stars:>1000topic:machine-learning
language:go stars:>500forks:>100

Full Documentation: GitHub Search Syntax

Rate Limits & Authentication

Without GitHub Token

  • 60 requests per hour
  • Good for: Testing, small batches (<30 repos)
  • Unauthenticated access

With GitHub Token (Recommended)

  • 5,000 requests per hour
  • Good for: Production, large batches (100s of repos)
  • Required for: Frequent usage

Creating a GitHub Token

  1. Go to GitHub Settings β†’ Tokens
  2. Click "Generate new token (classic)"
  3. Select scopes: public_repo (read public repositories)
  4. Copy token and use in githubToken parameter

Note: Tokens are optional but highly recommended for production use.

Pricing (Pay-Per-Result)

$0.015 per repository ($15 per 1,000 repositories)

Example Cost Calculation:

Fetching 1,000 repositories:

  • Repository metadata: 1,000 Γ— $0.015 = $15.00

πŸ’‘ No browser costs, no proxy costs - just lightweight API calls!

Best Practices

Search Optimization

  1. Use Specific Queries: language:python stars:>1000 > python
  2. Filter by Activity: pushed:>2024-01-01 for active projects
  3. Combine Criteria: Use stars, language, topics together
  4. Sort Strategically: stars for popular, updated for active

README Extraction for AI/LLM

  1. Enable README Fetching: Always set includeReadme: true
  2. Filter Quality: Focus on repos with stars:>100
  3. Language Filtering: Target specific tech stacks
  4. Documentation Rich: Search for topic:documentation

Rate Limit Management

  1. Use Authentication: Get a GitHub token for 5,000 requests/hour
  2. Batch Requests: Plan your searches to minimize API calls
  3. Monitor Limits: Check rate limit in actor logs
  4. Schedule Runs: Spread large jobs across hours

FAQ

Q: Is this legal? A: Yes! Uses GitHub's official REST API with proper permissions.

Q: Do I need a GitHub account? A: No for basic usage (60 requests/hour). Yes for higher limits (5,000 requests/hour with token).

Q: What's the rate limit without a token? A: 60 requests per hour (unauthenticated). 5,000 with a token.

Q: Can I extract private repositories? A: Only public repositories. Private repos require different permissions.

Q: How do I get README content for AI training? A: Set includeReadme: true and use search mode to find relevant repositories.

Q: Can I search by multiple languages? A: Use language:python OR language:javascript in search query.

Q: What happens if rate limit is exceeded? A: Actor will log a warning. Add a GitHub token to increase limits.

Why Use This Actor?

FeatureThis Actor (GitHub API)Web Scraping
Legalβœ… Official API❌ Violates ToS
Stableβœ… API rarely changes❌ HTML breaks often
Fastβœ… Direct API calls❌ Browser overhead
Costβœ… $15 per 1k repos❌ $30+ per 1k
Authenticationβœ… Optional (higher limits)❌ Complex login
README Accessβœ… Direct API endpoint❌ Requires parsing
Maintenanceβœ… Minimal❌ Constant updates

Output Use Cases

AI/LLM Training

  • Feed README content into vector databases
  • Build code documentation datasets
  • Create programming Q&A pairs
  • Extract technical writing samples

Developer Tools

  • Tech stack analysis
  • Framework popularity tracking
  • Library comparison
  • Documentation aggregation

Business Intelligence

  • Competitor monitoring
  • Technology trend analysis
  • Open-source landscape mapping
  • Developer ecosystem research

Legal & Ethics

βœ… Legal Compliance:

  1. Official API: Uses GitHub REST API v3 with proper authentication
  2. Public Data Only: Accesses only publicly available repositories
  3. Rate Limits: Respects GitHub's rate limiting
  4. Terms of Service: Complies with GitHub's API ToS
  5. No Scraping: No HTML parsing or browser automation

This actor is 100% legal and ethical - uses official GitHub API with proper permissions.

Support

Need help? Have questions?


Built with ❀️ using Apify and GitHub REST API

Perfect for:

  • πŸ€– AI/LLM training data collection
  • πŸ“Š Developer research and analytics
  • πŸ’Ό Competitive intelligence
  • πŸ” Technology trend analysis
  • πŸ“š Documentation aggregation

You might also like

GitHub Repositories Scraper - CheapπŸ“¦πŸ™πŸ”

scrapestorm/github-repositories-scraper-cheap

πŸ” Easily collect repositories from GitHub Provide a GitHub profile URL or username and extract detailed repository information such as repository name, description, language, stars, topics & repository link πŸ“¦πŸ™ Perfect for open-source analysis, developer scouting & market intelligence πŸ“ŠπŸ”₯

2

GitHub repositories Scraper - Low-costπŸ’²πŸ”₯πŸ“¦πŸ™

delectable_incubator/github-repositories-scraper-low-cost

Scrape GitHub repositories πŸ“¦πŸ™ with a powerful developer data scraper. Extract repository names, descriptions, programming languages, stars, topics, forks, and repository URLs from any GitHub profile. Ideal for open-source analysis, developer scouting, technology research and market insights πŸ“ŠπŸš€

GitHub Repository Scraper

vulnv/github-repository-scraper

Scrape and extract GitHub repository data, metadata, statistics, stars, forks, issues, and project information from multiple repositories at once.

GitHub Repository Intelligence

crawlerbros/github-repo-intelligence

Fetch rich metadata (stars, forks, README, languages, topics, license) from GitHub repositories. Search by query or provide direct URLs. Optional GitHub token for 80x higher rate limit.

Github Repository Analyzer

actually_good_at_this/apify-github-repository-analyzer

GitHub Repository Analyzer extracts comprehensive repository metrics using the official GitHub API: stars, forks, watchers, contributors, commit activity, and issues/PRs.