Pricing
$10.00/month + usage
GitHub Repository Scraper
Scrape and extract GitHub repository data, metadata, statistics, stars, forks, issues, and project information from multiple repositories at once.
Pricing
$10.00/month + usage
Rating
0.0
(0)
Developer
Actor stats
1
Bookmarked
15
Total users
0
Monthly active users
7 months ago
Last modified
Categories
Share
GitHub Repository Scraper - Extract Repository Data at Scale
Overview
The GitHub Repository Scraper is a powerful Apify Actor designed to extract comprehensive data from GitHub repositories efficiently. Perfect for competitive analysis, market research, developer insights, or building repository databases β this scraper provides detailed information about repositories, statistics, and project metadata.
β Bulk URL processing | β Comprehensive repository data | β Statistics extraction | β Metadata analysis | β Concurrent processing
Complete Repository Data Extraction
- Basic Information β Repository name, description, owner, creation date
- Statistics β Stars, forks, watchers, usage metrics
- Technical Details β Programming languages, file counts, commit information
- Project Metadata β Topics, license information, default branch
- Enhanced Repository Data β GitHub IDs, clone URLs, file listings, branch info
- Owner Information β Detailed owner profiles with avatars and organization status
- Repository Structure β File counts, directory listings, README information
- Access URLs β Multiple clone formats (HTTPS, SSH, GitHub CLI), download links
Key Features
- Bulk Processing β Process multiple GitHub repository URLs in one run
- Smart URL Parsing β Automatically extracts repository paths from full GitHub URLs
- Proxy Support β Built-in Apify proxy integration for reliable scraping
- Error Handling β Robust error handling with detailed status reporting
- Clean JSON Output β Structured, ready-to-use data format
- Concurrent Processing β Configurable concurrency for optimal performance
- Format Flexibility β Accepts various URL formats and automatically normalizes them
π§Ύ Input Configuration
Submit an array of GitHub repository URLs via the input schema:
{"urls":["https://github.com/microsoft/vscode","https://github.com/facebook/react","https://github.com/nodejs/node","https://github.com/torvalds/linux"],"maxConcurrency":5,"includeNotFound":false,"proxyConfiguration":{"useApifyProxy":true,"apifyProxyGroups":["RESIDENTIAL"]}}
Input Parameters
-
URLs (required):
- Array of GitHub repository URLs to scrape
- Supported formats:
https://github.com/owner/repo,github.com/owner/repo - Invalid URLs will be automatically filtered out with warnings
-
Max Concurrency (optional):
- Number of concurrent requests for scraping (1-20)
- Default: 5
- Higher values = faster processing but may increase chance of rate limiting
-
Include Not Found (optional):
- Whether to include repositories that return 404 (not found) in the results
- Default: false
- When enabled, includes error information for non-existent repositories
-
Proxy Configuration (recommended):
- Configure Apify proxy settings to avoid rate limiting
- Recommended for bulk scraping operations
- Format:
"proxyConfiguration":{"useApifyProxy":true,"apifyProxyGroups":["RESIDENTIAL"]}
- Available proxy groups:
RESIDENTIAL,DATACENTER,GOOGLE_SERP - Use
RESIDENTIALfor best reliability when scraping GitHub
Proxy Configuration Examples
For small-scale scraping (< 100 repositories):
"proxyConfiguration":{"useApifyProxy":true,"apifyProxyGroups":["DATACENTER"]}
For large-scale or production scraping (recommended):
"proxyConfiguration":{"useApifyProxy":true,"apifyProxyGroups":["RESIDENTIAL"]}
No proxy (not recommended for bulk operations):
// Omit proxyConfiguration entirely - may result in rate limiting
π€ Output Format
Each GitHub repository returns comprehensive structured data including enhanced metadata extracted from GitHub's embedded data:
{"url":"https://github.com/microsoft/vscode","repoPath":"microsoft/vscode","success":true,"data":{"url":"https://github.com/microsoft/vscode","type":"repo","description":"Visual Studio Code","website":"https://code.visualstudio.com","forkedfrom":null,"tags":["editor","typescript","electron","ide"],"usedby":250000,"watchers":3200,"stars":162000,"forks":28500,"langs":[{"name":"TypeScript","perc":"93.2%"},{"name":"JavaScript","perc":"4.1%"},{"name":"CSS","perc":"1.5%"}],// Enhanced data from GitHub's embedded JSON"id":41881900,"name":"vscode","full_name":"microsoft/vscode","owner":"microsoft","default_branch":"main","is_fork":false,"is_empty":false,"is_private":false,"is_org_owned":true,"created_at":"2015-09-03T20:23:30.000Z","clone_url":"https://github.com/microsoft/vscode.git","ssh_url":"git@github.com:microsoft/vscode.git","api_url":"https://api.github.com/repos/microsoft/vscode",// Owner information"owner_info":{"login":"microsoft","type":"Organization","url":"https://github.com/microsoft","avatar_url":"https://avatars.githubusercontent.com/u/6154722?v=4"},// File and repository structure"file_count":15420,"files":[{"name":"README.md","path":"README.md","type":"file"},{"name":"package.json","path":"package.json","type":"file"},{"name":"src","path":"src","type":"directory"}],// Clone and download URLs"clone_urls":{"https":"https://github.com/microsoft/vscode.git","ssh":"git@github.com:microsoft/vscode.git","github_cli":"gh repo clone microsoft/vscode"},"download_url":"/microsoft/vscode/archive/refs/heads/main.zip",// Branch and commit information"ref_info":{"name":"main","type":"branch","current_oid":"585acf48f88e399989d54f001029424b2b7c358a","can_edit":false},"commit_count":"185,234",// README information"readme_info":{"displayName":"README.md","repoName":"vscode","refName":"main","path":"README.md","loaded":true},// Metadata"enriched_at":"2024-12-29T15:30:45.123Z","data_source":"github_scraper_enhanced"}}
Error Handling
Failed repositories return structured error information:
{"url":"https://github.com/invalid/repo","repoPath":"invalid/repo","success":false,"error":"Repository not found or private"}
When includeNotFound is enabled, 404 repositories return structured data:
{"url":"https://github.com/nonexistent/repo","repoPath":"nonexistent/repo","success":true,"data":{"exists":false,"error":"Repository not found","statusCode":404}}
Common Error Cases:
Repository not found or privateβ Repository doesn't exist or is privateNetwork errorβ Connection issues or scraping errors- Invalid URLs are filtered out before processing with warning logs
πΌ Common Use Cases
Competitive Analysis & Market Research
- Analyze competitor repositories and project activity
- Track technology trends through repository statistics
- Research popular libraries and frameworks in specific domains
- Monitor open source project adoption rates
Developer & Technology Research
- Study programming language usage patterns
- Analyze repository structures and best practices
- Research active open source projects in specific technologies
- Track development activity and contribution patterns
Portfolio & Investment Analysis
- Research technology companies and their open source contributions
- Analyze developer productivity and project health metrics
- Track repository growth and community engagement
- Identify trending projects and technologies
Academic & Educational Research
- Study software development patterns and practices
- Analyze open source community dynamics
- Research programming language evolution
- Track educational resource repositories
π Output & Export Options
Dataset Storage
- All extracted data stored in Apify dataset
- Each repository becomes one dataset item
- Status tracking for successful and failed extractions
Export Formats
- JSON β Raw structured data for API integration
- CSV β Spreadsheet-compatible format for analysis
- Excel β Formatted spreadsheet with repository data
Data Processing
- Clean, validated URLs
- Structured error reporting
- Comprehensive logging for troubleshooting
β‘ Quick Start Guide
-
Configure Input:
- Add GitHub repository URLs to the
urlsarray - Set desired
maxConcurrency(recommended: 5-10) - Configure
proxyConfigurationwithuseApifyProxy: trueand appropriate proxy groups for reliable scraping
- Add GitHub repository URLs to the
-
Run the Actor:
- Execute through Apify Console or API
- Monitor progress through real-time logs
- Review extracted data in the dataset
-
Export Results:
- Download data in your preferred format
- Integrate with your existing tools and workflows
π Support & Feedback
For questions, feature requests, or technical support:
- Visit the Apify Community Forum
- Contact us through the Apify platform
- Submit issues for improvements and bug reports
π Explore More Actors
β¨ Need more scraping solutions? Discover additional actors on Apify for comprehensive web automation and data extraction. Explore our full range of tools at π Explore More Actors on Apify.
π§ For inquiries or custom development, reach out at apify@vulnv.com.
