VOOZH about

URL: https://apify.com/vulnv/github-repository-scraper

⇱ GitHub Repository Scraper Β· Apify


Pricing

$10.00/month + usage

Go to Apify Store

GitHub Repository Scraper

Scrape and extract GitHub repository data, metadata, statistics, stars, forks, issues, and project information from multiple repositories at once.

Pricing

$10.00/month + usage

Rating

0.0

(0)

Developer

πŸ‘ VulnV

VulnV

Maintained by Community

Actor stats

1

Bookmarked

15

Total users

0

Monthly active users

7 months ago

Last modified

Share

GitHub Repository Scraper - Extract Repository Data at Scale

Overview

The GitHub Repository Scraper is a powerful Apify Actor designed to extract comprehensive data from GitHub repositories efficiently. Perfect for competitive analysis, market research, developer insights, or building repository databases β€” this scraper provides detailed information about repositories, statistics, and project metadata.

βœ… Bulk URL processing | βœ… Comprehensive repository data | βœ… Statistics extraction | βœ… Metadata analysis | βœ… Concurrent processing


Complete Repository Data Extraction

  • Basic Information β€” Repository name, description, owner, creation date
  • Statistics β€” Stars, forks, watchers, usage metrics
  • Technical Details β€” Programming languages, file counts, commit information
  • Project Metadata β€” Topics, license information, default branch
  • Enhanced Repository Data β€” GitHub IDs, clone URLs, file listings, branch info
  • Owner Information β€” Detailed owner profiles with avatars and organization status
  • Repository Structure β€” File counts, directory listings, README information
  • Access URLs β€” Multiple clone formats (HTTPS, SSH, GitHub CLI), download links

Key Features

  • Bulk Processing β€” Process multiple GitHub repository URLs in one run
  • Smart URL Parsing β€” Automatically extracts repository paths from full GitHub URLs
  • Proxy Support β€” Built-in Apify proxy integration for reliable scraping
  • Error Handling β€” Robust error handling with detailed status reporting
  • Clean JSON Output β€” Structured, ready-to-use data format
  • Concurrent Processing β€” Configurable concurrency for optimal performance
  • Format Flexibility β€” Accepts various URL formats and automatically normalizes them

🧾 Input Configuration

Submit an array of GitHub repository URLs via the input schema:

{
"urls":[
"https://github.com/microsoft/vscode",
"https://github.com/facebook/react",
"https://github.com/nodejs/node",
"https://github.com/torvalds/linux"
],
"maxConcurrency":5,
"includeNotFound":false,
"proxyConfiguration":{
"useApifyProxy":true,
"apifyProxyGroups":["RESIDENTIAL"]
}
}

Input Parameters

  1. URLs (required):

    • Array of GitHub repository URLs to scrape
    • Supported formats: https://github.com/owner/repo, github.com/owner/repo
    • Invalid URLs will be automatically filtered out with warnings
  2. Max Concurrency (optional):

    • Number of concurrent requests for scraping (1-20)
    • Default: 5
    • Higher values = faster processing but may increase chance of rate limiting
  3. Include Not Found (optional):

    • Whether to include repositories that return 404 (not found) in the results
    • Default: false
    • When enabled, includes error information for non-existent repositories
  4. Proxy Configuration (recommended):

    • Configure Apify proxy settings to avoid rate limiting
    • Recommended for bulk scraping operations
    • Format:
      "proxyConfiguration":{
      "useApifyProxy":true,
      "apifyProxyGroups":["RESIDENTIAL"]
      }
    • Available proxy groups: RESIDENTIAL, DATACENTER, GOOGLE_SERP
    • Use RESIDENTIAL for best reliability when scraping GitHub

Proxy Configuration Examples

For small-scale scraping (< 100 repositories):

"proxyConfiguration":{
"useApifyProxy":true,
"apifyProxyGroups":["DATACENTER"]
}

For large-scale or production scraping (recommended):

"proxyConfiguration":{
"useApifyProxy":true,
"apifyProxyGroups":["RESIDENTIAL"]
}

No proxy (not recommended for bulk operations):

// Omit proxyConfiguration entirely - may result in rate limiting

πŸ“€ Output Format

Each GitHub repository returns comprehensive structured data including enhanced metadata extracted from GitHub's embedded data:

{
"url":"https://github.com/microsoft/vscode",
"repoPath":"microsoft/vscode",
"success":true,
"data":{
"url":"https://github.com/microsoft/vscode",
"type":"repo",
"description":"Visual Studio Code",
"website":"https://code.visualstudio.com",
"forkedfrom":null,
"tags":["editor","typescript","electron","ide"],
"usedby":250000,
"watchers":3200,
"stars":162000,
"forks":28500,
"langs":[
{"name":"TypeScript","perc":"93.2%"},
{"name":"JavaScript","perc":"4.1%"},
{"name":"CSS","perc":"1.5%"}
],
// Enhanced data from GitHub's embedded JSON
"id":41881900,
"name":"vscode",
"full_name":"microsoft/vscode",
"owner":"microsoft",
"default_branch":"main",
"is_fork":false,
"is_empty":false,
"is_private":false,
"is_org_owned":true,
"created_at":"2015-09-03T20:23:30.000Z",
"clone_url":"https://github.com/microsoft/vscode.git",
"ssh_url":"git@github.com:microsoft/vscode.git",
"api_url":"https://api.github.com/repos/microsoft/vscode",
// Owner information
"owner_info":{
"login":"microsoft",
"type":"Organization",
"url":"https://github.com/microsoft",
"avatar_url":"https://avatars.githubusercontent.com/u/6154722?v=4"
},
// File and repository structure
"file_count":15420,
"files":[
{"name":"README.md","path":"README.md","type":"file"},
{"name":"package.json","path":"package.json","type":"file"},
{"name":"src","path":"src","type":"directory"}
],
// Clone and download URLs
"clone_urls":{
"https":"https://github.com/microsoft/vscode.git",
"ssh":"git@github.com:microsoft/vscode.git",
"github_cli":"gh repo clone microsoft/vscode"
},
"download_url":"/microsoft/vscode/archive/refs/heads/main.zip",
// Branch and commit information
"ref_info":{
"name":"main",
"type":"branch",
"current_oid":"585acf48f88e399989d54f001029424b2b7c358a",
"can_edit":false
},
"commit_count":"185,234",
// README information
"readme_info":{
"displayName":"README.md",
"repoName":"vscode",
"refName":"main",
"path":"README.md",
"loaded":true
},
// Metadata
"enriched_at":"2024-12-29T15:30:45.123Z",
"data_source":"github_scraper_enhanced"
}
}

Error Handling

Failed repositories return structured error information:

{
"url":"https://github.com/invalid/repo",
"repoPath":"invalid/repo",
"success":false,
"error":"Repository not found or private"
}

When includeNotFound is enabled, 404 repositories return structured data:

{
"url":"https://github.com/nonexistent/repo",
"repoPath":"nonexistent/repo",
"success":true,
"data":{
"exists":false,
"error":"Repository not found",
"statusCode":404
}
}

Common Error Cases:

  • Repository not found or private β€” Repository doesn't exist or is private
  • Network error β€” Connection issues or scraping errors
  • Invalid URLs are filtered out before processing with warning logs

πŸ’Ό Common Use Cases

Competitive Analysis & Market Research

  • Analyze competitor repositories and project activity
  • Track technology trends through repository statistics
  • Research popular libraries and frameworks in specific domains
  • Monitor open source project adoption rates

Developer & Technology Research

  • Study programming language usage patterns
  • Analyze repository structures and best practices
  • Research active open source projects in specific technologies
  • Track development activity and contribution patterns

Portfolio & Investment Analysis

  • Research technology companies and their open source contributions
  • Analyze developer productivity and project health metrics
  • Track repository growth and community engagement
  • Identify trending projects and technologies

Academic & Educational Research

  • Study software development patterns and practices
  • Analyze open source community dynamics
  • Research programming language evolution
  • Track educational resource repositories

πŸ“Š Output & Export Options

Dataset Storage

  • All extracted data stored in Apify dataset
  • Each repository becomes one dataset item
  • Status tracking for successful and failed extractions

Export Formats

  • JSON β€” Raw structured data for API integration
  • CSV β€” Spreadsheet-compatible format for analysis
  • Excel β€” Formatted spreadsheet with repository data

Data Processing

  • Clean, validated URLs
  • Structured error reporting
  • Comprehensive logging for troubleshooting

⚑ Quick Start Guide

  1. Configure Input:

    • Add GitHub repository URLs to the urls array
    • Set desired maxConcurrency (recommended: 5-10)
    • Configure proxyConfiguration with useApifyProxy: true and appropriate proxy groups for reliable scraping
  2. Run the Actor:

    • Execute through Apify Console or API
    • Monitor progress through real-time logs
    • Review extracted data in the dataset
  3. Export Results:

    • Download data in your preferred format
    • Integrate with your existing tools and workflows

πŸ†˜ Support & Feedback

For questions, feature requests, or technical support:

  • Visit the Apify Community Forum
  • Contact us through the Apify platform
  • Submit issues for improvements and bug reports

🌟 Explore More Actors

✨ Need more scraping solutions? Discover additional actors on Apify for comprehensive web automation and data extraction. Explore our full range of tools at 🌐 Explore More Actors on Apify.

πŸ“§ For inquiries or custom development, reach out at apify@vulnv.com.

You might also like

GitHub Scraper

pear_fight/github-scraper

Scrape repositories, stars, issues and more from GitHub

GitHub Repositories Scraper - CheapπŸ“¦πŸ™πŸ”

scrapestorm/github-repositories-scraper-cheap

πŸ” Easily collect repositories from GitHub Provide a GitHub profile URL or username and extract detailed repository information such as repository name, description, language, stars, topics & repository link πŸ“¦πŸ™ Perfect for open-source analysis, developer scouting & market intelligence πŸ“ŠπŸ”₯

2

Github Repository Analyzer

actually_good_at_this/apify-github-repository-analyzer

GitHub Repository Analyzer extracts comprehensive repository metrics using the official GitHub API: stars, forks, watchers, contributors, commit activity, and issues/PRs.

GitHub repositories Scraper - Low-costπŸ’²πŸ”₯πŸ“¦πŸ™

delectable_incubator/github-repositories-scraper-low-cost

Scrape GitHub repositories πŸ“¦πŸ™ with a powerful developer data scraper. Extract repository names, descriptions, programming languages, stars, topics, forks, and repository URLs from any GitHub profile. Ideal for open-source analysis, developer scouting, technology research and market insights πŸ“ŠπŸš€