VOOZH about

URL: https://apify.com/himanshi1rana/github-docs-intelligence

โ‡ฑ GitHub Documentation Extractor - Extract Docs from Any Repo ยท Apify


๐Ÿ‘ GitHub Documentation Extractor (Agentic) avatar

GitHub Documentation Extractor (Agentic)

Under maintenance

Pricing

Pay per usage

Go to Apify Store

GitHub Documentation Extractor (Agentic)

Under maintenance

An agentic AI actor that automatically extracts and analyzes documentation from GitHub repositories to help developers understand projects faster.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

๐Ÿ‘ Himanshi Rana

Himanshi Rana

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

1

Monthly active users

6 months ago

Last modified

Share

๐Ÿค– GitHub Documentation Intelligence

AI-powered documentation extraction and analysis for GitHub repositories

Extract, structure, and analyze documentation from any GitHub repository in seconds. Perfect for building RAG systems, onboarding developers, and auditing documentation quality.


๐ŸŽฏ What It Does

Automatically extracts and structures:

  • โœ… README files - Main project documentation
  • โœ… Documentation folders - All markdown files from docs/, documentation/, etc.
  • โœ… Code documentation - Docstrings from Python, JavaScript, TypeScript files
  • โœ… Metadata - Repository info, stars, language, topics
  • โœ… Statistics - Word counts, file counts, documentation coverage

๐Ÿš€ Quick Start

Input Example

{
"url":"https://github.com/pallets/flask",
"maxFiles":20,
"extractCodeDocs":true
}

Output Example

{
"status":"success",
"metadata":{
"name":"flask",
"description":"The Python micro framework",
"language":"Python",
"stars":65000,
"url":"https://github.com/pallets/flask"
},
"readme":{
"filename":"README.md",
"content":"...",
"sections":[...],
"word_count":450
},
"documentation_files":[...],
"code_documentation":[...],
"combined_markdown":"...",
"statistics":{
"has_readme":true,
"documentation_files_count":23,
"code_files_with_docs":15,
"total_words":12500,
"total_docstrings":87
}
}

โญ Key Features

๐Ÿ“Š Comprehensive Extraction

  • Extracts README, docs folders, and code docstrings
  • Supports Python, JavaScript, TypeScript
  • Handles nested documentation structures
  • Preserves markdown formatting and sections

๐ŸŽฏ Structured Output

  • Clean JSON format ready for processing
  • Pydantic models for type safety
  • Combined markdown for easy reading
  • Detailed statistics and metadata

๐Ÿ›ก๏ธ Robust & Reliable

  • Proper error handling
  • Rate limit management
  • Partial success handling
  • Detailed logging

โšก Fast & Efficient

  • Async operations
  • Smart file filtering
  • Configurable limits
  • Optimized API usage

๐Ÿ’ก Use Cases

๐Ÿค– RAG Systems

Extract clean documentation for training AI models:

# Use extracted docs for embeddings
docs = result['combined_markdown']
chunks = create_embeddings(docs)

๐Ÿ‘จโ€๐Ÿ’ป Developer Onboarding

Generate comprehensive repo overviews:

  • Understand project structure
  • Find key documentation
  • Identify important files

๐Ÿ“ˆ Documentation Audits

Analyze documentation quality:

  • Check completeness
  • Identify gaps
  • Track improvements

๐Ÿ” Code Search

Enable semantic search over codebases:

  • Search through docstrings
  • Find relevant code examples
  • Understand APIs

๐Ÿ”ง Configuration

GitHub Token (Recommended)

For private repos and higher rate limits (5,000 vs 60 requests/hour):

  1. Go to https://github.com/settings/tokens
  2. Generate new token (classic)
  3. Select scopes: repo or public_repo
  4. Add to input: "githubToken": "ghp_your_token"

Options

OptionTypeDefaultDescription
maxFilesinteger100Maximum files to process
extractCodeDocsbooleantrueExtract code docstrings

๐Ÿ“Š Statistics Provided

  • has_readme: Whether README exists
  • documentation_files_count: Number of doc files found
  • code_files_with_docs: Number of code files with docstrings
  • total_words: Total documentation words
  • total_lines: Total documentation lines
  • total_docstrings: Total docstrings extracted

๐Ÿ› ๏ธ Development

Local Testing

# Install dependencies
pip install-r requirements.txt
# Run locally
apify run

Project Structure

.
โ”œโ”€โ”€ src/
โ”‚ โ”œโ”€โ”€ main.py # Actor entry point
โ”‚ โ”œโ”€โ”€ extractor.py # Extraction logic
โ”‚ โ”œโ”€โ”€ models.py # Data models
โ”‚ โ””โ”€โ”€ utils.py # Helper functions
โ”œโ”€โ”€ .actor/
โ”‚ โ”œโ”€โ”€ actor.json # Actor configuration
โ”‚ โ””โ”€โ”€ input_schema.json # Input schema
โ”œโ”€โ”€ requirements.txt # Dependencies
โ””โ”€โ”€ Dockerfile # Container config

๐Ÿค Contributing

Issues and pull requests welcome! This is an active project participating in the Apify $1M Challenge.


๐Ÿ“ License

Apache 2.0


๐Ÿ’ฌ Support

  • Questions? Join Apify Discord
  • Issues? Open a GitHub issue
  • Need help? Check Apify documentation

๐ŸŽฏ Coming Soon

  • ๐Ÿ”œ Documentation quality scoring (A-F grades)
  • ๐Ÿ”œ MCP server for AI agents
  • ๐Ÿ”œ Change detection and tracking
  • ๐Ÿ”œ Multi-repo comparison
  • ๐Ÿ”œ PDF documentation support
  • ๐Ÿ”œ Website documentation scraping

##FAQs Q: Why did extraction fail? A: Common reasons: 1.Repository doesn't exist (check URL) 2.Repository is private (add GitHub token) 3.Rate limit exceeded (add token for 5000/hour) 4.Repository is too large (reduce maxFiles)

Q: What if I hit rate limits? A:

Without token: 60 requests/hour With token: 5,000 requests/hour Get token: https://github.com/settings/tokens

Q: Can I extract from private repos? A: Yes! Add your GitHub token in the input: json{ "source": { "url": "...", "githubToken": "ghp_your_token" } } Q: What's the maximum repository size? A:

1.Max 500 files per run 2.Max 5MB per file 3.Max 50MB total data 4.Adjust maxFiles if needed

Q: Why are some files skipped? A: Files are skipped if they: 1.Are too large (>5MB) 2.Can't be decoded (binary files) 3.Cause encoding errors

Q: How long does extraction take? A: 1.Small repos (<100 files): 2-5 seconds 2.Medium repos (100-500 files): 10-30 seconds 3.Large repos (500+ files): 30-60 seconds 4.Max timeout: 4 minutes


Built with โค๏ธ for the Apify $1M Challenge

โญ If you find this useful, please star the Actor!

You might also like

Github Code Auditor

devwithbobby/github-code-auditor

GitHub Code Auditor is an AI-powered Apify Actor that analyzes GitHub repositories.

๐Ÿ‘ User avatar

Dev with Bobby

23

GithubScraper

fornace/githubscraper

Automatically scrapes and downloads Markdown documentation from GitHub repositories, for easy AI finetuning.

GitHub Scraper

pear_fight/github-scraper

Scrape repositories, stars, issues and more from GitHub

Product Documentation Change Monitor scraper

funny_electrician/Korak1910

Product Documentation Change Monitor scraper: Alerts AI agents when an API or library's documentation updates.

๐Ÿ‘ User avatar

Milton Gardener

2

Agentic AI For Good

transparent_flea/agentic-ai-explorer-developer

Fetches the latest agentic AI tools, models, frameworks, and research from top developer sources, then delivers structured insights and examples that keep you up to date every day

GitHub Trending Scraper

uxinfra/github-trending-scraper

Scrapes trending repositories from GitHub with real-time data

GitHub Repos Scraper

gio21/github-repos-scraper

Search and scrape GitHub repositories. Extract stars, forks, language, license, topics, and more from the GitHub public API.

Github Email Scraper - Advanced, Fast & Cheapest

contacts-api/github-email-scraper-fast-advanced-and-cheapest

๐Ÿ™ GitHub Email Scraper helps you collect developer and company emails from GitHub profiles and repositories โšก Ideal for recruiting and sales ๐Ÿ“ง

28

5.0