VOOZH about

URL: https://apify.com/datapilot/startup-investor-scraper

โ‡ฑ Startup Investor Scraper ยท Apify


Pricing

$10.00/month + usage

Go to Apify Store

Startup Investor Scraper

VC Firm Data Scraper collects venture capital firm information using Wikipedia, DuckDuckGo search, and official websites. It extracts firm name, website, location, phone, description, investment stages, focus sectors, AUM, and social links. The actor outputs structured JSON data

Pricing

$10.00/month + usage

Rating

0.0

(0)

Developer

๐Ÿ‘ Data Pilot

Data Pilot

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

3 months ago

Last modified

Categories

Share

Startup Investor Scraper - Advanced Apify Actor

๐Ÿš€ Startup Investor Scraper (Advanced) is a production-grade Apify Actor designed to extract comprehensive venture capital and investment firm data using advanced Wikipedia validation, multi-source scraping, and intelligent fallback mechanisms. This tool provides detailed Startup Investor information, including firm profiles, AUM, investment stages, focus areas, contact information, and social media links for legitimate investment firms.

With async/await architecture, smart Wikipedia filtering, multi-page website scraping, proxy fallback mechanisms, and Apify Dataset integration, the Startup Investor Scraper ensures reliable extraction of verified investment firm information. It focuses on key Startup Investor metrics like AUM, investment stages, focus areas, and firm types, making it an essential tool for Startup Investor analysis and fundraising intelligence.

๐Ÿ”ฅ Features

  • Smart Wikipedia Validation โ€“ Filters investment-related Wikipedia results to ensure only legitimate firms are processed using keyword matching.
  • Multi-Source Data Aggregation โ€“ Combines Wikipedia infobox data, official websites, contact pages, and team pages for complete information.
  • Async/Await Architecture โ€“ Optimized concurrent processing using Python asyncio for maximum performance.
  • Proxy Fallback Mechanism โ€“ Uses Apify RESIDENTIAL proxies with automatic fallback to direct connection on proxy failures (UPSTREAM502/503).
  • Multi-Page Website Scraping โ€“ Automatically scrapes main website, /about, /contact, /team pages for comprehensive data extraction.
  • Intelligent Website Discovery โ€“ Combines DuckDuckGo search with blind domain guessing for reliable website identification.
  • Address Extraction โ€“ Extracts full addresses including street, suite/floor, city, state, ZIP code with smart validation.
  • Investment Stage Detection โ€“ Automatically identifies investment stages (Pre-Seed, Seed, Series A-C, Growth, IPO).
  • Focus Area Classification โ€“ Detects investment focus areas (AI/ML, Fintech, HealthTech, SaaS, DeepTech, etc.).
  • Firm Type Detection โ€“ Classifies firms (VC, CVC, Angel, PE, Accelerator, Seed Fund, Growth Equity).
  • AUM Extraction โ€“ Extracts Assets Under Management using multiple text pattern matching algorithms.
  • Social Media Integration โ€“ Finds LinkedIn and Twitter/X profiles with URL validation.
  • Contact Information โ€“ Extracts phone numbers using pattern matching with validation.
  • UUID Generation โ€“ Generates Crunchbase-compatible UUIDs for database integration.
  • Timestamp Recording โ€“ Records created_at, updated_at, last_checked timestamps in ISO 8601 format.
  • Dataset Push with Metadata โ€“ Pushes results to Apify Dataset with search metadata for tracking.


๐Ÿ“ฅ Input

FieldTypeDefaultDescription
keywordstringrequiredInvestment firm keyword to search
max_resultsinteger5Maximum firms to return (1-20)
useApifyProxybooleantrueEnable Apify residential proxies
apifyProxyGroupsarray["RESIDENTIAL"]Proxy group configuration

Example Input:

{
"keyword":"venture capital technology san francisco",
"max_results":10,
"useApifyProxy":true,
"apifyProxyGroups":["RESIDENTIAL"]
}

๐Ÿ“ค Output

FieldTypeDescription
firm_idintegerUnique ID (1, 2, 3...)
firm_type_idstringVC, CVC, Angel, PE, Accelerator, Seed Fund, Growth
firm_namestringOfficial firm name
firm_address_1stringStreet address
firm_address_2stringSuite/floor number
firm_citystringCity
firm_statestringState abbreviation
firm_countrystringCountry
firm_zipstringZIP/Postal code
firm_phonestringPhone number
firm_websitestringOfficial website URL
firm_linkedin_urlstringLinkedIn profile
twitter_urlstringTwitter/X profile
crunchbase_uuidstringUUID for database linking
firm_descriptionstringCompany overview
firm_stagesarrayInvestment stages
firm_aumfloatAssets Under Management (millions)
firm_focusarrayInvestment focus areas
last_checkedstringISO 8601 timestamp
created_atstringCreation timestamp
updated_atstringUpdate timestamp

Example Record:

{
"firm_id":1,
"firm_type_id":"VC",
"firm_name":"Sequoia Capital",
"firm_address_1":"2800 Sand Hill Road",
"firm_city":"Menlo Park",
"firm_state":"CA",
"firm_country":"United States",
"firm_zip":"94025",
"firm_phone":"(650) 234-7800",
"firm_website":"https://www.sequoiacap.com",
"firm_linkedin_url":"https://linkedin.com/company/sequoia-capital",
"twitter_url":"https://twitter.com/sequoiacap",
"firm_description":"Global venture capital firm...",
"firm_stages":["Pre-Seed","Seed","Series A","Series B","Series C"],
"firm_aum":85000.0,
"firm_focus":["AI/ML","Enterprise","Consumer","Fintech"],
"last_checked":"2025-02-14T12:00:00Z",
"created_at":"2025-02-14T12:00:00Z",
"updated_at":"2025-02-14T12:00:00Z"
}

๐Ÿงฐ Technical Stack

  • Async HTTP: aiohttp with ClientTimeout and SSL
  • HTML Parsing: BeautifulSoup4 with tag decomposition
  • Search: DuckDuckGo HTML scraping
  • APIs: Wikipedia API with JSON responses
  • Pattern Matching: Python regex (re module)
  • UUID: Python uuid module
  • Timestamps: datetime with timezone
  • Logging: Apify Actor logging system
  • Proxy: Apify Proxy with fallback mechanisms
  • Platform: Apify Actor serverless environment

๐Ÿ“Š Data Fields Explained

Location & Address

  • firm_address_1: Street address (e.g., "2800 Sand Hill Road")
  • firm_address_2: Suite or floor (e.g., "Suite 100")
  • firm_city: City validated against street words
  • firm_state: 2-letter state code (CA, NY, etc.)
  • firm_country: Country name (usually United States)
  • firm_zip: 5-digit ZIP code

Investment Profile

  • firm_stages: Array of supported investment stages
  • firm_aum: Assets Under Management in millions USD
  • firm_focus: Array of focus areas (max 8)
  • firm_type_id: Classification (VC, CVC, Angel, PE, etc.)

Metadata

  • last_checked: When data was verified
  • created_at: When record was created
  • updated_at: When record was last updated
  • crunchbase_uuid: UUID for linking to databases

# Validates city against street words to prevent
# extracting "Road Billerica" as a city
STREET_WORDS ={"road","street","avenue","boulevard",...}
ifnotany(sw in city.lower()for sw in STREET_WORDS):
data["city"]= city # Valid

โš™๏ธ Configuration

Proxy Configuration

{
"useApifyProxy":true,
"apifyProxyGroups":["RESIDENTIAL"]
}

Disable proxy:

{
"useApifyProxy":false
}

Data Quality

  • Validation: Multi-stage validation ensures firm legitimacy
  • Accuracy: Wikipedia infobox data is authoritative
  • Completeness: Multi-page scraping captures all available data
  • Freshness: Timestamps recorded for verification tracking
  • Verification: Always verify critical data independently

Best Practices

  • Run during off-peak hours
  • Use reasonable delays between searches
  • Verify investor details with official sources
  • Don't rely solely on automated data
  • Respect communication preferences
  • Use data ethically for legitimate purposes

๐Ÿ“ฆ Changelog

New Features:

  • Smart Wikipedia validation with 2-stage filtering
  • Multi-page scraping (main, /about, /contact, /team)
  • Intelligent address extraction with street word validation
  • Proxy fallback mechanism for UPSTREAM502/503 errors
  • Website discovery with blind domain guessing
  • AUM extraction with multiple pattern matching
  • Investment stage auto-detection
  • Focus area classification (8 areas max)
  • Firm type detection (7 types)
  • Social media profile extraction
  • UUID generation for database linking
  • ISO 8601 timestamp recording

Improvements:

  • Async/await architecture for performance
  • Reduced request failures from 15% to 5%
  • Increased data completeness from 70% to 95%
  • Better address accuracy with validation
  • Improved AUM extraction reliability
  • Enhanced error logging and recovery

Bug Fixes:

  • Fixed "Street City" extraction bug
  • Improved phone number validation
  • Better LinkedIn URL filtering
  • Twitter/X profile URL fixes
  • Removed non-investment firms

๐Ÿง‘โ€๐Ÿ’ป Support & Feedback

  • Issues: Submit via Apify console
  • Documentation: Check Actor details page
  • Community: Join Apify forum discussions
  • Feature Requests: Suggest improvements
  • Bug Reports: Report with logs and details

Disclaimer: Startup Investor Scraper Advanced is provided as-is for research purposes. Users are responsible for compliance with website policies and laws. Always verify data independently.


๐ŸŽ‰ Get Started Today

Deploy this production-grade actor now!

Use for:

  • ๐ŸŽฏ Fundraising Research
  • ๐Ÿ’ผ Investor Intelligence
  • ๐Ÿ“Š Market Analysis
  • ๐Ÿ’ก Fund Research
  • ๐Ÿ” Due Diligence

Perfect for:

  • Entrepreneurs
  • Founders
  • Investors
  • Corporate Development
  • Research Teams

Last Updated: February 2025
Version: 2.0.0 Advanced
Status: Production Ready
Platform: Apify Actor
Architecture: Async/Await
Validation: Multi-stage
Reliability: Enterprise-grade


๐Ÿ“š Related Tools

  • Startup Company Data Collector
  • Business Social Media Finder
  • Smart Article Extractor
  • Fast News Content Scraper

You might also like

FindLaw Law Firm Scraper - Attorney & Law Firm Data Extraction

rigelbytes/findlaw-scraper

Scrape law firm profiles from FindLaw.com. Extract detailed attorney and firm information including contact details, ratings, practice areas, and client reviews.

Martindale Law Firm Scraper

parseforge/martindale-scraper

Collect law firm listings from Martindale using filters for keyword, practice area, and location. Get clear records with firm name, attorneys, ratings, contact info, address details, service flags, and source links in clean outputs ready for legal research, lead generation, and competitive analysis.

Investment Finance Professionals

johnvc/SECInvestmentAdvisorContacts

Find and filter 250,000+ investment professionals and 15,000+ financial firms by location, firm name, and more. Get structured contact data with emails, LinkedIn profiles, and firm associations for lead generation and market research.

Law Firm Website Contact Scraper

jungle_synthesizer/law-firm-website-contact-scraper

Extract attorney profiles, contact details, practice areas, and bios directly from law firm websites. Provide a list of law firm URLs and get structured attorney data including name, title, email, phone, education, bar admissions, and headshot.

๐Ÿ‘ User avatar

BowTiedRaccoon

2

Startup Investors Data Scraper

johnvc/startup-investors-data-scraper

10,469 investment firms at your fingertips (as of Dec 2025). The most comprehensive startup investor firm database for finding funding and customers. Access detailed firm profiles to accelerate your startup's growth, find customers, and conduct comprehensive market research.

VC Sheet Funds Scraper โ€“ Venture Capital Directory

giovannibiancia/vc-sheet-funds-scraper---venture-capital-directory

Scrape the complete VC Sheet fund directory (vcsheet.com) to extract structured data on hundreds of active venture capital funds. Perfect for founders building investor outreach lists, researchers mapping the VC ecosystem, and B2B data providers targeting the startup finance space.

๐Ÿ‘ User avatar

Giovanni Bianciardi

13

PitchBook Investors Scraper

jungle_synthesizer/pitchbook-investors-scraper

Scrape public investor profile metadata from PitchBook without a subscription. Supports text search, direct profile URLs, and bulk sitemap discovery. Returns firm name, description, location, investor type, status, investment metrics, social links, and more.

๐Ÿ‘ User avatar

BowTiedRaccoon

2

Signal NFX Investor Scraper

powerai/signal-listing-scraper

Scrape investor profiles from Signal NFX with automatic pagination and comprehensive investor data including firm details, check sizes, and investment focus.

Startup Company Data Collector

datapilot/startup-company-data-collector

Startup Data Collector gathers structured startup information from multiple sources like Wikipedia, official websites, and search results. It extracts company description, website, industry, location, founding year, employees, funding data, emails, and social links (LinkedIn, Twitter, etc.),