VOOZH about

URL: https://apify.com/jungle_synthesizer/openalex-works-crawler

โ‡ฑ OpenAlex Scraper - Scholarly Papers & Citation Data ยท Apify


๐Ÿ‘ OpenAlex Scraper - Scholarly Works, Authors & Citations Graph avatar

OpenAlex Scraper - Scholarly Works, Authors & Citations Graph

Pricing

Pay per event

Go to Apify Store

OpenAlex Scraper - Scholarly Works, Authors & Citations Graph

Scrape OpenAlex, the open scholarly graph with 250M+ works, 100M+ authors, and 120K+ institutions. Extract titles, abstracts, authors, ORCIDs, institutions, concepts, citations, open-access flags, and grants.

Pricing

Pay per event

Rating

0.0

(0)

Developer

๐Ÿ‘ BowTiedRaccoon

BowTiedRaccoon

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

1

Monthly active users

11 days ago

Last modified

Share

OpenAlex API Scraper โ€” Scholarly Works, Authors, Citations & DOI Crawler

Extract structured scholarly data from the OpenAlex API (api.openalex.org), the open successor to Microsoft Academic Graph and a free alternative to Scopus and Web of Science. This OpenAlex scraper pulls research papers, author profiles, and citation data from an index of 309M+ works, 100M+ authors, 120K+ institutions, and 65K+ concepts โ€” titles, DOIs, authors with ORCIDs, citation counts, open-access status, grants, and full abstracts reconstructed from the inverted index most scrapers quietly skip.

OpenAlex Crawler Features

  • Queries six entity types in one actor: Works, Authors, Institutions, Concepts, Sources (journals), and Publishers
  • Filters works by publication year range, work type, concept ID, institution country, venue ISSN, and open-access status โ€” combine any of them
  • Reconstructs full-text abstracts from OpenAlex's inverted word-position index, a step most competing actors leave out
  • Fetches 200 records per API call with cursor pagination, so pulling 10,000 records takes 50 requests, not 500
  • Extracts 30+ fields per work including authors with ORCIDs, all contributing institutions, top concepts with scores, grants, referenced works, and SDG tags
  • Qualifies for the OpenAlex polite pool via the politeEmail input โ€” higher rate limits and better latency
  • Pure JSON API, no HTML parsing, no proxy, no authentication required

Who Uses This OpenAlex Research Paper & Citation Data?

  • AI/ML teams training research assistants โ€” bulk-fetch works with abstracts to build citation recommenders, literature review agents, and domain-specific retrieval corpora
  • Bibliometric analysts โ€” map citation networks by concept, institution, or country and track publication output over time
  • Research intelligence products โ€” feed competitive analysis dashboards that track which labs publish what and who funds it
  • Academic librarians โ€” build institutional publication lists filtered by open-access status and funding source
  • Science policy researchers โ€” measure research alignment with the UN Sustainable Development Goals across countries and years

How OpenAlex Crawler Works

  1. Pick an entity type โ€” Works is the default, but Authors, Institutions, Concepts, Sources, and Publishers all work the same way.
  2. Add filters. For Works that means year range, concept, country, venue ISSN, and the open-access flag. Other entities accept the filters that apply to them.
  3. The crawler hits the OpenAlex API with cursor pagination, pulling up to 200 records per request until it hits your maxItems limit or exhausts the query.
  4. Each record is flattened into a consistent schema โ€” arrays stay as arrays of primitives, concepts and grants get formatted into readable strings, abstracts are rebuilt from the inverted index when you want them.

Input

Basic: recent machine learning papers

{
"entityType":"works",
"query":"machine learning",
"yearFrom":2023,
"yearTo":2024,
"maxItems":500
}

Open-access climate science, concept-filtered

{
"entityType":"works",
"concept":"C132651083",
"openAccessOnly":true,
"institutionCountry":"US",
"maxItems":1000,
"politeEmail":"you@example.com"
}

Authors search

{
"entityType":"authors",
"query":"yoshua bengio",
"maxItems":20
}

Institutions in a country

{
"entityType":"institutions",
"institutionCountry":"DE",
"maxItems":100
}

Input Parameters

FieldTypeDefaultDescription
entityTypestringworksOne of works, authors, institutions, concepts, sources, publishers.
querystringmachine learningFull-text search across titles and abstracts (works) or display names (other entities). Leave empty to browse all.
yearFrominteger0Earliest publication year (Works only). 0 means no lower bound.
yearTointeger0Latest publication year (Works only). 0 means no upper bound.
openAccessOnlybooleanfalseRestrict Works to open-access publications.
workTypestring""Filter by work type: article, preprint, book, book-chapter, dataset, dissertation, review, report, standard, other.
conceptstring""OpenAlex concept ID (e.g. C41008148 for Computer Science). Applies to Works and Authors.
institutionCountrystring""Two-letter ISO country code. Applies to Works (via authorships), Authors, and Institutions.
venueIssnstring""Filter Works by host venue ISSN (e.g. 0028-0836 for Nature).
reconstructAbstractbooleantrueRebuild full abstracts from the OpenAlex inverted index. Adds a small amount of per-record work.
politeEmailstring""Your email to qualify for the OpenAlex polite pool. Recommended for any real run.
maxItemsinteger100Maximum records to return. Set to 0 for unlimited โ€” requires at least one filter or search query.
proxyConfigurationobjectdisabledProxy settings. OpenAlex does not require proxies.

OpenAlex Crawler Output Fields

All entity types share a common output schema. Fields that don't apply to a given entity type are left empty. The examples below show Works and Authors โ€” Institutions, Concepts, Sources, and Publishers use the same table.

Works output example

{
"openalex_id":"W2101234009",
"doi":"https://doi.org/10.5555/1953048.2078195",
"title":"Scikit-learn: Machine Learning in Python",
"abstract":"Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems...",
"entity_type":"work",
"publication_year":2011,
"publication_date":"2011-10-01",
"work_type":"article",
"language":"en",
"open_access_is_oa":true,
"open_access_status":"green",
"open_access_oa_url":"https://hal.inria.fr/hal-00650905v2/document",
"venue_name":"Journal of Machine Learning Research",
"venue_issn":"1532-4435, 1533-7928",
"venue_publisher":"JMLR",
"author_names":["Fabian Pedregosa","Gaรซl Varoquaux","Alexandre Gramfort"],
"author_ids":["A5014316393","A5047509574","A5001829085"],
"author_orcids":["https://orcid.org/0000-0003-4025-383X"],
"corresponding_author_name":"Fabian Pedregosa",
"first_institution_name":"Inria",
"first_institution_country":"FR",
"institution_names":["Inria","CEA","ENS Paris"],
"institution_ids":["I1294671590","I4210121523"],
"concepts":[
"Computer science (L0, 0.820)",
"Machine learning (L1, 0.750)",
"Python (programming language) (L3, 0.610)"
],
"cited_by_count":58421,
"referenced_works_count":18,
"referenced_works":["W2109394935","W2131462250"],
"related_works":["W2963773651"],
"counts_by_year":["2023: 8421","2022: 7310","2021: 6122"],
"grants":["French National Research Agency โ€” ANR-12-BS01-0009"],
"sustainable_development_goals":[],
"openalex_url":"https://openalex.org/W2101234009"
}

Authors output example

{
"openalex_id":"A5083138872",
"title":"Albert Einstein",
"entity_type":"author",
"author_orcids":[],
"institution_names":["Princeton University","Institute for Advanced Study"],
"concepts":[
"Physics (L0, 0.910)",
"Quantum mechanics (L1, 0.720)"
],
"cited_by_count":184521,
"works_count":412,
"counts_by_year":["2024: 1502","2023: 1610"],
"openalex_url":"https://openalex.org/A5083138872"
}
FieldTypeDescription
openalex_idstringShort OpenAlex identifier (W*, A*, I*, C*, S*, P*).
entity_typestringwork, author, institution, concept, source, or publisher.
titlestringWork title or entity display name.
doistringDOI URL (Works only).
abstractstringFull abstract reconstructed from OpenAlex's inverted index (Works only, when reconstructAbstract is enabled).
publication_yearnumberPublication year (Works only).
publication_datestringPublication date in YYYY-MM-DD (Works only).
work_typestringWork type: article, preprint, book, dataset, etc. (Works only).
languagestringLanguage code (Works only).
open_access_is_oabooleanOpen-access flag (Works and Sources).
open_access_statusstringgold, green, hybrid, bronze, diamond, closed (Works only).
open_access_oa_urlstringURL to the open-access version (Works only).
venue_namestringHost venue or journal name (Works only).
venue_issnstringHost venue ISSN, comma-separated (Works and Sources).
venue_publisherstringHost venue publisher name (Works and Sources).
author_namesstring[]Author display names.
author_idsstring[]OpenAlex author IDs.
author_orcidsstring[]Author ORCID URLs when available.
corresponding_author_namestringName of the corresponding author (Works only).
first_institution_namestringPrimary institution of the first author (Works only).
first_institution_countrystringCountry code of the first author's institution (Works only).
institution_namesstring[]All institutions across all authors (Works, Authors).
institution_idsstring[]All OpenAlex institution IDs (Works only).
conceptsstring[]Top 5 OpenAlex concepts, formatted as Name (Level, Score).
cited_by_countnumberTotal citation count.
referenced_works_countnumberNumber of works this work cites (Works only).
referenced_worksstring[]OpenAlex IDs of referenced works (Works only).
related_worksstring[]OpenAlex IDs of related works (Works only).
counts_by_yearstring[]Citations per year, formatted as YYYY: N.
grantsstring[]Funders and award IDs, formatted as Funder โ€” Award.
sustainable_development_goalsstring[]UN SDG names matched to the work.
works_countnumberNumber of works (Authors, Institutions, Concepts, Sources, Publishers).
country_codestringCountry code (Institutions, Publishers).
homepage_urlstringHomepage URL (Institutions, Publishers).
concept_levelnumberConcept hierarchy level, 0 = most general (Concepts only).
concept_descriptionstringConcept description text (Concepts only).
openalex_urlstringFull OpenAlex URL for the entity.

FAQ

How many records does OpenAlex Crawler cover? OpenAlex Crawler reads the full OpenAlex index โ€” 309M+ works, 100M+ authors, 120K+ institutions, and 65K+ concepts. If a paper, researcher, or organization is in OpenAlex, the crawler can reach it.

Do I need an API key or proxies? OpenAlex Crawler runs without either. OpenAlex is free and open. Setting politeEmail is optional but recommended โ€” it qualifies runs for the polite pool, which has higher rate limits and more predictable latency than the shared anonymous pool.

What is abstract reconstruction and why does it matter? OpenAlex stores abstracts as an inverted index โ€” a map of words to their positions โ€” to comply with publisher terms. The raw field is not human-readable. OpenAlex Crawler rebuilds the original text from that index, so you get a normal paragraph instead of a JSON object. Most competing actors skip this step.

Can I run a bulk export without filters? Not with maxItems set to 0. Pulling the entire 309M-work index through the API would take a while and would not be what most people actually want. Provide at least one filter or a search query when running unlimited. With filters, unlimited runs are fine โ€” the cursor pagination scales.

How do I filter by a specific concept? OpenAlex Crawler accepts any OpenAlex concept ID in the concept field. Find IDs by browsing openalex.org/concepts or querying the Concepts endpoint in this actor. Common examples: C41008148 (Computer Science), C86803240 (Biology), C185592680 (Chemistry), C121332964 (Physics).

How current is the data? OpenAlex Crawler reads the live OpenAlex API. OpenAlex ingests from Crossref, PubMed, institutional repositories, and other sources on a continuous basis; individual records include publication_date and update timestamps on the underlying records.

Is this a free alternative to Scopus or Web of Science for citation data? For most bibliometric and citation-graph work, yes. OpenAlex is an open dataset with no paywall, and this scraper exports its works, authors, citation counts, references, and DOIs as flat records. Coverage and metadata fields differ from Scopus and Web of Science, so confirm OpenAlex carries the specific venues or fields you need before swapping a workflow over.

Need More Features?

Need extra fields, a different filter, or a scheduled run? Get in touch.

Why Use OpenAlex Crawler?

  • Full scholarly graph, one actor โ€” Works, Authors, Institutions, Concepts, Sources, and Publishers all share the same output schema, so you can build pipelines across entity types without juggling separate tools
  • Reconstructed abstracts โ€” OpenAlex returns abstracts as an inverted word-position index, which is not text anyone can use directly; this crawler rebuilds them into readable paragraphs, which is usually what the data is for
  • Clean, flat output โ€” 30+ fields per work, arrays of primitives rather than nested blobs, so downstream CSV exports, pandas DataFrames, and database loads work without preprocessing

You might also like

OpenAlex Scraper

gio21/openalex-scraper

Scrape OpenAlex - the free open catalog of scholarly works (250M+ papers, 100M+ authors, 100K institutions). Search across works, authors, institutions, concepts, journals. Returns title, abstract, authors, citations, DOI, OA status, and more.

OpenAlex Scraper

crawlerbros/openalex-scraper

Scrape OpenAlex the free, open catalog of 250M+ scholarly works, authors, institutions, and concepts. Search papers, authors, or fetch by OpenAlex ID / DOI. Pulls citations, open-access status, abstracts, authorships, journals, topics, and more.

OpenAlex Scraper

automation-lab/openalex-scraper

Extract research papers from OpenAlex โ€” titles, authors, citations, institutions, and open access links.

๐Ÿ‘ User avatar

Stas Persiianenko

7

OpenAlex Scholarly Works Scraper

parseforge/openalex-scraper

Export academic works, authors, institutions, sources, and concepts from OpenAlexs open catalog of 250M+ scholarly records. Successor to Microsoft Academic Graph. Filter by author, concept, year, open access status, or affiliation.

13

5.0

OpenAlex Works Scraper

powerai/openalex-works-scraper

Collect scholarly works from OpenAlex search results by URL, with automatic pagination and structured data (title, authors, venue, citations, PDF link).

OpenAlex Scholarly Data Extractor

xtracto/openalex-scholarly

Extract scholarly works, authors, institutions, journals, publishers, and funders from OpenAlex โ€” one record per row. 316M+ works. Public data, no key.

๐Ÿ‘ User avatar

Farhan Febrian Nauval

1