VOOZH about

URL: https://apify.com/crawlerbros/arxiv-papers-scraper

⇱ arXiv Papers Scraper Β· Apify


Pricing

from $1.00 / 1,000 results

Go to Apify Store

arXiv Papers Scraper

Scrape academic preprints from arXiv.org by keyword, author, or category. Returns clean records with title, authors, abstract, categories, PDF URL, DOI. HTTP-only via the public arXiv API. No login, no proxy.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

8 days ago

Last modified

Share

Search arXiv.org β€” the world's largest open-access archive of scientific preprints (2.5M+ papers across CS, math, physics, biology, finance, economics) β€” and return clean structured records for every match. HTTP-only via the public arXiv API. No login, no cookies, no proxy.

What this actor does

  • Queries the arXiv API (https://export.arxiv.org/api/query) by keyword, author, and/or category
  • Parses the Atom XML response into one structured JSON record per paper
  • Filters by date range, DOI presence, abstract length, abstract keyword
  • Sorts by relevance, submitted-date, or last-updated-date
  • Walks paginated results until maxItems is reached
  • Respects arXiv's 1-request-per-3-seconds rate limit

Output per paper

  • arxivId β€” e.g. 2401.12345
  • title, abstract, abstractWordCount
  • authors[], authorCount, affiliations[]
  • categories[], primaryCategory β€” e.g. cs.LG
  • submittedAt, updatedAt β€” ISO-8601 UTC
  • doi β€” when published in a journal
  • journalRef β€” full citation
  • comment β€” author's note (e.g. "15 pages, 5 figures")
  • pdfUrl β€” direct PDF download link
  • htmlUrl β€” abstract page on arXiv.org
  • recordType: "paper", scrapedAt

Empty fields are omitted (no nulls).

Input

FieldTypeDefaultDescription
searchQuerystring"large language models"Free-text query against title + abstract + authors
categoriesarray[]arXiv subject codes (e.g. cs.LG, stat.ML). 50+ choices in the dropdown
authorContainsstring–Filter by author name substring
sortByenumsubmittedDaterelevance / submittedDate / lastUpdatedDate
sortOrderenumdescendingdescending (newest first) / ascending
dateRangeFromstring–Drop papers submitted before this ISO date
dateRangeTostring–Drop papers submitted after this ISO date
maxItemsint50Hard cap on emitted papers (1–5000)
includeDoiOnlyboolfalseDrop papers without a DOI (typically pre-publication)
minAbstractLengthint–Drop papers with abstracts shorter than N characters
abstractContainsstring–Only emit papers whose abstract contains this substring

Example: latest LLM papers

{
"searchQuery":"large language models",
"categories":["cs.CL","cs.LG"],
"sortBy":"submittedDate",
"maxItems":100
}

Example: papers by a specific author

{
"authorContains":"Yann LeCun",
"sortBy":"submittedDate",
"maxItems":50
}

Example: published papers (DOI required)

{
"searchQuery":"transformer",
"categories":["cs.LG"],
"includeDoiOnly":true,
"minAbstractLength":200,
"dateRangeFrom":"2024-01-01"
}

Example: niche query

{
"searchQuery":"diffusion model",
"categories":["cs.CV"],
"abstractContains":"image generation",
"sortBy":"relevance",
"maxItems":25
}

Use cases

  • AI/ML research tracking β€” daily run on cs.LG + cs.AI to surface new methods
  • Literature review automation β€” feed every paper matching your query into your RAG index
  • Author following β€” watch a specific researcher's new submissions
  • Trend analysis β€” count papers per topic over time to chart research interest
  • Citation database β€” pair with Crossref/DOI lookup for full bibliographic records
  • Academic content marketing β€” find papers citing techniques your tool implements

FAQ

Does it require a login or cookies? No. arXiv's API is fully public.

Is a proxy needed? No. arXiv accepts requests from any IP. The actor honors arXiv's 3-seconds-between-requests rate limit by default.

How fresh is the data? Real-time. arXiv typically posts new papers within hours of submission.

Can I get the full PDF? The actor returns pdfUrl β€” a direct link to the PDF. Download it with any HTTP client.

Why is doi missing on some papers? arXiv preprints don't always have a DOI assigned at the time of upload. Set includeDoiOnly=true to filter to peer-reviewed or journal-published versions only.

What's the difference between searchQuery and abstractContains? searchQuery is sent to arXiv's server-side search (ranks by relevance). abstractContains is a client-side substring filter applied AFTER fetching. Use searchQuery for relevance, abstractContains for narrow keyword filtering on top of that.

Why limit to 5000 items? arXiv's API allows up to 30k results per query but pagination beyond a few thousand becomes very slow due to the 3-second rate limit. For larger crawls, run multiple actor runs with different dateRangeFrom/dateRangeTo windows.

Can I scrape the PDF text content? Not directly β€” this actor returns metadata only. Pair it with a downstream PDF-extraction actor if you need full-text.

How are categories specified? Use arXiv's official codes (e.g. cs.LG for ML, stat.ML for stats ML, cs.CL for NLP, q-bio.QM for quantitative biology). The dropdown lists 50+ common codes; the full taxonomy is at arxiv.org/category_taxonomy.

You might also like

arXiv Research Paper Scraper

crawlerbros/arxiv-research-paper-scraper

Scrape research papers from arXiv.org - search by query, category, or author; lookup by arXiv ID. Returns title, authors, abstract, PDF URL, DOI, categories, and more. Uses the public arXiv Atom API. No login or proxy required.

ArXiv Paper Search

gentle_cloud/arxiv-paper-search

Search and extract academic papers from ArXiv. Find papers by keyword, author, or category with full metadata including title, authors, abstract, categories, and PDF links.

10

arXiv Paper Scraper

plantane/arxiv-scraper

Scrape research papers from arXiv by search query or category. Get titles, abstracts, authors, categories, and PDF links via the public arXiv API.

arXiv Scraper: Papers, Authors, Categories & Search

perconey/arxiv-scraper

Scrape arxiv.org via the official Atom API. Full-text search, by author / title / category, paper detail by id, latest in any category. Returns title, abstract, authors, DOI, PDF link. No auth, no proxies. Pay only per result item.

arXiv Scraper

dami_studio/arxiv-scraper

Search arXiv via the official API and get clean, structured paper metadata: title, abstract, authors, categories, DOI, dates, and abstract + PDF links. No key, no login, no anti-bot. Uses arXiv search syntax (all:, cat:, ti:, au:).

2

5.0

ArXiv Paper Scraper

sheshinmcfly/arxiv-paper-scraper

Search and extract scientific papers from ArXiv.org across any field. Returns title, authors, full abstract, PDF link, arXiv ID, categories, and submission date. Ideal for AI research monitoring, RAG pipelines, literature reviews, and academic trend analysis. No API key needed.

arXiv Metadata Collectorβ€” Metadata, PDF, Authors & Abstract

scrapepilot/arxiv-metadata-collector---metadata-pdf-authors-abstract

Scrape arXiv research papers with metadata including title, authors, abstract, PDF links, DOI, and categories. Supports keyword search, proxy integration, and structured dataset output for AI, ML, and academic research use