Pricing
from $4.99 / 1,000 results
Go to Apify Store
arXiv Research Paper Scraper
Extract comprehensive research paper data from arXiv search results including titles, authors, abstracts, categories, and more.
Pricing
from $4.99 / 1,000 results
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
7 days ago
Last modified
Categories
Share
Extract comprehensive research paper metadata from arXiv β the premier open-access preprint server for physics, mathematics, computer science, and more. ππ
Features
- Full paper metadata β arXiv ID, title, authors, abstract, categories, dates
- PDF & abstract links β direct links to papers
- Pagination β automatically iterates through pages to reach
maxItems - Deduplication β no duplicate papers across pages
- Flexible search β search by all fields, title, author, abstract, category, etc.
- Sorting β sort by relevance, submission date, or last updated date
- No anti-bot issues β arXiv is an open academic resource
Input Parameters
| Field | Type | Default | Description |
|---|---|---|---|
query | string | (required) | Search query (e.g. "large language models", "quantum computing") |
searchType | string | "all" | Search field: all, ti (title), au (author), abs (abstract), cat (category) |
sortBy | string | "relevance" | Sort by: relevance, lastUpdatedDate, submittedDate |
sortOrder | string | "descending" | Sort order: descending, ascending |
maxItems | integer | 50 | Maximum number of papers to extract (1β1000) |
proxyConfiguration | object | β | Apify proxy config |
Example INPUT.json
{"query":"large language models","searchType":"all","sortBy":"submittedDate","sortOrder":"descending","maxItems":50}
Output Fields
| Field | Type | Description |
|---|---|---|
position | integer | Rank in results (1-based) |
arxivId | string | arXiv paper ID (e.g. 2401.12345) |
title | string | Full paper title |
authors | array | List of author names |
abstract | string | Full paper abstract |
primaryCategory | string | Primary subject category (e.g. cs.AI) |
categories | array | All subject categories |
submittedDate | string | Original submission date |
updatedDate | string | Last updated date |
abstractUrl | string | URL to the abstract page |
pdfUrl | string | Direct link to the PDF |
comments | string | Author comments (e.g. "20 pages, 5 figures") |
journalRef | string | Journal reference if published |
doi | string | DOI if available |
reportNumber | string | Report number if available |
searchQuery | string | Query used for this result |
scrapedAt | string | ISO 8601 timestamp |
Example Output
{"position":1,"arxivId":"2501.12345","title":"Scaling Laws for Neural Language Models","authors":["Jared Kaplan","Sam McCandlish"],"abstract":"We study empirical scaling laws for language model performance...","primaryCategory":"cs.LG","categories":["cs.LG","cs.CL","stat.ML"],"submittedDate":"15 January, 2025","updatedDate":null,"abstractUrl":"https://arxiv.org/abs/2501.12345","pdfUrl":"https://arxiv.org/pdf/2501.12345","comments":"35 pages, 14 figures","journalRef":null,"doi":null,"searchQuery":"large language models","scrapedAt":"2025-05-01T12:00:00.000Z"}
Pagination
arXiv returns 25 results per page. The scraper automatically navigates through pages using the start offset parameter until maxItems is reached or no more results are available.
Use Cases
- Academic research monitoring β track new papers in your field
- Trend analysis β identify emerging topics and research directions
- Author profiling β collect all papers by specific authors
- Citation database β build reference datasets for research tools
- Competitive intelligence β monitor publications from research groups
- AI/ML dataset creation β collect paper abstracts for NLP training
Notes
- arXiv is a free, open-access resource β no authentication needed
- Results may vary slightly based on arXiv's real-time indexing
- The
abstractfield contains the full abstract text - Use
searchType: "au"to search by author name (e.g."Hinton, Geoffrey") - Use
searchType: "cat"with category codes like"cs.AI","math.CO","hep-th"
