VOOZH about

URL: https://apify.com/technicaldost/arxiv-paper-scraper

⇱ Arxiv Paper Scraper Β· Apify


Pricing

$1.00 / 1,000 papers

Go to Apify Store

Pricing

$1.00 / 1,000 papers

Rating

0.0

(0)

Developer

πŸ‘ Technical Dost Solutions

Technical Dost Solutions

Maintained by Community

Actor stats

0

Bookmarked

11

Total users

4

Monthly active users

4 months ago

Last modified

Categories

Share

Scrape single-page in JavaScript template

A template for scraping data from a single web page in JavaScript (Node.js). The URL of the web page is passed in via input, which is defined by the input schema. The template uses the Axios client to get the HTML of the page and the Cheerio library to parse the data from it. The data are then stored in a dataset where you can easily access them.

The scraped data in this template are page headings but you can easily edit the code to scrape whatever you want from the page.

Included features

  • Apify SDK - toolkit for building Actors
  • Input schema - define and easily validate a schema for your Actor's input
  • Dataset - store structured data where each object stored has the same attributes
  • Axios client - promise-based HTTP Client for Node.js and the browser
  • Cheerio - library for parsing and manipulating HTML and XML

How it works

  1. Actor.getInput() gets the input where the page URL is defined

  2. axios.get(url) fetches the page

  3. cheerio.load(response.data) loads the page data and enables parsing the headings

  4. This parses the headings from the page and here you can edit the code to parse whatever you need from the page

    $("h1, h2, h3, h4, h5, h6").each((_i, element)=>{...});
  5. Actor.pushData(headings) stores the headings in the dataset

Resources

Getting started

For complete information see this article. To run the Actor use the following command:

$apify run

Deploy to Apify

Connect Git repository to Apify

If you've created a Git repository for the project, you can easily connect to Apify:

  1. Go to Actor creation page
  2. Click on Link Git Repository button

Push project on your local machine to Apify

You can also deploy the project on your local machine to Apify without the need for the Git repository.

  1. Log in to Apify. You will need to provide your Apify API Token to complete this action.

    $apify login
  2. Deploy your Actor. This command will deploy and build the Actor on the Apify Platform. You can find your newly created Actor under Actors -> My Actors.

    $apify push

Documentation reference

To learn more about Apify and Actors, take a look at the following resources:

You might also like

arXiv Paper-to-JSON scraper

funny_electrician/Korak1904

​arXiv Paper-to-JSON scraper: Extracts equations, tables, and text from new AI research papers.

πŸ‘ User avatar

Milton Gardener

2

ArXiv Research Paper Scraper

datapilot/arxiv-research-paper-scraper

arXiv Research Paper Scraper retrieves academic paper metadata from the arXiv API based on a keyword. It extracts titles, abstracts, authors with affiliations, DOI, categories, submission dates, and PDF links. Supports proxy usage and outputs structured JSON results for research and data analysis.

arXiv Paper Scraper

plantane/arxiv-scraper

Scrape research papers from arXiv by search query or category. Get titles, abstracts, authors, categories, and PDF links via the public arXiv API.

arXiv Search Scraper πŸ“š

easyapi/arxiv-search-scraper

Extract comprehensive research paper data from arXiv search results. Get detailed metadata including titles, authors, abstracts, categories and more. Perfect for academic research monitoring, trend analysis and building paper databases. πŸŽ“πŸ“š

arXiv Research Paper Scraper

crawlerbros/arxiv-research-paper-scraper

Scrape research papers from arXiv.org - search by query, category, or author; lookup by arXiv ID. Returns title, authors, abstract, PDF URL, DOI, categories, and more. Uses the public arXiv Atom API. No login or proxy required.

arXiv Scraper β€” Search & Export Paper Metadata

devilscrapes/arxiv-papers-scraper

Search arXiv by query, category, or author and export structured paper metadata β€” title, authors, abstract, primary category, DOI, PDF URL, submitted and updated timestamps β€” to JSON or CSV. An arXiv API wrapper that handles pagination, retries, and rate-limit pacing for your pipeline.