VOOZH about

URL: https://blog.logrocket.com/best-node-js-web-scrapers-use-case/

⇱ The best Node.js web scrapers for your use case - LogRocket Blog


2024-10-17
1876
#node
Juan Cruz Martinez
161463
111
👁 Image

See how LogRocket's Galileo AI surfaces the most severe issues for you

No signup required

Check it out

Editor’s note: This article was last updated on 17 October 2024.

👁 The Best Node.js Web Scrapers For Your Use Case

In this article, we’ll explore a few of the best Node.js web scraping libraries and techniques. You’ll also learn about their differences, considering when each is the right fit for your project’s needs.

🚀 Sign up for The Replay newsletter

The Replay is a weekly newsletter for dev and engineering leaders.

Delivered once a week, it's your curated guide to the most important conversations around frontend dev, emerging AI tools, and the state of modern software.

The best Node.js web scraping libraries

Whether you want to build your own search engine, monitor a website to alert you when tickets for your favorite concert are available, or you need essential information for your company, there are many Node.js web scraper libraries that have you covered.

Axios

If you’re familiar with Axios, it might not sound like the most appealing option for scraping the web. Be that as it may, it is a simple solution that can help you get the job done, and it offers the added benefit of being a library you likely already know quite well.

Axios is a promised-based HTTP client for Node.js that became super popular among JavaScript projects for its simplicity and adaptability. Although Axios is typically used in the context of calling REST APIs, it can fetch websites’ HTML as well.

Because Axios will only get the response from the server, it will be up to you to parse and work with the result. Therefore, I recommend using this library when working with JSON responses or for simple scraping needs.

You can install Axios using your favorite package manager as follows:

npm install axios

Below is an example of using Axios to list all the articles headlines from the LogRocket Blog’s homepage:

const axios = require('axios');

axios
 .get("https://logrocket.com/blog")
 .then(function (response) {
 const reTitles = /(?<=\<h2 class="card-title"><a\shref=.*?\>).*?(?=\<\/a\>)/g;
 [...response.data.matchAll(reTitles)].forEach(title => console.log(`- ${title}`));
 });

In the example above, you can see how Axios is great for HTTP requests. However, parsing the HTML in complex structures requires elaborating complex rules, or regular expressions, even for simple tasks.

So, if regular expressions aren’t your thing and you prefer a more DOM-based approach, you could transform the HTML into a DOM-like object with libraries like JSDom or Cheerio. Let’s explore the same example from above using JSDom instead:

const axios = require('axios');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;

axios
 .get("https://logrocket.com/blog")
 .then(function (response) {
 const dom = new JSDOM(response.data);
 [...dom.window.document.querySelectorAll('.card-title a')].forEach(el => console.log(`- ${el.textContent}`));
 });

This kind of solution would soon encounter its limitations. For example, you’ll only get the raw response from the server — what if elements on the page you want to access are loaded asynchronously?

What about single-page applications (SPAs), where the HTML simply loads JavaScript libraries that do all the rendering work on the client? Or what if you encounter one of the limitations imposed by such libraries? After all, they aren’t a full HTML/DOM implementation but a subset of the same.

In scenarios like these, or for complex websites, the best choice may be a completely different approach using other libraries.

Puppeteer

Puppeteer is a high-level Node.js API to control Chrome or Chromium with code. So, what does it mean for us in terms of web scraping?

With Puppeteer, you access the power of a full-fetch browser like Chromium, running in the background in headless mode, to navigate websites and fully render styles, scripts, and asynchronous information.

To use Puppeteer in your project, you can install it like any other JavaScript package:

npm install puppeteer

Now, let’s see an example of Puppeteer in action:

const puppeteer = require("puppeteer");

async function parseLogRocketBlogHome() {
 // Launch the browser
 const browser = await puppeteer.launch();

 // Open a new tab
 const page = await browser.newPage(); 


 // Visit the page and wait until network connections are completed
 await page.goto('https://logrocket.com/blog', { waitUntil: 'networkidle2' });

 // Interact with the DOM to retrieve the titles
 const titles = await page.evaluate(() => { 
 // Select all elements with crayons-tag class 
 return [...document.querySelectorAll('.card-title a')].map(el => el.textContent);
 });

 // Don't forget to close the browser instance to clean up the memory
 await browser.close();

 // Print the results
 titles.forEach(title => console.log(`- ${title}`))
}

parseLogRocketBlogHome();

While Puppeteer is a fantastic solution, it is more complex to work on, especially for simple projects. It is also much more demanding in terms of resources — you are, after all, running a full Chromium browser, and we know how memory-hungry those can be.

X-Ray

X-Ray is a Node.js library created for scraping the web, so it’s no surprise that its API is heavily focused on that task. As such, it abstracts most of the complexity we encounter in Puppeteer and Axios.

To install X-Ray, you can run the following command:

npm install x-ray

Now, let’s build our example using X-Ray:

const Xray = require('x-ray');
const x = Xray()

x('https://logrocket.com/blog', {
 titles: ['.card-title a']
})((err, result) => {
 result.titles.forEach(title => console.log(`- ${title}`));
});

X-Ray is a great option if your use case involves scraping a large number of webpages. It supports concurrency and pagination out of the box, so you don’t need to worry about those details.

Osmosis

Osmosis is very similar to X-Ray, designed explicitly for scraping webpages and extracting data from HTML, XML, and JSON documents.

To install Osmosis, run the following code:

npm install osmosis

Below is the sample code:

var osmosis = require('osmosis');

osmosis.get('https://logrocket.com/blog')
.set({
 titles: ['.card-title a']
})
.data(function(result) {
 result.titles.forEach(title => console.log(`- ${title}`));
});

As you can see, Osmosis is similar to X-Ray in terms of syntax and style used to retrieve and work with data.

Superagent

Superagent is a lightweight, progressive, client-side Node.js library for handling HTTP requests. Due to its simplicity and ease of use, it is commonly used for web scraping.

Just like Axios, Superagent is also limited to only getting the response from the server; it will be up to you to parse and work with the result. Depending on your scraping needs, you can retrieve HTML pages, JSON data, or other types of content using Superagent.

To use Superagent in your project, you can install it like any other JavaScript package:

npm install superagent

When scraping HTML pages, you must parse the HTML content to extract the desired data. For this, you can use libraries like Cheerio or JSDOM.

To use Cheerio in your project, you can install it like any other JavaScript package:

npm install cheerio

Let’s review an example of web scraping with Superagent and Cheerio in action:

const superagent = require("superagent");
const cheerio = require("cheerio");
const url = "https://blog.logrocket.com";
superagent.get(url).end((err, res) => {
 if (err) {
 console.error("Error fetching the website:", err);
 return;
 }
 const $ = cheerio.load(res.text);
 // Replace the following selectors with the actual HTML elements you want to scrape
 const titles = $(".card-title a")
 .map((i, el) => $(el).text())
 .get();
 const descriptions = $("p.description")
 .map((i, el) => $(el).text())
 .get();
 // Display the scraped data
 console.log("Titles:", titles);
 console.log("Descriptions:", descriptions);
});

The script will make an HTTP GET request to the specified URL using Superagent, fetch the HTML content of the page, and then use Cheerio to extract the data from the specified selectors.


Over 200k developers use LogRocket to create better digital experiences

👁 Image
Learn more →

While Superagent is a great solution, using it for web scraping may result in incomplete or inaccurate data extraction resulting in data inconsistency, depending on the complexity of the website’s structure and the parsing methods used.

Playwright

Playwright is a powerful tool for web scraping and browser automation, especially when dealing with modern web applications with dynamic content and complex interactions. Its multibrowser support, automation capabilities, and performance make it an excellent choice for developers looking to perform advanced web scraping tasks in Node.js applications.

Playwright is a relatively new open source library developed by Microsoft. It provides complete control over the browser’s state, cookies, network requests, and browser events, making it ideal for complex scraping scenarios.

To use Playwright in your project, you can install it like so:

npm install playwright

Let’s look at an example of web scraping with Playwright:

const { chromium } = require("playwright");
(async () => {
 const browser = await chromium.launch();
 const context = await browser.newContext();
 const page = await context.newPage();
 const url = "https://blog.logrocket.com"; // Replace with the URL of the website you want to scrape
 try {
 await page.goto(url);
 // Replace the following selectors with the actual HTML elements you want to scrape
 const titleElement = await page.$("h1");
 const descriptionElement = await page.$("p.description");
 const title = await titleElement.textContent();
 const description = await descriptionElement.textContent();
 const inputElement = await page.$('input[type="text"]');
 const value = await inputElement.inputValue();

 console.log(value);
 console.log("Title:", title);
 console.log("Description:", description);
 } catch (error) {
 console.error("Error while scraping:", error);
 } finally {
 await browser.close();
 }
})();

The script will launch a Chromium browser, navigate to the specified URL, and use Playwright’s methods to interact with the website and extract data from the specified selectors.

Playwright is a robust scraping library, but when compared to lightweight HTTP-based scraping libraries, it incurs more resource overhead because it uses headless browsers to perform scraping tasks. This can have an impact on performance and memory usage, especially if you’re scraping multiple pages or performing a large number of scraping tasks.

Things to know about scraping the web

Although web scraping is legal for publicly available information, you should be aware that many sites put limitations in place as part of their terms of service. Some may even include rate limits to prevent you from slowing down their services — but why is that?

When you scrape information from a site, you use its resources.

Let’s suppose you’re aggressive in terms of accessing too many pages too quickly. In that case, you may degrade the site’s general performance for its users. So, when scraping the web, you must get consent or permission from the owner and be mindful of the strains you are putting on their sites.

Lastly, web scraping requires a considerable effort for development and, in many cases, maintenance. Changes in the structure of the target site may break your scraping code and require you to update your script to adjust to the new formats.

For this reason, I prefer consuming an API when possible and scraping the web only as a last resort.

Which is the best Node.js scraper?

Ultimately, the best Node.js scraper is the one that best fits your project needs. In this article, we covered some factors to help influence your decision.

For most tasks, any of these options will suffice, so choose the one you feel most comfortable with. In my professional life, I’ve had the opportunity to build multiple projects with information-gathering requirements from publicly available information and internal systems.

Because the requirements were diverse, each of these projects used different approaches and libraries, ranging from Axios to X-Ray, and ultimately resulting in Puppeteer for the most complex situations.

Finally, you should always respect the website’s terms and conditions regardless of what scraper you choose. Scraping data can be a powerful tool, but with that comes great responsibility. Thanks for reading!

200s only 👁 Image
Monitor failed and slow network requests in production

Deploying a Node-based web app or website is the easy part. Making sure your Node instance continues to serve resources to your app is where things get tougher. If you’re interested in ensuring requests to the backend or third-party services are successful, try LogRocket.

👁 LogRocket Network Request Monitoring

LogRocket lets you replay user sessions, eliminating guesswork around why bugs happen by showing exactly what users experienced. It captures console logs, errors, network requests, and pixel-perfect DOM recordings — compatible with all frameworks.

LogRocket's Galileo AI watches sessions for you, instantly identifying and explaining user struggles with automated monitoring of your entire product experience.

LogRocket instruments your app to record baseline performance timings such as page load time, time to first byte, slow network requests, and also logs Redux, NgRx, and Vuex actions/state. Start monitoring for free.

👁 Image
👁 Image
👁 Image

Stop guessing about your digital experience with LogRocket

Get started for free

Recent posts:

What is TSRX?: What JSX would look like if it were designed today

TSRX adds first-class control flow, conditional hooks, and scoped styles to React via a TypeScript compiler extension — no new framework required.

👁 Image
Ikeh Akinyemi
Jun 12, 2026 ⋅ 6 min read

How to add authentication to a React Native app with Better Auth

Learn how to build a full React Native auth system using Better Auth and Expo — with email/password login, Google OAuth, session persistence, and protected routes.

👁 Image
Chinwike Maduabuchi
Jun 9, 2026 ⋅ 13 min read

AI dev tool power rankings & comparison [June 2026]

Compare the top AI development tools and models of June 2026. View updated rankings, feature breakdowns, and find the best fit for you.

👁 Image
Chizaram Ken
Jun 8, 2026 ⋅ 11 min read

How to check username availability at scale with Bloom filters

Learn how Bloom filters reduce database lookups for username availability checks while preserving correctness at scale.

👁 Image
Rosario De Chiara
Jun 8, 2026 ⋅ 6 min read
View all posts

Hey there, want to help make our blog better?

Join LogRocket’s Content Advisory Board. You’ll help inform the type of content we create and get access to exclusive meetups, social accreditation, and swag.

Sign up now