node.js

How To Use Node.js, Request, and Cheerio to Set Up Simple Web-Scraping on Ubuntu 24.04 or Newer

Web scraping is a powerful technique that allows you to extract information from websites. Whether you need to gather data for research, monitoring prices, or aggregating content, web scraping can be a useful tool in your arsenal. In this blog post, we will walk you through the process of setting up a simple web scraping script using Node.js, the Request library, and Cheerio on Ubuntu 24.04 or newer.

Prerequisites

Before we dive into the code, ensure that you have the following installed on your Ubuntu machine:

  1. Node.js: You can install Node.js using the NodeSource repository.
  2. npm: This is the package manager for Node.js and is included with Node.js installations.

Installing Node.js and npm

To install Node.js and npm, open your terminal and run the following commands:

# Update your package index
sudo apt update

# Install Node.js (You may want to specify a version)
sudo apt install nodejs npm

You can verify the installation by checking the versions:

node -v
npm -v

Setting Up Your Project

  1. Create a new directory for your project:
mkdir web-scraper
cd web-scraper
  1. Initialize a new Node.js project:
npm init -y

This command will create a package.json file in your project directory.

  1. Install the necessary libraries:

We will need request to make HTTP requests and cheerio to parse the HTML. Install them by running:

npm install request cheerio

Writing the Web Scraping Script

Now that we have everything set up, it’s time to write our web scraping script. Create a new file named scraper.js:

touch scraper.js

Open scraper.js in your favorite text editor and add the following code:

const request = require('request');
const cheerio = require('cheerio');

// URL to scrape
const url = 'https://example.com'; // Replace with the website you want to scrape

// Make a request to the URL
request(url, (error, response, body) => {
    if (!error && response.statusCode === 200) {
        // Load the body into cheerio
        const $ = cheerio.load(body);

        // Use jQuery-like syntax to select elements
        $('h1').each((i, element) => {
            console.log($(element).text()); // Print the text of each <h1>
        });

        // Add more scraping logic as needed
    } else {
        console.error('Failed to fetch the page: ', error);
    }
});

Explanation of the Code

  • We import request and cheerio.
  • We define the URL we want to scrape (replace 'https://example.com' with your target website).
  • We make a GET request to the URL using the request library.
  • If the request is successful, we load the HTML body into Cheerio.
  • We use Cheerio’s jQuery-like syntax to select and manipulate the HTML. In this example, we’re scraping all <h1> elements and printing their text to the console.

Running the Scraper

To run your scraping script, execute the following command in your terminal:

node scraper.js

If everything is set up correctly, you should see the text of all the <h1> elements from the specified webpage printed in your terminal.

Important Considerations

  1. Respect Robots.txt: Always check the site’s robots.txt file to see if web scraping is allowed.
  2. Rate Limiting: Avoid bombarding the server with requests. Use delays between requests if scraping multiple pages.
  3. Legal Compliance: Ensure you are in compliance with the website’s terms and conditions regarding data usage.

Conclusion

In this blog post, we’ve shown you how to set up a simple web scraping tool using Node.js, Request, and Cheerio on Ubuntu 24.04 or newer. With this foundation, you can expand your scraper to extract various types of data from websites.