How to Scrape a Real Website with Node.js

A simple yet powerful tool for any web developer!

So you need some data. You have been searching for it and found it online. But it’s not something that you can just copy and paste because maybe you want to store it somewhere (maybe in a database or in JSON format)

But copy-pasting manually is a boring job. You are a smart guy and love to automate things.

Well, say no more! The tool you are looking for is scraping.

What is scraping anyway?

According to zyte,

Web scraping is the process of collecting structured web data in an automated fashion.

Yeah, that’s right. Today we will extract data from this website’s FAQ section.

Requirements

The concept of scraping can be broken down into the following steps.

  • Get the data HTML file from the website (We will use axios)
  • Parse the data (We will use cheerio)

And that’s it! So simple right?

Initialize your project

Let’s first initialize an empty Node.js project. Go to your terminal and run the following commands

mkdir web-scrapercd web-scraper
npm init -y

This will create a package.json file for you.

Then create our root file

touch index.js

So we have a skeleton Node.js project now. Let’s use it!

Install the dependencies

Run the following command to get the required dependencies

npm i axios cheerio

Open up your index.js file and import them

const axios = require("axios");
const cheerio = require("cheerio");

Get data from the website

As we discussed earlier the first thing to do is to get the raw data from the website. That means getting the full .html file. We can do it like the following.

axios.get("https://hydeparkwinterwonderland.com/faqs/").then(
  (response) => {
    if (response.status === 200) {
      const html = response.data;
      const $ = cheerio.load(html);
    }
  },
  (error) => console.log(err)
);

Notice line number 5. We have loaded the HTML file into the cheerio . As cheerio has much resemblance with Jquery we are using the **$** as a variable name. You can use something else!.

Okay so now we have the data in the HTML format.

But how do we parse it?

To understand that we have to go to that website and open up the target website’s inspect tool. And try to understand the structure of the HTML file there.

Cheerio has the ability to select based on classname or element type (div, button, etc). We will try to find out the place where we can get the questions.

Getting the questions

if we look closely the questions are inside a button which lives inside a div with classname = "row"

That means if we get all the div’s with classname="row" we will get all the faq’s and from them, we can extract the questions.

Let’s do that!

const $ = cheerio.load(html); // from the previous step

const individualBlock = $("div.row");
individualBlock.each(function (idx, el) {
  const question = $(el).children("div").children("button");

  console.log("question => ", $(question).text());
});

After this go to your console and run

node index.js

and look at your console

So you have scraped all the questions out of the website. Now use it any way you like. Try to figure out to get the answer’s yourself.

If you can’t you can figure out the logic to find the answers the check the complete code in the following repo.

GitHub Repository

https://github.com/Mohammad-Faisal/web-scraper


Share this post


Profile Image

Who I am

Hi, I amMohammad Faisal, A full-stack software engineer @Cruise , working remotely from a small but beautiful country named Bangladesh.

I am most experienced inReactJS,NodeJS andAWS

Buy Me a Coffee Widget