Web Scraping in Node.js using Puppeteer

Table of contents


  • Introduction

  • What is Web Scraping?

  • What is a Web Crawler?

  • Why should we use Web Scraping with Node.js?

  • Web Scraping in Node.js using Puppeteer?

  • Use Cases of Web Scraping

  • Pros & Cons of Web Scraping

  • Conclusion


Introduction

In this digital era, data plays a vital role in every process. From sales to machine learning, the functionality of every process is data-centric. Various types of data namely text, image, audio, and video are used depending on the process.  Collecting these types of data from various resources for a particular process is called data collection. Nowadays data collection is done via the internet manually or programmatically depending upon the process. For a small data process manual data collection is enough but for a large data process such as training an AI bot requires a huge amount of data, in this case, manual data collection will be a very difficult and more time-consuming process. To avoid these hurdles we use programmatical data collection called web scraping for gathering a large amount of data from the world wide web using programming languages such as Node.js.


Let's learn some fundamentals, before diving into the topic.


What is Web Scraping?

Web scraping is an automation software which assists users to extract various data such as text, audio, image, and video using scripts from various resources such as URLs, websites, webpages, etc., on the internet for a different kind of data-centric process. It is also known as web harvesting, and web data extraction. 


For example, the following data are collected from the internet by assisting web scraping.

  • Emails collected for marketing and sales process.

  • Retrieving product details such as price, features, etc., from e-commerce portals. 

  • Collecting audio and text files of human languages for Natural Language Processing in AI.


Web scraping is generally deployed when the required websites don’t want to reveal the API for data collection. The web scraping software may directly approach the World Wide Web via the Hypertext Transfer Protocol(HTTP) or a web browser.


What is a Web crawler?

A Web crawler, also known as a robot or spider bot, is an internet bot that automatically searches and scans websites and is often assisted by search engines (like Google or Bing) to gather all the data from a website and index it.


Web crawlers assist you in collecting data from public websites, finding information, and indexing online pages. Moreover, crawlers examine the links between URLs on a website to determine the structure of how these pages are related to one another.


Crawling - deployed when we want to search for information on the internet.

Scraping - deployed when we want to extract that information from the internet.

Why should we use web scraping with Node.js?

  • Node. js can be effectively used to perform web scraping even if other languages and frameworks are more popular for web scraping.


  • Node. js is the preferred solution for data-intensive, real-time IoT devices and applications because it is quick and scalable. Node. js functions well for encoding and broadcasting video and audio, uploading numerous files, and data streaming because of its non-blocking architecture.


  • For projects that need intensive data processing and analysis, Node.js is a better bet. It can manage many concurrent connections without lagging or freezing due to its asynchronous nature. Hence, Node.js is something to think about if you're searching for a solid platform to conduct your data science projects.


  • Node. js has a paradigm-shifting technology. Using this, developers can turn JavaScript into a server and client application and developers can automatically communicate and synchronise their data between these two locations by saving their time and effort.


  • For data processing, Node.js consumes less time compared with other languages.


  • Node.js can act as a proxy server. It can be a result of the services having varying response times or due to their collection of data from multiple resources. This could be a result of a lack of infrastructure where you’ll have to deal with server-side software. The software may manage resources owned by third parties, compile data from various sources, or save videos or photos.


  • With the following example, this might be made simpler: Assuming a server-side programme that communicates with external resources. It compiles information from several sources and saves photographs, videos, or both. Node.JS can be used as a proxy in this case if the organisation does not currently have a proxy infrastructure in place or has to develop this solution locally.


Let’s deep dive into the topic,


Web Scraping in Node.js using Puppeteer

There are more varieties of JS libraries are available in Node.js for web scraping. In this article, we will discuss Puppeteer, one of the most utilised and featured JS modules in Node.js.


Puppeteer

A puppeteer is a simple and famous JS module in Node.js for web scraping. It has a lot of methods to make simple the process of web scraping and web automation.


A high-level API for controlling the Chromium or Chrome browser over the DevTools Protocol is offered by the Node library puppeteer. The default setting is headless, although full operation is also possible (non-headless). It was developed by Google.

Features of Puppeteer

  • It is used to retrieve the text content of the scraped elements.

  • It can communicate with web pages by filling out forms, clicking on buttons or running searches inside a search bar.

  • Using Puppeteer users can scrap and download data from the web.

  • With Puppeteer, you may crawl a Single Page Application and produce pre-rendered content.

  • It could also be very useful for a variety of other activities that are not related to web crawling, such as UI testing and performance improvement.

  • We can create PDFs from web pages and take screenshots using Puppeteer.

  • It is used to see web scraping in progress by deploying headless mode. Headless mode or headless browser is a feature to navigate the web via the command line without the graphical user interface which means when you use a headless browser to access websites, you can't see anything, the program only runs in the background.


Let’s have a look at some playground examples for web scraping in Node.js using the Puppeteer module.


This example will show you how to develop a web scraper that extracts the lowest and highest pricing of a product. Let's develop a programme using Puppeteer that pulls the product and price from Amazon.


With NPM, we can install Puppeteer and use it by importing it into the server file.


Configuration

Let's first set up the necessary files and folder structure and install the modules before we begin writing the code.

  1. Node.js

To ensure Node.js is installed on your system, open a terminal and enter the following command.

node -v

If it returns a version, Node.js is already installed; else, install it from https://nodejs.org/en/.

  1. NPM

The Node Packages Manager, or NPM, allows us to install any Node.js module.

 NPM for our project will be set up using this command.

npm init -y

  1. Setup Puppeteer 

With this command, the puppeteer module will be installed inside our project.

npm install puppeteer



  1. Source-Code editor

Open the project in the code editor. VS-Code is used in this post, but you may also use other editors like atom, subline, etc.

  1. Creating folder

Make the "Amazon Web Scraper" project folder, and inside it, create the file app.js where you will put the code for the scraping.


Using Node.js to Build an Amazon Price Scraper

Let’s create an Amazon web scraper with the above configuration.

  1. Import Puppeteer

Open the app.js folder and import the puppeteer module using the following command

const puppeteer = require('puppeteer');

  1. Make A Function That Returns The Full Date

Let's write a function that returns the year, month, day, and hour of the moment.


function getData() {

    let date = new Date();

let fullDate = date.getFullYear() + "-" + date.getMonth() + "-" + date.getDate() + " " + date.getHours() + ":" + date.getMinutes() + ":" + date.getSeconds();

    return fullDate;

}

  1. The Web scraper Function

Let's write a web scraper function and pass the URL argument, which identifies the website's address from which we wish to extract data. As we will utilise await inside the process, it is necessary to use async before the function to make it asynchronous.


async function webScraper(url) {

     

};

  1. Calling an object

Make an object browser, then instruct it to use the Puppeteer module's launch function to start Puppeteer.


const browser = await puppeteer.launch({})


Using the newPage method of the browser object, create an object page and set it to open a new page.


const page = await browser.newPage()


Using the goto method of the page object, navigate to the provided URL.

await page.goto(url)

  1. Extracting data from Amazon

Amazon employs an a-price-whole class for the price and a productTitle ID for the title of its products.

To choose the element, let's create a product variable and set it to the waitForSelector method. Then, we'll extract the text content from the product and set it to another variable so that we can print it using the following commands.


var product = await page.waitForSelector("#productTitle")

var productText = await page.evaluate(product => product.textContent, product)


Apply the same logic to the product's selling price.


var price = await page.waitForSelector(".a-price-whole")

var priceText = await page.evaluate(price => price.textContent, price)


Finally, passing the productText and priceText, and the console logs the most recent data by using the getDate method we built before.

console.log("Date: " + getData() + "Product: " + productText + "Price: " + priceText)


Then use the following command to exit the browser that we previously opened.


browser.close()


Thereafter, call the webScraper method while passing the URL of the Amazon product you want to scrape.


webScraper('https://www.amazon.in/dp/B09W9MBS1G/ref=cm_sw_r_apa_i_NWPQ1TXATPCD3XBZ0P7W_0');


Entire code for data extraction from Amazon

const puppeteer = require('puppeteer');

 

function getData() {

    let date = new Date();

    let fullDate = date.getFullYear() + "-" + date.getMonth() + "-" + date.getDate() + " " + date.getHours() + ":" + date.getMinutes() + ":" + date.getSeconds();

    return fullDate;

}

 

async function webScraper(url) {

    const browser = await puppeteer.launch({})

    const page = await browser.newPage()

 

    await page.goto(url)

    var product = await page.waitForSelector("#productTitle")

    var productText = await page.evaluate(product => product.textContent, product)

    var price = await page.waitForSelector(".a-price-whole")

    var priceText = await page.evaluate(price => price.textContent, price)

    console.log("Date: " + getData() + "Product: " + productText + "Price: " + priceText)

    browser.close()

};

 

webScraper('https://www.amazon.in/dp/B09W9MBS1G/ref=cm_sw_r_apa_i_NWPQ1TXATPCD3XBZ0P7W_0');


Output:

Date: 2022-9-6 20:9:37    

Product: ASUS Vivobook 15, 15.6-inch (39.62 cms) FHD, AMD Ryzen 7 3700U, Thin and Light Laptop (16GB/512GB SSD/Integrated Graphics/Windows 11/Office 2021/Silver/1.8 kg), M515DA-BQ722WS       

Price: 50,799.


Use cases of web scraping

  • Web Scraping has many beneficial uses across many industries when it is at its best. By 2021, over half of all web scrapes will support e-commerce processes.

  • Web scraping has evolved into a crucial tool for marketing organisations wishing to monitor their industry without having to conduct time-consuming human research. For instance, How are your clients acting? How are your leads doing? How do your prices compare to those of your rivals? Do you possess the knowledge necessary to develop an effective inbound marketing or content marketing campaign?

  • The advantages of web scraping for market research also extend to business automation in many cases.

  • Web scraping can be used to produce enough user data to build organised lead lists. It's more convenient (and more hopeful) than creating lead lists on your own, however, results may vary.

  • One of the most popular uses for web scraping is to extract prices, commonly known as price scraping.

  • RSS feeds and other simple interfaces are already offered by some news websites and blogs, although they aren't necessarily the standard and aren't as popular as they once were. Because of this, gathering the precise news and content you need frequently calls for web scraping.

  • Being up to date without having to go through multiple news sources and articles is achievable with web scraping.

  • Web scraping can also be used to check the minimum pricing for a brand's goods or services (MAP). Although this is a type of price scraping, it offers essential information that can help firms decide whether their pricing is in line with what customers are expecting.

  • To simplify the process and compile real estate listings into a single database, many websites use web scraping. Realtors can employ scraping software to keep track of typical rent and sales prices, the kinds of properties being sold, and other essential indicators.


Pros and Cons of Web Scraping


Pros

  • Web scrapers may automatically extract data from multiple websites at once, saving time and gathering more pertinent data than a single person could manually.

  • Web scraping allows you to retrieve and manage data in databases or spreadsheets on your local computer.

  • It is possible to schedule web scrapers to run regularly and export data in the preferred format.

  • Each data-driven firm needs accurate data to function. Web scraping is the solution if you're seeking a data extraction technique that is accurate, human-free, and hassle-free.


Cons

  • Scrapers may malfunction because HTML structures on web pages are continually changing. You should undertake routine maintenance to keep your scrapers clear and functional, whether you use web scraping tools or develop your code.

  • Several requests must be made to scrape huge websites. Some websites may ban IP addresses that often make requests.

  • Proxy servers are required since many websites block access from specific countries. Free or inexpensive proxies typically don't assist in these circumstances either because they are frequently used and their IPs are already blacklisted.

  • A large number of contemporary websites show content while the page loads in the browser. Any effort to view the source code of the page or retrieve it using a simple HTTP request will show the message "You need to allow JavaScript to execute this application". Headless browsers are necessary to scrape such dynamically created websites. Moreover, rendering requires more hardware resources and time for multiple pages.


Key takeaways

Web scraping is a technique for gathering a variety of publicly accessible data from the internet, including prices, text, photos, contact details, and much more. When trying to gather data that would otherwise take a lot of effort to gather and organise manually, this can be helpful.


Scraping has a lot of benefits on its own. You can keep an eye on suppliers, automate chores, and search for mentions of your brand or other companies. You can also keep an eye on pricing comparisons. You can stay up to date with new technology and analyse large amounts of data. Using web scraping, you can automate data collection and turn it into insightful information for your company.


A "crawler" is what a basic web scraping software uses to access the internet, browse the web, and collect data from predetermined pages.


Node.js and JS modules are great for data scraping in all parts of the process. The usage of Node.js enables not only the resolution of all scraping-related problems but also the assurance of the security and dependability of data extraction. Moreover, the use of headless browsers will simulate user behaviour.


We can conclude Puppeteer is a potent library for automation, web scraping, screenshotting, saving PDFs, debugging, and it even supports non-headless contexts.










Comments

Popular posts from this blog

Prometheus Architecture Scalability: Challenges and Tools for Enhanced Solutions

OpenTelemetry vs OpenTracing: Which is Better for Instrumentation

What is Network Latency, What Causes it, and How can it be Improved?