Web scraping with Axios and Cheerio using Node.js
Many businesses search for a solution that can automate web browser activities like checking in to a website to download files or copying data from a website into a spreadsheet. Usually, when faced with a challenge, their initial inclination is to build a custom script—a solution that is neither dependable nor affordable.
Website automation is the best solution to automate common web actions such as filling out forms, clicking on buttons, downloading files and handing them over to helpful software bots. These web automation tasks require a bulk amount of data to process particular operations. In this case, we cannot use manual data collection which results in more time and effort. Instead of manual data collection, we can use web scraping for various web automation tasks.
In this article, we will learn web scraping using Node.js programming language with JS libraries Axios and Cheerio.
Web Scraping is the process of retrieving large amounts of data programmatically from the World Wide Web for automation processes. For instance,
Retrieving contact details of consumers for lead generation in marketing.
Retrieving Policies from Government Websites.
How does WebScraping Work?
A web crawler and a web scraper are the only two components needed for web scraping to function, making it a straightforward concept. The information you want to collect is found online by web crawlers, who direct the scraper to the right data so the scraper may extract it.
Web crawlers - sometimes known as "spiders," are a kind of artificial intelligence (AI) that navigate the internet just as you do by looking for keywords and clicking on links. Crawling a certain website to find pertinent URLs is the first step in almost all web scraping projects. The crawler then moves these URLs to the scraper.
Web Scraper
A web scraper is a highly specialised tool that varies in complexity depending on the requirements of the task. It is created to rapidly retrieve data from a particular web page. A web scraper's data locator or data selector is an essential component since it locates the data you want to extract, generally using CSS selectors, regex, XPath, or a mix of those.
Web scraping has two parts namely,
Making an HTTP request to get data - Ex. Axios
Parsing the HTML DOM to extract important data - Ex. Cheerio
Types of Web Scrapers
Different web scrapers offer users a variety of capabilities and are appropriate for every circumstance. As a result, four different types of web scrapers can be used to solve any difficult issue.
Pre-built
With advanced coding skills, users can develop customized web scrapers. Depending on the users, these web scrapers might have a wide range of functions. Also, numerous pre-built web scrapers are accessible to gather pertinent data. These ready-to-eat web scrapers may export data to Google Sheets, JSON, and many more formats.
User Interface
The user interfaces of different web scrapers can vary. The primary user interface and command line are both supported by some web scrapers. Others, however, can have a more complicated user interface with a wide range of capabilities, allowing consumers to select any one of them. For users with less technical skills, the latter option is advantageous.
Browser extension vs Software
There are typically two categories of web scrapers. The first is a browser extension and the software is the other. Users can apply the appropriate browser extensions to their browsers like the extensions for ad blockers and themes. It is simple to install and use these browser extensions. These extensions' integration with the browsers is a disadvantage, though. For instance, incorporating certain features outside of browser limitations, such as IP rotations, won't be possible.
Web scraping software serves as another type. This software is easy to download and set up on users' PCs. These scrapers offer advanced features that go beyond what the browser extensions can do.
Local vs Cloud
Local web scrapers slow down the user's machine by utilising its internet connection, CPU, RAM, and other resources. It can stop other user activity while gathering data. Additionally, scrapers working on lengthy jobs or huge URLs may affect ISP data restrictions.
However, cloud-based web scrapers do not utilise computing power. They operate on an off-site server, carry out their duties independently of other users' requests, and only notify for data exporting. Additionally, they can offer cutting-edge capabilities like IP rotation and prevent blocking from different websites.
Axios and its advantages
Axios: A simple-to-use library that supports Js HTTP requests. Axios is one of the most widely used JavaScript HTTP clients available, and it can be used in both Node.js and the browser.
Axios is a simple HTTP client that uses the XMLHttpRequests service as its basis for operation. It is used for performing HTTP requests and is quite similar to the Fetch API.
Axios lets you send requests to websites and servers, similar to how a browser works. But Axios lets you alter the response using code rather than visually displaying the results. This is quite helpful when it comes to web scraping.
To obtain a website's HTML, we can utilise axios:
import axios from 'axios';
await axios.get('https://www.realtor.com/news/real-estate-news/');
It will provide us with the requested URL's HTML.
Advantages:
It accommodates legacy browsers
It has the means to specify a response timeout and a way to cancel a request.
It has interceptors that can change a request.
It assures promise.
Excellent error handling.
It features integrated CSRF protection, allows upload progress, and automatically transforms JSON data.
Cheerio and its advantages
Cheerio is a small package that offers jQuery-like APIs for examining HTML and XML documents. Using Cheerio, you can parse HTML documents, select certain HTML components, and get data out of them. In other words, Cheerio provides a sophisticated API for site scraping.
Cheerio's ability to perform web scraping quickly and efficiently is one of its greatest advantages. Because Cheerio is lightweight and performance-oriented, it can handle huge web pages rapidly and without using a lot of RAM. Cheerio's adaptability is another benefit.
Cheerio implements the essential parts of jQuery quickly and simply. It supports both browser and server usage and aids in DOM traversal utilising a hospitable and well-known API. No Javascript is run inside the document, and no external resources are loaded; it merely parses the HTML and XML.
Cheerio essentially provides you with jQuery-like queries on the DOM structure of the HTML you load. It is remarkable and enables actions like these:
const cheerio = require('cheerio')
const $ = cheerio.load('<h2 class="title">Hello world</h2>')
const titleText = $('h2.title').text();
Advantages:
It becomes incredibly simple to parse and extract data.
There are available preconfigured methods.
API is swift.
WebScraping using Axios and Cheerio in Node.js
It is quite easy to create our NodeJS scraper with Axios and Cheerio. A URL is called using axios, and cheerio is then loaded with the HTML output. Cheerio will load our HTML, and we can then query the DOM to get the data we need.
import axios from 'axios';
import cheerio from 'cheerio';
export async function scrapeRealtor() {
const html = await axios.get('https://www.realtor.com/news/real-estate-news/');
const $ = await cheerio.load(html.data);
let data = [];
$('.site-main article').each((i, elem) => {
if (i <= 3) {
data.push({
image: $(elem).find('img.wp-post-image').attr('src'),
title: $(elem).find('h2.entry-title').text(),
excerpt: $(elem).find('p.hide_xxs').text().trim(),
link: $(elem).find('h2.entry-title a').attr('href')
})
}
});
console.log(data);
}
Output
[ { image:
'https://rdcnewsadvice.wpengine.com/wp-content/uploads/2019/08/iStock-172488314-832x468.jpg',
title:
'One-Third of Mortgage Borrowers Are Missing This Opportunity to Save $2,000',
excerpt:
'Consumer advocates have an important recommendation for first-time buyers to take advantage of an opportunity to save on housing costs.',
link:
'https://www.realtor.com/news/real-estate-news/one-third-of-mortgage-borrowers-are-missing-this-opportunity-to-save-2000/' },
{ image:
'https://rdcnewsadvice.wpengine.com/wp-content/uploads/2019/08/iStock-165493611-832x468.jpg',
title:
'Trump Administration Reducing the Size of Loans People Can Get Through FHA Cash-Out Refinancing',
excerpt:
'Cash-out refinances have grown in popularity in recent years in tandem with ballooning home values across much of the country.',
link:
'https://www.realtor.com/news/real-estate-news/trump-administration-reducing-the-size-of-loans-people-can-get-through-fha-cash-out-refinancing/' },
{ image:
'https://rdcnewsadvice.wpengine.com/wp-content/uploads/2019/08/GettyImages-450777069-832x468.jpg',
title: 'Mortgage Rates Steady as Fed Weighs Further Cuts',
excerpt:
'Mortgage rates stayed steady a day after the Federal Reserve made its first interest-rate reduction in a decade, and as it considers more.',
link:
'https://www.realtor.com/news/real-estate-news/mortgage-rates-steady-as-fed-weighs-further-cuts/' },
{ image:
'https://rdcnewsadvice.wpengine.com/wp-content/uploads/2019/07/GettyImages-474822391-832x468.jpg',
title: 'Mortgage Rates Were Falling Before Fed Signaled Rate Cut',
excerpt:
'The Federal Reserve is prepared to cut interest rates this week for the first time since 2008, but the biggest source of debt for U.S. consumers—mortgages—has been getting cheaper since late last year.',
link:
'https://www.realtor.com/news/real-estate-news/mortgage-rates-were-falling-before-fed-signaled-rate-cut/' } ]
Final Thoughts
Web scraping is the technique of extracting information and material from a website using bots. Web scraping is useful for automating processes like extracting the country codes of all the countries in a drop-down list, which saves time over the laborious process of manually cataloguing products from a website. The benefit of this technique is that it makes it simpler for data scientists to collect and arrange the data in tables for accurate analysis.
With the enormous growth in the amount of data on the Internet, this method is becoming more and more useful for obtaining information from websites and using it in a variety of contexts. Web data extraction often entails sending a request to the specified web page, gaining access to its HTML code, and parsing that code to extract some data. whenever a user asks a server to view a website, an HTML form is sent back by the server. Essentially, web scraping modifies this HTML data to obtain the necessary information.
The ability to manipulate the DOM (Document Object Model) inside a web browser with JavaScript makes it possible to create incredibly flexible data extraction scripts in Node.js.
NodeJS is a very adaptable and user-friendly data scraping tool that enables you to quickly and effectively scrape data from multiple websites by employing two free npm modules:
Axios is a Promise-based HTTP client for the browser.
Cheerio is jQuery for Node.js and makes it simple to choose, edit, and view DOM elements.
As developers, we might be asked to retrieve data from a website without an API. While some websites don't place any limitations on the "Web Scraping" procedure, others do. The site's legal policy must be read, understood, and followed in both scenarios.
Before we finish, let's discuss some advice for effective web scraping with Axios and Cheerio:
Keep in mind that not all websites permit web scraping, as we already said. Before you begin website scraping, be sure you're not breaking any restrictions.
Apply caching and throttles - Use throttling to set a rate cap on the number of requests you make so as not to overwhelm a website with them. The results of earlier requests can also be saved using caching, preventing the need to make new requests.
Comments
Post a Comment