How To Build A Web Crawler With Node
How To Build A Web Crawler With Node
How To Build A Web Crawler With Node
dev.to/bnevilleoneill/how-to-build-a-web-crawler-with-node-3hn2
Brian Neville-
O'Neill
Introduction
A web crawler, often shortened to crawler or sometimes called a spider-bot, is a bot that
systematically browses the internet typically for the purpose of web indexing. These
internet bots can be used by search engines to improve the quality of search results for
users. In addition to indexing the world wide web, crawling can also be used to gather
data (known as web scraping).
The process of web scraping can be quite tasking on the CPU depending on the site’s
structure and the complexity of data being extracted. To optimize and speed up this
process, we will make use of Node workers (threads) which are useful for CPU-intensive
operations.
In this article, we will learn how to build a web crawler that scrapes a website and stores
the data in a database. This crawler bot will perform both operations using Node
workers.
Prerequisites
1. Basic knowledge of Node.js
2. Yarn or NPM (we’ll be using Yarn)
3. A system configured to run Node code (preferably version 10.5.0 or superior)
Installation
Launch a terminal and create a new directory for this tutorial:
$ mkdir worker-tutorial
$ cd worker-tutorial
Axios — a promised based HTTP client for the browser and Node.js
Cheerio — a lightweight implementation of jQuery which gives us access to the
DOM on the server
Firebase database — a cloud-hosted NoSQL database. If you’re not familiar with
setting up a firebase database, check out the documentation and follow steps 1-3
to get started
Let’s install the packages listed above with the following command:
Hello workers
Before we start building the crawler using workers, let’s go over some basics. You can
create a test file hello.js in the root of the project to run the following snippets.
Registering a worker
A worker can be initialized (registered) by importing the worker class from the
worker_threads module like this:
// hello.js
new Worker("./worker.js");
Hello world
Printing out Hello World with workers is as simple as running the snippet below:
// hello.js
This snippet pulls in the worker class and the isMainThread object from the
worker_threads module:
isMainThread helps us know when we are either running inside the main thread or
a worker thread
2/10
new Worker(__filename) registers a new worker with the __filename variable
which, in this case, is hello.js
// hello.js
if (isMainThread) {
const worker = new Worker(__filename);
worker.once('message', (message) => {
console.log(message); // prints 'Worker thread: Hello!'
});
worker.postMessage('Main Thread: Hi!');
} else {
parentPort.once('message', (message) => {
console.log(message) // prints 'Main Thread: Hi!'
parentPort.postMessage("Worker thread: Hello!");
});
}
The source code for this tutorial is available here on GitHub. Feel free to clone it, fork it
or submit an issue.
We will also use cheerio to traverse the DOM and extract data from the table element.
To know the exact elements to extract, we will open the IBAN website in our browser
and load dev tools:
From the image above, we can see the table element with the classes — table table-
bordered table-hover downloads . This will be a great starting point and we can feed that
into our cheerio root element selector:
4/10
// main.js
Running the code above with Node will give the following output:
5/10
Going forward, we will update the main.js file so that we can properly format our
output and send it to our worker thread.
6/10
// main.js
[...]
let workDir = __dirname+"/dbWorker.js";
return dataObj;
}
mainFunc().then((res) => {
// start worker
const worker = new Worker(workDir);
console.log("Sending crawled data to dbWorker...");
// send formatted data to worker thread
worker.postMessage(res);
// listen to message from worker thread
worker.on("message", (message) => {
console.log(message)
});
});
[...]
In the snippet above, we are doing more than data formatting; after the mainFunc() has
been resolved, we pass the formatted data to the worker thread for storage.
7/10
Worker thread (dbWorker.js)
In this worker thread, we will initialize firebase and listen for the crawled data from the
main thread. When the data arrives, we will store it in the database and send a message
back to the main thread to confirm that data storage was successful.
The snippet that takes care of the aforementioned operations can be seen below:
// dbWorker.js
//firebase credentials
let firebaseConfig = {
apiKey: "XXXXXXXXXXXX-XXX-XXX",
authDomain: "XXXXXXXXXXXX-XXX-XXX",
databaseURL: "XXXXXXXXXXXX-XXX-XXX",
projectId: "XXXXXXXXXXXX-XXX-XXX",
storageBucket: "XXXXXXXXXXXX-XXX-XXX",
messagingSenderId: "XXXXXXXXXXXX-XXX-XXX",
appId: "XXXXXXXXXXXX-XXX-XXX"
};
// Initialize Firebase
admin.initializeApp(firebaseConfig);
let db = admin.firestore();
// get current data in DD-MM-YYYY format
let date = new Date();
let currDate = `${date.getDate()}-${date.getMonth()}-${date.getFullYear()}`;
// recieve crawled data from main thread
parentPort.once("message", (message) => {
console.log("Recieved data from mainWorker...");
// store data gotten from main thread in database
db.collection("Rates").doc(currDate).set({
rates: JSON.stringify(message)
}).then(() => {
// send data back to main thread if operation was successful
parentPort.postMessage("Data saved successfully");
})
.catch((err) => console.log(err))
});
Note: To set up a database on firebase, please visit the firebase documentation and follow
steps 1-3 to get started.
Running main.js (which encompasses dbWorker.js ) with Node will give the following
output:
8/10
You can now check your firebase database and will see the following crawled data:
Final notes
Although web crawling can be fun, it can also be against the law if you use data to
commit copyright infringement. It is generally advised that you read the terms and
conditions of the site you intend to crawl, to know their data crawling policy beforehand.
You can learn more in the Crawling Policy section of this page.
The use of worker threads does not guarantee your application will be faster but can
present that mirage if used efficiently because it frees up the main thread by making
CPU intensive tasks less cumbersome on the main thread.
Conclusion
In this tutorial, we learned how to build a web crawler that scrapes currency exchange
rates and saves it to a database. We also learned how to use worker threads to run these
operations.
The source code for each of the following snippets is available on GitHub. Feel free to
clone it, fork it or submit an issue.
9/10
Further reading
Interested in learning more about worker threads? You can check out the following links:
LogRocket instruments your app to record baseline performance timings such as page
load time, time to first byte, slow network requests, and also logs Redux, NgRx, and Vuex
actions/state. Start monitoring for free.
The post How to build a web crawler with Node appeared first on LogRocket Blog.
10/10