Crawler
Crawler
Crawler
2
Approach to build the Crawler
●
We are going to use the following python libraries to achieve the
task
1. requests library to fetch the pages.
2. beautifulsoup4 to parse the response received from the response
object.
3. pymongo to connect to mongodb where we are going to store the
data.
3
Basic Structure of Crawler
glugle/crawler/crawler.py
●
class Crawler:
# part where we are going to make connection to our mongodb instance
def start_crawl():
pass
def crawl():
pass
4
How to make connection to database?
import pymongo
client = pymongo.MongoClient(connection_url_to_database)
# in our case connection url is “mongodb://127.0.0.1:27017/”
db = client.name_of_database
# this creates the database using the name provided
5
The start_crawl Function
6
The flow of start_crawl function
●
It sends get request to the robots.txt page of the url provided
using requests library.
import urllib.parse
complete_url = urllib.parse.join(url, '/robots.txt')
robots = requests.get(complete_url)
7
The flow of start_crawl function
Then it parses the content it got from requests using another
library BeautifulSoup.
8
The flow of start_crawl function
It then checks for links in the <p> tags in the html content and
stores them in a list named “Disallowed_links”.
This list is then passed to the “crawl” function which does the
task of crawling and collecting data and storing them.
9
How our Crawler is going to work?
10
The crawl function
This is the function where most of the things are done. First we
define it with the parameters url, depth and disallowed_links.
Then inside the function the following takes place.
●
It tries to connect to the provided url using the requests library.
●
If request returns a response it parses the content returned
from the first step using the Beautifulsoup library and looks for
title tags and p tags which it saves in the title and description
variables respectively.
11
The crawl function
●
After completion of all the above steps it creates a dictionary named “query”
with url, title and description in it which will be saved in the database.
query = {
‘url’ : url,
‘title’ : title,
‘description’ : description
}
12
How to create collection and insert data in
MongoDB?
# making a collection in database
collection = db.name_of_collection
# inserting the query in our database.
collection.insert_one(query)
# indexing the document
collection.create_index([
('url', pymongo.TEXT),
('title', pymongo.TEXT),
('description', pymongo.TEXT),
], name='search_results', default_language='english')
13
The crawl function
●
Next it checks if depth is equal to zero or not. If it is zero the
functions stops, else it collects all the links present in the page
using Beautifulsoup and stores them in a list named links.
●
It then loops through all the links and for each link it calls the
crawl with the depth variable decremented by one like this:
self.crawl(link, depth-1)
14
The Crawl function
●
Atlast it closes the connection it made with the database
using .close() method.
self.client.close()
15
Task
●
Create your own crawler and make it crawl on some of the
popular domains and store that data in your database.
●
You can always start by crawling popular domains like
https://wikipedia.org, https://www.rottentomatoes.com/,
https://python.org, https://stackoverflow.com/ and
https://www.geeksforgeeks.org/.
16