Crawler

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Crawler

The wheels of our search engine.


WEB CRAWLER

A web crawler, spider, or search engine bot downloads and
indexes content from all over the Internet. They're called web
crawlers because crawling is the technical term for
automatically accessing a website and obtaining data via a
program.

The crawler goes from page to page and stores the data
fetched from it in the database, so that the information can be
retrieved when it's needed.

2
Approach to build the Crawler

We are going to use the following python libraries to achieve the
task
1. requests library to fetch the pages.
2. beautifulsoup4 to parse the response received from the response
object.
3. pymongo to connect to mongodb where we are going to store the
data.

3
Basic Structure of Crawler
glugle/crawler/crawler.py


class Crawler:
# part where we are going to make connection to our mongodb instance

def start_crawl():
pass

def crawl():
pass

4
How to make connection to database?
import pymongo

client = pymongo.MongoClient(connection_url_to_database)
# in our case connection url is “mongodb://127.0.0.1:27017/”
db = client.name_of_database
# this creates the database using the name provided

5
The start_crawl Function

This function starts the process of crawling. It performs the task


of collecting the links in robots.txt and storing them in a list
named “disallowed_links”.

6
The flow of start_crawl function

It sends get request to the robots.txt page of the url provided
using requests library.

import urllib.parse
complete_url = urllib.parse.join(url, '/robots.txt')
robots = requests.get(complete_url)

7
The flow of start_crawl function
Then it parses the content it got from requests using another
library BeautifulSoup.

from bs4 import BeautifulSoup


soup = BeautifulSoup(robots.text, 'lxml')
our_robots = soup.find('p').text

8
The flow of start_crawl function

It then checks for links in the <p> tags in the html content and
stores them in a list named “Disallowed_links”.
This list is then passed to the “crawl” function which does the
task of crawling and collecting data and storing them.

9
How our Crawler is going to work?

10
The crawl function
This is the function where most of the things are done. First we
define it with the parameters url, depth and disallowed_links.
Then inside the function the following takes place.

It tries to connect to the provided url using the requests library.

If request returns a response it parses the content returned
from the first step using the Beautifulsoup library and looks for
title tags and p tags which it saves in the title and description
variables respectively.

11
The crawl function

After completion of all the above steps it creates a dictionary named “query”
with url, title and description in it which will be saved in the database.
query = {

‘url’ : url,

‘title’ : title,

‘description’ : description
}

12
How to create collection and insert data in
MongoDB?
# making a collection in database
collection = db.name_of_collection
# inserting the query in our database.
collection.insert_one(query)
# indexing the document
collection.create_index([
('url', pymongo.TEXT),
('title', pymongo.TEXT),
('description', pymongo.TEXT),
], name='search_results', default_language='english')

13
The crawl function

Next it checks if depth is equal to zero or not. If it is zero the
functions stops, else it collects all the links present in the page
using Beautifulsoup and stores them in a list named links.


It then loops through all the links and for each link it calls the
crawl with the depth variable decremented by one like this:
self.crawl(link, depth-1)

14
The Crawl function


Atlast it closes the connection it made with the database
using .close() method.
self.client.close()

15
Task

Create your own crawler and make it crawl on some of the
popular domains and store that data in your database.


You can always start by crawling popular domains like
https://wikipedia.org, https://www.rottentomatoes.com/,
https://python.org, https://stackoverflow.com/ and
https://www.geeksforgeeks.org/.

16

You might also like