Crawler

Crawler
The wheels of our search engine.

WEB CRAWLER
●
A web crawler, spider, or search engine bot downloads and
indexes content from all over the Internet. They're called web
crawlers because crawling is the technical term for
automatically accessing a website and obtaining data via a
program.
●
The crawler goes from page to page and stores the data
fetched from it in the database, so that the information can be
retrieved when it's needed.
2
Approach to build the Crawler
●
We are going to use the following python libraries to achieve the
task
1. requests library to fetch the pages.
2. beautifulsoup4 to parse the response received from the response
object.
3. pymongo to connect to mongodb where we are going to store the
data.
3
Basic Structure of Crawler
glugle/crawler/crawler.py
●
class Crawler:
# part where we are going to make connection to our mongodb instance
def start_crawl():
pass
def crawl():
pass
4
How to make connection to database?
import pymongo
client = pymongo.MongoClient(connection_url_to_database)
# in our case connection url is “mongodb://127.0.0.1:27017/”
db = client.name_of_database
# this creates the database using the name provided
5
The start_crawl Function
This function starts the process of crawling. It performs the task

of collecting the links in robots.txt and storing them in a list
named “disallowed_links”.
6
The flow of start_crawl function
●
It sends get request to the robots.txt page of the url provided
using requests library.
import urllib.parse
complete_url = urllib.parse.join(url, '/robots.txt')
robots = requests.get(complete_url)
7
Then it parses the content it got from requests using another
library BeautifulSoup.
from bs4 import BeautifulSoup

soup = BeautifulSoup(robots.text, 'lxml')
our_robots = soup.find('p').text
8
It then checks for links in the <p> tags in the html content and
stores them in a list named “Disallowed_links”.
This list is then passed to the “crawl” function which does the
task of crawling and collecting data and storing them.
9
How our Crawler is going to work?
10
The crawl function
This is the function where most of the things are done. First we
define it with the parameters url, depth and disallowed_links.
Then inside the function the following takes place.
●
It tries to connect to the provided url using the requests library.
●
If request returns a response it parses the content returned
from the first step using the Beautifulsoup library and looks for
title tags and p tags which it saves in the title and description
variables respectively.
11
The crawl function
●
After completion of all the above steps it creates a dictionary named “query”
with url, title and description in it which will be saved in the database.
query = {
‘url’ : url,
‘title’ : title,
‘description’ : description
}
12
How to create collection and insert data in
MongoDB?
# making a collection in database
collection = db.name_of_collection
# inserting the query in our database.
collection.insert_one(query)
# indexing the document
collection.create_index([
('url', pymongo.TEXT),
('title', pymongo.TEXT),
('description', pymongo.TEXT),
], name='search_results', default_language='english')
13
The crawl function
●
Next it checks if depth is equal to zero or not. If it is zero the
functions stops, else it collects all the links present in the page
using Beautifulsoup and stores them in a list named links.
●
It then loops through all the links and for each link it calls the
crawl with the depth variable decremented by one like this:
self.crawl(link, depth-1)
14
The Crawl function
●
Atlast it closes the connection it made with the database
using .close() method.
self.client.close()
15
Task
●
Create your own crawler and make it crawl on some of the
popular domains and store that data in your database.
●
You can always start by crawling popular domains like
https://wikipedia.org, https://www.rottentomatoes.com/,
https://python.org, https://stackoverflow.com/ and
https://www.geeksforgeeks.org/.
16

Crawler

Uploaded by

Copyright:

Available Formats

Crawler

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Crawler

Uploaded by

Copyright:

Available Formats

Crawler

The wheels of our search engine.

This function starts the process of crawling. It performs the task

from bs4 import BeautifulSoup

You might also like