Multithreading Crawler Project OS
Multithreading Crawler Project OS
Multithreading Crawler Project OS
Submitted by:
Table of Contents
1. Introduction
2. Problem Statement
3. Solution
4. Scope
5. Algorithm
6. Flow Chart
7. Code
8. Screen Shot
9. References
Introduction:
The Internet is a huge storehouse of data. While this is a great thing, its main issue lies in
its size. When one requires to get any information from the internet, we turn to search
engines. These search engines in turn depend on web crawlers which run in the
background constantly updating their databases in order to fetch the best results. At its
most basic a web crawler is a program which visits webpages, and then visits the
webpages which are linked in the first page and so on either until a set number of pages
are crawled and the information retrieved, or a certain query is matched with the
information contained in the webpage. To do this, they must download the data from the
webpage and scrape it, which presents certain challenges. A web crawler should be able
to visit the maximum number of pages, in minimum time and also provide results which
are relevant to the topic being queried for. To tackle this, a web crawler which runs in
parallel, via multi-threading and with each crawling process on a different logical core
can be developed. To further our efforts, we can also use natural language processing in
order to ensure that the results are relevant to the query being processed. This will enable
us to have a crawler which is both optimal in performance and accurate in results. It will
be able to crawl all the particular web pages of a website. It will be able to report back
any 2XX and 4XX links. It will take in the domain name from the command line. It will
avoid the cyclic traversal of links. Web crawlers can run multiple threads and/or
distribute themselves over multiple machines to increase the speed and efficiency of their
crawling. Multiple worker threads crawl the web, with each worker thread repeatedly
running through the loop of downloading and processing a webpage; a multi-threaded
process that uses synchronous I/O.
Problem statement:
Solution:
The most important thing to notice here is that the link start with html returns a list of all
urls from a webpage of given url. This is a blocking call, that means it will do HTTP
request and return when this request is finished. A blocking method and should
constitute a task and let it run in the background while we process already gotten URLs.
For every new url we get we should create a task (i.e, create a Callable) and submit to a
Thread Pool and get a future. Since we will be getting a lot of new urls and need to create
task for every one of them we should create a queue that keeps all the futures, and
process the newly found urls as the tasks in the queue are completed.
Scope:
The scope of this project is to create a multi-threaded web crawler. A Web crawler is an
Internet bot that systematically browses the World Wide Web, typically for the purpose
of Web indexing. Any search engine uses these indexes, web graphs, and an appropriate
algorithm ( such as PageRank ) to rank the pages. The main focus of the project would be
to implement a multi-threaded downloader that can download multi websites at the same
time.
Algorithm
Load html of this URL into a html parsing object like bs4 or lxml.
Get all URLs on this page (In Python bs4 : soup.find_all(‘a’, href=True))
For each URL in URLs, if the URL is internal, append to internal_urls. Else append to
external_urls.
This algorithm should give you two filled queues — external_urls and processed_urls.
Flow Chart
Code:
import multiprocessing
from bs4 import BeautifulSoup
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urljoin, urlparse
import requests
class MultiThreadedCrawler:
def run_web_crawler(self):
while True:
try:
print("\n Name of the current executing process: ",
multiprocessing.current_process().name, '\n')
target_url = self.crawl_queue.get(timeout=60)
if target_url not in self.scraped_pages:
print("Scraping URL: {}".format(target_url))
self.current_scraping_url = "{}".format(target_url)
self.scraped_pages.add(target_url)
job = self.pool.submit(self.scrape_page, target_url)
job.add_done_callback(self.post_scrape_callback)
except Empty:
return
except Exception as e:
print(e)
continue
def info(self):
print('\n Seed URL is: ', self.seed_url, '\n')
print('Scraped pages are: ', self.scraped_pages, '\n')
if __name__ == '__main__':
cc = MultiThreadedCrawler("https://www.geeksforgeeks.org/")
cc.run_web_crawler()
cc.info()
Screen Shot
Create a class named MultiThreadedCrawler. And initialize all the variables in the constructor,
assign base URL to the instance variable named seed_url. And then format the base URL into
absolute URL, using schemes as HTTPS and net location.
To execute the crawl frontier task concurrently use multithreading in python. Create an object
of ThreadPoolExecutor class and set max workers as 5 i.e To execute 5 threads at a time. And
to avoid duplicate visits to web pages, In order to maintain the history create a set data
structure.
Create a queue to store all the URLs of crawl frontier and put the first item as a seed URL .
Create a method named run_web_crawler(), to keep on adding the link to frontier and
extracting the information use an infinite while loop and display the name of the currently
executing process.
Get the URL from crawl frontier, for lookup assign timeout as 60 seconds and check whether
the current URL is already visited or not. If not visited already, Format the current URL and
add it to scraped_pages set to store in the history of visited pages and choose from a pool of
threads and pass scrape page and target URL .
Using the handshaking method place the request and set default time as 3 and maximum time
as 30 and once the request is successful return the result set.
Create a method named scrape_info(). And pass the webpage data into BeautifulSoap which
helps us to organize and format the messy web data by fixing bad HTML and present to us in
an easily-traversible structure.
Using the BeautifulSoup operator extract all the text present in the HTML document.
Create a method named parse links, using BeautifulSoup operator extract all the anchor tags
present in HTML document. Soup.find_all(‘a’,href=True) returns a list of items that contain
all the anchor tags present in the webpage. Store all the tags in a list named anchor_Tags. For
each anchor tag present in the list Aachor_Tags, Retrieve the value associated with href in the
tag using Link[‘href’]. For each retrieved URL check whether it is any of the absolute URL or
relative URL.
Relative URL: URL Without root URL and protocol names.
Absolute URLs: URL With protocol name, Root URL, Document name.
If it is a Relative URL using urljoin method change it to an absolute URL using the base URL
and relative URL. Check whether the current URL is already visited or not. If the URL has
not been visited already, put it in the crawl queue.
For extracting the links call the method named parse_links() and pass the result. For extracting
the content call the method named scrape_info() and pass the result.
References
Multithreaded crawler in Python - GeeksforGeeks