IR - Assigment - RK

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Society for Computer Technology and Research’s

PUNE INSTITUTE OF COMPUTER TECHNOLOGY


S.No.-27, Pune Satara Road, Dhankawadi, Pune-411043

A.Y. 2023-24

Department of Computer Engineering

Information Retrieval
Class: BE 3
(SEMESTER-VII)

A REPORT ON

“Parallel Information retrieval and Web Search”

Submitted by
Rutuja Kamble (41360)

1
ABSTRACT

This project report provides a comprehensive exploration of two critical domains in the realm
of information retrieval and web search: "Parallel Information Retrieval" and "Web Search." It
investigates the intricacies of Parallel Information Retrieval, including Parallel Query Processing and
the utilization of the MapReduce framework. The project also delves into Web Search, covering topics
such as the structure of the web, user queries, ranking algorithms (both static and dynamic), evaluation
methods, and web crawling techniques using Python libraries such as Scrappy and Beautiful Soup.

The report outlines the importance of these topics in today's digital age, emphasizing the need
for efficient information retrieval and web search capabilities. It begins with an introduction to set the
context and importance of these subjects in the broader field of computer science and information
technology. Parallel Information Retrieval is a central focus, with discussions on its essential
components, including parallel query processing, distributed databases, and query optimization. The
integration of MapReduce into this framework is explored, showcasing its application in improving
efficiency and scalability.

2
INDEX

1. Introduction…………………………………………………………………………………….4
2. Parallel Query Processing……………………………………………………………………...5
2.1 Query Processing…………………………………………………………………………..5
2.1.1 Document Partioning
2.1.2 Team Partioning
2.2 Map Reduce………………………………………………………………………………..8
3. Web Search…………………………………………………………………………………...10
3.1 Structure of the web
3.2 Users and queries
3.3 Static and Dynamic Ranking
3.4 Evaluation of Web search
3.5 Web crawlers
3.6 Web crawlers libraries
3.7 Python Scrappy
3.8 Beautiful Soap

3
1. INTRODUCTION

In the digital age, the ability to swiftly and effectively access relevant information is paramount.
Whether you're searching the vast expanse of the internet or querying large databases, the art of
Information Retrieval is at the core of modern life. This project embarks on a comprehensive
exploration of two crucial facets of this discipline: "Parallel Information Retrieval" and "Web
Search."

Parallel Information Retrieval is the art of harnessing parallel and distributed computing
techniques to enhance the efficiency of information retrieval from expansive datasets. With the
ever-growing volume of data, traditional retrieval methods are increasingly inadequate. Parallel
Information Retrieval offers a solution by allowing the distribution of retrieval tasks across multiple
processors and machines, making it possible to handle data of unprecedented scale. The heart of
Parallel Information Retrieval lies in "Parallel Query Processing." This technique involves breaking
queries into smaller, independent subqueries that can be processed concurrently. Query
decomposition, execution, and coordination are key components of this process. Additionally,
Parallel Information Retrieval leverages "MapReduce," a programming model and framework that
divides data into chunks and processes them in parallel, facilitating tasks such as indexing and query
processing in distributed environments.

Complementing this exploration is a deep dive into "Web Search." The internet is an
interconnected maze of information, and efficiently navigating it is an intricate art. This section
investigates the architecture of the web, the dynamics of user queries, and the complex algorithms
behind search engine ranking systems. It dissects the concepts of "Static Ranking" algorithms,
which determine the initial order of search results, and "Dynamic Ranking" algorithms, which adapt
to user behaviour and feedback.

Evaluation is a critical aspect of web search, and this report elucidates the metrics and
methodologies used to assess the effectiveness of search engines. Moreover, web crawling is an
essential process in acquiring data from the web. Python libraries, such as Scrapy and Beautiful
Soup, are presented as tools for web scraping and crawling.

Throughout the project, case studies and practical applications illustrate the real-world
significance of these concepts, spanning from e-commerce recommendation systems to the
scientific frontier. Challenges, including scalability and data privacy concerns, are highlighted,
along with a glimpse into the future of Information Retrieval in the era of big data, cloud computing,
and distributed machine learning.

4
2. PARALLEL INFORMATION RETRIEVAL

2.1 Introduction
Information retrieval systems often have to deal with very large amounts of data. They must be
able to process many gigabytes or even terabytes of text, and to build and maintain an index for
millions of documents. A single computer simply does not have the computational power or the
storage capabilities required for indexing even a small fraction of the World Wide Web.1 In this
chapter we examine various ways of making information retrieval systems scale to very large text
collections such as the Web. The first part is concerned with parallel query processing, where the
search engine’s service rate is increased by having multiple index server’s process incoming
queries in parallel. It also discusses redundancy and fault tolerance issues in distributed search
engines. In the second part we shift our attention to the parallel execution of off-line tasks, such as
index construction and statistical analysis of a corpus of text. We explain the basics of MapReduce,
a framework designed for massively parallel computations carried out on large amounts of data

2.2 Query Processing

There are many ways in which parallelism can help a search engine process queries faster. The
two most popular approaches are index partitioning and replication. Suppose we have a total of n
index servers. Following the standard terminology, we refer to these servers as nodes. By creating
n replicas of the index and assigning each replica to a separate node, we can realize an n-fold
increase of the search engine’s service rate (its theoretical throughput) without affecting the time
required to process a single query. This type of parallelism is referred to as inter-query parallelism,
because multiple queries can be processed in parallel but each individual query is processed
sequentially.

Alternatively, we could split the index into n parts and have each node work only on its own
small part of the index. This approach is referred to as intra-query parallelism, because each query
is processed by multiple servers in parallel. It improves the engine’s service rate as well as the
average time per query. In this section we focus primarily on methods for intra-query parallelism.
We study index partitioning schemes that divide the index into independent parts so that each node
is responsible for a small piece of the overall index. The two predominant index partitioning
schemes are document partitioning and term partitioning (visualized in Figure 14.1). In a
document-partitioned index, each node holds an index for a subset of the documents in the
collection. For instance, the index maintained by node 2 in Figure 14.1(a) contains the following
doc id lists: L1 = h4, 6i, L2 = h5i, L4 = h4i, L5 = h6i, L7 = h4, 6i. In a term-partitioned index, each
node is responsible for a subset of the terms in the collection. The index stored in node 1 in Figure
14.1(b) contains the following lists: L1 = h1, 3, 4, 6, 9i, L2 = h2, 5i, L3 = h2, 3, 8i. The two
partitioning strategies differ greatly in the way queries are processed by the system. In a document-
partitioned search engine, each of the n nodes is involved in processing all queries received by the
engine. In a term-partitioned configuration, a query is seen by a given node only if the node’s index
contains at least one of the query terms.

5
Figure: The two prevalent index partitioning schemes: document partitioning and term
partitioning (shown for a hypothetical index containing 8 terms and 9 documents).

2.2.1 Document partitioning

In a document-partitioned index, each index server is responsible for a subset of the


documents in the collection. Each incoming user query is received by a frontend server, the
receptionist, which forwards it to all n index nodes, waits for them to process the query,
merges the search results received from the index nodes, and sends the final list of results to
the user.

Figure: Document-partitioned query processing with one receptionist and several index
servers. Each incoming query is forwarded to all index servers.

6
2.2.2 Term Partitioning

Team partitioning in parallel query processing involves dividing complex queries into
smaller, independent subqueries, which are then assigned to teams consisting of multiple
processors or nodes. These teams work concurrently to process their assigned subqueries,
aiming to distribute the workload efficiently and reduce query execution time. The results
or intermediate data are aggregated at the team level, and the final query result is obtained
by combining these aggregated results. Team partitioning is a fundamental strategy in
parallel computing environments, optimizing query execution by leveraging the processing
power of multiple nodes, and ensuring that queries are processed effectively and
expediently.

Figure: Term-partitioned query processing with one receptionist and four index servers. Each
incoming query is passed from index server to index server, depending on the terms found in
the query (shown for a query containing three terms).

7
2.3 Map Reduce
Apart from processing search queries, there are many other data-intensive tasks that need to be
carried out by a large-scale search engine. Such tasks include building and updating the index;
identifying duplicate documents in the corpus; and analyzing the link structure of the document
collection (e.g., PageRank; see Section 15.3.1). MapReduce is a framework developed at Google
that is designed for massively parallel computations (thousands of machines) on very large
amounts of data (many terabytes), and that can accomplish all of the tasks listed above. MapReduce
was first presented by Dean and Ghemawat (2004). In addition to a high-level overview of the
framework, their paper includes information about many interesting implementation details and
performance optimizations.

MapReduces are highly parallelizable, because both map and reduce can be executed in parallel
on many different machines. Suppose we have a total of n = m + r machines, where m is the number
of map workers and r is the number of reduce workers. The input of the MapReduce is broken into
small pieces called map shards. Each shard typically holds between 16 and 64 MB of data. The
shards are treated independently, and each shard is assigned to one of the m map workers. In a
large MapReduce, it is common to have dozens or hundreds of map shards assigned to each map
worker. A worker usually works on only 1 shard at a time, so all its shards have to be processed
sequentially. However, if a worker has more than 1 CPU, it may improve performance to have it
work on multiple shards in parallel.
In a similar fashion, the output is broken into separate reduce shards, where the number of reduce
shards is often the same as r, the number of reduce workers. Each key/value pair generated by the
map function is sent to one of the r reduce shards. Typically, the shard that a given key/value pair
is sent to depends only on the key. For instance, if we have r reduce shards, the target shard for
each pair could be chosen according to
shard(key, value) = hash(key) mod r, (14.11)
where hash is an arbitrary hash function. Assigning the map output to different reduce shards in
this manner guarantees that all values for the same key end up in the same reduce shard. Within
each reduce shard, incoming key/value pairs are sorted by their key (this is the shuffle phase), and
are eventually fed into the reduce function to produce the final output of the MapReduce.
Figure 14.6 shows the data flow for the MapReduce from Figure 14.5, for three small text
fragments from the Shakespeare corpus. Each fragment represents a separate map shard. The
key/value pairs emitted by the map workers are partitioned onto the three reduce shards based on
the hash of the respective key. For the purpose of the example, we assume hash(“heart”) mod 3 =
0, hash(“soul”) mod 3 = 1, and so forth.
The map phase may overlap with the shuffle phase, and the shuffle phase may overlap with the
reduce phase. However, the map phase may never overlap with the reduce phase. The reason for
this is that the reduce function can only be called after all values for a given key are available.
Since, in general, it is impossible to predict what keys the map workers will emit, the reduce phase
cannot commence before the map phase is finished.

8
Figure: Data flow for the MapReduce definition

9
3. WEB SEARCH

3.1 Introduction

Web search is an integral part of our digital landscape, revolutionizing the way we access and
interact with information on the internet. In an era defined by the vast and intricate web of
interconnected data, understanding the principles and techniques behind web search is crucial. It
encompasses the architecture of the World Wide Web, the dynamics of user queries, the intricacies
of ranking algorithms, and the methods of evaluating search performance. This introduction sets
the stage for an exploration of the multifaceted world of web search, shedding light on the
mechanisms that empower us to navigate and retrieve relevant information from the vast digital
expanse.

3.2 Structure Of The Web

The World Wide Web, often simply referred to as the web, is a vast and interconnected network
of digital information, spanning billions of websites, web pages, and resources. At its core, the web
is structured around a decentralized model, where content is organized through the use of Uniform
Resource Locators (URLs) and hyperlinks. Web pages are linked to one another, creating a vast and
intricate web of interconnected information. The web's structure includes various components, such
as web servers, web browsers, and search engines, all working together to enable users to navigate
this information-rich environment.

The web is organized into a hierarchical structure, with the top-level domain names (e.g., .com,
.org, .edu) serving as the highest level of categorization. Beneath these top-level domains are
subdomains, individual websites, and web pages. Search engines play a critical role in helping users
navigate this structure, as they index and rank web pages based on relevance, ensuring that users can
access the information they seek quickly and efficiently. Understanding the structure of the web is
fundamental to effective information retrieval and web navigation, as it provides the foundation for
how users access and interact with digital content.

3.3 Queries and users


Queries and users are at the heart of web search, representing the dynamic interaction between
individuals seeking information and the search engines designed to fulfill those information needs.
Queries are the user-generated search terms or questions entered into a search engine's interface,
acting as the bridge between the user's intent and the vast repository of web content. These queries
vary widely in complexity, ranging from simple keyword searches to more elaborate natural
language questions, reflecting the diverse ways in which users express their information needs.
Understanding user queries is pivotal in search engine design, as it drives the process of information
retrieval and determines how effectively the search engine can match user intent with relevant web
content.

10
Users, on the other hand, are the individuals or entities employing search engines to access
information on the web. Their needs, preferences, and behaviors play a central role in shaping the
effectiveness of search engines. The study of user behavior and search intent is a key focus of
research in web search, with the aim of providing more personalized and relevant search results.
Search engines aim to understand and anticipate user intent, adapting their algorithms to deliver
more tailored search results. As technology and user expectations continue to evolve, understanding
the dynamic interplay between users and their queries is essential in creating search engines that
provide efficient and satisfactory information retrieval experiences.

3.4 Static and Dynamic


Static ranking, also known as static relevancy ranking, is a fundamental concept in web search that
focuses on the initial ordering of search results when a user submits a query. This ranking approach
assesses the relevance of web pages based on various factors, such as keyword matching, metadata,
and link analysis. One of the most notable algorithms used in static ranking is the PageRank
algorithm, developed by Larry Page and Sergey Brin, which evaluates the importance and authority
of web pages based on the structure of the web and the links between pages. Other static ranking
factors include on-page content, the presence of keywords in titles and headings, and the quality of
the content. Static ranking determines the order in which web pages appear on a search engine's
results page when a user initiates a query, with the goal of providing a good starting point for users
to find the most relevant information.

Dynamic ranking, on the other hand, focuses on real-time adaptability and personalization of search
results based on user behavior and contextual factors. It acknowledges that user intent may change
during a search session, and it aims to deliver results that are more relevant to the user's evolving
needs. Dynamic ranking algorithms take into account user interactions, such as click-through rates
and dwell time on search results, to adjust the order of results. By considering the user's browsing
history, location, and device, dynamic ranking strives to provide more personalized and contextually
relevant results. This approach aims to enhance the user's search experience by presenting content
that matches their evolving intent and preferences, thus making search engines more responsive to
individual user needs and delivering improved search results over time.

11
3.5 Evaluation of web search
Evaluation of web search is a critical aspect of ensuring that search engines provide users with
relevant and high-quality search results. It involves the assessment of how well a search engine
performs in terms of retrieving information that matches user queries. Evaluation is essential for
improving the algorithms and techniques used in web search and ensuring that users receive
satisfactory and accurate search results. Here are key components and methods related to the
evaluation of web search:

1. Evaluation Metrics: Various metrics are used to measure the performance of a search engine.
These metrics include precision, recall, F1 score, mean average precision (MAP), normalized
discounted cumulative gain (nDCG), and click-through rate (CTR). Precision represents the ratio of
relevant results to the total number of results, while recall measures the fraction of relevant results
retrieved. F1 score combines precision and recall to provide a single measure of performance. MAP
and nDCG assess the quality of ranking, considering the order of results. CTR reflects user
engagement by measuring the percentage of users who click on a search result.

2. Test Collections: To evaluate search engine performance, test collections are created, consisting
of a set of queries, a relevance judgment (indicating which results are relevant for each query), and
a set of documents. These test collections enable researchers and practitioners to compare different
search engines and retrieval algorithms using standardized benchmarks. Prominent test collections
include TREC (Text Retrieval Conference) and CLEF (Conference and Labs of the Evaluation
Forum).

3. Relevance Judgments: Relevance judgments play a central role in the evaluation process. They
are human assessments of the relevance of search results to specific queries. Human assessors review
search results and rate them as either relevant or non-relevant based on predefined criteria.
Relevance judgments are used to calculate metrics like precision and recall.

4. User Studies: In addition to automated evaluation metrics, user studies are conducted to gather
direct feedback from users. User feedback is valuable in understanding user satisfaction, information
needs, and search behavior. User studies can involve surveys, questionnaires, and usability testing
to gain insights into how well search engines meet user expectations.

5. Online Experiments: Some evaluations are carried out as online A/B tests, where different
algorithms or features are compared in real-time by randomly assigning users to different groups.
Online experiments provide valuable insights into user behavior and the impact of algorithmic
changes.

6. Relevance Models: Relevance models aim to capture the relationship between user queries and
relevant documents. These models may be based on statistical techniques, machine learning, or
natural language processing. By modeling relevance, search engines can improve their retrieval and
ranking algorithms.

12
3.6 Web Crawlers
A web crawler, also known as a web spider or web robot, is a program or script designed to
systematically browse the World Wide Web and collect information from web pages. The primary
purpose of a web crawler is to index and catalog web content for search engines, enabling users to
search for and retrieve relevant information efficiently. Web crawlers navigate through web pages
by following hyperlinks, and they typically employ various algorithms and strategies to discover and
index web pages. They play a crucial role in maintaining up-to-date and comprehensive search
engine indexes.

3.7 Web Crawler Libraries:


Web crawler libraries are pre-built tools or frameworks that provide a set of functions and
methods for creating and managing web crawlers. These libraries streamline the development of
web crawling applications by offering features such as HTTP request handling, URL parsing, and
link extraction. They also help manage the storage of crawled data. Popular web crawler libraries
include Scrapy, Beautiful Soup, and Apache Nutch. These libraries are particularly useful for
developers looking to build customized web crawlers or data scraping applications efficiently.

3.8 Python Scrapy


Scrapy is an open-source and powerful web crawling and web scraping framework specifically
designed for Python. It offers a wide range of features and tools for developers to create web crawlers
and data extraction tasks. Scrapy provides a framework for defining how to follow links, handle
requests and responses, parse HTML content, and store the crawled data. It also includes an
interactive shell for testing and debugging, support for handling user sessions and cookies, and
asynchronous processing, which allows it to retrieve data from multiple websites concurrently.
Scrapy is well-documented and widely used for a variety of web crawling and scraping applications.

3.9 Beautiful Soup


Beautiful Soup is a Python library that specializes in parsing and navigating HTML and XML
documents. While it is not a web crawling framework like Scrapy, Beautiful Soup complements web
crawlers by providing a powerful tool for parsing and extracting data from web pages. It simplifies
the process of navigating the Document Object Model (DOM) of web pages, searching for specific
elements, and extracting content. Beautiful Soup is often used in conjunction with other libraries or
tools to process and extract structured data from web pages. It is particularly useful for data
extraction, web scraping, and web page parsing tasks.

13
4. CONCLUSION

This project has undertaken a comprehensive exploration of two critical domains in the field of
information retrieval and web search: Parallel Information Retrieval and Web Search. These areas
are fundamental in our digital age, where vast amounts of data and information are continuously
generated and accessed. In the realm of Parallel Information Retrieval, we delved into the importance
of parallel and distributed computing techniques to efficiently extract information from large and
complex datasets. Understanding the principles of parallel query processing, distributed databases,
and query optimization is crucial for the development of high-performance information retrieval
systems. The integration of the MapReduce framework further enhances the scalability and
efficiency of these systems, making them capable of handling massive volumes of data.

Web Search, on the other hand, forms the backbone of how we access and interact with digital
information on the World Wide Web. We examined the architecture of the web, user queries, and
the ranking algorithms that determine how search results are presented to users. The concepts of
static and dynamic ranking, along with user personalization and feedback, have been explored in
depth. The project also addressed the critical topic of evaluation in web search, highlighting the
various metrics, test collections, and relevance judgments used to assess the performance of search
engines. Additionally, we explored the significance of web crawlers, web crawler libraries like
Scrapy and parsing tools like Beautiful Soup in the data retrieval process.

14

You might also like