Introduction To IR 2021

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

By Abdo Ababor

2021
Outline
 Define Information Retrieval Systems
 Examples of IR systems
 General Goal of Information Retrieval
 Info Retrieval vs. Data Retrieval
 Why is IR so hard?
 Logical View of Documents
 Structure of an IR System (IR black box)

2
Information Hierarchy
More refined and abstract
Distilled and integrated knowledge
Demonstrative of high-level “understanding”

Wisdom Justified true belief.


Information that can
be acted upon
Knowledge
Data organized and presented
Information in a particular manner

Data The raw material


of information

3
What types of information?
 Text (Documents )
 XML and structured documents
 Images
 Audio
 Video
 Source code
 Applications/Web services

4
The most 3 terms used in this
course
 Information
 news or facts about something
 thing told, items of knowledge, news

 knowledge communicated or received concerning a

particular fact or circumstance; news


 Retrieval
 “Fetch something” that’s been stored
 Search through stored messages to find some messages

relevant to the task at hand


 Systems
 How do computer systems fit into the human
information seeking process?
5
Information Retrieval
 Information retrieval (IR) is the process of finding
material (usually documents) of an unstructured nature
(usually text) that satisfies an information need from
within large collections (usually stored on computers).
 Information is organized into (a large number of) documents
 Large collections of documents from various sources:
news articles, research papers, books, digital
libraries, Web pages, etc.

6
What is Information Retrieval ?
 A good formal definition of information retrieval is given in
Baeze-Yates & Riberio-Neto (1990)
“Information retrieval deals with representation, storage,
organization of, and access to information items. The organization
and access of information items should provide the user with easy
access to the information in which he is interested”
 The definition incorporates all important features of a good
information retrieval system
 Representation

 Storage

 Organization

 Access

 The focus is mainly on the user information need


7
Examples of IR systems
 Conventional (library catalog):
 Search by keyword, title, author, etc.

 Text-based (Lexis-Nexis, Google, FAST):

 Search by keywords.

 Multimedia (IBMs QBIC, WebSeek, SaFe):

 Search by visual appearance (shapes, colors,… ).

 Question answering systems (AskJeeves, Answerbus):

 Search in (restricted) natural language

 Cross language information retrieval,


 Music retrieval
8
Statistical point of view
 Roughly half of the world's population or 3.8 billion people use
the internet every day.
 end of 2010, the number of internet users reached 1.9 billion.
Query: How many Google searches per day on average in 2021?
Details: There are over 2 trillion Google requests per day in 2021, but
Internet Live Stats, claims around 5.5 billion searches done on Google
per day or over 63,000 search queries done per second.

9
Motivation
 Nowadays the businesses, organizations,
and individuals all around the world
actively use websites on a daily basis.
 Webpages
 Website
 Search engine
 More than 1billion of websites
 More than 85% are not active (e.g. have a
similar function)
 About 10%-15% only active
 Google indexed trillions of webpages
1,864,860,691
 Google is currently the most visited Websites online right now
website (2021) 10
General Goal of Information Retrieval
 To help users find useful information based on their
information needs (with a minimum effort) despite:
 Increasing complexity of Information

 Changing needs of user

 Provide immediate random access to the document collection.

 Retrieval systems, such as Google, Yahoo, are developed with


this aim.
 The primary goal of an IR system is to retrieve all the

documents which are relevant to a user query while retrieving


as few non-relevant documents as possible

11
Info Retrieval vs. Data Retrieval
Emphasis of IR is on the retrieval of information, rather than
on the retrieval of data
 Data retrieval
 Consists mainly of determining which documents contain a set of
keywords in the user query
 Aims at retrieving all objects that satisfy well defined semantics
 a single erroneous object among a thousand retrieved objects implies
failure
 Mainly designed for structured databases
 Information retrieval
 Is concerned with retrieving information about a subject or topic than
retrieving data which satisfies a given query
 semantics is frequently loose: the retrieved objects might be inaccurate
 small errors are tolerated
12
Info Retrieval vs. Data Retrieval
 Example of data retrieval system is a relational database
Data Retrieval Info Retrieval
Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,) than text and images etc)
Query Language Artificial (defined, SQL) Free text (“natural
language”), Boolean
Matching Exact (results are Partial match, best match
always “correct”)
Query specification Complete Incomplete

Items wanted Matching Relevant


Accuracy 100% < 50%
Error response Sensitive Insensitive 13
IR System [Search Engines]
 Search Engine refers to a huge database of internet resources
such as web pages, programs, images etc.
 Search Engine performs 3 main tasks:
1. Crawling
2. Indexing
3. Ranking

14
Search Engines: Crawling
 Crawling— that is gathering information (webpage)
 Web Crawler or web spider or web bots is a software for
downloading pages from the World Wide Web, for the
purpose of indexing.
 E.g. Scrapy, Httrack, etc…

 It copy information like: Content, header, page titles,


keywords, images, links, etc …
 Crawling the entire Web would take a very large amount of time
 Search engines typically cover only a part of the Web, not all of it.

15
Search Engines: Crawling
 Crawling is done by multiple processes on multiple machines,
running in parallel
 Set of links to be crawled stored in a database

 New links found in crawled pages added to this set, to be

crawled later
 What is the Main problem of focused crawling?
 To predict the relevance of a page before downloading the page

 What are the basic rules for Web crawler operation are?
 It obey the robots exclusion protocol (robots.txt)

 It must keep a low bandwidth usage in a given Web site.

16
Search Engines: Indexing
 The process of collecting, parsing, and storing data for use by
search engine. Creates a new copy of index instead of modifying
old index
 Old index is used to answer queries

 After a crawl is “completed” new index becomes “old” index


and stored on database.

 An index is a critical data structure

 It allows fast searching over large volumes of data

 Indexing process runs on multiple machines


17
Search Engines: Ranking
 The user first specifies a user need which is then parsed and
transformed by the same text operation applied to the text.
 Next the query operations is applied before the actual query,
which provides a system representation for the user need, is
generated
 The query is then processed to obtain the retrieved documents
 Before the retrieved documents are sent to the user, the

retrieved documents are ranked according to the likelihood


of relevance.
 Multiple machines used to answer queries for load balancing

18
Why is IR so hard?
 Traditionnel Information retrieval (IR) System attempt to find
relevant documents to respond to a user’s request.
 Information retrieval problem: locating relevant
documents based on user input, such as keywords or example
documents
 The real problem boils down to matching the language of
the query to the language of the document.
 Simply matching on words is a very brittle (no elasticity)
approach. One word can have different semantic
meanings. Consider: Take
 “take a place at the table”

 “take money to the bank”

 “take a picture”
19
Basic Concepts in Information Retrieval:
(i) User Task and (ii) Logical View of documents
The User Task:
two user task – retrieval and browsing

Retrieval

Browsing

USER

20
The User Task (1): Retrieval
 It is the process of retrieving information whereby the
main objective is clearly defined from the onset of
searching process.
 The user of a retrieval system has to translate his
information need into a query in the language provided by
the system.
 In this context (i.e. by specifying a set of words), the user
searches for useful information executing a retrieval task
 English Language Statement :
 I want a book by J. K Rowling titled The Chamber of

Secrets
21
The User Task (2): Browsing
 It is the process of retrieving information, whereby the
main objective is not clearly defined from the beginning and
whose purpose might change during the interaction with the
system.
 E.g. User might search for documents about ‘car racing’ .
Meanwhile he might find interesting documents about ‘car
manufacturers’. While reading about car manufacturers in
Addis, he might turn his attention to a document providing
‘direction to Addis’, and from this to documents which
cover ‘Tourism in Ethiopia’.
 In this context, user is said to be browsing in the collection
and not searching, since a user may has an interest glancing
around 22
Logical View of Documents
 Documents in a collection are frequently represented by a set
of index terms or keywords
 Such keywords are mostly extracted directly from the text of the
document
 These representative keywords provide a logical view of the
document
Docs Tokenization stop words stemming Indexing

Full Index terms


text

 Document representation viewed as a continuum, in which


logical view of documents might shift from full text to index
terms 23
Logical view of documents
 If full text :
 Each word in the text is a keyword

 Most complex form

 Expensive

 If full text is too large, the set of representative keywords

can be reduced through transformation process called


text operation
 It reduce the complexity of the document representation and
allow moving the logical view from that of a full text to a set
of index terms

24
Structure of an IR System
 An Information Retrieval System serves as a bridge between the
world of authors and the world of readers/users,

 That is, writers present a set of ideas in a document using a set


of concepts. Then Users seek the IR system for relevant
documents that satisfy their information need.

User Documents
Black box

The black box is the information retrieval system.


25
Structure of an IR System:
The IR Black Box
Query Documents

Black box

Hits

26
Structure of an IR System:
Inside The IR Black Box
Query Documents

Representation Representation
Function Function

Query Representation Document Representation

Comparison
Index
Function

Hits
27
Structure of an IR System:
The Central Problem in IR
Information Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?


28
Why is IR hard? Because language is hard!
Building the IR Black Box
 Text representation [chapter 2]
 How do we capture the meaning of documents?

 Is meaning just the sum of all terms?

 what makes a “good” representation?

 how is a representation generated from text?

 what are retrievable objects and how are they organized?

 Indexing [chapter 3]
 How do we actually store all those words?

 How do we access indexed terms quickly?


Building the IR Black Box
 Different models of information retrieval [chapter 4]
 Boolean model

 Vector space model

 Languages models

 Information needs representation [chapter 7]


 what is an appropriate query language?

 how can interactive query formulation and refinement be


supported?
 Evaluating effectiveness of retrieval [chapter 5]
 what are good metrics/measurements?

30
Building the IR Black Box
 Relevance Feedback [chapter 6]
 How do humans (and machines) modify queries based on retrieved
results?
 The user then examines the set of ranked documents in
the search for useful information. Then modify the query
 Hopefully, this modified query is a better representation
of the real user need.
o User Interaction
 How do we present search results to users in an effective
manner?
 What tools can systems provide to aid the user in information
seeking?
31
Focus in IR System Design
Our focus during IR system design is:
 In improving performance effectiveness of the system
 Effectiveness of the system is measured in terms of

precision, recall, …etc.

 In improving performance efficiency


 The concern here is storage space usage, access time,

searching time, data transfer time …etc.

32
Structure of an IR System
 To be effective in its attempt to satisfy information need
of users, the IR system must ‘interpret’ the contents of
documents in a collection and rank them according to
their degree of relevance to the user query.
 Thus the notion of relevance is at the centre of IR

33
Structure of an IR System
Typical IR Task
Document
 Given: corpus
 A corpus of textual
natural-language
documents. Query
String
 A user query in the
form of a textual 1. Doc1
string. 2. Doc2
Ranked 3. Doc3
 Find: Documents .
.
 A ranked set of
documents that are
relevant to the query. 34
Detail view of the Retrieval Process
User
Interface
User Text
need

Text Operations
logical view Logical view

Query Language & DB manager


User Indexing
Operations Module
feedback

Query Inverted file

Searching Index

Retrieved docs Text


Database
Ranking
Ranked docs 35
Detail view of the Retrieval Process
User
Interface
User Text
need

Text Operations
logical view Logical view

Query Language & DB manager


User Indexing
Operations Module
feedback

Query Inverted file

Searching Index

Retrieved docs Text


Database
Ranking
Ranked docs
36
The 2 subsystems of an IR system
 The two subsystems of an IR system:
 Searching: is an online process of finding relevant documents
in the index list as per users query
 Indexing: is an offline process of organizing documents using
keywords extracted from the collection
 Indexing and searching: are unavoidably connected
 you cannot search what was not first indexed

 indexing of documents or objects is done in order to be


searchable
 to index one needs an indexing language

 there are many indexing languages


 even taking every word in a document is an indexing
language
37
Indexing Subsystem
documents
Documents Assign document identifier

text Document IDs


Tokenize
tokens
Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index

38
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop list
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index

39
Check your knowledge
1. What are the two user task in IR ?
2. What are the two sub-system in IR describe them
3. Define the logical view of the document the steps in the
logical view of the documents
4. Define the four term that define an information retrieval?
5. Define the steps in the overview of retrieval process
6. What is the IR black box?
7. What is the goal of IR
8. Why IR is so hard?

40

You might also like