Introduction To IR 2021
Introduction To IR 2021
Introduction To IR 2021
2021
Outline
Define Information Retrieval Systems
Examples of IR systems
General Goal of Information Retrieval
Info Retrieval vs. Data Retrieval
Why is IR so hard?
Logical View of Documents
Structure of an IR System (IR black box)
2
Information Hierarchy
More refined and abstract
Distilled and integrated knowledge
Demonstrative of high-level “understanding”
3
What types of information?
Text (Documents )
XML and structured documents
Images
Audio
Video
Source code
Applications/Web services
4
The most 3 terms used in this
course
Information
news or facts about something
thing told, items of knowledge, news
6
What is Information Retrieval ?
A good formal definition of information retrieval is given in
Baeze-Yates & Riberio-Neto (1990)
“Information retrieval deals with representation, storage,
organization of, and access to information items. The organization
and access of information items should provide the user with easy
access to the information in which he is interested”
The definition incorporates all important features of a good
information retrieval system
Representation
Storage
Organization
Access
Search by keywords.
9
Motivation
Nowadays the businesses, organizations,
and individuals all around the world
actively use websites on a daily basis.
Webpages
Website
Search engine
More than 1billion of websites
More than 85% are not active (e.g. have a
similar function)
About 10%-15% only active
Google indexed trillions of webpages
1,864,860,691
Google is currently the most visited Websites online right now
website (2021) 10
General Goal of Information Retrieval
To help users find useful information based on their
information needs (with a minimum effort) despite:
Increasing complexity of Information
11
Info Retrieval vs. Data Retrieval
Emphasis of IR is on the retrieval of information, rather than
on the retrieval of data
Data retrieval
Consists mainly of determining which documents contain a set of
keywords in the user query
Aims at retrieving all objects that satisfy well defined semantics
a single erroneous object among a thousand retrieved objects implies
failure
Mainly designed for structured databases
Information retrieval
Is concerned with retrieving information about a subject or topic than
retrieving data which satisfies a given query
semantics is frequently loose: the retrieved objects might be inaccurate
small errors are tolerated
12
Info Retrieval vs. Data Retrieval
Example of data retrieval system is a relational database
Data Retrieval Info Retrieval
Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,) than text and images etc)
Query Language Artificial (defined, SQL) Free text (“natural
language”), Boolean
Matching Exact (results are Partial match, best match
always “correct”)
Query specification Complete Incomplete
14
Search Engines: Crawling
Crawling— that is gathering information (webpage)
Web Crawler or web spider or web bots is a software for
downloading pages from the World Wide Web, for the
purpose of indexing.
E.g. Scrapy, Httrack, etc…
15
Search Engines: Crawling
Crawling is done by multiple processes on multiple machines,
running in parallel
Set of links to be crawled stored in a database
crawled later
What is the Main problem of focused crawling?
To predict the relevance of a page before downloading the page
What are the basic rules for Web crawler operation are?
It obey the robots exclusion protocol (robots.txt)
16
Search Engines: Indexing
The process of collecting, parsing, and storing data for use by
search engine. Creates a new copy of index instead of modifying
old index
Old index is used to answer queries
18
Why is IR so hard?
Traditionnel Information retrieval (IR) System attempt to find
relevant documents to respond to a user’s request.
Information retrieval problem: locating relevant
documents based on user input, such as keywords or example
documents
The real problem boils down to matching the language of
the query to the language of the document.
Simply matching on words is a very brittle (no elasticity)
approach. One word can have different semantic
meanings. Consider: Take
“take a place at the table”
“take a picture”
19
Basic Concepts in Information Retrieval:
(i) User Task and (ii) Logical View of documents
The User Task:
two user task – retrieval and browsing
Retrieval
Browsing
USER
20
The User Task (1): Retrieval
It is the process of retrieving information whereby the
main objective is clearly defined from the onset of
searching process.
The user of a retrieval system has to translate his
information need into a query in the language provided by
the system.
In this context (i.e. by specifying a set of words), the user
searches for useful information executing a retrieval task
English Language Statement :
I want a book by J. K Rowling titled The Chamber of
Secrets
21
The User Task (2): Browsing
It is the process of retrieving information, whereby the
main objective is not clearly defined from the beginning and
whose purpose might change during the interaction with the
system.
E.g. User might search for documents about ‘car racing’ .
Meanwhile he might find interesting documents about ‘car
manufacturers’. While reading about car manufacturers in
Addis, he might turn his attention to a document providing
‘direction to Addis’, and from this to documents which
cover ‘Tourism in Ethiopia’.
In this context, user is said to be browsing in the collection
and not searching, since a user may has an interest glancing
around 22
Logical View of Documents
Documents in a collection are frequently represented by a set
of index terms or keywords
Such keywords are mostly extracted directly from the text of the
document
These representative keywords provide a logical view of the
document
Docs Tokenization stop words stemming Indexing
Expensive
24
Structure of an IR System
An Information Retrieval System serves as a bridge between the
world of authors and the world of readers/users,
User Documents
Black box
Black box
Hits
26
Structure of an IR System:
Inside The IR Black Box
Query Documents
Representation Representation
Function Function
Comparison
Index
Function
Hits
27
Structure of an IR System:
The Central Problem in IR
Information Seeker Authors
Concepts Concepts
Indexing [chapter 3]
How do we actually store all those words?
Languages models
30
Building the IR Black Box
Relevance Feedback [chapter 6]
How do humans (and machines) modify queries based on retrieved
results?
The user then examines the set of ranked documents in
the search for useful information. Then modify the query
Hopefully, this modified query is a better representation
of the real user need.
o User Interaction
How do we present search results to users in an effective
manner?
What tools can systems provide to aid the user in information
seeking?
31
Focus in IR System Design
Our focus during IR system design is:
In improving performance effectiveness of the system
Effectiveness of the system is measured in terms of
32
Structure of an IR System
To be effective in its attempt to satisfy information need
of users, the IR system must ‘interpret’ the contents of
documents in a collection and rank them according to
their degree of relevance to the user query.
Thus the notion of relevance is at the centre of IR
33
Structure of an IR System
Typical IR Task
Document
Given: corpus
A corpus of textual
natural-language
documents. Query
String
A user query in the
form of a textual 1. Doc1
string. 2. Doc2
Ranked 3. Doc3
Find: Documents .
.
A ranked set of
documents that are
relevant to the query. 34
Detail view of the Retrieval Process
User
Interface
User Text
need
Text Operations
logical view Logical view
Searching Index
Text Operations
logical view Logical view
Searching Index
38
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop list
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
39
Check your knowledge
1. What are the two user task in IR ?
2. What are the two sub-system in IR describe them
3. Define the logical view of the document the steps in the
logical view of the documents
4. Define the four term that define an information retrieval?
5. Define the steps in the overview of retrieval process
6. What is the IR black box?
7. What is the goal of IR
8. Why IR is so hard?
40