What Is Information Retrieval (IR) ?
What Is Information Retrieval (IR) ?
What Is Information Retrieval (IR) ?
1
What is information retrieval
• Gathering information from a source(s) based on
an information need usually from a query
– Major assumption - that the information need can be
specified
– Broad definition of information
• Sources of information
– Other people
– Archived information (libraries, maps, etc.)
– Radio, TV, etc.
– Web
– Nature
2
Data, information, knowledge
• Data - Facts, observations, or perceptions.
• Information - Subset of data, only including those data that
possess context, relevance, and purpose.
• Knowledge - A more simplistic view considers knowledge as
being at the highest level in a hierarchy with data (at the lowest
level) and information (at the middle level).
3
How much information is there?
Yotta
• Soon most everything will be Everything
recorded and indexed Zetta
!
• Most bytes will never be seen Recorded
by humans. Gray - Microsoft All Books Exa
• Data summarization, MultiMedia
trend detection Peta
anomaly detection All books
are key technologies (words) Tera
See Mike Lesk:
How much information is there: .Movi
http://www.lesk.com/mlesk/ksg97/ksg.html
e Giga
See Lyman & Varian:
How much information A Photo
http://www.sims.berkeley.edu/research/projects/how-much-info/ Mega
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli 4 Kilo
Ideal Information Retrieval
• The answer should be:
5
What is relevance?
• An answer(s) that fits your need.
6
How is IR accomplished?
• Ask someone
• Search
– Search for someone to ask
– Search for needed information
– Use a search engine
• Process of IR - queries or questions
7
Information to be retrieved
• Tacit vs explicit information
– Tacit: in someone’s mind
– Explicit: written down
• Permanent vs Impermanent information
– Conversation
– Documents (in a general sense)
• Text
• Video
• Files
• Pictures
• Data
• Both
• Assumption: it exists!
8
The information acquisition process
• Know what you want, where it is and go get it
• Ask questions to information sources as needed
(queries) - manifestation of SEARCH - and let
them suggest (rank) answers
• Have information sent to you on a regular basis
based on some predetermined information need
• Push/pull models (RSS)
9
What is SEARCH?
DEFINITIONS FROM THE WEB
11
What IR is usually not about
• Not about structured data (databases)
– Why?
– Grow of structured data?
• Retrieval from databases is usually not considered
– Database querying assumes that the data is in a
standardized format
– Transforming all information, news articles, web sites
into a database format is difficult for large data
collections
• INTEGRATED IR
12
What an IR system should do
• Store/archive information
• Provide access to that information
• Answer queries with relevant information
• Stay current
• Future list
– Understand the user’s queries
– Understand the user’s need
– Acts as an assistant
14
What is relevance?
• In IR relevance is everything
• Relevance information is that suited to your
information need.
• Dependent on
– User
– Space/time
– Group
– Context
• Examples?
15
How good is the IR system
Measures of performance based on what the system
returns:
• Relevance
• Coverage
• Recency
• Functionality (e.g. query syntax)
• Speed
• Availability
• Usability
• Time/ability to satisfy user requests
16
How IR systems work
Algorithms implemented in software
• Gathering of information
• Storage of information
• Indexing
• Interaction
• Evaluation
17
IR is an Iterative Process
Goals Repositories
Workspace
18
User’s
Information
Need
text input
Parse Query
Collections
Pre-process
Index
User’s
Information Collections
Need
Pre-process
text input
Rank or Match
User’s
Information Collections
Need
Pre-process
text input
Rank or Match
Query Reformulation
2
3 23