Chapter 1: Boolean Retrieval

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Chapter 1: Boolean Retrieval

● The meaning of the term information retrieval can be very broad. In academics, IR is
finding material of an unstructured nature that satisfies an information need from within
large collections.
● The term ​unstructured data​ refers to data that does not have a clear, semantically overt
easy-for-a computer structure.
● IR systems can be distinguished by the scale at which they operate:
○ Web search
○ Personal:​ On your machine, but has a variety of document types.
○ Enterprise, institutional & domain-specific:​ Here documents are typically
stored on centralized file systems.

1.1 ​ ​An example information retrieval problem

● Grepping​ through text is to do some type of linear scan through the documents. It often
allows useful possibilities for wildcard pattern matching through the use of regular
expressions. However, this has some disadvantages too -
○ To process large amount of data
○ To allow more flexible form of matching. Like NEAR.
○ To allow ranked retrieval.
● The way to avoid linearly scanning the texts for each query is to index the documents in
advance.
Here, we make a vector, if a word is present in a document. Here, the terms are
the indexed units; they are usually words. Using the vector, for each document we can
make a term incidence matrix for all the documents in the collection.
Once, we have the matrix, we can make boolean queries using the vectors to find
the result for any query. Ex.

Brutus AND Caesar AND NOT Calpurnia


110100 AND 110111 AND 101111 = 100100

● Documents ​refer to whatever units we have decided to build a retrieval system over.
Collections ​are the set of all the documents we are going to use. Sometimes also
referred to as ​corpus.
● Ad hoc retrieval: ​Most standard IR task. Here, we try to provide documents from the
collection that are relevant to an arbitrary user information need, communicated as a
query.
● An ​information need​ is a topic about which the user desires to know more, and is
differentiated from a ​query​, which is what the user conveys to the computer in an
attempt to communicate the information need.
● A document is ​relevant​ if it is one that the user perceives as containing information of
value with respect to their personal information need.
● Effectiveness​ of a system is measured by: ​Precision ​( What fraction of the returned
results are relevant to the information need ?) and ​Recall ​( What fraction of the relevant
documents in the collection were returned by the system? )
● But however, making such a large matrix is not possible. Hence we come to the idea of
inverted index.
○ We keep a ​dictionary ​of terms ( sometimes also referred to as ​vocabulary ​or
lexicon​) then for each term, we have a list that records which documents the
term occurs in. Each item in the list - which records that a term appeared in a
document - is called a ​postings.

1.2 Building an inverted index


● The major steps of making an inverted index are - collect the documents to be indexed,
tokenize the text (turning each text into a list of tokens), linguistic preprocessing, index
the documents that each term occurs in by creating an inverted index, consisting of a
dictionary and postings.
● Each doc has a ​document identifier ( docId ). ​The core indexing step is sorting this list
so that the terms are alphabetical i.e. ​vocabulary ​is alphabetical. Instances of the terms
from the same document are merged. The result is then split into a ​dictionary​ and
postings​.
● Document frequency ​is the number of documents in which a term appears in a
collection. However, not vital for a boolean search but it helps to increase the efficiency
of the search engine at the query time. The postings are secondarily sorted by docID.
● Posting lists are generally stored on the disk due to the large size. For an in-memory
posting list, two good alternatives are :
○ Single linked lists:​ Cheap insertions
○ Vectors:​ Less storage required
When stored on disk, they are stored as a continuous run of postings without explicit
pointers, so as to minimize the size of the postings list and the number of disk seeks to
read a posting list into memory.

1.3 Processing boolean queries


● Query optimization​ is the process of selecting how to organize the work of answering a
query so that the least total amount of work needs to be done by the system.
● PS: Query Optimization Examples

1.4 The extended Boolean model versus ranked retrieval


● In ​ranked retrieval model ​users generally use ​free text queries, ​ that is, just typing one
or more words rather than using a precise language with operators for building up query
expressions.
● A strict Boolean expression over terms with unordered results set is too limited for many
of the information needs that people have, and these systems implemented extended
Boolean retrieval models by incorporating additional operators such as term proximity
operators. A ​proximity operator​ is a way of specifying that two terms in a query must
occur close to each other in a document, where closeness may be measured by limiting
the allowed number of intervening words or by reference to a structural unit such as a
sentence or paragraph.

-------------------------------------------------------------------------------------------------------------------------------

2. The term vocabulary and postings lists

2.1 Document delineation and character sequence decoding


● Obtaining the character sequence in a document
○ The text is encoded in various formats, that needs to be decoded before working
on data.
○ The idea that text is a linear sequence of characters is also called into question
by some writing styles, like Arabic.
● Choosing a document unit ( for indexing )
○ For very large documents, the issue of indexing ​granularity​ arises.
○ If the units get too small, we are likely to miss important passages because terms
were distributed over several mini-documents, while if units are too large we tend
to get spurious matches and the relevant information is hard for the user to find.

2.2 Determining the vocabulary of terms


● Tokenization: ​Given a character sequence and a defined document unit, tokenization is
the task of chopping it up into pieces, called tokens.
○ Token ​is an instance of a sequence of characters in some particular document
that are grouped together as a useful semantic unit for processing.
○ Type ​is the class of all tokens containing the same character sequence.
○ Term​ is a type that is included in the IR system’s dictionary.
○ The set of index terms could be entirely distinct from the tokens, for instance,
they could be semantic identifiers in a taxonomy, but in practice in modern IR
systems they are strongly related to the tokens in the document.
○ We always want to apply the same type of tokenization on the document and the
query.
○ Issues of tokenization are language specific. ​Language identification​ based on
classifiers that use short character subsequences as features is highly effective.
○ Each new language has its own issues. One approach here is to perform ​word
segmentation​ as prior linguistic processing.
● Stop word removal:
○ Some extremenly common words appear in a document which would appear to
be of little value in helping select documents matching a user need are excluded
from the vocabulary are entirely. These words are called ​stop words.
For determining stop words, we sort the terms by ​collection frequency, ​and

then to take the most frequent terms, often hand-filtered as a ​stop list.
○ For most modern IR systems, the additional cost of including stop words is not
that big – neither in terms of index size nor in terms of query processing time.
● Normalization ( equivalence classing of terms ) => TBDL
● Stemming and Lemmatization
○ Stemming​ usually refers to a crude heuristic process that chops off the ends of
words in the hope of achieving this goal correctly most of the time, and often
includes the removal of derivational affixes.
○ Lemmatization​ usually refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to remove
inflectional endings only and to return the base or dictionary from a word, which
is known as ​lemma. ​Works well on inflectional changes in the word.
○ Stemming increases recall while harming precision.

2.3 Faster posting list via skip pointer


● Here we have skip pointers that allow us to make jumps from one posting to some
posting at a distance. In general, for a posting list of length P, use root(P) evenly-spaced
skip pointers.

2.4 Positional posting and phrase queries


● Many queries are phrase queries, where some words are to be treated as phrases and
not individual words.
● Biword indexes
○ One approach to handling phrases is to consider every pair of consecutive terms
in a document as a phrase.
○ Related nouns can often be divided fron each other by various function words.
So, first we tokenize the text and perform POS tagging, then group terms into
nouns

-------------------------------------------------------------------------------------------------------------------------------

3. Dictionaries and tolerant retrieval

3.1 Search structures for dictionaries


● When given a query, we need to find if the term is in the vocabulary, for this we use
dictionary ​( implemented using ​search tree​ or ​hashing ​).
● It should be noted that unlike hashing, search trees demand that the characters used in
the document collection have a prescribed ordering.

3.2 Wildcard Queries


● Used when - the user is uncertain of the spelling of a query term, the user is aware of
multiple variants of spelling a term and seeks documents containing any of the variants.
● TBDL

-------------------------------------------------------------------------------------------------------------------------------

4. Index Construction

● Constructing an inverted index is called ​index construction.

-------------------------------------------------------------------------------------------------------------------------------

5. Index Compression

● Why compression ?
○ Increased usage of caching is possible.
○ Faster transfer of data from disk to main memory.
5.1 Statistical properties of terms in information retrieval
● The compression techniques we describe in the remainder of this chapter are ​lossless​,
that is, all information is preserved. Better compression ratios can be achieved with
lossy compression​( case folding, stemming, stop word removal), which discards some
information.
● Heap’s Law : Estimating number of terms
○ A better way of getting a handle on M (number of distinct terms / vocabulary size)
is Heap’s law.
M = k * (T^b)
where,
T = number of tokens in the collection
Parameters 30 < k < 100 and b ~ 0.
○ The parameter k is quite variable because vocabulary growth depends a lot on
the nature of the collection and how it is processed. Case-folding and stemming
reduce the growth rate of the vocabulary, whereas including numbers and
spelling errors increase it.
● Zipf’s law : Modeling the distribution of terms
○ Model of distribution of terms in a collection is Zipf’s law.
○ States that, if t1 is the most common term in the collection, t2 is the next most
common, and so on, then the collection frequency cf(i) of the i th most common
term is proportional to 1/i: ​cf(i) * i ~~ const.

-------------------------------------------------------------------------------------------------------------------------------

6. Scoring, term weighting and the vector space model

6.2 Term frequency and weighting


● A free text query, views the query as simply a set of words. A plausible scoring
mechanism then is to compute a score that is the sum, over the query terms, of the
match scores between each query term and the document.
● Each term in a document is assigned a ​weight​ for that term, that depends on the
number of occurrences of the term in the document. The simplest approach is to assign
the weight to be equal to the number of occurrences of term t in document d. This
weighting scheme is referred to as ​term frequency.
● Here, for a doc d, the set of wts. Determined by the tf weights may be viewed as a
quantitative view of the document. This view is often known as the ​bag of words model​,
as the exact ordering of terms is not considered.
● Inverse document frequency
○ Raw term frequency as above suffers from a critical problem: all terms are
considered equally important when it comes to assessing relevancy on a query.
○ Like, suppose we are talking about the auto industry, then the term ‘auto’
appears in almost all the documents and hence it plays no role in determining
relevance of the document.
○ Thus, an immediate idea is to scale down the term weights of terms with high
collection frequency, defined to be the number of occurences of a term in the
collection. The idea would be to reduce the tf weight of a term by a factor that
grows with its collection frequency.
○ However, it is more commonplace to use for this purpose the ​document
frequency ​( defined to be the number of documents in the collection that contain
a term t )
idf(t) = log( N / df(t) ) ​where N is the total number of documents in a
collection
● Tf-idf weighting
○ The tf-idf weighting scheme assigns to term t a weight in document d given by
tf-idf(t,d) = tf(t,d) × idf(t)
○ Highest when t occurs many times within a small number of docs; lower when the
term occurs fewer times in a doc, or occurs in many docs, lowest when the term
is virtually present in all documents.
○ We can view each document as a vector with one component corresponding to
each term in the dictionary, together with a wt for each component that is given
by the scheme.
○ Overlap score measure : ​ The score of a document d is the sum, over all query
terms, of the number of times each of the query terms occurs in d.
Score(q, d) = sigma(t < q) tf-idf(t,d)

6.3 Vector Space Model


● The representation of a set of docs as vectors in a common vector space is known as
the ​vector space model​.
● Dot Products
○We denote by V(d) the vector derived from document d, with one component in
the vector for each dictionary term.
○ This representation losses the relatice ordering of the terms in the document.
○ How to quantify the similarity between two documents :
■ Compare ​magnitudes of the vector difference​ between two doc
vectors. ​Drawback: T​ wo similar documents but differeing in length will be
regarded as non-similar.
■ Cosine Similarity: ​This compensates for the document length drawback
of the previous measure. Thus, can be viewed as the dot product of the
normalised versions of two document vectors.
○ Viewing a collection of N docs as a collection of vectors leads to a natural view of
a collection as a ​term-document matrix
● Queries as vectors : ​We can also view our query as a vector.
● Computing vector scores ( See later )

-------------------------------------------------------------------------------------------------------------------------------

8. Evaluation in information retrieval

8.1 Information retrieval system evaluation


● The standard approach to information retrieval system evaluation revolves around the
notion of relevant and nonrelevant documents.
● Relevance is assessed relative to an information need, not a query.
● A document is relevant if it addresses the stated information need, not because it just
happens to contain all the words in the query.
● Many systems have parameters, however stating best results by tuning parameter is not
a correct measure as it focuses on that particular dataset only.

8.3 Evaluation of unranked retrieval results


● The two frequent and basic measures used are ​precision ​and ​recall.
● Precision : ​fraction of retrieved docs that are relevant.
● Recall : ​fraction of relevant docs that are retrieved.
● Apart from the above, accuracy is another measure, however it is not an appropriate
measure. Because in almost all circumstances, the data is extremely skewed.
● Advantage of having 2 numbers for precision and recall is that one is more imp that the
other in many circumstances.
● Recall​ is a ​non-decreasing​ function of the number of documents retrieved. On the other
hand, in a good system, ​precision​ usually ​decreases​ as the number of documents
retrieved is increased.
● A single measure that trades off precision vs recall is the ​F measure, ​which is the
weighted HM of precision and recall.
● But why ​harmonic mean​ ? When the values of two numbers differ greatly, the harmonic
mean is closer to their minimum than to their arithmetic mean.
● Ex : ​We can always get 100% recall by returning all the documents, and thus avg would
be 50% , although the result is not correct. Here the system didn’t perform well. Thus ​AM
is not a good measure.

8.4 Evaluation of ranked retrieval results


● In a ranked retrieval context, appropriate sets of retrieved documents are naturally given
by the top k retrieved documents.
● Precision-recall curves have a distinctive saw-tooth shape: if the (k + 1) th document
retrieved is nonrelevant then recall is the same as for the top k documents, but precision
has dropped. If it is relevant, then both precision and recall increase, and the curve jags
up and to the right.
● It is often useful to remove these jiggles and the standard way to do this is with an
interpolated precision: ​the interpolated precision ​( at a certain recall level r , p interp is
defined as the highest precision found for any recall level r’ >= r )
P ​interp​ (r) = max ​[ r’ >= r ] p(r’)

● But why ?? ​Because anyone would be prepared to look at a few more documents if it
would increase the percentage of the viewed set that were relevant i.e. precision of the
larger set is higher.
● The ​traditional way​ of doing this is the ​11-point interpolated average precision. ​Here
for each info need, the interpolated precision is measured at 11 recall levels ​0,0.1,0.2,...,
0.9, 1.0.
● Mean Average Precision (MAP), ​provides a single-feature measure of quality across
recall levels.
○ For a single information need, Average Precision is the average of the precision
value obtained for the set of top k documents existing after each relevant
document is retrieved, and this value is then averaged over information needs.
○ When a relevant document is not retrieved at all,1 the precision value in the
above equation is taken to be 0.
○ For a single information need, the average precision approximates the area
under the uninterpolated precision-recall curve, and so the MAP is roughly the
average area under the precision-recall curve for a set of queries.
○ Calculated MAP scores normally vary widely across info needsThis means that a
set of test information needs must be large and diverse enough to be
representative of system effectiveness across different queries.
● Precision at k
○ It has the advantage of not requiring any estimate of the size of the set of
relevant documents but the disadvantages that it is the least stable of the
commonly used evaluation measures and that it does not average well
● MRR (Mean Reciprocal Rank):
○ Here, we take the mean of RR, where RRq = 1/R1 where R1 is the rank of 1st
relevant doc.
● R precision:
○ It requires having a set of known relevant documents Rel, from which we
calculate the precision of the top Rel documents returned.
○ In this, a perfect system could score 1 on this metric, whereas, even a perfect
system could only achieve a precision of .4.
○ If there are |Rel| relevant documents for a query, we examine the top |Rel| results
of a system, and find that r are relevant, then by definition, not only is the
precision (and hence R-precision) r/|Rel|, but the recall of this result set is also
r/|Rel|.
● Cumulative Discounted Gain
○ Increasing adoption, especially when employed with ML approaches to rannking
is measure of ​cumulative gain, ​and in particular ​normalized discounted
cumulative gain.
○ NDCG ​is designed for situations of non-binary notions of relevance. Like
precision at k, it is evaluated over some number k of top search results.

8.5 Assessing relevance

● Using random combinations of query terms as an information need is generally not a


good idea because typically they will not resemble the actual distribution of information
needs.
● TO BE CONTINUED.

You might also like