Inferring Probability of Relevance Using The Method of Logistic Regression
Inferring Probability of Relevance Using The Method of Logistic Regression
Inferring Probability of Relevance Using The Method of Logistic Regression
Fredric C. Gey
UC Data Archive and Technical Assistance University of California, Berkeley USA e-mail: [email protected]. edu Abstract
research of logistic
evaluates regression
a model
for
retrieval;
utilizes
the as a
to obtain properties.
by probability of relevance By
of relevance
present
v = O and standard
a = 1), the method collections, test which or equally of the retrieval retrieval measure. tions) formances significant
to other
model directly
to three particular
uses term-frequency/inverse-document-frequency the logistic were inference to well two as (in the third models occurred collection) subjected the tfidf/cosine
and better
In the comparison,
significantly
(in two
The
differences
statistically
or could
have
by chance.
1. Introduction
We can consider the central task of Information or documents. in the collection. [1] [2] [3] both that degree Retrieval The heart to be that of this and of providing intellectual access to the of the and the For disIn the or in contents meanings associates this tance One model, such of a collection of the texts at Cornell term which we have well-know form to the of texts contained University space a clear measure term task queries is the concise developed terms from indicate show representation by Gerard is of size the as vectors m.
space model
Salton
m -dimensional measures
vocabulary document
of nontrivial vector
measure
vector.
and query
model,
vectors
and document
vectors
of a vocabulary
in either vectors.
generalized
by adding
or document.
weights,
geometrically,
from
of terms.
or
the attribute
frequency
in either
the query
tj used in documents,
number of documents text
inverse and
document ~t 1 is the
is the
collection used
tj occurs
retrieval
to index [4].
logged, frequencytest
terms
document
inverse collections.
(DAF
~~F
).
a number several
the
performance
measurements a simple
schemes
accounted
query,
performed By
worst
query
processes,
and
satisfying
to place
the computation
of retrieval
value
theoretical
footing. need,
be computed
the probability
223
ranking in order babilistic tunately tistical methods cessful principle of their retrieval experiments evidence for [9]. statistical [6] states that the optimal was relevance, retrieval This leads to when the half will be achieved Linked have the if documents Binary model are returned by Cooper to the user of proUnforstasuc[8]
to the so-called
model
approach
to predict
collection
as a training
set. Improved
estimation
of this
insufficient
2. The logistic
If no assumptions query-document ated (VI, (such number attribute exists with each pair query
inference
are made r term about
model
the character of the probabilities allow that a text are term of relevance retrieval a finite set of which system associate exists about between associterm the a and that
tq,and document
(such of the with as count term divided the in
tdjthere
document), then we
attributes from
(such i.e.
by the term),
collection,
a Model
O system,
introducing
models.
logistic
binary
inference
relevance is present
sample of query-document-term
the logarithm of the odds
triples of relevance
for
which
for term
tk which
dj
and query
by the formula:
log~(~
Iqi,dj,tk)
= co+clvl+
lvn
that for
of the odds
of relevance
for
the i h
query
qi =Ct
of
[logo
(~
(~
)1
at random term show from the
where
O(R), will
known
odds of relevance
qi. of The of logistic
chosen
be relevant variable
to query
coefficients
properties statistical
by using
regression
[10] which
fits
an equation variables
a dichotomous
(possibly
continuous)
independent
log
odds
of relevance to directly
is computed obtain
from
a simple
linear
formula,
the inverse
logistic
transforma-
be applied
the probability
of relevance 1
of a document
to a query:
P((ZUqj,dj)=
l+e Once retrieval ples, future). Regression tions mate of multiple relevance. success models clues [11]. There which such are More two approximate frequency, Fuhr recently the dependent authorship, and Buckley major these the coefficients for coefficients may of the equation collection be used for from logodds a random odds
- log(O (R I qi ,dj ))
have other
been
derived
within
system
a particular
to predict
document
relevance first
by linear
as term
were
by Ed Fox
well-known
problems
approaches
G
to probability variable
approximation: rather than than continuous, hence assumed by ordinary regression. and others [14]
The
outcome
the error An
distribution alternative
rather
normal,
as is usually networks
approach
is to use Bayesian
inference
as developed
by Turtle
[15]
[16].
3. Sampling
One pute logistic
for logistic
problems coefficients.
regression
in using Even logistic supposing regression that only is the computational half the document size necessary to comas a collection is used
224
training nomical. query, attempt than not ments, pensates ments, and factor We Cranfields lowing sonable queries fit all, set the number Naturally, naturally non-relevant and for to construct of the relevant adjusting for a very ones. of query-document-term approach fraction which small To of the triples total which Since are to be used for computation the number in the higher smaller process of relevant proportion proportion by fewer factor collection, documents, becomes is, for upon astroeach us to if
a sampling a sample
is necessary.
documents of relevant
assure
adequate
representation
of relevant
sample
in the regression are typically thirtieth by applying upon retrieval which has which for say, every
documents
1/ lo(hh
non-relevant
query-document-term
regression
computation
model.
collection with
is the
experiments
experiments.
number
the documents
of an ideal be:
of an ideal language
queries
form,
statistics
can be collected
as well
as document
statis-
tics, . the documents lection), The queries are a random in order sample to achieve sample from for from some larger query population. There must be a sufficiently large have both titles and abstracts present (abstracts are missing from some of the CACM col-
number
G
of queries
some
statistical
significance. collection pairs and, in the collection. judgments If this is not possible crafted sam-
Documents That
a larger all
document
relevance
(as it cannot
be for
the relevance
are a carefully
size be known. (52 queries, 3204 documents) and the CISI collec-
(76 queries,
1460 documents)
4. Logistic
Our ments particular (RFAD). the query (QRF),
regression
logistic inference
relative
attributes
additional given elementary clues: relative frequency in (DRF) and relative frequency of the term of term in all docu-
frequency formula
The complete
of relevance,
the presence
tj
is then:
Zt,
+ c ~iog
= logO(R
= Co + cllog(QAF)
+ c2bg(QZW)
) + C ~log
(DAF
) + c410g (~~~
) + C ~zog (~~~
(WAD
previously (DRF)
defined.
Query number
relative of term
is equal terms
to
divided
by the total
in the is the of
relative terms
frequency in
absolute in
occurrences of term
frequency
all
documents number
in the collection
by the total
of occurrences
are logged.
using taking an
50 occurrences behavioral
of a term properties.
has nice
as products logged
in comparing
(at least
225
The sample fit for from this extended clues logistic model The is done against the same are then 20,246 derived query-document-term from this fit:
taken
the Cranfield
collection.
following
coefficients
+ -0.203610g
(QAF)
+ 0.1914310g
(QZ?F)
) + 0.7503310g
this collection. This
(IWAD )
is estimated assigned (for pairs
of query-document times
which dividing
of relevance number
has been
by the total
of query-document
1400 documents).
Hence:
1838 Przor
becomes the prior probability of relevance, logprior and the final formula for ranking is: I ~)) =
= .005835
=
and =
2251400
log
rl~
lprior
= 5. 138
log(O(l/
-5,138
+ ~(Z~,
j=l
+ 5.138)
50 Performance
A key bilistic model, ment part which vector. is the of this ranks fraction query but method, which
evaluation:
paper is to compare ranks documents according
statistical
performances according to the cosine retrieved over all to
significance
of two major probability of the angle in terms at a certain which queries it of of
tests
methods: between the retrieval our logistic against and in regression the precision the retrieval intent subject have vector probaspace relevance recall point
documents of
the query
vector
Performance
evaluation relevant
described
Recall
while for and
precision
a single precision
of retrieved of
documents
These to
are usually
performance;
however,
differences chance. The better retrieval, bility would mance statistical cally of than
to a statistical
test to determine
whether
the difference
reason another
to test statistical the performance is chosen chosen performance that the results
is that for
query
may model
be of
posed. randomly we
documents
randomly
as logistic
inference
versus
tfidf/cosine
the randomly
a test,
in
the
literature, therefore,
would
which statistical
would
the be:
differences
between
performance
query,
and we determine
differences
or in precision
are statistically
G
approach,
significance
For
each
of relevant levels
method a unique
precision
of recall).
in average compared.
precision Thus,
between in applying
the two
numbers
found
with size
each will
method the
for
the T-test,
the sample
equal
number
G
in the test collection. to these mean differences difference under should the null be zero hypothesis and their that the methods standard deviation perform not identically significantly
Apply
and
hence
different. [18] provides a summary of possible statistical tests which might be used to evaluate retrieval experiments.
226
6. Logistic
For the logistic
model
inference
performance
model on Cranfield Inference Cranfield
collection
Logistic
versus Collection:
Space 225
Precision 0.8330 0.8116 0.7129 0.6021 0.5161 0.4503 0.3698 0.2859 0.2280 0.1640 0.1464 0.4655
Precision 0.7787 0.7440 0.6434 0.5301 0.4380 0.3814 0.2994 0.2267 0.1882 0.1379 0.1251 0.4084 -12.3
compute
the difference
in average
precision
and logistic)
we find
are statistically
significance
Hypothesis between N null mean 225 Thus we can see that clues, 0.0000 making
Test:
two-tail inference
stat 224 of logistic over -8.31 regression .0000 on the set of six term vector space model.
property
we obtain
a statistically
significant
improvement
the tfidf/cosine
7. A criticism
The above results queries, documents
of the Cranfield
obtained for the logistic from when and terms
results
inference the resulting models on half We where feel model logistic this have a fundamental the sample are used difficulty. to rank used The entire set of set for based step to test was used to predict in order in our be to apply set is included whereby the number that within used as the training documents which training will [19], the next
the relevant
the logistic on relevance, for fitting. in testing the results train for
of probabilistic
a large
is necessary collections
the Cranfield
as a training
set to other
in particular
8. Logistic
If we apply the obtain
inference
logistic result: Cranfield
for
CACM
equation fitted for the Cranfield sample to the CACM collection, we
regression
the following
Logistic CACM
versus
Tfidf/cosine Averages
Vector over
Space
Performance
Collection:
% Change:
While
it may
logistic
model
performs
5.2 percent
better
than
the tfidf/cosine
vector
space model
if we apply
test:
227 CACM Hypothesis Logistic sample mean -0.020528 of performance another logistic model. test: 2-tail T on Precision and Tfidf/cosine df Differences methods test stat 51.0 -1.29 significant vector P value .2016 and hence space we cannot might reject perform a
between N null mean 52 we find null as well again that this 0.0000 difference given
is not statistically
hypothesis or better
that,
in fact,
set of queries
the tfidf/cosine
model
9. Cranfield
Applying
logistic
fitted
model
regression
for
CISI
to the CISI collection, we find:
the Cranfield
equation
Cranfield
Logistic CISI
versus Collection:
Vector over
Space 76 Queries
Performance
Vector
Space
Precision 0.2088
% Change: While this seems to indicate better vector space performance, queries: 2-tail T on Precision and Tfidf/cosine sample SE 0.0087791 significant model. While 75.0 difference the collection between of produce df Differences for CISI test stat 0.65 when we apply
the statistical
age precision
and recall
measures
Hypothesis between N null mean 76 we find that there 0.0000 can the be no extended model, Logistic sample mean 0.0056973 statistically logistic for
inference
application
Cranfield-derived vector
coefficient space
collection
on these
can we find
the difference
significant. coefficients something from derived one collection, from the statistical to adapt distribution we apply them of one collection clever clues to of the
collection collections.
to be desired.
section,
some
thought
to the transformation
of coefficients
to the statistical
10. Standardized
The direct application lection lowing and one
variables
of coefficients It assumes that derived all from a query-document-term collection which fi-om from for are being used sample as sensitive to collection. probability original as the from one document makes colto another statistical do the clues and standard document the clues come distribution deviation and yet another collection clue set of queries indicators By the folstatistical but they the
of probability
identical distribution,
collection
on which
fitting
has been done. might used derive for from the same underlying would of probability statistical function deviation probability in distribution, either This the gives it seems and us insight known highly into that the one i.e.,
unlikely deviation possible can take though one which If of this lection
that,
there
mean
standard
collection. into
It is well
and transform
(even
probability
a standardized
O) and
for such
1).
then we can The also apply the coefficients colof such distribution underlying of yet another assumption
coefficients distribution
standardized which
the standardized
of relevance.
an adaptation
G
For
underlying
probability
distribution
of
that
clue
remains
identical
from
228 collection to collection changing only in its mean and standard deviation, i.e., only the mean and standard deviation of the particular collection-dependent distribution will change while the underlying standardized probability distribution remains constant over all collections.
One might well question this assumption on the following grounds:
Suppose the mean pi and standard deviation Oi for clue i varies meaningtidly from collection to collection, and that these differences have predictive power with respect to relevance. Then the application of standardization of distributions will remove this predictive power.
test the assumption. coefficient coefficient regression we assume for for a first
of testing, other
assumption
compute to apply
tribution formulae,
technique.
and Y ):
x lWX1
log(o(ll lx~,y~)) =c~+c~( ~ xl )+C2(
Y1-wy, ~ YI )
above,
for apply
collection, coefficients
if we then
look
for we
assuming
to a standardized distribution
collection.
from
we have
for collection
x 2PX2
Y 2wy2
log(o(R
and we wish to derive
lx2,yJ)
= co + C*(
~
X2
)+C2(
~
Y2
but if we multiply
equation
we obtain WX2
Py2
log(o(zi
Ixz,yz))
=coc~; X2
C2Z Y2
X2 ;1 X2
5Y2 Y2
C()=
co-cl
cl=
and C2 =
c1
6
X2
C2 0 Y2
Thus values
to find
the
logistic
for v,
we
need
only and
determine apply
the them
mean, directly
~V,
and
standard
coefficients,
to clue
This retrieval
methodology
first
applied This
by paper
Cooper, provides
Gey
and
Chen complete
[20]
to the queries
of the NIST
text
conference
(TREC
1) [21].
a more
analysis
of the method.
11. Applying
In order sample to obtain from each
to CACM
means of those
and CISI
deviation If we
collections
for obtain all clues a small for the new (of collection, 1 in 125) we must query obtain a new term sample document
229 triples CISI. and all associated clues values, we can compute the following standard deviations for CACM and
Coil.
CISI CISI CACM CACM
St.
mean
log q.af
0.3031 0.5153 0.1763 0.3347
std dev
mean
std dev
We then use these means and standard deviations lowing new logistic or logodds coefficients
coefficients
Coil.
CRAN CACM CISI
const. -4.125
-1.341 -0.934
Standardized Logistic vs. Tfidf/cosine Vector Space Performance CACM- C llection: Averages over 52 Q :ries Recall Standardized Logistic Precision 0.7452 0.6040 0.5338 0.4392 0.3683 0.3128 0.2682 0.1846 0.1388 0.0961 0.0694 1 I 0.3419 I I Vector Space Precision 0.6985 0.5683 0.4830 0.3887 0.3210 0.2761 0.2306 0.1770 0.1441 0.1018 0.0735 0.3148 -7.9
0.00
0.10 0.20 0.30
I
I
Note that the average precision has increased approximately another 3 percent over the tfidf cosine vector space method of ranking documents. The recall-precision table makes it quite clear that except for the tail of recall-precision, the standardized extended logistic model performs substantially better than the tfidflcosine vector space model for the CACM collection. Indeed, if we once again go through our process of computing average precision from all points of recall for each of the 52 significant queries for CACM, and then do a statistical test of the difference from standardized logistic method and tfidf/cosine vector space average precision, we find this improvement is now statistically significant at the 2 percent level.
CACM Hypothesis Test: 2-tail T on Precision Differences between Standardized Logistic and Tfidflcosine N 52 null mean 0.0000 sample mean -0.030180 sample SE 0.012335 df 51.0 test stat -2.45 P value .0179 to the In this collecvector
That is to say, we would only expect in less than two percent of all samples of queries applied CACM collection, for the vector space model to perform better than the standardized logistic model. way, we can reject the null hypothesis that there is no difference in the performance in the CACM tion and accept the clear indication that standardized logistic regression performs better than the space model for the CACM collection.
230 On the other hand, if we apply this correction factor and standardization find that there is a slight worsening in the performance results. to the CISI collection, we
Standardized Logistic vs. Tfidflcosine Vector Space Performance CISI Collection: Averages over 76 Queries Recall 1l-pt Avg: Yo Change: Standardized Logistic Precision 0.2042 Vector Space Precision 0.2137 4.7
When comparing this result with the non-standru-dized extended logistic model for CISI we see that the precision changes occur in the third decimal place. We can conclude that the difference between the two is certainly not statistically significant, nor is that between CISI standardized and tfidf/cosine. Among reasons for this failure to achieve performance improvement in the CISI collection are the different prior probability of relevance for the collections. [22] gives more detail on the sensitivity of the model to the proper estimation of prior probability of relevance.
12. Conclusions
In this research we have investigated a new probabilistic text and document search method based upon logistic regression. This logistic inference method estimates probability of relevance for documents with respect to a query which represents the users information need. Documents are then ranked in descending order of their estimated probability of relevance to the query. 1. The logistic inference method has been subjected to detailed performance tests comparing it (in terms of recall and precision averages) to the traditional tfidf/cosine vector space method, using the same retrieval and evaluation software (the Cornell SMART system) for both methods. In this way the test results remain free of bias which might be introduced from different software implementations of evaluation inference method outperforms the methods. In terms of recall and precision averages, the logistic tfidf/cosine vector space method on the Cranfield and CACM test collections. The methods seem to perform equally well on the CISI test collection (for an appropriately estimated prior probability). 2. Statistical tests have been applied to ascertain whether these performance differences between the The performance improvement of the logistic inference method two methods are statistically significant. over the tfidf/cosine vector space method for the Cranfield and CACM collection is statistically significant at the five percent level. Performance differences for the CISI collections are not statistically significant, for the most plausible estimate of prior probability. 3. The use of standardized variables (statistical clues standardized to mean w = O and standard deviation CT= 1) seems to enable the training of and fitting for logistic regression coefficients to take place on the queries and documents of one collection and to be applied directly to the queries and documents of other collections.
13. Acknowledgments
The work described was part of the authors dissertation research at the School of Library and Information Studies at the University of California, Berkeley. Many of the ideas contained herein were jointly formulated with my dissertation advisor, Professor William S. Cooper, whose clarity of thinking and relentless persistence made this research both possible and doable. My outside committee member, Professor of Biostatistics Steve Selvin, provided much helpful advice on statistical techniques.
References
1. Salton G et al. The SMART Prentice-Hall, Englewood Cliffs, 2. Salton Addison retrieval NJ, 1971 system: Experiments in automatic document processing.
of information
by computer.
to modern information of
New York,
1983 Journal
A statistical
interpretation
term specificity
in retrieval.
C. Term weighting approaches 5. Salton G Buckley cessing and Management 1988; 24:513-523 6. Robertson, 7. Robertson 27:129-145 S. The probability S Sparck-Jones ranking K. principle in IR. weighting
Relevance
Journal
IR, In: Proceedings of the Fourteenth Annual 8. Cooper W. Inconsistencies and misnomers in probabilistic International ACM/SIGIR Conference on Research and Development in Information Retrieval, Chicago, 111,Ott 13-16, 1991, pp 57-61 estimation 9. Fuhr N Huther H. ODtimum Probability cessing and Management 198;; 25:493;507 10. Hosmer D Lemeshow S. Applied logistic from empirical distributions. Information Pro-
regression.
John Wiley
1989 Queries
11. Fox E. Extending the Boolean and Vector Space Models and Multiple Concept Types. PhD dissertation, Computer
University, ranking
retrieval functions based on 12. Fuhr N. Optimal polynomial Transactions on Information Systems 1989; 7:183-204 learning 13. Fuhr N Buckley C. A probabilistic on Information Systems 19919:223-248 approach for
the probability
ACM
document
indexing.
ACM
Transactions
Proceedings of the 1993 SIGIR feedback and inference networks. 14. Haines D Croft B. Relevance International Conference on Information Retrieva 1, Pittsburgh, Pa, June 27-July 1, 1993, pp 2-12 15, Turtle H. Inference networks for document retrieval. COINS Technical Report 90-92, February, 1991 PhD Dissertation, University of Massachusetts,
concept-bases information 16, Fung R Crawford S Appelbaum L Tong R. An architecture for probabilistic retrieval. In: Proceedings of the 13th international conference on research and development in information retrieval. Brussels, Belgium, September 5-7, 1990, pp. 455-467 17 Swanson D. Information retrieval as a trial-and-error process. Library Quarterly 1977; 47:128-148
18. Hull D. Using statistical testing in the evaluation of retrieval experiments. Proceedings of the 1993 SIGIR international conference on information retrieval. Pittsburgh, Pa, June 27-July 1, 1993, pp.329338 19. Yu C Buckley C Lam H Salton G. A generalized term dependence retrieval. Information Technology: Research and Development 1983; 2:129-154 model in information
20. Cooper W Gey F Chen A. Information retrieval from the TIPSTER collection: an application of staged logistic regression. In: Proceedings of the Fkst NIST Text Retrieval Conference, National Institute for Standards and Technology, Washington, DC, November 4-6, 1992, NIST Special Publication 500-207, March 1993, pp 73-88 21. Harman, D. Overview of the tional conference on information first TREC conference. In: Proceedings of the 1993 SIGIR retrieva 1, Pittsburgh, Pa, June 27-July 1, 1993, pp 36-47 and logistic 1993 inference in information retrieval. interna-
PhD dissertation,