Ngram Lda
Ngram Lda
Ngram Lda
... wi−1 wi wi+1 wi+2 ... 4. for each document d, draw a Discrete distribution θ(d)
(d)
D from a Dirichlet prior α; then for each word wi in
document d:
(d)
(a) draw xi from Bernoulli ψw(d) ;
σ δ i−1
TW (b) draw
(d)
zi from Discrete θ (d)
; and
(a) Bigram topic model (d) (d)
(c) draw wi from Discrete σw(d) if xi = 1; else
α i−1
(d)
draw wi from Discrete φz(d) .
i
θ
Note that in the LDA Collocation model, bigrams do not
... zi−1 zi zi+1 zi+2 ... have topics as the second term of a bigram is generated from
a distribution σv conditioned on the previous word v only.
... xi xi+1 xi+2 ...
2.3 Topical N -gram Model (TNG)
... wi−1 wi wi+1 wi+2 ...
D The topical n-gram model (TNG) is not a pure addition
of the bigram topic model and LDA collocation model. It
can solve the problem associated with the “neural network”
φ γ ψ σ example as the bigram topic model, and automatically de-
β δ
termine whether a composition of two terms is indeed a bi-
T W
gram as in the LDA collocation model. However, like other
(b) LDA-Collocation model collocation discovery methods discussed in Section 3, a dis-
α covered bigram is always a bigram in the LDA Collocation
model no matter what the context is.
θ One of the key contributions of our model is to make it
possible to decide whether to form a bigram for the same
... zi−1 zi zi+1 zi+2 ... two consecutive word tokens depending on their nearby
context (i.e., co-occurrences). Thus, additionally, our model
... xi xi+1 xi+2 ... is a perfect solution for the “white house” example in Sec-
tion 1. As in the LDA collocation model, we may assume
... wi−1 wi wi+1 wi+2 ... some x’s are observed for the same reason as we discussed
in Section 2.2. The graphical model presentation of this
D
model is shown in Figure 1(c). Its generative process can
be described as follows:
terms in a bigram. As shown in the above, the topic as- where nzw represents how many times word w is assigned
signments for the two terms in a bigram are not required into topic z as a unigram, mzwv represents how many times
to be identical. We can take the topic of the first/last word word v is assigned to topic z as the 2nd term of a bigram
token or the most common topic in the phrase, as the topic given the previous word w, pzwk denotes how many times
of the phrase. In this paper, we will use the topic of the the status variable x = k (0 or 1) given the previous word
last term as the topic of the phrase for simplicity, since long w and the previous word’s topic z, and qdz represents how
noun phrases do truly sometimes have components indica- many times a word is assigned to topic z in document d.
tive of different topics, and its last noun is usually the “head Note all counts here do include the assignment of the token
noun”. Alternatively, we could enforce consistency in the being visited. Details of the Gibbs sampling derivation are
model with ease, by simply adding two more sets of arrows provided in Appendix A.
(zi−1 → zi and xi → zi ). Accordingly, we could substitute Simple manipulations give us the posterior estimates of
(d)
Step 4(b) in the above generative process with “draw zi θ, φ, ψ, and σ as follows:
(d) (d) (d)
from Discrete θ(d) if xi = 1; else let zi = zi−1 ;” In (d)
this way, a word has the option to inherit a topic assignment θ̂z = PTαz +qdz φ̂zw = PWβw +nzw
(αt +qdt ) (βv +nzv )
from its previous word if they form a bigram phrase. How-
t=1
γk +pzwk
v=1 (1)
ψ̂zwk =P 1 σ̂zwv = PWδv +mzwv
ever, from our experimental results, the first choice yields (γk +pzwk ) (δv +mzwv )
k=0 v=1
better performance. From now on, we will focus on the As discussed in the bigram topic model [22], one could
model shown in Figure 1(c). certainly infer the values of the hyperparameters in TNG
Finally we want to point out that the topical n-gram using a Gibbs EM algorithm [1]. For many applications,
model is not only a new framework for distilling n-gram topic models are sensitive to hyperparameters, and it is im-
phrases depending on nearby context, but also a more sen- portant to get the right values for the hyperparameters. In
sible topic model than the ones using word co-occurrences the particular experiments discussed in this paper, however,
alone. we find that sensitivity to hyperparameters is not a big con-
In state-of-the-art hierarchical Bayesian models such as cern. For simplicity and feasibility in our Gigabyte TREC
latent Dirichlet allocation, exact inference over hidden topic retrieval tasks, we skip the inference of hyperparameters,
variables is typically intractable due to the large number and use some reported empirical values for them instead to
of latent variables and parameters in the models. Approxi- show salient results.
mate inference techniques such as variational methods [12],
Gibbs sampling [1] and expectation propagation [17] have
3 Related Work
been developed to address this issue. We use Gibbs sam-
pling to conduct approximate inference in this paper. To
Collocation has long been studied by lexicographers and
reduce the uncertainty introduced by θ, φ, ψ, and σ, we
linguists in various ways. Traditional collocation discov-
could integrate them out with no trouble because of the con-
jugate prior setting in our model. Starting from the joint 1 As (d)
shown in Appendix A, one could further calculate P (zi | · · ·)
distribution P (w, z, x|α, β, γ, δ), we can work out the con- (d)
and P (xi | · · ·) as in a traditional Gibbs sampling procedure.
(d) (d) (d) (d)
ditional probabilities P (zi , xi |z−i , x−i , w, α, β, γ, δ) 2 For some observed x(d) , only z (d) needs to be drawn.
i i
ery methods range from frequency to variance, to hypothe- with comparison to the corresponding closest (by KL diver-
sis testing, to mutual information. The simplest method is gence) topics found by LDA.
counting. A small amount of linguistic knowledge (a part- The “Reinforcement Learning” topic provides an ex-
of-speech filter) has been combined with frequency [13] to tremely salient summary of the corresponding research
discover surprisingly meaningful phrases. Variance based area. The LDA topic assembles many common words used
collocation discovery [19] considers collocations in a more in reinforcement learning, but in its word list, there are quite
flexible way than fixed phrases. However, high frequency a few generic words (such as “function”, “dynamic”, “deci-
and low variance can be accidental. Hypothesis testing can sion”) that are common and highly probable in many other
be used to assess whether or not two words occur together topics as well. In TNG, we can find that these generic words
more often than chance. Many statistical tests have been are associated with other words to form n-gram phrases
explored, for example, t-test [5], χ2 test [4], and likelihood (such as “markov decision process”, etc.) that are only
ratio test [7]. More recently, an information-theoretically highly probable in reinforcement learning. More impor-
motivated method for collocation discovery is utilizing mu- tantly, by forming n-gram phrases, the unigram word list
tual information [6, 11]. produced by TNG is also cleaner. For example, because
The hierarchical Dirichlet language model [14] is closely of the prevalence of generic words in LDA, highly related
related to the bigram topic model [22]. The probabilistic words (such as “q-learning” and “goal”) are not ranked high
view of smoothing in language models shows how to take enough to be shown in the top 20 word list. On the contrary,
advantage of a bigram model in a Bayesian way. they are ranked very high in the TNG’s unigram word list.
The main stream of topic modeling has gradually gained In the other three topics (Table 2), we can find similar
a probabilistic flavor as well in the past decade. One of phenomena as well. For example, in “Human Receptive
the most popular topic model, latent Dirichlet allocation System”, some generic words (such as “field”, “receptive”)
(LDA), which makes the bag-of-words assumption, has are actually the components of the popular phrases in this
made a big impact in the fields of natural language pro- area as shown in the TNG model. “system” is ranked high in
cessing, statistical machine learning and text mining. Three LDA, but almost meaningless, and on the other hand, it does
models we discussed in Section 2 all contain an LDA com- not appear in the top word lists of TNG. Some extremely
ponent that is responsible for the topic part. related words (such as “spatial”), ranked very high in TNG,
In our point of view, the HMMLDA model [10] is the are absent in LDA’s top word list. In “Speech Recognition”,
first attack to word dependency in the topic modeling frame- the dominating generic words (such as “context”, “based”,
work. The authors present HMMLDA as a generative com- “set”, “probabilities”, “database”) make the LDA topic less
posite model that takes care of both short-range syntac- understandable than even just TNG’s unigram word list.
tic dependencies and long-range semantic dependencies be- In many situations, a crucially related word might be not
tween words; its syntactic part is a hidden Markov model mentioned enough to be clearly captured in LDA, on the
and the semantic component is a topic model (LDA). Inter- other hand, it would become very salient as a phrase due to
esting results based on this model are shown on tasks such the relatively stronger co-occurrence pattern in an extremely
as part-of-speech tagging and document classification. sparse setting for phrases. The “Support Vector Machines”
topic provides such an example. We can imagine that “kkt”
4 Experimental Results will be mentioned no more than a few times in a typical
NIPS paper, and it probably appears only as a part of the
We apply the topical n-gram model to the NIPS pro- phrase “kkt conditions”. TNG satisfyingly captures it suc-
ceedings dataset that consists of the full text of the 13 cessfully as a highly probable phrase in the SVM topic.
years of proceedings from 1987 to 1999 Neural Informa- As we discussed before, higher-order n-grams (n > 2)
tion Processing Systems (NIPS) Conferences. In addition can be approximately modeled by concatenating consecu-
to downcasing and removing stopwords and numbers, we tive bigrams in the TNG model, as shown in Table 2 (such
also removed the words appearing less than five times in as “markov decision process”, “hidden markov model” and
the corpus—many of them produced by OCR errors. Two- “support vector machines”, etc.).
letter words (primarily coming from equations), were re- To numerically evaluate the topical n-gram model, we
moved, except for “ML”, “AI”, “KL”, “BP”, “EM” and could have used some standard measures such as perplexity
“IR.” The dataset contains 1,740 research papers, 13,649 and document classfication accuracy. However, to convinc-
unique words, and 2,301,375 word tokens in total. Top- ingly illustrate the power of the TNG model on larger, more
ics found from a 50-topic run on the NIPS dataset (10,000 real scale, here we apply the TNG model to a much larger
Gibbs sampling iterations, with symmetric priors α = 1, standard text mining task—we employ the TNG model
β = 0.01, γ = 0.1, and δ = 0.01) of the topical n- within the language modeling framework to conduct ad-hoc
gram model are shown in Table 2 as anecdotal evidence, retrieval on Gigabyte TREC collections.
Reinforcement Learning Human Receptive System
LDA n-gram (2+) n-gram (1) LDA n-gram (2+) n-gram (1)
state reinforcement learning action motion receptive field motion
learning optimal policy policy visual spatial frequency spatial
policy dynamic programming reinforcement field temporal frequency visual
action optimal control states position visual motion receptive
reinforcement function approximator actions figure motion energy response
states prioritized sweeping function direction tuning curves direction
time finite-state controller optimal fields horizontal cells cells
optimal learning system learning eye motion detection figure
actions reinforcement learning rl reward location preferred direction stimulus
function function approximators control retina visual processing velocity
algorithm markov decision problems agent receptive area mt contrast
reward markov decision processes q-learning velocity visual cortex tuning
step local search goal vision light intensity moving
dynamic state-action pair space moving directional selectivity model
control markov decision process step system high contrast temporal
sutton belief states environment flow motion detectors responses
rl stochastic policy system edge spatial phase orientation
decision action selection problem center moving stimuli light
algorithms upright position steps light decision strategy stimuli
agent reinforcement learning methods transition local visual stimuli cell
Speech Recognition Support Vector Machines
LDA n-gram (2+) n-gram (1) LDA n-gram (2+) n-gram (1)
recognition speech recognition speech kernel support vectors kernel
system training data word linear test error training
word neural network training vector support vector machines support
face error rates system support training error margin
context neural net recognition set feature space svm
character hidden markov model hmm nonlinear training examples solution
hmm feature vectors speaker data decision function kernels
based continuous speech performance algorithm cost functions regularization
frame training procedure phoneme space test inputs adaboost
segmentation continuous speech recognition acoustic pca kkt conditions test
training gamma filter words function leave-one-out procedure data
characters hidden control context problem soft margin generalization
set speech production systems margin bayesian transduction examples
probabilities neural nets frame vectors training patterns cost
features input representation trained solution training points convex
faces output layers sequence training maximum margin algorithm
words training algorithm phonetic svm strictly convex working
frames test set speakers kernels regularization operators feature
database speech frames mlp matrix base classifiers sv
mlp speaker dependent hybrid machines convex optimization functions
Table 2. The four topics from a 50-topic run of TNG on 13 years of NIPS research papers with their
closest counterparts from LDA. The Title above the word lists of each topic is our own summary of
the topic. To better illustrate the difference between TNG and LDA, we list the n-grams (n > 1) and
unigrams separately for TNG. Each topic is shown with the 20 sorted highest-probability words. The
TNG model produces clearer word list for each topic by associating many generic words (such as
“set”, “field”, “function”, etc.) with other words to form n-gram phrases.
4.1 Ad-hoc Retrieval γ = 0.1, and δ = 0.01) for the NIPS dataset are used. Here,
we aim to beat the state-of-the-art model [23] instead of the
Traditional information retrieval (IR) models usually state-of-the-art results in TREC retrieval that need signifi-
represent text with bags-of-words assuming that words oc- cant, non-modeling effort to achieve (such as stemming).
cur independently, which is not exactly appropriate to nat-
ural language. To address this problem, researchers have 4.2 Difference between Topical N-grams
been working on capturing word dependencies. There are and LDA in IR Applications
mainly two types of dependencies being studied and shown
From both of LDA and TNG, a word distribution for
to be effective: 1) topical (semantic) dependency, which is
each document can be calculated, which thus can be viewed
also called long-distance dependency. Two words are con-
as a document model. With these distributions, the likeli-
sidered dependent when their meanings are related and they
hood of generating a query can be computed to rank docu-
co-occur often, such as “fruit” and “apple”. Among mod-
ments, which is the basic idea in the query likelihood (QL)
els capturing semantic dependency, the LDA-based docu-
model in IR. When the two models are directly applied to do
ment models [23] are state-of-the-art. For IR applications,
ad-hoc retrieval, the TNG model performs significant better
a major advantage of topic models (document expansion),
than the LDA model under the Wilcoxon test at 95% level.
compared to online query expansion in pseudo relevance
Among of 4881 relevant documents for all queries, LDA
feedback, is that they can be trained offline, thus more effi-
retrieves 2257 of them but TNG gets 2450, 8.55% more.
cient in handling a new query; 2) phrase dependency, also
The average precision for TNG is 0.0709, 61.96% higher
called short-distance dependency. As reported in literature,
than its LDA counterpart (0.0438). Although these results
retrieval performance can be boosted if the similarity be-
are not the state-of-the-art IR performance, we claim that,
tween a user query and a document is calculated by com-
if used alone, TNG represent a document better than LDA.
mon phrases instead of common words [9, 8, 21, 18]. Most
The average precisions for both models are very low, be-
research on phrases in information retrieval has employed
cause corpus-level topics may be too coarse to be used as
an independent collocation discovery module, e.g., using
the only representation in IR [3, 23]. Significant improve-
the methods described in Section 3. In this way, a phrase
ments in IR can be achieved through a combination with the
can be indexed exactly as an ordinary word.
basic query likelihood model.
The topical n-gram model automatically and simulta- In the query likelihood model, each document is scored
neously takes cares of both semantic co-occurrences and by the likelihood of its model generating a query Q,
phrases. Also, it does not need a separate module for phrase PLM (Q|d). Let the query Q = (q1 , q2 , ..., qLQ ). Under
discovery, and everything can be seamlessly integrated into QLQ
the bag-of-words assumption, PLM (Q|d) = i=1 P (qi |d),
the language modeling framework, which is one of the most which is often specified by the document model with
popular statistically principled approaches to IR. In this sec- Dirichlet smoothing [24],
tion, we illustrate the difference in IR experiments of the
Nd Nd
TNG and LDA models, and compare the IR performance PLM (q|d) = PM L (q|d) + (1 − )PM L (q|coll),
Nd + µ Nd + µ
of all three models in Figure 1 on a TREC collection intro-
duced below. where Nd is the length of document d, PM L (q|d) and
The SJMN dataset, taken from TREC with standard PM L (q|coll) are the maximum likelihood (ML) estimates
queries 51-150 that are taken from the title field of TREC of a query term q generated in document d and in the en-
topics, covers materials from San Jose Mercury News in tire collection, respectively, and µ is the Dirichlet smooth-
1991. All text is downcased and only alphabetic characters ing prior (in our reported experiments we used a fixed value
are kept. Stop words in both the queries and documents with µ = 1000 as in [23]).
are removed, according to a common stop word list in the To calculate the query likelihood from the TNG model
Bow toolkit [16]. If any two consecutive tokens were orig- within the language modeling framework, we need to sum
inally separated by a stopword, no bigram is allowed to be over the topic variable and bigram status variable for each
formed. In total, the SJMN dataset we use contains 90,257 token in the query token sequence. Given the posterior es-
documents, 150,714 unique words, and 21,156,378 tokens, timates θ̂, φ̂, ψ̂, and σ̂ (Equation 1), the query likelihood of
which is order of magnitude larger than the NIPS dataset. query Q given document d, PT N G (Q|d) can be calculated3
Relevance judgments are taken from the the judged pool as
of the top retrieved documents by various participating re- LQ
Y
trieval systems from previous TREC conferences. PT N G (Q|d) = PT N G (qi |qi−1 , d),
The number of topics are set to be 100 for all models i=1
with 10,000 Gibbs sampling iterations, and the same hyper- 3 A dummy q is assumed at the beginning of every query, for the con-
0
parameter setting (with symmetric priors α = 1, β = 0.01, venience of mathematical presentation.
No. Query LDA TNG Change
053 Leveraged Buyouts 0.2141 0.3665 71.20%
097 Fiber Optics Applications 0.1376 0.2321 68.64%
108 Japanese Protectionist Measures 0.1163 0.1686 44.94%
111 Nuclear Proliferation 0.2353 0.4952 110.48%
064 Hostage-Taking 0.4265 0.4458 4.52%
125 Anti-smoking Actions by Government 0.3118 0.4535 45.47%
145 Influence of the “Pro-Israel Lobby” 0.2900 0.2753 -5.07%
148 Conflict in the Horn of Africa 0.1990 0.2788 40.12%
Table 3. Comparison of LDA and TNG on TREC retrieval performance (average precision) of eight
queries. The top four queries obviously contain phrase(s), and thus TNG achieves much better per-
formance. On the other hand, the bottom four queries do not contain common phrase(s) after pre-
processing (stopping and punctuation removal). Surprisingly, TNG still outperforms LDA on some
of these queries.
Table 4. Comparison of the bigram topic model (λ = 0.7), LDA collocation model (λ = 0.9) and the
topical n-gram Model (λ = 0.8) on TREC retrieval performance (average precision). * indicates sta-
tistically significant differences in performance with 95% confidence according to the Wilcoxon test.
TNG performs significantly better than other two models overall.
and the behaviors are more possibly due to randomness. jugate priors to simplify the integrals. All symbols are
defined in Section 2.
5 Conclusions
P (w, z, x|α, β, γ, δ)
In this paper, we have presented the topical n-gram ZZZ Y D Y Nd
model. The TNG model automatically determines to form =
(d) (d)
(P (wi |xi , φz(d) , σz(d) w(d) )
an n-gram (and further assign a topic) or not, based on its d=1 i=1
i i i−1
N -grams Model Using the chain rule and Γ(α) = (α − 1)Γ(α − 1), we can
obtain the conditional probability conveniently,
We begin with the joint distribution
(d) (d) (d) (d)
P (w, x, z|α, β, γ, δ). We can take advantage of con- P (zi , xi |w, z−i , x−i , α, β, γ, δ)
(d) (d) (d) (d) (d) (d)
P (wi , zi , xi |w−i , z−i , x−i , α, β, γ, δ) [7] T. E. Dunning. Accurate methods for the statistics of
= (d) (d) (d) (d) surprise and coincidence. Computational Linguistics,
P (wi |w−i , z−i , x−i , α, β, γ, δ) 19(1):61–74, 1993.
∝ (γx(d) + pz(d) w(d) xi − 1)(αz(d) + qdz(d) − 1) [8] D. A. Evans, K. Ginther-Webster, M. Hart, R. G. Lefferts,
i i−1 i−1 i i
and I. A. Monarch. Automatic indexing using selective
β (d) +n (d) (d) −1
w z w (d) NLP and first-order thesauri. In Proceedings of Intelligent
PWi
i i
if xi = 0 Multimedia Information Retrieval Systems and Management
(βv +n (d) )−1
v=1 z v
× i
δ (d) +m (d) (d) (d) −1 (RIAO’91), pages 624–643, 1991.
w z w w
PW i i i−1 i (d)
if xi = 1 [9] J. Fagan. The effectiveness of a nonsyntactic approach to
v=1
(δv +m (d) (d) )−1
z w v
automatic phrase indexing for document retrieval. Journal
i i−1
of the American Society for Information Science, 40(2):115–
Or equivalently, 139, 1989.
[10] T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. Inte-
(d) (d) grating topics and syntax. In Advances in Neural Informa-
P (zi |w, z−i , x, α, β, γ, δ)
tion Processing Systems 17, 2005.
∝ (αz(d) + qdz(d) − 1) [11] J. Hodges, S. Yie, R. Reighart, and L. Boggess. An auto-
i i
mated system that assists in the generation of document in-
β (d) +n (d) (d) −1
w z w (d) dexes. Natural Language Engineering, 2(2):137–160, 1996.
PWi i i
if xi =0 [12] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul.
(βv +n (d) )−1
v=1 z v
× i
δ (d) +m (d) (d) (d) −1 An introduction to variational methods for graphical mod-
w z w w (d)
PWi
i i−1 i
if xi =1 els. In Proceedings of the NATO Advanced Study Institute
(δv +m (d) (d) )−1 on Learning in graphical models, pages 105–161, 1998.
v=1 z w v
i i−1
[13] J. S. Justeson and S. M. Katz. Technical terminology: some
And, linguistic properties and an algorithm for identification in
text. Natural Language Engineering, 1:9–27, 1995.
(d) (d)
P (xi |w, z, x−i , α, β, γ, δ) [14] D. J. C. MacKay and L. Peto. A hierarchical Dirichlet lan-
guage model. Natural Language Engineering, 1(3):1–19,
∝ (γx(d) + pz(d) w(d) xi − 1) 1994.
i i−1 i−1
[15] C. Manning and H. Schutze. Foundations of Statistical Nat-
β (d) +n (d) (d) −1
w z w (d) ural Language Processing. MIT Press, Cambridge, MA,
PWi i i
if xi =0
(βv +n (d) )−1 1999.
v=1 z v
× i
δ (d) +m (d) (d) (d) −1 [16] A. K. McCallum. Bow: A toolkit for statistical lan-
w z w w (d)
PWi i i−1 i
if xi =1 guage modeling, text retrieval, classification and clustering.
(δv +m (d) (d) )−1 http://www.cs.cmu.edu/ mccallum/bow, 1996.
v=1 z w v
i i−1
[17] T. Minka and J. Lafferty. Expectation-propagation for the
generative aspect model. In Proceedings of the 18th Confer-
References ence on Uncertainty in Artificial Intelligence, 2002.
[18] M. Mitra, C. Buckley, A. Singhal, and C. Cardie. An anal-
[1] C. Andrieu, N. de Freitas, A. Doucet, and M. Jordan. An ysis of statistical and syntactic phrases. In Proceedings
introduction to mcmc for machine learning. Machine Learn- of RIAO-97, 5th International Conference, pages 200–214,
ing, 50:5–43, 2003. Montreal, CA, 1997.
[2] D. M. Blei, A. Y. Ng, and M. J. Jordan. Latent Dirichlet [19] F. Smadja. Retrieving collocations from text: Xtract. Com-
allocation. Journal of Machine Learning Research, 3:993– putational Linguistics, 19:143–177, 1993.
1022, 2003. [20] M. Steyvers and T. Griffiths. Mat-
[3] C. Chemudugunta, P. Smyth, and M. Steyvers. Modeling lab topic modeling toolbox 1.3.
general and specific aspects of documents with a probabilis- http://psiexp.ss.uci.edu/research/programs data/toolbox.htm,
tic topic model. In Advances in Neural Information Process- 2005.
ing Systems 19, pages 241–248, 2007. [21] T. Strzalkowski. Natural language information retrieval.
[4] K. Church and W. Gale. Concordances for parallel text. In Information Processing and Management, 31(3):397–417,
Proceedings of the Seventh Annual Conference of the UW 1995.
[22] H. Wallach. Topic modeling: beyond bag-of-words. In Pro-
Centre for the New OED and Text Research, pages 40–62,
ceedings of the 23rd International Conference on Machine
1991.
Learning, 2006.
[5] K. Church and P. Hanks. Word association norms, mutual
[23] X. Wei and W. B. Croft. LDA-based document models for
information and lexicography. In Proceedings of the 27th
ad-hoc retrieval. In Proceedings of the 29th Annual Interna-
Annual Meeting of the Association for Computational Lin-
tional ACM SIGIR Conference on Research & Development
guistics (ACL), pages 76–83, 1989.
on Information Retrieval, 2006.
[6] K. W. Church, W. Gale, P. Hanks, and D. Hindle. Using [24] C. Zhai and J. Lafferty. A study of smoothing methods
statistics in lexical analysis. In Lexical Acquisition: Us- for language models applied to information retrieval. ACM
ing On-line Resources to Build a Lexicon, pages 115–164. Transactions on Information System, 22(2):179–214, 2004.
Lawrence Erlbaum, 1991.