Academia.eduAcademia.edu

Interval Semi-supervised LDA: Classifying Needles in a Haystack

2013, Advances in Artificial Intelligence and Its Applications

An important text mining problem is to find, in a large collection of texts, documents related to specific topics and then discern further structure among the found texts. This problem is especially important for social sciences, where the purpose is to find the most representative documents for subsequent qualitative interpretation. To solve this problem, we propose an interval semi-supervised LDA approach, in which certain predefined sets of keywords (that define the topics researchers are interested in) are restricted to specific intervals of topic assignments. We present a case study on a Russian LiveJournal dataset aimed at ethnicity discourse analysis.

Interval Semi-Supervised LDA: Classifying Needles in a Haystack Svetlana Bodrunova, Sergei Koltsov, Olessia Koltsova, Sergey Nikolenko, and Anastasia Shimorina Laboratory for Internet Studies (LINIS), National Research University Higher School of Economics, ul. Soyuza Pechatnikov, d. 16, 190008 St. Petersburg, Russia Abstract. An important text mining problem is to find, in a large collection of texts, documents related to specific topics and then discern further structure among the found texts. This problem is especially important for social sciences, where the purpose is to find the most representative documents for subsequent qualitative interpretation. To solve this problem, we propose an interval semi-supervised LDA approach, in which certain predefined sets of keywords (that define the topics researchers are interested in) are restricted to specific intervals of topic assignments. We present a case study on a Russian LiveJournal dataset aimed at ethnicity discourse analysis. Keywords: topic modeling, latent Dirichlet allocation, text mining 1 Introduction Many applications in social sciences are related to text mining. Researchers often aim to understand how a certain large body of text behaves: what topics interest the authors of this body, how these topics develop and interact, what are the key words that define these topics in the discourse and so on. Topic modeling approaches, usually based on some version of the LDA (latent Dirichlet allocation) model [8], are very important in this regard. Often, the actually interesting part of the dataset is relatively small, although it is still too large to be processed by hand and, moreover, it is unclear how to separate the interesting part from the rest of the dataset. Such an “interesting” part may be, for instance, represented by certain topics that are defined, but not limited to, certain relevant keywords (so that a simple search for these keywords would yield only a subset of the interesting part). In this short paper, we propose a method for identifying documents relevant to a specific set of topics that also extracts its topical structure based on a semi-supervised version of the LDA model. The paper is organized as follows. In Section 2, we briefly review the basic LDA model and survey related work concerning various extensions of the LDA model. In Section 3 we introduce two extensions: semi-supervised LDA that sets a single topic for each predefined set of key words and interval semi-supervised LDA that maps a set of keywords to an interval of topics. In Section 4, we present a case study of mining ethnical 2 Bodrunova, Koltsov, Koltsova, Nikolenko, Shimorina discourse from a dataset of Russian LiveJournal blogs and show the advantages of the proposed approach; Section 5 concludes the paper. 2 The LDA Model and Extensions 2.1 LDA The basic latent Dirichlet allocation (LDA) model [8, 12] is depicted on Fig. 1a. In this model, a collection of D documents is assumed to contain T topics expressed with W different words. Each document d ∈ D is modeled as a discrete distribution θ(d) over the set of topics: p(zw = j) = θ(d) , where z is a discrete variable that defines the topic for each word w ∈ d. Each topic, in turn, cor(j) responds to a multinomial distribution over the words, p(w | zw = j) = φw . The model also introduces Dirichlet priors α for the distribution over documents (topic vectors) θ, θ ∼ Dir(α), and β for the distribution over the topical word distributions, φ ∼ Dir(β). The inference problem in LDA is to find hidden topic variables z, a vector spanning all instances of all words in the dataset. There are two approaches to inference in the LDA model: variational approximations and MCMC sampling which in this case is convenient to frame as Gibbs sampling. In this work, we use Gibbs sampling because it generalizes easily to semi-supervised LDA considered below. In the LDA model, Gibbs sampling after easy transformations [12] reduces to the so-called collapsed Gibbs sampling, where zw are iteratively resampled with distributions p(zw = t | z −w , w, α, β) ∝ (d) (w) n−w,t + β n−w,t + α   , P ∝ q(zw , t, z −w , w, α, β) = P (d) (w′ ) n n + β + α ′ ′ ′ −w,t −w,t t ∈T w ∈W (d) (w) where n−w,t is the number of times topic t occurs in document d and n−w,t is the number of times word w is generated by topic t, not counting the current value zw . 2.2 Related work: LDA extensions Over the recent years, the basic LDA model has been subject to many extensions; each of them presenting either a variational of a Gibbs sampling algorithm for a model that builds upon LDA to incorporate some additional information or additional presumed dependencies. Among the most important extensions we can list the following: – correlated topic models (CTM) improve upon the fact that in the base LDA model, topic distributions are independent and uncorrelated, but, of course, some topics are closer to each other and share words with each other; CTM use logistic normal distribution instead of Dirichlet to model correlations between topics [5]; Interval Semi-Supervised LDA: Classifying Needles in a Haystack 3 – Markov topic models use Markov random fields to model the interactions between topics in different parts of the dataset (different text corpora), connecting a number of different hyperparameters βi in a Markov random field that lets one subject these hyperparameters to a wide class of prior constraints [16]; – relational topic models construct a hierarchical model that reflects the structure of a document network as a graph [9]; – the Topics over Time model applies when documents have timestamps of their creation (e.g., news articles); it represents the time when topics arise in continuous time with a beta distribution [27]; – dynamic topic models represent the temporal evolution of topics through the evolution of their hyperparameters α and β, either with a state-based discrete model [6] or with a Brownian motion in continuous time [26]; – supervised LDA assigns each document with an additional response variable that can be observed; this variable depends on the distribution of topics in the document and can represent, e.g., user response in a recommender system [7]; – DiscLDA assumes that each document is assigned with a categorical label and attempts to utilize LDA for mining topic classes related to this classification problem [15]; – the Author-Topic model incorporates information about the author of a document, assuming that texts from the same author will be more likely to concentrate on the same topics and will be more likely to share common words [18, 19]; – finally, a lot of work has been done on nonparametric LDA variants based on Dirichlet processes that we will not go into in this paper; for the most important nonparametric approaches to LDA see [4, 10, 20, 21, 28] and references therein. The extension that appears to be closest to the one proposed in this work is the Topic-in-Set knowledge model and its extension with Dirichlet forest priors [1, 2]. In [1], words are assigned with “z-labels”; a z-label represents the topic this specific word should fall into; in this work, we build upon and extend this model. 3 3.1 Semi-Supervised LDA and Interval Semi-Supervised LDA Semi-Supervised LDA In real life text mining applications, it often happens that the entire dataset D deals with a large number of different unrelated topics, while the researcher is actually interested only in a small subset of these topics. In this case, a direct application of the LDA model has important disadvantages. Relevant topics may have too small a presence in the dataset to be detected directly, and one would need a very large number of topics to capture them in an unsupervised fashion. 4 Bodrunova, Koltsov, Koltsova, Nikolenko, Shimorina α θ D α Nd β z T D θ Nd β z T φ φ a α Ld β Nd z T D Ld θ w I[zw l, zr ] z φ b c Fig. 1. Probabilistic models: (a) LDA; (b) semi-supervised LDA; (c) interval semisupervised LDA. For a large number of topics, however, the LDA model often has too many local maxima, giving unstable results with many degenerate topics. To find relevant subsets of topics in the dataset, we propose to use a semisupervised approach to LDA, fixing the values of z for certain key words related to the topics in question; similar approaches have been considered in [1, 2]. The resulting graphical model is shown on Fig. 1b. For words w ∈ Wsup from a predefined set Wsup , the values of z are known and remain fixed to z̃w throughout the Gibbs sampling process: ( [t = z̃w ], w ∈ Wsup , p(zw = t | z −w , w, α, β) ∝ q(zw , t, z −w , w, α, β) otherwise. Otherwise, the Gibbs sampler works as in the basic LDA model; this yields an efficient inference algorithm that does not incur additional computational costs. 3.2 Interval Semi-Supervised LDA One disadvantage of the semi-supervised LDA approach is that it assigns only a single topic to each set of keywords, while in fact there may be more than one topics about them. For instance, in our case study (see Section 4) there are several topics related to Ukraine and Ukrainians in the Russian blogosphere, and artificially drawing them all together with the semi-supervised LDA model would have undesirable consequences: some “Ukrainian” topics would be cut off from the supervised topic and left without Ukrainian keywords because it is more likely for the model to cut off a few words even if they fit well than bring together two very different sets of words under a single topic. Therefore, we propose to map each set of key words to several topics; it is convenient to choose a contiguous interval, hence interval semi-supervised LDA (ISLDA). Each key word w ∈ Wsup is thus mapped to an interval [zlw , zrw ], and the probability distribution is restricted to that interval; the graphical model is Interval Semi-Supervised LDA: Classifying Needles in a Haystack 5 shown on Fig. 1c, where I[zlw , zrw ] denotes the indicator function: I[zlw , zrw ](z) = 1 iff z ∈ [zlw , zrw ]. In the Gibbs sampling algorithm, we simply need to set the probabilities of all topics outside [zlw , zrw ] to zero and renormalize the distribution inside: ( w w w ,t,z −w ,w,α,β) , w ∈ Wsup , I[zl , zr ](t) P w q(z ′ w q(zw ,t ,z −w ,w,α,β) z ≤t′ ≤zr l p(zw = t | z −w , w, α, β) ∝ q(zw , t, z −w , w, α, β) otherwise. Note that in other applications it may be desirable to assign intersecting subsets of topics to different words, say in a context when some words are more general or have homonyms with other meanings; this is easy to do in the proposed model by assigning a specific subset of topics Z w to each key word, not necessarily a contiguous interval. The Gibbs sampling algorithm does not change: ( ,t,z −w ,w,α,β) , w ∈ Wsup , I[Z w ](t) P ′ q(zwwq(z ′ w ,t ,z −w ,w,α,β) t ∈Z p(zw = t | z −w , w, α, β) ∝ q(zw , t, z −w , w, α, β) otherwise. 4 4.1 Mining Ethnical Discourse Case study: project description We have applied the method outlined above to a sociological project intended to study ethnical discourse in the Russian blogosphere. The project aims to analyze ethnically-marked discourse, in particular, find: (1) topics of discussion connected to ethnicity and their qualitative discursive interpretation; (2) ethnically-marked social milieus or spaces; (3) ethnically-marked social problems; (4) “just dangerous” ethnicities that would be surrounded by pejorative / stereotyped / fearmarked discourse without any particular reason for it evident from data mining. This study stems from constructivist research on inequalities in socio-economical development vs. ethnical diversity, ethnicities as social borders, and ethnicities as sources of moral panic [3,13,14]; our project goes in line with current research in mediated representations of ethnicity and ethnic-marked discourse [11,17,22,23]. The major issues in mediated representation of ethnicity may be conceptualized as criminalization of ethnicity, sensibilization of cultural difference and enhancing cultural stereotypes, problematization of immigration, reinforcement of negativism in image of ethnicities, unequal coverage of ethnic groups, labeling and boundary marking, and flawed connections of ethnicity with other major areas of social cleavages, e.g. religion. The approach needs to be both quantitative and qualitative; we need to be able to automatically mine existing topics from a large dataset and then qualitatively interpret these results. This led us to topic modeling, and ultimately to developing ISLDA for this project. Thus, the second aim of the project is methodological, as we realize that ethnic vocabulary may not show up as the most important words in the topics, and discursive significance of less frequent ethnonyms (like Tajik or Vietnamese) will be very low. As for the most frequent ethnonyms (in case of the Russian blogosphere, 6 Bodrunova, Koltsov, Koltsova, Nikolenko, Shimorina they include American, Ukranian, or German), our hypothesis was that they may provide varying numbers of discussion topics in the blogosphere, from 0 (no clear topics evident) up to 4 or 5 major topics, which are not always spotted by regular LDA. 4.2 Case study results and discussion In this section, we present qualitative results of the case study itself and compare LDA with ISLDA. In this case study, the dataset consisted of four months of LiveJournal posts written by 2000 top bloggers. In total, there were 235,407 documents in the dataset, and the dictionary, after cleaning stopwords and low frequency words, contained 192,614 words with about 53.5 million total instances of these words. We have performed experiments with different numbers of topics (50, 100, 200, and 400) for both regular LDA and Interval Semi-Supervised LDA. Comparing regular LDA results for 100 and 400 topics, it is clear that ethnic topics need to be dug up at 400 rather than 100 topics. The share of ethnic topics was approximately the same: 9 out of 100 (9%) and 34 out of 400 (8.5%), but in terms of quality, the first iteration gives “too thick” topics like Great Patriotic war, Muslim, CEE countries, “big chess play” (great world powers and their roles in local conflicts), Russian vs. Western values, US/UK celebrities and East in travel (Japan, India, China and Korea). This does not provide us with any particular hints on how various ethnicities are treated in the blogosphere. The 400-topic LDA iteration looks much more informative, providing topics of three kinds: event-oriented (e.g., death of Kim Jong-il or boycotting Russian TV channel NTV in Lithuania), current affairs oriented (e.g., armed conflicts in Libya and Syria or protests in Kazakh city Zhanaozen), and long-term topics. The latter may be divided into “neutral” descriptions of country/historic realities (Japan, China, British Commonwealth countries, ancient Indians etc.), long-term conflict topics (e.g., the Arab-Israeli conflict, Serb-Albanian conflict and the Kosovo problem), and two types of “problematized” topics: internal problems of a given country/nation (e.g., the U.S.) and “Russia vs. another country/region” topics (Poland, Chechnya, Ukraine). There are several topics of particular interest for the ethnic case study: a topic on Tajiks, two opposing topics on Russian nationalism (“patriotic” and “negative”), and a Tatar topic. Several ethnicities, e.g., Americans, Germans, Russians, and Arabs, were subject of more than one topic. In ISLDA results, the 100-topic modeling covered the same ethnic topics as regular LDA, but Ukrainian ethnonyms produced a new result discussed below. 400-topic ISLDA gave a result much better than regular LDA. For ex-Soviet ethnicities (Tajik and Georgian), one of two pre-assigned topics clearly showed a problematized context. For Tajiks, it was illegal migration: the word collection also showed the writers from opposing opinion camps (Belkovsky, Kholmogorov, Krylov) and vocabulary characteristic of opinion media texts. For Georgians, the context of the Georgian-Ossetian conflict of 2008 clearly showed up, enriched by current events like election issues in South Ossetia. French and Ukrainian, both assigned 4 topics, showed good results. France had all topics more or less clearly Interval Semi-Supervised LDA: Classifying Needles in a Haystack 7 connected to distinctive topics: a Mediterranean topic, Patriotic wars in Russia (with France and Germany), the current conflict in Lybia and general history of Europe. Here, we see that topics related to current affairs are easily de-aligned from long-term topics. In general, we have found that ISLDA results have significant advantages over regular LDA. Most importantly, ISLDA finds new important topics related to the chosen semi-supervised subjects. As an example, Table 1 shows topics from our runs with 100 and 400 topics related to Ukraine. In every case, there is a strong topic related to Ukrainian politics, but then differences begin. In the 100 topic case (Figs. 1a and 1c), ISLDA distinguishes a Ukrainian nationalist topic (very important for our study) that was lost on LDA. With 400 topics (Figs. 1b and 1d), LDA finds virtually the same topics, while ISLDA finds three new important topics: scandals related to Russian natural gas transmitted through Ukraine, a topic devoted to Crimea, and again the nationalist topic (this time with a Western Ukrainian spin). The same pattern appears for other ethnical subjects in the dataset: ISLDA produces more informative topics on the specified subjects. As for numerical evaluation of modeling results, we have computed the heldout perplexity on two test sets of 1000 documents each; i.e., we estimated the value of Z p(w | D) = p(w | Φ, αm)p(Φ, αm | D)dαdΦ for each held-out document w and then normalized the result as ! P w∈Dtest log p(w) . perplexity(Dtest ) = exp − P w∈Dtest Nd To compute p(w | D), we used the left-to-right algorithm proposed and recommended in [24, 25]. The test sets were separate datasets of blog posts from the same set of authors and around the same time as the main dataset; the first test key set Dtest contained general posts while the second, Dtest , was comprised of posts that contain at least one of the key words used in ISLDA. Perplexity results are shown in Table 2; it is clear that perplexity virtually does not suffer in ISLDA, and there is no difference in the perplexity between the keyword-containing test set and the general test set. This indicates that ISLDA merely brings the relevant topics to the surface of the model and does not in general interfere with the model’s predictive power. For further sociological studies directed at specific issues, we recommend to use ISLDA with the number of preassigned topics (interval sizes) chosen a priori larger than the possible number of relevant topics: in our experiments, we saw that extra slots are simply filled up with some unrelated topics and do not deteriorate the quality of relevant topics. However, the results begin to deteriorate if more than about 10% of all topics (e.g., 40 out of 400) are assigned to the semi-supervised part; one always needs to have sufficient “free space” to fill with other topics. This provides a certain tension that may be resolved with further study (see below). 8 Bodrunova, Koltsov, Koltsova, Nikolenko, Shimorina (a) Ukraine Ukrainian Polish Belorussian Poland Belarus 0.043 Ukraine 0.049 0.029 Ukrainian 0.017 0.012 Timoshenko 0.015 0.011 Yanukovich 0.015 0.011 Victor 0.012 0.010 president 0.012 (b) Ukraine Ukrainian Belorussian Belarus Kiev Kievan 0.098 Ukraine 0.054 dragon 0.026 0.068 Timoshenko 0.019 Kiev 0.022 0.020 Yanukovich 0.018 Bali 0.012 Ukrainian 0.016 house 0.010 0.018 0.018 president 0.015 place 0.006 0.012 Victor 0.013 work 0.006 (c) Ukraine gas Europe Russia Ukrainian Belorussian Belarus European 0.065 Ukraine 0.062 Ukrainian 0.040 Crimea 0.046 0.030 Timoshenko 0.023 Ukraine 0.036 Crimean 0.015 0.026 Ukrainian 0.022 Polish 0.021 Sevastopol 0.015 0.019 Yanukovich 0.018 Poland 0.017 Simferopol 0.008 Kiev 0.015 year 0.009 Yalta 0.008 0.018 0.018 Victor 0.014 L’vov 0.006 source 0.007 0.017 president 0.013 Western 0.005 Orjonikidze 0.005 0.015 party 0.013 cossack 0.005 sea 0.005 (d ) Ukraine gas Europe Russia Ukrainian Belorussian Belarus European 0.065 Ukraine 0.062 Ukrainian 0.040 Crimea 0.046 0.030 Timoshenko 0.023 Ukraine 0.036 Crimean 0.015 0.026 Ukrainian 0.022 Polish 0.021 Sevastopol 0.015 0.019 Yanukovich 0.018 Poland 0.017 Simferopol 0.008 Kiev 0.015 year 0.009 Yalta 0.008 0.018 0.018 Victor 0.014 L’vov 0.006 source 0.007 0.017 president 0.013 Western 0.005 Orjonikidze 0.005 0.015 party 0.013 cossack 0.005 sea 0.005 Table 1. A comparison of LDA topics related to Ukraine: (a) LDA, 100 topics; (b) LDA, 400 topics; (c) ISLDA, 100 topics; (d ) ISLDA, 400 topics. 5 Conclusion In this work, we have introduced the Interval Semi-Supervised LDA model (ISLDA) as a tool for a more detailed analysis of a specific set of topics inside a larger dataset and have showed an inference algorithm for this model based on collapsed Gibbs sampling. With this tool, we have described a case study in ethnical discourse analysis on a dataset comprised of the Russian LiveJournal blogs. We show that topics relevant to the subject of study do indeed improve in the ISLDA analysis and recommend ISLDA for further use in sociological studies of the blogosphere. For further work, note that the approach outlined above requires the user to specify how many topics are assigned to each keyword. We have mentioned that there is a tradeoff between possibly losing interesting topics and breaking the model up by assigning too many topics in the semi-supervised part; in the current model, we can only advise to experiment until a suitable number of semi- Interval Semi-Supervised LDA: Classifying Needles in a Haystack # of topics 100 200 400 Perplexity, LDA key Dtest Dtest 12.7483 12.7483 12.7457 12.7457 12.6171 12.6172 Perplexity, Dtest 12.7542 12.7485 12.6216 9 ISLDA key Dtest 12.7542 12.7486 12.6216 Table 2. Held-out perplexity results. supervised topics is found. Therefore, we propose an interesting open problem: develop a nonparametric model that chooses the number of topics in each semisupervised cluster of topics separately and also chooses separately the rest of the topics in the model. Acknowledgements. This work was done at the Laboratory for Internet Studies, National Research University Higher School of Economics (NRU HSE), Russia, and partially supported by the Basic Research Program of NRU HSE. The work of Sergey Nikolenko was also supported by the Russian Foundation for Basic Research grant 12-01-00450-a and the Russian Presidential Grant Programme for Young Ph.D.’s, grant no. MK-6628.2012.1. References 1. Andrzejewski, D., Zhu, X.: Latent Dirichlet allocation with topic-in-set knowledge. In: Proc. NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing. pp. 43–48. SemiSupLearn ’09, Association for Computational Linguistics, Stroudsburg, PA, USA (2009) 2. Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In: Proc. 26th Annual International Conference on Machine Learning. pp. 25–32. ICML ’09, ACM, New York, NY, USA (2009) 3. Barth, F.: Introduction. In: Barth, F. (ed.) Ethnic Groups and Boundaries: The social organization of culture difference, pp. 9–38. London: George Allen and Unwin (1969) 4. Blei, D.M., Jordan, M.I., Griffiths, T.L., Tennenbaum, J.B.: Hierarchical topic models and the nested chinese restaurant process. Advances in Neural Information Processing Systems 13 (2004) 5. Blei, D.M., Lafferty, J.D.: Correlated topic models. Advances in Neural Information Processing Systems 18 (2006) 6. Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning. pp. 113–120. ACM, New York, NY, USA (2006), http://doi.acm.org/10.1145/1143844.1143859 7. Blei, D.M., McAuliffe, J.D.: Supervised topic models. Advances in Neural Information Processing Systems 22 (2007) 8. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3(4–5), 993–1022 (2003) 9. Chang, J., Blei, D.M.: Hierarchical relational models for document networks. Annals of Applied Statistics 4(1), 124–150 (2010) 10 Bodrunova, Koltsov, Koltsova, Nikolenko, Shimorina 10. Chen, X., Zhou, M., Carin, L.: The contextual focused topic model. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 96–104. ACM, New York, NY, USA (2012), http://doi.acm. org/10.1145/2339530.2339549 11. Downing, J.D.H., Husbands, C.: Representing Race: Racisms, Ethnicity and the Media. London: Sage (2005) 12. Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences 101 (Suppl. 1), 5228–5335 (2004) 13. Hall, S.: Ethnicity: Identity and difference. Radical America 23(4), 9–22 (1991) 14. Hechter, M.: Internal colonialism: the Celtic fringe in British national development, 1536–1966. London: Routledge & Kegan Paul (1975) 15. Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: Discriminative learning for dimensionality reduction and classification. Advances in Neural Information Processing Systems 20 (2008) 16. Li, S.Z.: Markov Random Field Modeling in Image Analysis. Advances in Pattern Recognition, Springer (2009) 17. Nyamnjoh, F.B.: Africa’s Media, Democracy and the Politics of Belonging. London: Zed Books (2005) 18. Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P., Steyvers, M.: Learning author-topic models from text corpora. ACM Trans. Inf. Syst. 28(1), 1–38 (Jan 2010) 19. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. pp. 487–494. AUAI Press, Arlington, Virginia, United States (2004), http://dl.acm.org/citation.cfm?id=1036843.1036902 20. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. Journal of the American Statistical Association 101(476), 1566–1581 (2004) 21. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Sharing clusters among related groups: Hierarchical Dirichlet processes. Advances in Neural Information Processing Systems 17, 1385–1392 (2005) 22. Voltmer, K.: The Media in Transitional Democracies. Cambridge: Polity (2013) 23. ter Wal, J. (ed.): Racism and cultural diversity in the mass media: An overview of research and examples of good practice in the EU member states, 1995-2000. Vienna: European Monitoring Centre on Racism and Xenofobia (2002) 24. Wallach, H.M.: Structured topic models for language. Ph.D. thesis, University of Cambridge (2008) 25. Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th International Conference on Machine Learning. pp. 1105–1112. ACM, New York, NY, USA (2009), http://doi.acm. org/10.1145/1553374.1553515 26. Wang, C., Blei, D.M., Heckerman, D.: Continuous time dynamic topic models. In: Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (2008), http://uai2008.cs.helsinki.fi/UAI_camera_ready/wang.pdf 27. Wang, X., McCallum, A.: Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 424–433. ACM, New York, NY, USA (2006), http://doi.acm.org/10.1145/1150402.1150450 28. Williamson, S., Wang, C., Heller, K.A., Blei, D.M.: The IBP compound Dirichlet process and its application to focused topic modeling. In: Proceedings of the 27th International Conference on Machine Learning. pp. 1151–1158 (2010)