This book constitutes the proceedings of the 34th European Conference on IR Research, ECIR 2012, ... more This book constitutes the proceedings of the 34th European Conference on IR Research, ECIR 2012, held in Barcelona, Spain, in April 2012. The 37 full papers, 28 poster papers and 7 demonstrations presented in this volume were carefully reviewed and selected from 167 submissions. The contributions are organized in sections named: query representation; blogs and online-community search; semi-structured retrieval; evaluation; applications; retrieval models; image and video retrieval; text and content classification, categorisation, ...
Retrieving entities instead of just documents has become an important task for search engines. In... more Retrieving entities instead of just documents has become an important task for search engines. In this paper we study entity retrieval for news applications, and in particular the importance of the news trail history (i.e., past related articles) in determining the relevant entities in current articles. This is an important problem in applications that display retrieved entities to the user, together with the news article.
Le nombre d'informations textuelles accessibles sous forme électronique augmente très rapide... more Le nombre d'informations textuelles accessibles sous forme électronique augmente très rapidement, ce qui entraîne une nouvelle besoin d'outils capables d'exploiter ces informations. Dans ce travail nous explorons une voie alternative du traitement de l'information textuelle, par l'application de modèles dynamiques d'apprentissage numérique. Ces modèles nos permettent d'aborder sous une même formalisme nombreuses tâches d'analyse textuel. Nous introduisons d'abord les différentes disciplines de traitement d' ...
In this paper we report the experiments for the CLEF 2009 Robust-WSD task, both for the monolingu... more In this paper we report the experiments for the CLEF 2009 Robust-WSD task, both for the monolingual (English) and the bilingual (Spanish to English) subtasks. Our main experimentation strategy consisted on expanding and translating the documents, based on the related concepts of the documents. For that purpose we applied a stateof-the art semantic relatedness method based on WordNet. The relatedness measure was used with and without WSD information. Even if we obtained positive results in our training and development datasets, we did not manage to improve over the baseline in the monolingual case. The improvement over the baseline in the bilingual case is marginal. We plan to further work on this technique, which has attained positive results in the passage retrieval for question answering task at CLEF (ResPubliQA).
Résumé Le déploiement du web incite actuellement plusieurs communautés de l'informatique à t... more Résumé Le déploiement du web incite actuellement plusieurs communautés de l'informatique à travailler sur l'accès à l'information et en particulier à l'information textuelle. La communauté apprentissage s' intéresse depuis quelques années à l'analyse de l'information textuelle en vue d'automatiser les traitements pour une gamme de tâches allant de la recherche à l'extraction de l'information. Nous présentons dans ce texte d'une part les travaux réalisés en recherche et extraction d'information pour introduire de l' ...
In this paper we describe the Time Explorer, an application designed for analyzing how news chang... more In this paper we describe the Time Explorer, an application designed for analyzing how news changes over time. We extend on current time-based systems in several important ways. First, Time Explorer is designed to help users discover how entities such as people and locations associated with a query change over time. Second, by searching on time expressions extracted automatically from text, the application allows the user to explore not only how topics evolved in the past, but also how they will continue to evolve in the future. Finally, Time Explorer is designed around an intuitive interface that allows users to interact with time and entities in a powerful way. While aspects of these features can be found in other systems, they are combined in Time Explorer in a way that allows searching through time in no time at all.
Proceedings of the 19th international conference on World wide web - WWW '10, 2010
A Web search engine must update its index periodically to incorporate changes to the Web. We argu... more A Web search engine must update its index periodically to incorporate changes to the Web. We argue in this paper that index updates fundamentally impact the design of search engine result caches, a performance-critical component of modern search engines. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. Naïve approaches, such as flushing the entire cache upon every index update, lead to poor performance and in fact, render caching futile when the frequency of updates is high. Solving the invalidation problem efficiently corresponds to predicting accurately which queries will produce different results if re-evaluated, given the actual changes to the index.
In this paper, we propose a method to create aggregated representations of the information needs ... more In this paper, we propose a method to create aggregated representations of the information needs of Web users when searching for particular types of objects. We suggest this method as a way to investigate the gap between what Web search users are expecting to find and the kind of information that is provided by Semantic Web datasets formatted according to a particular ontology. We evaluate our method qualitatively by measuring its power as a query completion mechanism. Last, we perform a qualitative evaluation comparing the information Web users search for with the information available in Dbpedia, the structured data representation of Wikipedia.
Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06, 2006
Optimising the parameters of ranking functions with respect to standard IR rank-dependent cost fu... more Optimising the parameters of ranking functions with respect to standard IR rank-dependent cost functions has eluded satisfactory analytical treatment. We build on recent advances in alternative differentiable pairwise cost functions, and show that these techniques can be successfully applied to tuning the parameters of an existing family of IR scoring functions (BM25), in the sense that we cannot do better using sensible search heuristics that directly optimize the rank-based cost function NDCG. We also demonstrate how the size of training set affects the number of parameters we can hope to tune this way.
All our submissions from the Microsoft Research Cambridge (MSRC) team this year continue to explo... more All our submissions from the Microsoft Research Cambridge (MSRC) team this year continue to explore issues in IR from a perspective very close to that of the original Okapi team, working first at City University of London, and then at MSRC.
The following report summarizes the highlights of the first workshop on exploiting semantic annot... more The following report summarizes the highlights of the first workshop on exploiting semantic annotations in information retrieval (ESAIR'08). The workshop format included paper and demo presentations as well as breakout sessions and a panel discussion.
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM '07, 2007
We discuss the problem of ranking very many entities of different types. In particular we deal wi... more We discuss the problem of ranking very many entities of different types. In particular we deal with a heterogeneous set of types, some being very generic and some very specific. We discuss two approaches for this problem: i) exploiting the entity containment graph and ii) using a Web search engine to compute entity relevance. We evaluate these approaches on the real task of ranking Wikipedia entities typed with a state-of-the-art named-entity tagger. Results show that both approaches can greatly increase the performance of methods based only on passage retrieval. * on sabbatical at Yahoo! Research Barcelona.
Neural Networks are commonly used in classification and decision tasks. In this paper, we focus o... more Neural Networks are commonly used in classification and decision tasks. In this paper, we focus on the problem of the local confidence of their results. We review some notions from statistical decision theory that offer an insight on the determination and use of confidence measures for classification with Neural Networks. We then present an overview of the existing confidence measures and finally propose a simple measure which combines the benefits of the probabilistic interpretation of network outputs and the estimation of the quality of the model by bootstrap error estimation. We discuss empirical results on a real-world application and an artificial problem and show that the simplest measure behaves often better than more sophisticated ones, but may be dangerous under certain situations.
With the rapid increase of textual information available electronically, there is an acute need f... more With the rapid increase of textual information available electronically, there is an acute need for automatic textual analysis tools. Two communities have dealt with the problem of automatic textual analysis: information retrieval (IR), and information extraction (IE). Information retrieval has been very successful at the “document level”: locating, categorizing and filtering entire documents from large corpus, etc. Unfortunately, it is very difficult to extend the information retrieval paradigm so as to realize more complex tasks such as ...
... Blah blah blaaaah blahaha blah blaaaah blahaha. Cheap shrimps and free wireless Internet conn... more ... Blah blah blaaaah blahaha blah blaaaah blahaha. Cheap shrimps and free wireless Internet connection at Jhonny's. also available. ... without decreasing NDCG or Avg. Prec. Therefore, these measures must share local maxima! [Robertson S., Zaragoza H., MSR-TR-2006-61] ...
We describe a large scale real-world application of neural networks for the modelization of heat ... more We describe a large scale real-world application of neural networks for the modelization of heat radiation emitted by a source and observed through the atmosphere. For this problem, thousands of regressors need to be trained and incorporated into a single model of the process. On such large scale applications, standard techniques for the control of complexity are impossible to implement. We investigate the interest of i) integrating several regressors into a single neural network, and ii) refining the learned functions by optimizing simultaneously all regressors over a global function. The two approaches described offer a solution to these problems, and were crucial for the development of a fast and accurate model of radiation intensity.
This book constitutes the proceedings of the 34th European Conference on IR Research, ECIR 2012, ... more This book constitutes the proceedings of the 34th European Conference on IR Research, ECIR 2012, held in Barcelona, Spain, in April 2012. The 37 full papers, 28 poster papers and 7 demonstrations presented in this volume were carefully reviewed and selected from 167 submissions. The contributions are organized in sections named: query representation; blogs and online-community search; semi-structured retrieval; evaluation; applications; retrieval models; image and video retrieval; text and content classification, categorisation, ...
Retrieving entities instead of just documents has become an important task for search engines. In... more Retrieving entities instead of just documents has become an important task for search engines. In this paper we study entity retrieval for news applications, and in particular the importance of the news trail history (i.e., past related articles) in determining the relevant entities in current articles. This is an important problem in applications that display retrieved entities to the user, together with the news article.
Le nombre d'informations textuelles accessibles sous forme électronique augmente très rapide... more Le nombre d'informations textuelles accessibles sous forme électronique augmente très rapidement, ce qui entraîne une nouvelle besoin d'outils capables d'exploiter ces informations. Dans ce travail nous explorons une voie alternative du traitement de l'information textuelle, par l'application de modèles dynamiques d'apprentissage numérique. Ces modèles nos permettent d'aborder sous une même formalisme nombreuses tâches d'analyse textuel. Nous introduisons d'abord les différentes disciplines de traitement d' ...
In this paper we report the experiments for the CLEF 2009 Robust-WSD task, both for the monolingu... more In this paper we report the experiments for the CLEF 2009 Robust-WSD task, both for the monolingual (English) and the bilingual (Spanish to English) subtasks. Our main experimentation strategy consisted on expanding and translating the documents, based on the related concepts of the documents. For that purpose we applied a stateof-the art semantic relatedness method based on WordNet. The relatedness measure was used with and without WSD information. Even if we obtained positive results in our training and development datasets, we did not manage to improve over the baseline in the monolingual case. The improvement over the baseline in the bilingual case is marginal. We plan to further work on this technique, which has attained positive results in the passage retrieval for question answering task at CLEF (ResPubliQA).
Résumé Le déploiement du web incite actuellement plusieurs communautés de l'informatique à t... more Résumé Le déploiement du web incite actuellement plusieurs communautés de l'informatique à travailler sur l'accès à l'information et en particulier à l'information textuelle. La communauté apprentissage s' intéresse depuis quelques années à l'analyse de l'information textuelle en vue d'automatiser les traitements pour une gamme de tâches allant de la recherche à l'extraction de l'information. Nous présentons dans ce texte d'une part les travaux réalisés en recherche et extraction d'information pour introduire de l' ...
In this paper we describe the Time Explorer, an application designed for analyzing how news chang... more In this paper we describe the Time Explorer, an application designed for analyzing how news changes over time. We extend on current time-based systems in several important ways. First, Time Explorer is designed to help users discover how entities such as people and locations associated with a query change over time. Second, by searching on time expressions extracted automatically from text, the application allows the user to explore not only how topics evolved in the past, but also how they will continue to evolve in the future. Finally, Time Explorer is designed around an intuitive interface that allows users to interact with time and entities in a powerful way. While aspects of these features can be found in other systems, they are combined in Time Explorer in a way that allows searching through time in no time at all.
Proceedings of the 19th international conference on World wide web - WWW '10, 2010
A Web search engine must update its index periodically to incorporate changes to the Web. We argu... more A Web search engine must update its index periodically to incorporate changes to the Web. We argue in this paper that index updates fundamentally impact the design of search engine result caches, a performance-critical component of modern search engines. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. Naïve approaches, such as flushing the entire cache upon every index update, lead to poor performance and in fact, render caching futile when the frequency of updates is high. Solving the invalidation problem efficiently corresponds to predicting accurately which queries will produce different results if re-evaluated, given the actual changes to the index.
In this paper, we propose a method to create aggregated representations of the information needs ... more In this paper, we propose a method to create aggregated representations of the information needs of Web users when searching for particular types of objects. We suggest this method as a way to investigate the gap between what Web search users are expecting to find and the kind of information that is provided by Semantic Web datasets formatted according to a particular ontology. We evaluate our method qualitatively by measuring its power as a query completion mechanism. Last, we perform a qualitative evaluation comparing the information Web users search for with the information available in Dbpedia, the structured data representation of Wikipedia.
Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06, 2006
Optimising the parameters of ranking functions with respect to standard IR rank-dependent cost fu... more Optimising the parameters of ranking functions with respect to standard IR rank-dependent cost functions has eluded satisfactory analytical treatment. We build on recent advances in alternative differentiable pairwise cost functions, and show that these techniques can be successfully applied to tuning the parameters of an existing family of IR scoring functions (BM25), in the sense that we cannot do better using sensible search heuristics that directly optimize the rank-based cost function NDCG. We also demonstrate how the size of training set affects the number of parameters we can hope to tune this way.
All our submissions from the Microsoft Research Cambridge (MSRC) team this year continue to explo... more All our submissions from the Microsoft Research Cambridge (MSRC) team this year continue to explore issues in IR from a perspective very close to that of the original Okapi team, working first at City University of London, and then at MSRC.
The following report summarizes the highlights of the first workshop on exploiting semantic annot... more The following report summarizes the highlights of the first workshop on exploiting semantic annotations in information retrieval (ESAIR'08). The workshop format included paper and demo presentations as well as breakout sessions and a panel discussion.
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM '07, 2007
We discuss the problem of ranking very many entities of different types. In particular we deal wi... more We discuss the problem of ranking very many entities of different types. In particular we deal with a heterogeneous set of types, some being very generic and some very specific. We discuss two approaches for this problem: i) exploiting the entity containment graph and ii) using a Web search engine to compute entity relevance. We evaluate these approaches on the real task of ranking Wikipedia entities typed with a state-of-the-art named-entity tagger. Results show that both approaches can greatly increase the performance of methods based only on passage retrieval. * on sabbatical at Yahoo! Research Barcelona.
Neural Networks are commonly used in classification and decision tasks. In this paper, we focus o... more Neural Networks are commonly used in classification and decision tasks. In this paper, we focus on the problem of the local confidence of their results. We review some notions from statistical decision theory that offer an insight on the determination and use of confidence measures for classification with Neural Networks. We then present an overview of the existing confidence measures and finally propose a simple measure which combines the benefits of the probabilistic interpretation of network outputs and the estimation of the quality of the model by bootstrap error estimation. We discuss empirical results on a real-world application and an artificial problem and show that the simplest measure behaves often better than more sophisticated ones, but may be dangerous under certain situations.
With the rapid increase of textual information available electronically, there is an acute need f... more With the rapid increase of textual information available electronically, there is an acute need for automatic textual analysis tools. Two communities have dealt with the problem of automatic textual analysis: information retrieval (IR), and information extraction (IE). Information retrieval has been very successful at the “document level”: locating, categorizing and filtering entire documents from large corpus, etc. Unfortunately, it is very difficult to extend the information retrieval paradigm so as to realize more complex tasks such as ...
... Blah blah blaaaah blahaha blah blaaaah blahaha. Cheap shrimps and free wireless Internet conn... more ... Blah blah blaaaah blahaha blah blaaaah blahaha. Cheap shrimps and free wireless Internet connection at Jhonny's. also available. ... without decreasing NDCG or Avg. Prec. Therefore, these measures must share local maxima! [Robertson S., Zaragoza H., MSR-TR-2006-61] ...
We describe a large scale real-world application of neural networks for the modelization of heat ... more We describe a large scale real-world application of neural networks for the modelization of heat radiation emitted by a source and observed through the atmosphere. For this problem, thousands of regressors need to be trained and incorporated into a single model of the process. On such large scale applications, standard techniques for the control of complexity are impossible to implement. We investigate the interest of i) integrating several regressors into a single neural network, and ii) refining the learned functions by optimizing simultaneously all regressors over a global function. The two approaches described offer a solution to these problems, and were crucial for the development of a fast and accurate model of radiation intensity.
Uploads
Papers by Hugo Zaragoza