Skip to main content

Gareth Jones

Followers

16

Following

4

Co-authors

3

Public Views

Interests

Uploads

Papers by Gareth Jones

Enhanced Information Retrieval by Exploiting Recommender Techniques in Cluster-Based Link Analysis

Proceedings of the 2013 Conference on the Theory of Information Retrieval

Inspired by the use of PageRank algorithms in document ranking, we develop and evaluate a cluster... more Inspired by the use of PageRank algorithms in document ranking, we develop and evaluate a cluster-based PageRank algorithm to re-rank information retrieval (IR) output with the objective of improving ad hoc search effectiveness. Unlike existing work, our methods exploit recommender techniques to extract the correlation between documents and apply detected correlations in a cluster-based PageRank algorithm to compute the importance of each document in a dataset. In this study two popular recommender techniques are examined in four proposed PageRank models to investigate the effectiveness of our approach. Comparison of our methods with strong baselines demonstrates the solid performance of our approach. Experimental results are reported on an extended version of the FIRE 2011 personal information retrieval (PIR) data collection which includes topically related queries with click-through data and relevance assessment data collected from the query creators. The search logs of the query creators are categorized based on their different topical interests. The experimental results show the significant improvement of our approach compared to results using standard IR and cluster-based PageRank methods.

Evaluating Professional Search: A German Construction Law Use Case

Forum for Information Retrieval Evaluation, 2020

We present a real world case study for the evaluation of professional search focusing on German c... more We present a real world case study for the evaluation of professional search focusing on German construction law. Reliable identification of relevant previous cases is an important part of many legal disputes, and currently relies on domain expertise acquired over a lengthy professional career. We describe our experiences from the development of a Cranfield type test collection for a German construction law dataset to enable research into the development of search technologies for new tools which are less dependent on expert knowledge. We describe examination of the search needs of lawyers, the development of a set of search queries created by lawyers, and our experiences in collecting expert relevance data for the completion of a test collection for legal search. Important findings of this latter process are the need for individuals with expert legal training to assess relevance, and the identification of context dependence in determining relevance. While the cost of the development of this test collection was found to be very high, we demonstrate its value in terms of identifying the effectiveness of legal search methods and in identifying research directions for legal case search.

Word-Node2Vec: Improving Word Embedding with Document-Level Non-Local Word Co-occurrences

Proceedings of the 2019 Conference of the North, 2019

A standard word embedding algorithm, such as word2vec and glove, makes a strong assumption that w... more A standard word embedding algorithm, such as word2vec and glove, makes a strong assumption that words are likely to be semantically related only if they co-occur locally within a window of fixed size. However, this strong assumption may not capture the semantic association between words that co-occur frequently but non-locally within documents. In this paper, we propose a graph-based word embedding method, named ‘word-node2vec’. By relaxing the strong constraint of locality, our method is able to capture both the local and non-local co-occurrences. Word-node2vec constructs a graph where every node represents a word and an edge between two nodes represents a combination of both local (e.g. word2vec) and document-level co-occurrences. Our experiments show that word-node2vec outperforms word2vec and glove on a range of different tasks, such as predicting word-pair similarity, word analogy and concept categorization.

Overview of the CLEF 2019 Personalised Information Retrieval Lab (PIR-CLEF 2019)

Lecture Notes in Computer Science, 2019

At CLEF 2018, the Personalised Information Retrieval Lab (PIR-CLEF 2018) has been conceived to pr... more At CLEF 2018, the Personalised Information Retrieval Lab (PIR-CLEF 2018) has been conceived to provide an initiative aimed at both providing and critically analysing a new approach to the evaluation of personalization in Information Retrieval (PIR). PIR-CLEF 2018 is the first edition of this Lab after the successful Pilot lab organised at CLEF 2017. PIR CLEF 2018 has provided registered participants with the data sets originally developed for the PIR-CLEF 2017 Pilot task; the data collected are related to real search sessions over a subset of the ClueWeb12 collection, undertaken by 10 users by using a novel methodology. The data were gathered during the search sessions undertaken by 10 volunteer searchers. Activities during these search sessions included relevance assessment of a retrieved documents by the searchers. 16 groups registered to participate at PIR-CLEF 2018 and were provided with the data set to allow them to work on PIR related tasks and to provide feedback about our proposed PIR evaluation methodology with the aim to create an effective evaluation task.

Query Expansion for Sentence Retrieval Using Pseudo Relevance Feedback and Word Embedding

Lecture Notes in Computer Science, 2017

Investigating segment-based query expansion for user-generated spoken content retrieval

2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI), Jun 1, 2016

The very rapid growth in user-generated social multimedia content on online platforms is creating... more The very rapid growth in user-generated social multimedia content on online platforms is creating new challenges for search technologies. A significant issue for search of this type of content is its highly variable form and quality. This is compounded by the standard information retrieval (IR) problem of mismatch between search queries and target items. Query Expansion (QE) has been shown to be an effect technique to improve IR effectiveness for multiple search tasks. In QE, words from a number of relevant or assumed relevant top ranked documents from an initial search are added to the initial search query to enrich it before carrying out a further search operation. In this work, we investigate the application of QE methods for searching social multimedia content. In particular we focus on social multimedia content where the information is primarily in the audio stream. To address the challenge of content variability, we introduce three speech segment-based methods for QE using: Semantic segmentation, Discourse segmentation and Window-Based. Our experimental investigation illustrates the superiority of these segment-based methods in comparison to a standard full document QE method for a version of the MediaEval 2012 Search task newly extended as an adhoc search task.

An analysis of evaluation campaigns in ad-hoc medical information retrieval: CLEF eHealth 2013 and 2014

Information Retrieval Journal, 2018

Estimating Gaussian mixture models in the local neighbourhood of embedded word vectors for query performance prediction

Information Processing & Management, 2019

The study of query performance prediction (QPP) in information retrieval (IR) aims to predict ret... more The study of query performance prediction (QPP) in information retrieval (IR) aims to predict retrieval effectiveness. The specificity of the underlying information need of a query often determines how effectively can a search engine retrieve relevant documents at top ranks. The presence of ambiguous terms makes a query less specific to the sought information need, which in turn may degrade IR effectiveness. In this paper, we propose a novel word embedding based pre-retrieval feature which measures the ambiguity of each query term by estimating how many 'senses' each word is associated with. Assuming each sense roughly corresponds to a Gaussian mixture component, our proposed generative model first estimates a Gaussian mixture model (GMM) from the word vectors that are most similar to the given query terms. We then use the posterior probabilities of generating the query terms themselves from this estimated GMM in order to quantify the ambiguity of the query. Previous studies have shown that post-retrieval QPP approaches often outperform pre-retrieval ones because they use additional information from the top ranked documents. To achieve the best of both worlds, we formalize a linear combination of our proposed GMM based pre-retrieval predictor with NQC, a state-of-the-art post-retrieval QPP. Our experiments on the TREC benchmark news and web collections demonstrate that our proposed hybrid QPP approach (in linear combination with NQC) significantly outperforms a range of other existing pre-retrieval approaches in combination with NQC used as baselines.

Proceedings of the Forum for Information Retrieval Evaluation on - FIRE '14, 2015

We describe the participation of Dublin City University (DCU) in the FIRE-2014 transliteration se... more We describe the participation of Dublin City University (DCU) in the FIRE-2014 transliteration search task (TST). The TST involves an ad-hoc search over a collection of Hindi film song lyrics. The Hindi language content of each document in the collection is either written in the native Devanagari script or transliterated in Roman script or a combination of both. The queries can be in mixed script as well. The task is challenging primarily because of the vocabulary mismatch which may arise due to the multiple transliteration alternatives. We attempt to address the vocabulary mismatch problem both during the indexing and retrieval stages. During indexing, we apply a rule-based normalization on some character sequences of the transliterated words in order to have a single representation in the index for the multiple transliteration alternatives. During the retrieval phase, we make use of prefix matched fuzzy query terms to account for the morphological variations of the transliterated words. The results show significant improvement over a standard baseline query likelihood language modelling (LM) approach. Additionally, we also apply statistical machine transliteration to train a transliteration model in order to predict the transliteration of out-of-vocabulary words. Surprisingly, even with satisfactory transliteration accuracy, we found that automatic transliteration of query terms degraded retrieval effectiveness.

External Query Reformulation for Text-Based Image Retrieval

Lecture Notes in Computer Science, 2011

In text-based image retrieval, the Incomplete Annotation Problem (IAP) can greatly degrade retrie... more In text-based image retrieval, the Incomplete Annotation Problem (IAP) can greatly degrade retrieval effectiveness. A standard method used to address this problem is pseudo relevance feedback (PRF) which updates user queries by adding feedback terms selected automatically from top ranked documents in a prior retrieval run. PRF assumes that the target collection provides enough feedback information to select effective expansion terms. This is often not the case in image retrieval since images often only have short metadata annotations leading to the IAP. Our work proposes the use of an external knowledge resource (Wikipedia) in the process of refining user queries. In our method, Wikipedia documents strongly related to the terms in user query ("definition documents") are first identified by title matching between the query and titles of Wikipedia articles. These definition documents are used as indicators to re-weight the feedback documents from an initial search run on a Wikipedia abstract collection using the Jaccard coefficient. The new weights of the feedback documents are combined with the scores rated by different indicators. Query-expansion terms are then selected based on these new weights for the feedback documents. Our method is evaluated on the ImageCLEF WikipediaMM image retrieval task using text-based retrieval on the document metadata fields. The results show significant improvement compared to standard PRF methods.

Enhanced Information Retrieval Using Domain-Specific Recommender Models

Lecture Notes in Computer Science, 2011

The objective of an information retrieval (IR) system is to retrieve relevant items which meet a ... more The objective of an information retrieval (IR) system is to retrieve relevant items which meet a user information need. There is currently significant interest in personalized IR which seeks to improve IR effectiveness by incorporating a model of the user's interests. However, in some situations there may be no opportunity to learn about the interests of a specific user on a certain topic. In our work, we propose an IR approach which combines a recommender algorithm with IR methods to improve retrieval for domains where the system has no opportunity to learn prior information about the user's knowledge of a domain for which they have not previously entered a query. We use search data from other previous users interested in the same topic to build a recommender model for this topic. When a user enters a query on a topic, new to this user, an appropriate recommender model is selected and used to predict a ranking which the user may find interesting based on the behaviour of previous users with similar queries. The recommender output is integrated with a standard IR method in a weighted linear combination to provide a final result for the user. Experiments using the INEX 2009 data collection with a simulated recommender training set show that our approach can improve on a baseline IR system.

An Affect-Based Video Retrieval System with Open Vocabulary Querying

Lecture Notes in Computer Science, 2011

Content-based video retrieval systems (CBVR) are creating new search and browse capabilities usin... more Content-based video retrieval systems (CBVR) are creating new search and browse capabilities using metadata describing significant features of the data. An often overlooked aspect of human interpretation of multimedia data is the affective dimension. Incorporating affective information into multimedia metadata can potentially enable search using this alternative interpretation of multimedia content. Recent work has described methods to automatically assign affective labels to multimedia data using various approaches. However, the subjective and imprecise nature of affective labels makes it difficult to bridge the semantic gap between system-detected labels and user expression of information requirements in multimedia retrieval. We present a novel affect-based video retrieval system incorporating an open-vocabulary query stage based on WordNet enabling search using an unrestricted query vocabulary. The system performs automatic annotation of video data with labels of well defined affective terms. In retrieval annotated documents are ranked using the standard Okapi retrieval model based on open-vocabulary text queries. We present experimental results examining the behaviour of the system for retrieval of a collection of automatically annotated feature films of different genres. Our results indicate that affective annotation can potentially provide useful augmentation to more traditional objective content description in multimedia retrieval.

A study on query expansion methods for patent retrieval

Proceedings of the 4th workshop on Patent information retrieval, 2011

Patent retrieval is a recall-oriented search task where the objective is to find all possible rel... more Patent retrieval is a recall-oriented search task where the objective is to find all possible relevant documents. Queries in patent retrieval are typically very long since they take the form of a patent claim or even a full patent application in the case of priorart patent search. Nevertheless, there is generally a significant mismatch between the query and the relevant documents, often leading to low retrieval effectiveness. Some previous work has tried to address this mismatch through the application of query expansion (QE) techniques which have generally showed effectiveness for many other retrieval tasks. However, results of QE on patent search have been found to be very disappointing. We present a review of previous investigations of QE in patent retrieval, and explore some of these techniques on a prior-art patent search task. In addition, a novel method for QE using automatically generated synonyms set is presented. While previous QE techniques fail to improve over baseline retrieval, our new approach show statistically better retrieval precision over the baseline, although not for recall. In addition, it proves to be significantly more efficient than existing techniques. An extensive analysis to the results is presented which seeks to better understand situations where these QE techniques succeed or fail.

Applying Query Formulation and Fusion Techniques For Cross Language News Story Search

Post-Proceedings of the 4th and 5th Workshops of the Forum for Information Retrieval Evaluation, 2013

Cross Language News story search (CLNSS) is concerned with finding documents describing the same ... more Cross Language News story search (CLNSS) is concerned with finding documents describing the same events in documents in different languages. As well as supporting information retrieval (IR), CLNSS has other applications in mining parallel and comparable data across different languages. In this paper, we present an overview of the work carried out for our participation in the Cross Language !ndian News Story Search (CL!NSS) task at FIRE 2013. In the CL!NSS task we explored the problem of cross language news search for the English-Hindi language pair. English news stories are used as queries to seek similar news documents from Hindi news articles. Hindi being a resource-scarce language offers many challenges towards retrieving relevant news articles. We investigate and contrast translation of input queries from English to Hindi using the Google and Bing translation services. To support translation of out-of-vocabulary words we use the Google transliteration service. A key challenge of the CL!NSS task is formation of search queries from the English news articles, since they are much longer than the much shorter queries typically used in IR applications. To address this problem, we explore the use of summarization to extract a query from the input news documents, and use these summarized queries as the input to the cross language IR system. We explore the use of query expansion using pseudo relevance feedback (PRF) in the IR process, since this has been shown to be effective for cross language IR in many previous investigations. We also explore in detail the use of data fusion techniques over different sets of retrieved results obtained using diverse query formulation techniques. For the CL!NSS task our team submitted 3 main runs. The results of our best run was ranked first among official submissions based on NDCG@5 and NDCG@10 values and second for NDCG@1 values. For the 25 test queries the results of our best main run were NDCG@1 0.7400, NDCG@5 0.6809 and NDCG@10 0.7268. We present our methodology, official results and results of a number of post-task experiments that were conducted to further examine the cross language search problem. Our experiments reveal that query formu

An Evaluation and Analysis of Incorporating Term Dependency for Ad-Hoc Retrieval

Lecture Notes in Computer Science

Dublin City University at CLEF 2006: Experiments for the ImageCLEF Photo Collection Standard Ad Hoc Task

Lecture Notes in Computer Science, 2007

For the CLEF 2006 Cross Language Image Retrieval (Im-ageCLEF) Photo Collection Standard Ad Hoc ta... more For the CLEF 2006 Cross Language Image Retrieval (Im-ageCLEF) Photo Collection Standard Ad Hoc task, DCU performed monolingual and cross language retrieval using photo annotations with and without feedback, and also a combined visual and text retrieval approach. Topics are translated into English using the Babelfish online machine translation system. Text runs used the BM25 algorithm, while visual approach used simple low-level features with matching based on the Jeffrey Divergence measure. Our results consistently indicate that the fusion of text and visual features is best for this task, and that performing feedback for text consistently improves on the baseline non-feedback BM25 text runs for all language pairs.

Cross-Lingual Topical Relevance Models

Cross-lingual relevance modelling (CLRLM) is a state-of-the-art technique for cross-lingual infor... more Cross-lingual relevance modelling (CLRLM) is a state-of-the-art technique for cross-lingual information retrieval (CLIR) which integrates query term disambiguation and expansion in a unified framework, to directly estimate a model of relevant documents in the target language starting with a query in the source language. However, CLRLM involves integrating a translation model either on the document side if a parallel corpus is available, or on the query side if a bilingual dictionary is available. For low resourced language pairs, large parallel corpora do not exist and the vocabulary coverage of dictionaries is small, as a result of which RLM-based CLIR fails to obtain satisfactory results. Despite the lack of parallel resources for a majority of language pairs, the availability of comparable corpora for many languages has grown considerably in the recent years. Existing CLIR techniques such as cross-lingual relevance models, cannot effectively utilize these comparable corpora, since they do not use information from documents in the source language. We overcome this limitation by using information from retrieved documents in the source language to improve the retrieval quality of the target language documents. More precisely speaking, our model involves a two step approach of first retrieving documents both in the source language and the target language (using query translation), and then improving on the retrieval quality of target language documents by expanding the query with translations of words extracted from the top ranked documents retrieved in the source language which are thematically related (i.e. share the same concept) to the words in the top ranked target language documents. Our key hypothesis is that the query in the source language and its equivalent target language translation retrieve documents which share topics. The ovelapping topics of these top ranked documents in both languages are then used to improve the ranking of the target language documents. Since the model relies on the alignment of topics between language pairs, we call it the cross-lingual topical relevance model (CLTRLM). Experimental results show that the CLTRLM significantly outperforms the standard CLRLM by upto 37% on English-Bengali CLIR, achieving mean average precision (MAP) of up to 60.27% of the Bengali monolingual IR MAP.

Multilingual Adaptive Search for Digital Libraries

Lecture Notes in Computer Science, 2011

We describe a framework for Adaptive Multilingual Information Retrieval (AMIR) which allows multi... more We describe a framework for Adaptive Multilingual Information Retrieval (AMIR) which allows multilingual resource discovery and delivery using on-the-fly machine translation of documents and queries. Result documents are presented to the user in a contextualised manner. Challenges and affordances of both adaptive and multilingual IR, with a particular focus on digital libraries, are detailed. The framework components are motivated by a series of results from experiments on query logs and documents from The European Library. We conclude that factoring adaptivity and multilinguality aspects into the search process can enhance the user's experience with online digital libraries.

An LDA-smoothed relevance model for document expansion

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, 2013

Document expansion (DE) in information retrieval (IR) involves modifying each document in the col... more Document expansion (DE) in information retrieval (IR) involves modifying each document in the collection by introducing additional terms into the document. It is particularly useful to improve retrieval of short and noisy documents where the additional terms can improve the description of the document content. Existing approaches to DE assume that documents to be expanded are from a single topic. In the case of multi-topic documents this can lead to a topic bias in terms selected for DE and hence may result in poor retrieval quality due to the lack of coverage of the original document topics in the expanded document. This paper proposes a new DE technique providing a more uniform selection and weighting of DE terms from all constituent topics. We show that our proposed method significantly outperforms the most recently reported relevance model based DE method on a spoken document retrieval task for both manual and automatic speech recognition transcripts.

Overview of the Personalized and Collaborative Information Retrieval (PIR) Track at FIRE-2011

Lecture Notes in Computer Science, 2013

Enhanced Information Retrieval by Exploiting Recommender Techniques in Cluster-Based Link Analysis

Proceedings of the 2013 Conference on the Theory of Information Retrieval

Inspired by the use of PageRank algorithms in document ranking, we develop and evaluate a cluster... more Inspired by the use of PageRank algorithms in document ranking, we develop and evaluate a cluster-based PageRank algorithm to re-rank information retrieval (IR) output with the objective of improving ad hoc search effectiveness. Unlike existing work, our methods exploit recommender techniques to extract the correlation between documents and apply detected correlations in a cluster-based PageRank algorithm to compute the importance of each document in a dataset. In this study two popular recommender techniques are examined in four proposed PageRank models to investigate the effectiveness of our approach. Comparison of our methods with strong baselines demonstrates the solid performance of our approach. Experimental results are reported on an extended version of the FIRE 2011 personal information retrieval (PIR) data collection which includes topically related queries with click-through data and relevance assessment data collected from the query creators. The search logs of the query creators are categorized based on their different topical interests. The experimental results show the significant improvement of our approach compared to results using standard IR and cluster-based PageRank methods.

Evaluating Professional Search: A German Construction Law Use Case

Forum for Information Retrieval Evaluation, 2020

We present a real world case study for the evaluation of professional search focusing on German c... more We present a real world case study for the evaluation of professional search focusing on German construction law. Reliable identification of relevant previous cases is an important part of many legal disputes, and currently relies on domain expertise acquired over a lengthy professional career. We describe our experiences from the development of a Cranfield type test collection for a German construction law dataset to enable research into the development of search technologies for new tools which are less dependent on expert knowledge. We describe examination of the search needs of lawyers, the development of a set of search queries created by lawyers, and our experiences in collecting expert relevance data for the completion of a test collection for legal search. Important findings of this latter process are the need for individuals with expert legal training to assess relevance, and the identification of context dependence in determining relevance. While the cost of the development of this test collection was found to be very high, we demonstrate its value in terms of identifying the effectiveness of legal search methods and in identifying research directions for legal case search.

Word-Node2Vec: Improving Word Embedding with Document-Level Non-Local Word Co-occurrences

Proceedings of the 2019 Conference of the North, 2019

A standard word embedding algorithm, such as word2vec and glove, makes a strong assumption that w... more A standard word embedding algorithm, such as word2vec and glove, makes a strong assumption that words are likely to be semantically related only if they co-occur locally within a window of fixed size. However, this strong assumption may not capture the semantic association between words that co-occur frequently but non-locally within documents. In this paper, we propose a graph-based word embedding method, named ‘word-node2vec’. By relaxing the strong constraint of locality, our method is able to capture both the local and non-local co-occurrences. Word-node2vec constructs a graph where every node represents a word and an edge between two nodes represents a combination of both local (e.g. word2vec) and document-level co-occurrences. Our experiments show that word-node2vec outperforms word2vec and glove on a range of different tasks, such as predicting word-pair similarity, word analogy and concept categorization.

Overview of the CLEF 2019 Personalised Information Retrieval Lab (PIR-CLEF 2019)

Lecture Notes in Computer Science, 2019

At CLEF 2018, the Personalised Information Retrieval Lab (PIR-CLEF 2018) has been conceived to pr... more At CLEF 2018, the Personalised Information Retrieval Lab (PIR-CLEF 2018) has been conceived to provide an initiative aimed at both providing and critically analysing a new approach to the evaluation of personalization in Information Retrieval (PIR). PIR-CLEF 2018 is the first edition of this Lab after the successful Pilot lab organised at CLEF 2017. PIR CLEF 2018 has provided registered participants with the data sets originally developed for the PIR-CLEF 2017 Pilot task; the data collected are related to real search sessions over a subset of the ClueWeb12 collection, undertaken by 10 users by using a novel methodology. The data were gathered during the search sessions undertaken by 10 volunteer searchers. Activities during these search sessions included relevance assessment of a retrieved documents by the searchers. 16 groups registered to participate at PIR-CLEF 2018 and were provided with the data set to allow them to work on PIR related tasks and to provide feedback about our proposed PIR evaluation methodology with the aim to create an effective evaluation task.

Query Expansion for Sentence Retrieval Using Pseudo Relevance Feedback and Word Embedding

Lecture Notes in Computer Science, 2017

Investigating segment-based query expansion for user-generated spoken content retrieval

2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI), Jun 1, 2016

The very rapid growth in user-generated social multimedia content on online platforms is creating... more The very rapid growth in user-generated social multimedia content on online platforms is creating new challenges for search technologies. A significant issue for search of this type of content is its highly variable form and quality. This is compounded by the standard information retrieval (IR) problem of mismatch between search queries and target items. Query Expansion (QE) has been shown to be an effect technique to improve IR effectiveness for multiple search tasks. In QE, words from a number of relevant or assumed relevant top ranked documents from an initial search are added to the initial search query to enrich it before carrying out a further search operation. In this work, we investigate the application of QE methods for searching social multimedia content. In particular we focus on social multimedia content where the information is primarily in the audio stream. To address the challenge of content variability, we introduce three speech segment-based methods for QE using: Semantic segmentation, Discourse segmentation and Window-Based. Our experimental investigation illustrates the superiority of these segment-based methods in comparison to a standard full document QE method for a version of the MediaEval 2012 Search task newly extended as an adhoc search task.

An analysis of evaluation campaigns in ad-hoc medical information retrieval: CLEF eHealth 2013 and 2014

Information Retrieval Journal, 2018

Estimating Gaussian mixture models in the local neighbourhood of embedded word vectors for query performance prediction

Information Processing & Management, 2019

The study of query performance prediction (QPP) in information retrieval (IR) aims to predict ret... more The study of query performance prediction (QPP) in information retrieval (IR) aims to predict retrieval effectiveness. The specificity of the underlying information need of a query often determines how effectively can a search engine retrieve relevant documents at top ranks. The presence of ambiguous terms makes a query less specific to the sought information need, which in turn may degrade IR effectiveness. In this paper, we propose a novel word embedding based pre-retrieval feature which measures the ambiguity of each query term by estimating how many 'senses' each word is associated with. Assuming each sense roughly corresponds to a Gaussian mixture component, our proposed generative model first estimates a Gaussian mixture model (GMM) from the word vectors that are most similar to the given query terms. We then use the posterior probabilities of generating the query terms themselves from this estimated GMM in order to quantify the ambiguity of the query. Previous studies have shown that post-retrieval QPP approaches often outperform pre-retrieval ones because they use additional information from the top ranked documents. To achieve the best of both worlds, we formalize a linear combination of our proposed GMM based pre-retrieval predictor with NQC, a state-of-the-art post-retrieval QPP. Our experiments on the TREC benchmark news and web collections demonstrate that our proposed hybrid QPP approach (in linear combination with NQC) significantly outperforms a range of other existing pre-retrieval approaches in combination with NQC used as baselines.

Proceedings of the Forum for Information Retrieval Evaluation on - FIRE '14, 2015

We describe the participation of Dublin City University (DCU) in the FIRE-2014 transliteration se... more We describe the participation of Dublin City University (DCU) in the FIRE-2014 transliteration search task (TST). The TST involves an ad-hoc search over a collection of Hindi film song lyrics. The Hindi language content of each document in the collection is either written in the native Devanagari script or transliterated in Roman script or a combination of both. The queries can be in mixed script as well. The task is challenging primarily because of the vocabulary mismatch which may arise due to the multiple transliteration alternatives. We attempt to address the vocabulary mismatch problem both during the indexing and retrieval stages. During indexing, we apply a rule-based normalization on some character sequences of the transliterated words in order to have a single representation in the index for the multiple transliteration alternatives. During the retrieval phase, we make use of prefix matched fuzzy query terms to account for the morphological variations of the transliterated words. The results show significant improvement over a standard baseline query likelihood language modelling (LM) approach. Additionally, we also apply statistical machine transliteration to train a transliteration model in order to predict the transliteration of out-of-vocabulary words. Surprisingly, even with satisfactory transliteration accuracy, we found that automatic transliteration of query terms degraded retrieval effectiveness.

External Query Reformulation for Text-Based Image Retrieval

Lecture Notes in Computer Science, 2011

In text-based image retrieval, the Incomplete Annotation Problem (IAP) can greatly degrade retrie... more In text-based image retrieval, the Incomplete Annotation Problem (IAP) can greatly degrade retrieval effectiveness. A standard method used to address this problem is pseudo relevance feedback (PRF) which updates user queries by adding feedback terms selected automatically from top ranked documents in a prior retrieval run. PRF assumes that the target collection provides enough feedback information to select effective expansion terms. This is often not the case in image retrieval since images often only have short metadata annotations leading to the IAP. Our work proposes the use of an external knowledge resource (Wikipedia) in the process of refining user queries. In our method, Wikipedia documents strongly related to the terms in user query ("definition documents") are first identified by title matching between the query and titles of Wikipedia articles. These definition documents are used as indicators to re-weight the feedback documents from an initial search run on a Wikipedia abstract collection using the Jaccard coefficient. The new weights of the feedback documents are combined with the scores rated by different indicators. Query-expansion terms are then selected based on these new weights for the feedback documents. Our method is evaluated on the ImageCLEF WikipediaMM image retrieval task using text-based retrieval on the document metadata fields. The results show significant improvement compared to standard PRF methods.

Enhanced Information Retrieval Using Domain-Specific Recommender Models

Lecture Notes in Computer Science, 2011

The objective of an information retrieval (IR) system is to retrieve relevant items which meet a ... more The objective of an information retrieval (IR) system is to retrieve relevant items which meet a user information need. There is currently significant interest in personalized IR which seeks to improve IR effectiveness by incorporating a model of the user's interests. However, in some situations there may be no opportunity to learn about the interests of a specific user on a certain topic. In our work, we propose an IR approach which combines a recommender algorithm with IR methods to improve retrieval for domains where the system has no opportunity to learn prior information about the user's knowledge of a domain for which they have not previously entered a query. We use search data from other previous users interested in the same topic to build a recommender model for this topic. When a user enters a query on a topic, new to this user, an appropriate recommender model is selected and used to predict a ranking which the user may find interesting based on the behaviour of previous users with similar queries. The recommender output is integrated with a standard IR method in a weighted linear combination to provide a final result for the user. Experiments using the INEX 2009 data collection with a simulated recommender training set show that our approach can improve on a baseline IR system.

An Affect-Based Video Retrieval System with Open Vocabulary Querying

Lecture Notes in Computer Science, 2011

Content-based video retrieval systems (CBVR) are creating new search and browse capabilities usin... more Content-based video retrieval systems (CBVR) are creating new search and browse capabilities using metadata describing significant features of the data. An often overlooked aspect of human interpretation of multimedia data is the affective dimension. Incorporating affective information into multimedia metadata can potentially enable search using this alternative interpretation of multimedia content. Recent work has described methods to automatically assign affective labels to multimedia data using various approaches. However, the subjective and imprecise nature of affective labels makes it difficult to bridge the semantic gap between system-detected labels and user expression of information requirements in multimedia retrieval. We present a novel affect-based video retrieval system incorporating an open-vocabulary query stage based on WordNet enabling search using an unrestricted query vocabulary. The system performs automatic annotation of video data with labels of well defined affective terms. In retrieval annotated documents are ranked using the standard Okapi retrieval model based on open-vocabulary text queries. We present experimental results examining the behaviour of the system for retrieval of a collection of automatically annotated feature films of different genres. Our results indicate that affective annotation can potentially provide useful augmentation to more traditional objective content description in multimedia retrieval.

A study on query expansion methods for patent retrieval

Proceedings of the 4th workshop on Patent information retrieval, 2011

Patent retrieval is a recall-oriented search task where the objective is to find all possible rel... more Patent retrieval is a recall-oriented search task where the objective is to find all possible relevant documents. Queries in patent retrieval are typically very long since they take the form of a patent claim or even a full patent application in the case of priorart patent search. Nevertheless, there is generally a significant mismatch between the query and the relevant documents, often leading to low retrieval effectiveness. Some previous work has tried to address this mismatch through the application of query expansion (QE) techniques which have generally showed effectiveness for many other retrieval tasks. However, results of QE on patent search have been found to be very disappointing. We present a review of previous investigations of QE in patent retrieval, and explore some of these techniques on a prior-art patent search task. In addition, a novel method for QE using automatically generated synonyms set is presented. While previous QE techniques fail to improve over baseline retrieval, our new approach show statistically better retrieval precision over the baseline, although not for recall. In addition, it proves to be significantly more efficient than existing techniques. An extensive analysis to the results is presented which seeks to better understand situations where these QE techniques succeed or fail.

Applying Query Formulation and Fusion Techniques For Cross Language News Story Search

Post-Proceedings of the 4th and 5th Workshops of the Forum for Information Retrieval Evaluation, 2013

Cross Language News story search (CLNSS) is concerned with finding documents describing the same ... more Cross Language News story search (CLNSS) is concerned with finding documents describing the same events in documents in different languages. As well as supporting information retrieval (IR), CLNSS has other applications in mining parallel and comparable data across different languages. In this paper, we present an overview of the work carried out for our participation in the Cross Language !ndian News Story Search (CL!NSS) task at FIRE 2013. In the CL!NSS task we explored the problem of cross language news search for the English-Hindi language pair. English news stories are used as queries to seek similar news documents from Hindi news articles. Hindi being a resource-scarce language offers many challenges towards retrieving relevant news articles. We investigate and contrast translation of input queries from English to Hindi using the Google and Bing translation services. To support translation of out-of-vocabulary words we use the Google transliteration service. A key challenge of the CL!NSS task is formation of search queries from the English news articles, since they are much longer than the much shorter queries typically used in IR applications. To address this problem, we explore the use of summarization to extract a query from the input news documents, and use these summarized queries as the input to the cross language IR system. We explore the use of query expansion using pseudo relevance feedback (PRF) in the IR process, since this has been shown to be effective for cross language IR in many previous investigations. We also explore in detail the use of data fusion techniques over different sets of retrieved results obtained using diverse query formulation techniques. For the CL!NSS task our team submitted 3 main runs. The results of our best run was ranked first among official submissions based on NDCG@5 and NDCG@10 values and second for NDCG@1 values. For the 25 test queries the results of our best main run were NDCG@1 0.7400, NDCG@5 0.6809 and NDCG@10 0.7268. We present our methodology, official results and results of a number of post-task experiments that were conducted to further examine the cross language search problem. Our experiments reveal that query formu

An Evaluation and Analysis of Incorporating Term Dependency for Ad-Hoc Retrieval

Lecture Notes in Computer Science

Dublin City University at CLEF 2006: Experiments for the ImageCLEF Photo Collection Standard Ad Hoc Task

Lecture Notes in Computer Science, 2007

For the CLEF 2006 Cross Language Image Retrieval (Im-ageCLEF) Photo Collection Standard Ad Hoc ta... more For the CLEF 2006 Cross Language Image Retrieval (Im-ageCLEF) Photo Collection Standard Ad Hoc task, DCU performed monolingual and cross language retrieval using photo annotations with and without feedback, and also a combined visual and text retrieval approach. Topics are translated into English using the Babelfish online machine translation system. Text runs used the BM25 algorithm, while visual approach used simple low-level features with matching based on the Jeffrey Divergence measure. Our results consistently indicate that the fusion of text and visual features is best for this task, and that performing feedback for text consistently improves on the baseline non-feedback BM25 text runs for all language pairs.

Cross-Lingual Topical Relevance Models

Cross-lingual relevance modelling (CLRLM) is a state-of-the-art technique for cross-lingual infor... more Cross-lingual relevance modelling (CLRLM) is a state-of-the-art technique for cross-lingual information retrieval (CLIR) which integrates query term disambiguation and expansion in a unified framework, to directly estimate a model of relevant documents in the target language starting with a query in the source language. However, CLRLM involves integrating a translation model either on the document side if a parallel corpus is available, or on the query side if a bilingual dictionary is available. For low resourced language pairs, large parallel corpora do not exist and the vocabulary coverage of dictionaries is small, as a result of which RLM-based CLIR fails to obtain satisfactory results. Despite the lack of parallel resources for a majority of language pairs, the availability of comparable corpora for many languages has grown considerably in the recent years. Existing CLIR techniques such as cross-lingual relevance models, cannot effectively utilize these comparable corpora, since they do not use information from documents in the source language. We overcome this limitation by using information from retrieved documents in the source language to improve the retrieval quality of the target language documents. More precisely speaking, our model involves a two step approach of first retrieving documents both in the source language and the target language (using query translation), and then improving on the retrieval quality of target language documents by expanding the query with translations of words extracted from the top ranked documents retrieved in the source language which are thematically related (i.e. share the same concept) to the words in the top ranked target language documents. Our key hypothesis is that the query in the source language and its equivalent target language translation retrieve documents which share topics. The ovelapping topics of these top ranked documents in both languages are then used to improve the ranking of the target language documents. Since the model relies on the alignment of topics between language pairs, we call it the cross-lingual topical relevance model (CLTRLM). Experimental results show that the CLTRLM significantly outperforms the standard CLRLM by upto 37% on English-Bengali CLIR, achieving mean average precision (MAP) of up to 60.27% of the Bengali monolingual IR MAP.

Multilingual Adaptive Search for Digital Libraries

Lecture Notes in Computer Science, 2011

We describe a framework for Adaptive Multilingual Information Retrieval (AMIR) which allows multi... more We describe a framework for Adaptive Multilingual Information Retrieval (AMIR) which allows multilingual resource discovery and delivery using on-the-fly machine translation of documents and queries. Result documents are presented to the user in a contextualised manner. Challenges and affordances of both adaptive and multilingual IR, with a particular focus on digital libraries, are detailed. The framework components are motivated by a series of results from experiments on query logs and documents from The European Library. We conclude that factoring adaptivity and multilinguality aspects into the search process can enhance the user's experience with online digital libraries.

An LDA-smoothed relevance model for document expansion

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, 2013

Document expansion (DE) in information retrieval (IR) involves modifying each document in the col... more Document expansion (DE) in information retrieval (IR) involves modifying each document in the collection by introducing additional terms into the document. It is particularly useful to improve retrieval of short and noisy documents where the additional terms can improve the description of the document content. Existing approaches to DE assume that documents to be expanded are from a single topic. In the case of multi-topic documents this can lead to a topic bias in terms selected for DE and hence may result in poor retrieval quality due to the lack of coverage of the original document topics in the expanded document. This paper proposes a new DE technique providing a more uniform selection and weighting of DE terms from all constituent topics. We show that our proposed method significantly outperforms the most recently reported relevance model based DE method on a spoken document retrieval task for both manual and automatic speech recognition transcripts.

Overview of the Personalized and Collaborative Information Retrieval (PIR) Track at FIRE-2011

Lecture Notes in Computer Science, 2013