Skip to main content

Yannis Korkontzelos

Followers

2

Following

0

Public Views

Interests

Uploads

Papers by Yannis Korkontzelos

SemEval-2013 Task 5: Evaluating Phrasal Semantics

This paper describes the SemEval-2013 Task 5: "Evaluating Phrasal Semantics". Its first... more This paper describes the SemEval-2013 Task 5: "Evaluating Phrasal Semantics". Its first subtask is about computing the semantic similarity of words and compositional phrases of minimal length. The second one addresses deciding the compositionality of phrases in a given context. The paper discusses the importance and background of these subtasks and their structure. In succession, it introduces the systems that participated and discusses evaluation results.

Identifying Content Types of Messages Related to Open Source Software Projects

Assessing the suitability of an Open Source Software project for adoption requires not only an an... more Assessing the suitability of an Open Source Software project for adoption requires not only an analysis of aspects related to the code, such as code quality, frequency of updates and new version releases, but also an evaluation of the quality of support offered in related online forums and issue trackers. Understanding the content types of forum messages and issue trackers can provide information about the extent to which requests are being addressed and issues are being resolved, the percentage of issues that are not being fixed, the cases where the user acknowledged that the issue was successfully resolved, etc. These indicators can provide potential adopters of the OSS with estimates about the level of available support. We present a detailed hierarchy of content types of online forum messages and issue tracker comments and a corpus of messages annotated accordingly. We discuss our experiments to classify forum messages and issue tracker comments into content-related classes, i.e...

Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts

Journal of biomedical informatics, Aug 27, 2016

The abundance of text available in social media and health related forums along with the rich exp... more The abundance of text available in social media and health related forums along with the rich expression of public opinion have recently attracted the interest of the public health community to use these sources for pharmacovigilance. Based on the intuition that patients post about Adverse Drug Reactions (ADRs) expressing negative sentiments, we investigate the effect of sentiment analysis features in locating ADR mentions. We enrich the feature space of a state-of-the-art ADR identification method with sentiment analysis features. Using a corpus of posts from the DailyStrength forum and tweets annotated for ADR and indication mentions, we evaluate the extent to which sentiment analysis features help in locating ADR mentions and distinguishing them from indication mentions. Evaluation results show that sentiment analysis features marginally improve ADR identification in tweets and health related forum posts. Adding sentiment analysis features achieved a statistically significant F-m...

Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014

Automatically compiling bilingual dictionaries of technical terms from comparable corpora is a ch... more Automatically compiling bilingual dictionaries of technical terms from comparable corpora is a challenging problem, yet with many potential applications. In this paper, we exploit two independent observations about term translations: (a) terms are often formed by corresponding sub-lexical units across languages and (b) a term and its translation tend to appear in similar lexical context. Based on the first observation, we develop a new character n-gram compositional method, a logistic regression classifier, for learning a string similarity measure of term translations. According to the second observation, we use an existing context-based approach. For evaluation, we investigate the performance of compositional and context-based methods on: (a) similar and unrelated languages, (b) corpora of different degree of comparability and (c) the translation of frequent and rare terms. Finally, we combine the two translation clues, namely string and contextual similarity, in a linear model and we show substantial improvements over the two translation signals.

OSSMETER: a software measurement platform for automatically analysing open source software projects

Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015

Deciding whether an open source software (OSS) project meets the required standards for adoption ... more Deciding whether an open source software (OSS) project meets the required standards for adoption in terms of quality, maturity, activity of development and user support is not a straightforward process as it involves exploring various sources of information. Such sources include OSS source code repositories, communication channels such as newsgroups, forums, and mailing lists, as well as issue tracking systems. OSSMETER is an extensible and scalable platform that can monitor and incrementally analyse a large number of OSS projects. The results of this analysis can be used to assess various aspects of OSS projects, and to directly compare different OSS projects with each other.

OSSMETER: Automated Measurement and Analysis of Open Source Software

Deciding whether an open source software (OSS) meets the required standards for adoption in terms... more Deciding whether an open source software (OSS) meets the required standards for adoption in terms of quality, maturity, activity of development and user support is not a straightforward process. It involves analysing various sources of information, including the project's source code repositories, communication channels, and bug tracking systems. OSSMETER extends state-of-the-art techniques in the field of automated analysis and measurement of open-source software (OSS), and develops a platform that supports decision makers in the process of discovering, comparing, assessing and monitoring the health, quality, impact and activity of opensource software. To achieve this, OSSMETER computes trustworthy quality indicators by performing advanced analysis and integration of information from diverse sources including the project metadata, source code repositories, communication channels and bug tracking systems of OSS projects.

Detecting compositionality in multi-word expressions

Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP '09, 2009

Identifying whether a multi-word expression (MWE) is compositional or not is important for numero... more Identifying whether a multi-word expression (MWE) is compositional or not is important for numerous NLP applications. Sense induction can partition the context of MWEs into semantic uses and therefore aid in deciding compositionality. We propose an unsupervised system to explore this hypothesis on compound nominals, proper names and adjective-noun constructions, and evaluate the contribution of sense induction. The evaluation set is derived from WordNet in a semisupervised way. Graph connectivity measures are employed for unsupervised parameter tuning. agony aunt, black maria, dead end, dutch oven, fish finger, fool's paradise, goat's rue, green light, high jump, joint chiefs, lip service, living rock, monkey puzzle, motor pool, prince Albert, stocking stuffer, sweet bay, teddy boy, think tank Compositional MWEs box white oak, cartridge brass, common iguana, closed chain, eastern pipistrel, field mushroom, hard candy, king snake, labor camp, lemon tree, life form, parenthesis-free notation, parking brake, petit juror, relational adjective, taxonomic category, telephone service, tea table, upland cotton

Boosting drug named entity recognition using an aggregate classifier

Artificial Intelligence in Medicine, 2015

Objective: Drug named entity recognition (NER) is a critical step for complex biomedical NLP task... more Objective: Drug named entity recognition (NER) is a critical step for complex biomedical NLP tasks such as the extraction of pharmacogenomic, pharmacodynamic and pharmacokinetic parameters. Large quantities of high quality training data are almost always a prerequisite for employing supervised machine-learning techniques to achieve high classification performance. However, the human labour needed to produce and maintain such resources is a significant limitation. In this study, we improve the performance of drug NER without relying exclusively on manual annotations. Methods: We perform drug NER using either a small gold-standard corpus (120 abstracts) or no corpus at all. In our approach, we develop a voting system to combine a number of heterogeneous models, based on dictionary knowledge, gold-standard corpora and silver annotations, to enhance performance. To improve recall, we employed genetic programming to evolve 11 regular-expression patterns that capture common drug suffixes and used them as an extra means for recognition. Materials: Our approach uses a dictionary of drug names, i.e. DrugBank, a small manually annotated corpus, i.e. the pharmacokinetic corpus, and a part of the UKPMC database, as raw biomedical text. Gold-standard and silver annotated data are used to train maximum entropy and multinomial logistic regression classifiers. Results: Aggregating drug NER methods, based on gold-standard annotations, dictionary knowledge and patterns, improved the performance on models trained on gold-standard annotations, only, achieving a maximum F-score of 95%. In addition, combining models trained on silver annotations, dictionary knowledge and patterns are shown to achieve comparable performance to models trained exclusively on gold-standard data. The main reason appears to be the morphological similarities shared among drug names. Conclusion: We conclude that gold-standard data are not a hard requirement for drug NER. Combining heterogeneous models build on dictionary knowledge can achieve similar or comparable classification performance with that of the best performing model trained on gold-standard annotations.

Graph connectivity measures for unsupervised parameter tuning of graph-based sense induction systems

Proceedings of the Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics - UMSLLS '09, 2009

In this paper, a framework for acquiring common sense knowledge from the Web is presented. Common... more In this paper, a framework for acquiring common sense knowledge from the Web is presented. Common sense knowledge includes information about the world that humans use in their everyday lives. To acquire this knowledge, relationships between nouns are retrieved by using search phrases with automatically filled constituents. Through empirical analysis of the acquired nouns over Word-Net, probabilities are produced for relationships between a concept and a word rather than between two words. A specific goal of our acquisition method is to acquire knowledge that can be successfully applied to NLP problems. We test the validity of the acquired knowledge by means of an application to the problem of word sense disambiguation. Results show that the knowledge can be used to improve the accuracy of a state of the art unsupervised disambiguation system. Concept Analysis Noun Acquisition web search parse and match

Descriptive document clustering via discriminant learning in a co‐embedded space of multilevel similarities

Journal of the Association for Information Science and Technology, 2014

Descriptive document clustering aims at discovering clusters of semantically interrelated documen... more Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarize the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common co‐embedded space that preserves higher‐order, neighbor‐based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are...

Locating Requests among Open Source Software Communication Messages

As a first step towards assessing the quality of support offered online for Open Source Software ... more As a first step towards assessing the quality of support offered online for Open Source Software (OSS), we address the task of locating requests, i.e., messages that raise an issue to be addressed by the OSS community, as opposed to any other message. We present a corpus of online communication messages randomly sampled from newsgroups and bug trackers, manually annotated as requests or non-requests. We identify several linguistically shallow, content-based heuristics that correlate with the classification and investigate the extent to which they can serve as independent classification criteria. Then, we train machine-learning classifiers on these heuristics. We experiment with a wide range of settings, such as different learners, excluding some heuristics and adding unigram features of various parts-of-speech and frequency. We conclude that some heuristics can perform well, while their accuracy can be improved further using machine learning, at the cost of obtaining manual annotations.

Using a Random Forest Classifier to recognise translations of biomedical terms across languages

We present a novel method to recognise semantic equivalents of biomedical terms in language pairs... more We present a novel method to recognise semantic equivalents of biomedical terms in language pairs. We hypothesise that biomedical term are formed by semantically similar textual units across languages. Based on this hypothesis, we employ a Random Forest (RF) classifier that is able to automatically mine higher order associations between textual units of the source and target language when trained on a corpus of both positive and negative examples. We apply our method on two language pairs: one that uses the same character set and another with a different script, English-French and English-Chinese, respectively. We show that English-French pairs of terms are highly transliterated in contrast to the English-Chinese pairs. Nonetheless, our method performs robustly on both cases. We evaluate RF against a state-of-the-art alignment method, GIZA++, and we report a statistically significant improvement. Finally, we compare RF against Support Vector Machines and analyse our results.

Extending an interoperable platform to facilitate the creation of multilingual and multimodal NLP applications

U-Compare is a UIMA-based workflow construction platform for building natural language processing... more U-Compare is a UIMA-based workflow construction platform for building natural language processing (NLP) applications from heterogeneous language resources (LRs), without the need for programming skills. U-Compare has been adopted within the context of the META-NET Network of Excellence, and over 40 LRs that process 15 European languages have been added to the U-Compare component library. In line with META-NET's aims of increasing communication between citizens of different European countries, U-Compare has been extended to facilitate the development of a wider range of applications, including both multilingual and multimodal workflows. The enhancements exploit the UIMA Subject of Analysis (Sofa) mechanism, that allows different facets of the input data to be represented. We demonstrate how our customised extensions to U-Compare allow the construction and testing of NLP applications that transform the input data in different ways, e.g., machine translation, automatic summarisation and text-to-speech.

Towards a better understanding of discourse: integrating multiple discourse annotation perspectives using UIMA

A method for discovering and inferring appropriate eligibility criteria in clinical trial protocols without labeled data

BMC Medical Informatics and Decision Making, 2013

Background: We consider the user task of designing clinical trial protocols and propose a method ... more Background: We consider the user task of designing clinical trial protocols and propose a method that discovers and outputs the most appropriate eligibility criteria from a potentially huge set of candidates. Each document d in our collection D is a clinical trial protocol which itself contains a set of eligibility criteria. Given a small set of sample documents D , |D | |D|, a user has initially identified as relevant e.g., via a user query interface, our scoring method automatically suggests eligibility criteria from D, D ⊃ D', by ranking them according to how appropriate they are to the clinical trial protocol currently being designed. The appropriateness is measured by the degree to which they are consistent with the user-supplied sample documents D'. Method: We propose a novel three-step method called LDALR which views documents as a mixture of latent topics. First, we infer the latent topics in the sample documents using Latent Dirichlet Allocation (LDA). Next, we use logistic regression models to compute the probability that a given candidate criterion belongs to a particular topic. Lastly, we score each criterion by computing its expected value, the probability-weighted sum of the topic proportions inferred from the set of sample documents. Intuitively, the greater the probability that a candidate criterion belongs to the topics that are dominant in the samples, the higher its expected value or score. Results: Our experiments have shown that LDALR is 8 and 9 times better (resp., for inclusion and exclusion criteria) than randomly choosing from a set of candidates obtained from relevant documents. In user simulation experiments using LDALR, we were able to automatically construct eligibility criteria that are on the average 75% and 70% (resp., for inclusion and exclusion criteria) similar to the correct eligibility criteria.

Unsupervised Learning of Multiword Expressions

Multiword expressions are expressions consisting of two or more words that correspond to some con... more Multiword expressions are expressions consisting of two or more words that correspond to some conventional way of saying things (Manning & Schutze 1999). Due to the idiomatic nature of many of them and their high frequency of occurence in all sorts of text, they cause problems in many Natural Language Processing (NLP) applications and are frequently responsible for their shortcomings. Efficiently recognising multiword expressions and deciding the degree of their idiomaticity would be useful to all applications that require ...

Uoy: Graphs of unambiguous vertices for word sense induction and disambiguation

… of the 5th International Workshop on …, Jul 15, 2010

This paper presents an unsupervised graph-based method for automatic word sense induction and dis... more This paper presents an unsupervised graph-based method for automatic word sense induction and disambiguation. The innovative part of our method is the assignment of either a word or a word pair to each vertex of the constructed graph. Word senses are induced by clustering the constructed graph. In the disambiguation stage, each induced cluster is scored according to the number of its vertices found in the context of the target word. Our system participated in SemEval-2010 word sense induction and disambiguation ...

Can recognising multiword expressions improve shallow parsing?

… : The 2010 Annual Conference of the …, Jun 2, 2010

There is significant evidence in the literature that integrating knowledge about multiword expres... more There is significant evidence in the literature that integrating knowledge about multiword expressions can improve shallow parsing accuracy. We present an experimental study to quantify this improvement, focusing on compound nominals, proper names and adjective-noun constructions. The evaluation set of multiword expressions is derived from Word-Net and the textual data are downloaded from the web. We use a classification method to aid human annotation of output parses. This method allows us to conduct ...

ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials

BMC Medical Informatics and Decision Making, 2012

Clinical trials are mandatory protocols describing medical research on humans and among the most ... more Clinical trials are mandatory protocols describing medical research on humans and among the most valuable sources of medical practice evidence. Searching for trials relevant to some query is laborious due to the immense number of existing protocols. Apart from search, writing new trials includes composing detailed eligibility criteria, which might be time-consuming, especially for new researchers. In this paper we present ASCOT, an efficient search application customised for clinical trials. ASCOT uses text mining and data mining methods to enrich clinical trials with metadata, that in turn serve as effective tools to narrow down search. In addition, ASCOT integrates a component for recommending eligibility criteria based on a set of selected protocols.

Simmelian Ties on Twitter: Empirical Analysis and Prediction

2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), 2019

SemEval-2013 Task 5: Evaluating Phrasal Semantics

This paper describes the SemEval-2013 Task 5: "Evaluating Phrasal Semantics". Its first... more This paper describes the SemEval-2013 Task 5: "Evaluating Phrasal Semantics". Its first subtask is about computing the semantic similarity of words and compositional phrases of minimal length. The second one addresses deciding the compositionality of phrases in a given context. The paper discusses the importance and background of these subtasks and their structure. In succession, it introduces the systems that participated and discusses evaluation results.

Identifying Content Types of Messages Related to Open Source Software Projects

Assessing the suitability of an Open Source Software project for adoption requires not only an an... more Assessing the suitability of an Open Source Software project for adoption requires not only an analysis of aspects related to the code, such as code quality, frequency of updates and new version releases, but also an evaluation of the quality of support offered in related online forums and issue trackers. Understanding the content types of forum messages and issue trackers can provide information about the extent to which requests are being addressed and issues are being resolved, the percentage of issues that are not being fixed, the cases where the user acknowledged that the issue was successfully resolved, etc. These indicators can provide potential adopters of the OSS with estimates about the level of available support. We present a detailed hierarchy of content types of online forum messages and issue tracker comments and a corpus of messages annotated accordingly. We discuss our experiments to classify forum messages and issue tracker comments into content-related classes, i.e...

Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts

Journal of biomedical informatics, Aug 27, 2016

The abundance of text available in social media and health related forums along with the rich exp... more The abundance of text available in social media and health related forums along with the rich expression of public opinion have recently attracted the interest of the public health community to use these sources for pharmacovigilance. Based on the intuition that patients post about Adverse Drug Reactions (ADRs) expressing negative sentiments, we investigate the effect of sentiment analysis features in locating ADR mentions. We enrich the feature space of a state-of-the-art ADR identification method with sentiment analysis features. Using a corpus of posts from the DailyStrength forum and tweets annotated for ADR and indication mentions, we evaluate the extent to which sentiment analysis features help in locating ADR mentions and distinguishing them from indication mentions. Evaluation results show that sentiment analysis features marginally improve ADR identification in tweets and health related forum posts. Adding sentiment analysis features achieved a statistically significant F-m...

Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014

Automatically compiling bilingual dictionaries of technical terms from comparable corpora is a ch... more Automatically compiling bilingual dictionaries of technical terms from comparable corpora is a challenging problem, yet with many potential applications. In this paper, we exploit two independent observations about term translations: (a) terms are often formed by corresponding sub-lexical units across languages and (b) a term and its translation tend to appear in similar lexical context. Based on the first observation, we develop a new character n-gram compositional method, a logistic regression classifier, for learning a string similarity measure of term translations. According to the second observation, we use an existing context-based approach. For evaluation, we investigate the performance of compositional and context-based methods on: (a) similar and unrelated languages, (b) corpora of different degree of comparability and (c) the translation of frequent and rare terms. Finally, we combine the two translation clues, namely string and contextual similarity, in a linear model and we show substantial improvements over the two translation signals.

OSSMETER: a software measurement platform for automatically analysing open source software projects

Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015

Deciding whether an open source software (OSS) project meets the required standards for adoption ... more Deciding whether an open source software (OSS) project meets the required standards for adoption in terms of quality, maturity, activity of development and user support is not a straightforward process as it involves exploring various sources of information. Such sources include OSS source code repositories, communication channels such as newsgroups, forums, and mailing lists, as well as issue tracking systems. OSSMETER is an extensible and scalable platform that can monitor and incrementally analyse a large number of OSS projects. The results of this analysis can be used to assess various aspects of OSS projects, and to directly compare different OSS projects with each other.

OSSMETER: Automated Measurement and Analysis of Open Source Software

Deciding whether an open source software (OSS) meets the required standards for adoption in terms... more Deciding whether an open source software (OSS) meets the required standards for adoption in terms of quality, maturity, activity of development and user support is not a straightforward process. It involves analysing various sources of information, including the project's source code repositories, communication channels, and bug tracking systems. OSSMETER extends state-of-the-art techniques in the field of automated analysis and measurement of open-source software (OSS), and develops a platform that supports decision makers in the process of discovering, comparing, assessing and monitoring the health, quality, impact and activity of opensource software. To achieve this, OSSMETER computes trustworthy quality indicators by performing advanced analysis and integration of information from diverse sources including the project metadata, source code repositories, communication channels and bug tracking systems of OSS projects.

Detecting compositionality in multi-word expressions

Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP '09, 2009

Identifying whether a multi-word expression (MWE) is compositional or not is important for numero... more Identifying whether a multi-word expression (MWE) is compositional or not is important for numerous NLP applications. Sense induction can partition the context of MWEs into semantic uses and therefore aid in deciding compositionality. We propose an unsupervised system to explore this hypothesis on compound nominals, proper names and adjective-noun constructions, and evaluate the contribution of sense induction. The evaluation set is derived from WordNet in a semisupervised way. Graph connectivity measures are employed for unsupervised parameter tuning. agony aunt, black maria, dead end, dutch oven, fish finger, fool's paradise, goat's rue, green light, high jump, joint chiefs, lip service, living rock, monkey puzzle, motor pool, prince Albert, stocking stuffer, sweet bay, teddy boy, think tank Compositional MWEs box white oak, cartridge brass, common iguana, closed chain, eastern pipistrel, field mushroom, hard candy, king snake, labor camp, lemon tree, life form, parenthesis-free notation, parking brake, petit juror, relational adjective, taxonomic category, telephone service, tea table, upland cotton

Boosting drug named entity recognition using an aggregate classifier

Artificial Intelligence in Medicine, 2015

Objective: Drug named entity recognition (NER) is a critical step for complex biomedical NLP task... more Objective: Drug named entity recognition (NER) is a critical step for complex biomedical NLP tasks such as the extraction of pharmacogenomic, pharmacodynamic and pharmacokinetic parameters. Large quantities of high quality training data are almost always a prerequisite for employing supervised machine-learning techniques to achieve high classification performance. However, the human labour needed to produce and maintain such resources is a significant limitation. In this study, we improve the performance of drug NER without relying exclusively on manual annotations. Methods: We perform drug NER using either a small gold-standard corpus (120 abstracts) or no corpus at all. In our approach, we develop a voting system to combine a number of heterogeneous models, based on dictionary knowledge, gold-standard corpora and silver annotations, to enhance performance. To improve recall, we employed genetic programming to evolve 11 regular-expression patterns that capture common drug suffixes and used them as an extra means for recognition. Materials: Our approach uses a dictionary of drug names, i.e. DrugBank, a small manually annotated corpus, i.e. the pharmacokinetic corpus, and a part of the UKPMC database, as raw biomedical text. Gold-standard and silver annotated data are used to train maximum entropy and multinomial logistic regression classifiers. Results: Aggregating drug NER methods, based on gold-standard annotations, dictionary knowledge and patterns, improved the performance on models trained on gold-standard annotations, only, achieving a maximum F-score of 95%. In addition, combining models trained on silver annotations, dictionary knowledge and patterns are shown to achieve comparable performance to models trained exclusively on gold-standard data. The main reason appears to be the morphological similarities shared among drug names. Conclusion: We conclude that gold-standard data are not a hard requirement for drug NER. Combining heterogeneous models build on dictionary knowledge can achieve similar or comparable classification performance with that of the best performing model trained on gold-standard annotations.

Graph connectivity measures for unsupervised parameter tuning of graph-based sense induction systems

Proceedings of the Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics - UMSLLS '09, 2009

In this paper, a framework for acquiring common sense knowledge from the Web is presented. Common... more In this paper, a framework for acquiring common sense knowledge from the Web is presented. Common sense knowledge includes information about the world that humans use in their everyday lives. To acquire this knowledge, relationships between nouns are retrieved by using search phrases with automatically filled constituents. Through empirical analysis of the acquired nouns over Word-Net, probabilities are produced for relationships between a concept and a word rather than between two words. A specific goal of our acquisition method is to acquire knowledge that can be successfully applied to NLP problems. We test the validity of the acquired knowledge by means of an application to the problem of word sense disambiguation. Results show that the knowledge can be used to improve the accuracy of a state of the art unsupervised disambiguation system. Concept Analysis Noun Acquisition web search parse and match

Descriptive document clustering via discriminant learning in a co‐embedded space of multilevel similarities

Journal of the Association for Information Science and Technology, 2014

Descriptive document clustering aims at discovering clusters of semantically interrelated documen... more Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarize the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common co‐embedded space that preserves higher‐order, neighbor‐based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are...

Locating Requests among Open Source Software Communication Messages

As a first step towards assessing the quality of support offered online for Open Source Software ... more As a first step towards assessing the quality of support offered online for Open Source Software (OSS), we address the task of locating requests, i.e., messages that raise an issue to be addressed by the OSS community, as opposed to any other message. We present a corpus of online communication messages randomly sampled from newsgroups and bug trackers, manually annotated as requests or non-requests. We identify several linguistically shallow, content-based heuristics that correlate with the classification and investigate the extent to which they can serve as independent classification criteria. Then, we train machine-learning classifiers on these heuristics. We experiment with a wide range of settings, such as different learners, excluding some heuristics and adding unigram features of various parts-of-speech and frequency. We conclude that some heuristics can perform well, while their accuracy can be improved further using machine learning, at the cost of obtaining manual annotations.

Using a Random Forest Classifier to recognise translations of biomedical terms across languages

We present a novel method to recognise semantic equivalents of biomedical terms in language pairs... more We present a novel method to recognise semantic equivalents of biomedical terms in language pairs. We hypothesise that biomedical term are formed by semantically similar textual units across languages. Based on this hypothesis, we employ a Random Forest (RF) classifier that is able to automatically mine higher order associations between textual units of the source and target language when trained on a corpus of both positive and negative examples. We apply our method on two language pairs: one that uses the same character set and another with a different script, English-French and English-Chinese, respectively. We show that English-French pairs of terms are highly transliterated in contrast to the English-Chinese pairs. Nonetheless, our method performs robustly on both cases. We evaluate RF against a state-of-the-art alignment method, GIZA++, and we report a statistically significant improvement. Finally, we compare RF against Support Vector Machines and analyse our results.

Extending an interoperable platform to facilitate the creation of multilingual and multimodal NLP applications

U-Compare is a UIMA-based workflow construction platform for building natural language processing... more U-Compare is a UIMA-based workflow construction platform for building natural language processing (NLP) applications from heterogeneous language resources (LRs), without the need for programming skills. U-Compare has been adopted within the context of the META-NET Network of Excellence, and over 40 LRs that process 15 European languages have been added to the U-Compare component library. In line with META-NET's aims of increasing communication between citizens of different European countries, U-Compare has been extended to facilitate the development of a wider range of applications, including both multilingual and multimodal workflows. The enhancements exploit the UIMA Subject of Analysis (Sofa) mechanism, that allows different facets of the input data to be represented. We demonstrate how our customised extensions to U-Compare allow the construction and testing of NLP applications that transform the input data in different ways, e.g., machine translation, automatic summarisation and text-to-speech.

Towards a better understanding of discourse: integrating multiple discourse annotation perspectives using UIMA

A method for discovering and inferring appropriate eligibility criteria in clinical trial protocols without labeled data

BMC Medical Informatics and Decision Making, 2013

Background: We consider the user task of designing clinical trial protocols and propose a method ... more Background: We consider the user task of designing clinical trial protocols and propose a method that discovers and outputs the most appropriate eligibility criteria from a potentially huge set of candidates. Each document d in our collection D is a clinical trial protocol which itself contains a set of eligibility criteria. Given a small set of sample documents D , |D | |D|, a user has initially identified as relevant e.g., via a user query interface, our scoring method automatically suggests eligibility criteria from D, D ⊃ D', by ranking them according to how appropriate they are to the clinical trial protocol currently being designed. The appropriateness is measured by the degree to which they are consistent with the user-supplied sample documents D'. Method: We propose a novel three-step method called LDALR which views documents as a mixture of latent topics. First, we infer the latent topics in the sample documents using Latent Dirichlet Allocation (LDA). Next, we use logistic regression models to compute the probability that a given candidate criterion belongs to a particular topic. Lastly, we score each criterion by computing its expected value, the probability-weighted sum of the topic proportions inferred from the set of sample documents. Intuitively, the greater the probability that a candidate criterion belongs to the topics that are dominant in the samples, the higher its expected value or score. Results: Our experiments have shown that LDALR is 8 and 9 times better (resp., for inclusion and exclusion criteria) than randomly choosing from a set of candidates obtained from relevant documents. In user simulation experiments using LDALR, we were able to automatically construct eligibility criteria that are on the average 75% and 70% (resp., for inclusion and exclusion criteria) similar to the correct eligibility criteria.

Unsupervised Learning of Multiword Expressions

Multiword expressions are expressions consisting of two or more words that correspond to some con... more Multiword expressions are expressions consisting of two or more words that correspond to some conventional way of saying things (Manning & Schutze 1999). Due to the idiomatic nature of many of them and their high frequency of occurence in all sorts of text, they cause problems in many Natural Language Processing (NLP) applications and are frequently responsible for their shortcomings. Efficiently recognising multiword expressions and deciding the degree of their idiomaticity would be useful to all applications that require ...

Uoy: Graphs of unambiguous vertices for word sense induction and disambiguation

… of the 5th International Workshop on …, Jul 15, 2010

This paper presents an unsupervised graph-based method for automatic word sense induction and dis... more This paper presents an unsupervised graph-based method for automatic word sense induction and disambiguation. The innovative part of our method is the assignment of either a word or a word pair to each vertex of the constructed graph. Word senses are induced by clustering the constructed graph. In the disambiguation stage, each induced cluster is scored according to the number of its vertices found in the context of the target word. Our system participated in SemEval-2010 word sense induction and disambiguation ...

Can recognising multiword expressions improve shallow parsing?

… : The 2010 Annual Conference of the …, Jun 2, 2010

There is significant evidence in the literature that integrating knowledge about multiword expres... more There is significant evidence in the literature that integrating knowledge about multiword expressions can improve shallow parsing accuracy. We present an experimental study to quantify this improvement, focusing on compound nominals, proper names and adjective-noun constructions. The evaluation set of multiword expressions is derived from Word-Net and the textual data are downloaded from the web. We use a classification method to aid human annotation of output parses. This method allows us to conduct ...

ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials

BMC Medical Informatics and Decision Making, 2012

Clinical trials are mandatory protocols describing medical research on humans and among the most ... more Clinical trials are mandatory protocols describing medical research on humans and among the most valuable sources of medical practice evidence. Searching for trials relevant to some query is laborious due to the immense number of existing protocols. Apart from search, writing new trials includes composing detailed eligibility criteria, which might be time-consuming, especially for new researchers. In this paper we present ASCOT, an efficient search application customised for clinical trials. ASCOT uses text mining and data mining methods to enrich clinical trials with metadata, that in turn serve as effective tools to narrow down search. In addition, ASCOT integrates a component for recommending eligibility criteria based on a set of selected protocols.

Simmelian Ties on Twitter: Empirical Analysis and Prediction

2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), 2019