I try to upload full-text papers to my website whenever possible. Please find them here: http://nlp.lab.arizona.edu/content/publications Supervisors: Hsinchun Chen
Named Entity Recognition (NER) is an important task in biomedical NLP which identifies and catego... more Named Entity Recognition (NER) is an important task in biomedical NLP which identifies and categorizes entities in biomedical text. We currently focus on a rule-based approach for NER to identify the diagnostic criteria of valley fever in the free text of electronic health records (EHRs), since no training data exist for machine learning. To aid the manual pattern defining process of the rule-based approach, we propose a graph-based lexicon expansion method. We used different word embedding models to create a lexicon graph and expanded the lexicons by conducting different graph search methods.
Automated lay summary generation can improve the accessibility of health information, but is chal... more Automated lay summary generation can improve the accessibility of health information, but is challenging because of the need to provide background information absent in source documents.
IEEE Journal of Biomedical and Health Informatics, Sep 1, 2019
Our goal is data-driven discovery of features for text simplification. In this work, we investiga... more Our goal is data-driven discovery of features for text simplification. In this work, we investigate three types of lexical chains: exact, synonymous, and semantic. A lexical chain links semantically related words in a document. We examine their potential with 1) a document-level corpus statistics study (914 texts) to estimate their overall capacity to differentiate between easy and difficult text and 2) a classification task (11,000 sentences) to determine usefulness of features at sentence-level for simplification. For the corpus statistics study we tested five document-level features for each chain type: total number of chains, average chain length, average chain span, number of crossing chains, and the number of chains longer than half the document length. We found significant differences between easy and difficult text for average chain length and the average number of cross chains. For the sentence classification study, we compared the lexical chain features to standard bag-of-words features on a range of classifiers: logistic regression, naïve Bayes, decision trees, linear and RBF kernel SVM, and random forest. The lexical chain features performed significantly better than the bag-of-words baseline across all classifiers with the best classifier achieving an accuracy of ~90% (compared to 78% for bag-of-words). Overall, we find several lexical chain features provide specific information useful for identifying difficult sentences of text, beyond what is available from standard lexical features.
Text continues to be an important medium for communicating health-related information. We have bu... more Text continues to be an important medium for communicating health-related information. We have built a text simplification tool that gives concrete suggestions on how to simplify health and medical texts. An important component of the tool identifies difficult words and suggests simpler synonyms based on pre-existing resources (WordNet and UMLS). These candidate substitutions are not always appropriate in all contexts. In this paper, we introduce a filtering algorithm that utilizes semantic similarity based on word embeddings to determine if the candidate substitution is appropriate in the context of the text. We provide an analysis of our approach on a new dataset of 788 labeled substitution examples. The filtering algorithm is particularly helpful at removing obvious examples and can improve the precision by 3% at a recall level of 95%.
Using electronic health records of children evaluated for Autism Spectrum Disorders, we are devel... more Using electronic health records of children evaluated for Autism Spectrum Disorders, we are developing a decision support system for automated diagnostic criteria extraction and case classification. We manually created 92 lexicons which we tested as features for classification and compared with features created automatically using word embedding. The expert annotations used for manual lexicon creation provided seed terms that were expanded with the 15 most similar terms (Word2Vec). The resulting 2,200 terms were clustered in 92 clusters parallel to the manually created lexicons. We compared both sets of features to classify case status with a FF\BP neural network (NN) and C5.0 decision tree. For manually created lexicons, classification accuracy was 76.92% for the NN and 84.60% for C5.0. For the automatically created lexicons, accuracy was 79.78% for the NN and 86.81% for C5.0. Automated lexicon creation required a much shorter development time and brought similarly high quality outcomes.
To help increase health literacy, we are developing a text simplification tool that creates more ... more To help increase health literacy, we are developing a text simplification tool that creates more accessible patient education materials. Tool development is guided by data-driven feature analysis comparing simple and difficult text. In the present study, we focus on the common advice to split long noun phrases. Our previous corpus analysis showed that easier texts contained shorter noun phrases. Subsequently, we conduct a user study to measure the difficulty of sentences containing noun phrases of different lengths (2-gram, 3-gram and 4-gram), conditions (split or not) and, to simulate unknown terms, use of pseudowords (present or not). We gathered 35 evaluations for 30 sentences in each condition (3×2×2 conditions) on Amazon's Mechanical Turk (N=12,600). We conducted a three-way ANOVA for perceived and actual difficulty. Splitting noun phrases had a positive effect on perceived difficulty but a negative effect on actual difficulty. The presence of pseudowords increased perceived and actual difficulty. Without pseudowords, longer noun phrase led to increased perceived and actual difficulty. A follow-up study using the phrases (N = 1,350) showed that measuring awkwardness may indicate when to split noun phrases. We conclude that splitting noun phrases benefits perceived difficulty, but hurts actual difficulty when the phrasing becomes less natural.
Current approaches to word sense disambiguation use and combine various machine-learning techniqu... more Current approaches to word sense disambiguation use and combine various machine-learning techniques. Most refer to characteristics of the ambiguous word and surrounding words and are based on hundreds of examples. Unfortunately, developing large training sets is time-consuming. We investigate the use of symbolic knowledge to augment machine-learning techniques for small datasets. UMLS semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. A naïve Bayes classifier was trained for 15 words with 100 examples for each. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in eight experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. In a follow-up evaluation, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators.
We are developing algorithms for semi-automated simplification of medical text. Based on lexical ... more We are developing algorithms for semi-automated simplification of medical text. Based on lexical and grammatical corpus analysis, we identified a new metric, term familiarity, to help estimate text difficulty. We developed an algorithm that uses term familiarity to identify difficult text and select easier alternatives from lexical resources such as WordNet, UMLS and Wiktionary. Twelve sentences were simplified to measure perceived difficulty using a 5-point Likert scale. Two documents were simplified to measure actual difficulty by posing questions with and without the text present (information understanding and retention). We conducted a user study by inviting participants (N=84) via Amazon Mechanical Turk. There was a significant effect of simplification on perceived difficulty (p<.001). We also saw slightly improved understanding with better question-answering for simplified documents but the effect was not significant (p=.097). Our results show how term familiarity is a valuable component in simplifying text in an efficient and scalable manner.
This dissertation aims to discover synergistic combinations of top-down (ontologies), interactive... more This dissertation aims to discover synergistic combinations of top-down (ontologies), interactive (relevance feedback), and bottom-up (machine learning) knowledge encoding techniques for text mining. The strength of machine learning techniques lies in their coverage and efficiency because they can discover new knowledge without human intervention. The output, however, is often imprecise and irrelevant. Human knowledge, top-down or interactively encoded, may remedy this. The research question addressed is if knowledge discovery can become more precise and relevant with hybrid systems. Three different combinations are evaluated. The first study investigates an ontology, the Unified Medical Language System (UMLS), combined with an automatically created thesaurus to dynamically adjust the thesaurus' output. The augmented thesaurus was added to a medical, meta-search portal as a keyword suggester and compared with the unmodified thesaurus and UMLS. Users preferred the hybrid approach. Thus, the combination of the ontology with the thesaurus was better than the components separately. The second study investigates implicit relevance feedback combined with genetic algorithms designed to adjust user queries for online searching. These were compared with pure relevance feedback algorithms. Users were divided into groups based on their overall performance. The genetic algorithm significantly helped low achievers, but hindered high achievers. Thus, the interactively elicited knowledge from relevance feedback was judged insufficient to guide machine learning for all users. The final study investigates ontologies combined with two natural language processing techniques: a shallow parser and an automatically created thesaurus. Both capture relations between phrases in biomedical text. Qualified researchers found all terms to be precise; however, terms that belonged to ontologies were more relevant. Parser relations were all precise. Thesaurus relations were less precise, but precision improved for relations that had their terms represented in ontologies. Thus, this integration of ontologies with natural language processing provided good results. In general, it was concluded that top-down encoded knowledge could be effectively integrated with bottom-up encoded knowledge for knowledge discovery in text. This is particularly relevant to business fields, which are text and knowledge intensive. In the future, it will be worthwhile to extend the parser and also to test similar hybrid approaches for data mining.
Transition words add important information and are useful for increasing text comprehension for r... more Transition words add important information and are useful for increasing text comprehension for readers. Our goal is to automatically detect transition words in the medical domain. We introduce a new dataset for identifying transition words categorized into 16 different types with occurrences in adjacent sentence pairs in medical texts from English and Spanish Wikipedia (70K and 27K examples, respectively). We provide classification results using a feedforward neural network with word embedding features. Overall, we detect the need for a transition word with 78% accuracy in English and 84% in Spanish. For individual transition word categories, performance varies widely and is not related to either the number of training examples or the number of transition words in the category. The best accuracy in English was for Examplification words (82%) and in Spanish for Contrast words (96%).
International Journal of Medical Informatics, Aug 1, 2005
Current approaches to word sense disambiguation use (and often combine) various machine learning ... more Current approaches to word sense disambiguation use (and often combine) various machine learning techniques. Most refer to characteristics of the ambiguity and its surrounding words and are based on thousands of examples. Unfortunately, developing large training sets is burdensome, and in response to this challenge, we investigate the use of symbolic knowledge for small datasets. A naïve Bayes classifier was trained for 15 words with 100 examples for each. Unified Medical Language System (UMLS) semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in nine experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. To investigate this large variance, we performed several followup evaluations, testing additional algorithms (decision tree and neural network), and gold standards (per expert), but the results did not significantly differ. However, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators. We conclude that neither algorithm nor individual human behavior cause these large differences, but that the structure of the UMLS Metathesaurus (used to represent senses of ambiguous words) contributes to inaccuracies in the gold standard, leading to varied performance of word sense disambiguation techniques.
Journal of the Association for Information Science and Technology, 2008
Since millions seek health information online, it is vital for this information to be comprehensi... more Since millions seek health information online, it is vital for this information to be comprehensible. Most studies use readability formulas, which ignore vocabulary, and conclude that online health information is too difficult. We developed a vocabularly-based, naïve Bayes classifier to distinguish between three difficulty levels in text. It proved 98% accurate in a 250-document evaluation. We compared our classifier with readability formulas for 90 new documents with different origins and asked representative human evaluators, an expert and a consumer, to judge each document. Average readability grade levels for educational and commercial pages was 10th grade or higher, too difficult according to current literature. In contrast, the classifier showed that 70-90% of these pages were written at an intermediate, appropriate level indicating that vocabulary usage is frequently appropriate in text considered too difficult by readability formula evaluations. The expert considered the pages more difficult for a consumer than the consumer did.
This tutorial teaches how to conduct evaluations that fit within the design science paradigm, i.e... more This tutorial teaches how to conduct evaluations that fit within the design science paradigm, i.e., evaluations of algorithms and entire systems. Design science is of increasing importance in IS with many of the main and top journals recognizing it as an important research approach in our field. However, many business schools and i-schools focus on behavioral or econometrics when teaching evaluation. This tutorial brings the complement to this: ANOVA and t-test for evaluation of artifacts under different conditions.
Human and machine generated knowledge have different strengths and weaknesses. Human knowledge is... more Human and machine generated knowledge have different strengths and weaknesses. Human knowledge is precise but has often limited coverage. Machine generated knowledge is less precise but can cover more ground efficiently. Knowledge based systems should tightly integrate both. Unfortunately, only few ways of combining human and machine generated knowledge will be practical and efficient. Three such knowledge integration approaches are developed and tested: human judgments to guide probabilistic and evolutionary information retrieval techniques, ontologies to provide the semantic context for an automatically created thesaurus, and ontologies to augment natural language processing. Human generated, precise, domain-specific ontologies were found to be well suited for integration with both machine learning algorithms and natural language processing for knowledge discovery. These conclusions are applied in the development of GeneScene, a knowledge based system being developed for biomedicine. GeneScene relies on a novel natural language processing technique: a 'function word'-based parser. This parser is integrated with medical ontologies to extract biomedical pathway information from text.
ABSTRACT With an increasing number of anonymous crime tips and reports being filed and digitized,... more ABSTRACT With an increasing number of anonymous crime tips and reports being filed and digitized, it is generally difficult for crime analysts to process and analyze crime reports efficiently. We are developing a decision support system (DSS), combining Natural Language Processing (NLP) techniques, a document similarity measure, and machine learning, i.e., a Naïve Bayes&#39; classifier, to support crime analysis and classify which crime reports discuss the same and different crime. The DSS is developed with text mining techniques and evaluated with an active crime analyst. We report here on an experiment that includes two datasets with 40 and 60 crime reports and 16 different types of crimes for each dataset. The results show that our system achieved the highest classification accuracy (94.82%), while the crime analyst&#39;s classification accuracy (93.74%) is slightly lower.
Simplifying medical texts facilitates readability and comprehension. While most simplification wo... more Simplifying medical texts facilitates readability and comprehension. While most simplification work focuses on English, we investigate whether features important for simplifying English text are similarly helpful for simplifying Spanish text. We conducted a user study on 15 Spanish medical texts using Amazon Mechanical Turk and measured perceived and actual difficulty. Using the median of the difficulty scores, we split the texts into easy and difficult groups and extracted 10 surface, 2 semantic and 4 grammatical features. Using t-tests, we identified those features that significantly distinguish easy text from difficult text in Spanish and compare with prior work in English. We found that easy Spanish texts use more repeated words and adverbs, less negations and more familiar words, similar to English. Also like English, difficult Spanish texts use more nouns and adjectives. However in contrast to English, easier Spanish texts contained longer sentences and used grammatical structures that were more varied.
Objective: Simplifying healthcare text to improve understanding is difficult but critical to impr... more Objective: Simplifying healthcare text to improve understanding is difficult but critical to improve health literacy. Unfortunately, few tools exist that have been shown objectively to improve text and understanding. We developed an online editor that integrates simplification algorithms that suggest concrete simplifications, all of which have been shown individually to affect text difficulty. Materials and Methods: The editor was used by a health educator at a local community health center to simplify 4 texts. A controlled experiment was conducted with community center members to measure perceived and actual difficulty of the original and simplified texts. Perceived difficulty was measured using a Likert scale; actual difficulty with multiple-choice questions and with free recall of information evaluated by the educator and 2 sets of automated metrics. Results: The results show that perceived difficulty improved with simplification. Several multiple-choice questions, measuring actual difficulty, were answered more correctly with the simplified text. Free recall of information showed no improvement based on the educator evaluation but was better for simplified texts when measured with automated metrics. Two follow-up analyses showed that self-reported education level and the amount of English spoken at home positively correlated with question accuracy for original texts and the effect disappears with simplified text. Discussion: Simplifying text is difficult and the results are subtle. However, using a variety of different metrics helps quantify the effects of changes. Conclusion: Text simplification can be supported by algorithmic tools. Without requiring tool training or linguistic knowledge, our simplification editor helped simplify healthcare related texts.
Autism spectrum disorder has become one of the most prevalent developmental disorders and one of ... more Autism spectrum disorder has become one of the most prevalent developmental disorders and one of the main impairments is difficulty with communication. One method of augmentative and alternative communication is the use of the Picture Exchange Communication System (PECS) to create messages using a series of images printed on cards and organized in binders. We are developing a digital alternative based on an image library that is displayed on a personal digital assistant (PDA). We conducted an initial user acceptance study that compared the effectiveness and usability of both systems. The study showed that the PDA system was able to communicate messages to adult recipients as effectively as PECS. However, the PDA was perceived to be more current, of higher quality, easier, and more normal looking than the PECS binder. I. INTRODUCTION Autism spectrum disorder is a serious developmental disorder that afflicts more than 500,000 children in the United States; it is more common than childhood cancer or Down's syndrome [1]. One of the primary impairments is difficulty with communication. Between one-third to one-half of people diagnosed with autism do not have functional verbal communication skills [2]. Research is being done in neuroscience, psychiatry, medicine, psychology, and many additional fields to determine causes and, eventually, a plan for prevention. The thousands of children who struggle with communication need immediate help. Frustration runs high
Named Entity Recognition (NER) is an important task in biomedical NLP which identifies and catego... more Named Entity Recognition (NER) is an important task in biomedical NLP which identifies and categorizes entities in biomedical text. We currently focus on a rule-based approach for NER to identify the diagnostic criteria of valley fever in the free text of electronic health records (EHRs), since no training data exist for machine learning. To aid the manual pattern defining process of the rule-based approach, we propose a graph-based lexicon expansion method. We used different word embedding models to create a lexicon graph and expanded the lexicons by conducting different graph search methods.
Automated lay summary generation can improve the accessibility of health information, but is chal... more Automated lay summary generation can improve the accessibility of health information, but is challenging because of the need to provide background information absent in source documents.
IEEE Journal of Biomedical and Health Informatics, Sep 1, 2019
Our goal is data-driven discovery of features for text simplification. In this work, we investiga... more Our goal is data-driven discovery of features for text simplification. In this work, we investigate three types of lexical chains: exact, synonymous, and semantic. A lexical chain links semantically related words in a document. We examine their potential with 1) a document-level corpus statistics study (914 texts) to estimate their overall capacity to differentiate between easy and difficult text and 2) a classification task (11,000 sentences) to determine usefulness of features at sentence-level for simplification. For the corpus statistics study we tested five document-level features for each chain type: total number of chains, average chain length, average chain span, number of crossing chains, and the number of chains longer than half the document length. We found significant differences between easy and difficult text for average chain length and the average number of cross chains. For the sentence classification study, we compared the lexical chain features to standard bag-of-words features on a range of classifiers: logistic regression, naïve Bayes, decision trees, linear and RBF kernel SVM, and random forest. The lexical chain features performed significantly better than the bag-of-words baseline across all classifiers with the best classifier achieving an accuracy of ~90% (compared to 78% for bag-of-words). Overall, we find several lexical chain features provide specific information useful for identifying difficult sentences of text, beyond what is available from standard lexical features.
Text continues to be an important medium for communicating health-related information. We have bu... more Text continues to be an important medium for communicating health-related information. We have built a text simplification tool that gives concrete suggestions on how to simplify health and medical texts. An important component of the tool identifies difficult words and suggests simpler synonyms based on pre-existing resources (WordNet and UMLS). These candidate substitutions are not always appropriate in all contexts. In this paper, we introduce a filtering algorithm that utilizes semantic similarity based on word embeddings to determine if the candidate substitution is appropriate in the context of the text. We provide an analysis of our approach on a new dataset of 788 labeled substitution examples. The filtering algorithm is particularly helpful at removing obvious examples and can improve the precision by 3% at a recall level of 95%.
Using electronic health records of children evaluated for Autism Spectrum Disorders, we are devel... more Using electronic health records of children evaluated for Autism Spectrum Disorders, we are developing a decision support system for automated diagnostic criteria extraction and case classification. We manually created 92 lexicons which we tested as features for classification and compared with features created automatically using word embedding. The expert annotations used for manual lexicon creation provided seed terms that were expanded with the 15 most similar terms (Word2Vec). The resulting 2,200 terms were clustered in 92 clusters parallel to the manually created lexicons. We compared both sets of features to classify case status with a FF\BP neural network (NN) and C5.0 decision tree. For manually created lexicons, classification accuracy was 76.92% for the NN and 84.60% for C5.0. For the automatically created lexicons, accuracy was 79.78% for the NN and 86.81% for C5.0. Automated lexicon creation required a much shorter development time and brought similarly high quality outcomes.
To help increase health literacy, we are developing a text simplification tool that creates more ... more To help increase health literacy, we are developing a text simplification tool that creates more accessible patient education materials. Tool development is guided by data-driven feature analysis comparing simple and difficult text. In the present study, we focus on the common advice to split long noun phrases. Our previous corpus analysis showed that easier texts contained shorter noun phrases. Subsequently, we conduct a user study to measure the difficulty of sentences containing noun phrases of different lengths (2-gram, 3-gram and 4-gram), conditions (split or not) and, to simulate unknown terms, use of pseudowords (present or not). We gathered 35 evaluations for 30 sentences in each condition (3×2×2 conditions) on Amazon's Mechanical Turk (N=12,600). We conducted a three-way ANOVA for perceived and actual difficulty. Splitting noun phrases had a positive effect on perceived difficulty but a negative effect on actual difficulty. The presence of pseudowords increased perceived and actual difficulty. Without pseudowords, longer noun phrase led to increased perceived and actual difficulty. A follow-up study using the phrases (N = 1,350) showed that measuring awkwardness may indicate when to split noun phrases. We conclude that splitting noun phrases benefits perceived difficulty, but hurts actual difficulty when the phrasing becomes less natural.
Current approaches to word sense disambiguation use and combine various machine-learning techniqu... more Current approaches to word sense disambiguation use and combine various machine-learning techniques. Most refer to characteristics of the ambiguous word and surrounding words and are based on hundreds of examples. Unfortunately, developing large training sets is time-consuming. We investigate the use of symbolic knowledge to augment machine-learning techniques for small datasets. UMLS semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. A naïve Bayes classifier was trained for 15 words with 100 examples for each. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in eight experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. In a follow-up evaluation, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators.
We are developing algorithms for semi-automated simplification of medical text. Based on lexical ... more We are developing algorithms for semi-automated simplification of medical text. Based on lexical and grammatical corpus analysis, we identified a new metric, term familiarity, to help estimate text difficulty. We developed an algorithm that uses term familiarity to identify difficult text and select easier alternatives from lexical resources such as WordNet, UMLS and Wiktionary. Twelve sentences were simplified to measure perceived difficulty using a 5-point Likert scale. Two documents were simplified to measure actual difficulty by posing questions with and without the text present (information understanding and retention). We conducted a user study by inviting participants (N=84) via Amazon Mechanical Turk. There was a significant effect of simplification on perceived difficulty (p<.001). We also saw slightly improved understanding with better question-answering for simplified documents but the effect was not significant (p=.097). Our results show how term familiarity is a valuable component in simplifying text in an efficient and scalable manner.
This dissertation aims to discover synergistic combinations of top-down (ontologies), interactive... more This dissertation aims to discover synergistic combinations of top-down (ontologies), interactive (relevance feedback), and bottom-up (machine learning) knowledge encoding techniques for text mining. The strength of machine learning techniques lies in their coverage and efficiency because they can discover new knowledge without human intervention. The output, however, is often imprecise and irrelevant. Human knowledge, top-down or interactively encoded, may remedy this. The research question addressed is if knowledge discovery can become more precise and relevant with hybrid systems. Three different combinations are evaluated. The first study investigates an ontology, the Unified Medical Language System (UMLS), combined with an automatically created thesaurus to dynamically adjust the thesaurus' output. The augmented thesaurus was added to a medical, meta-search portal as a keyword suggester and compared with the unmodified thesaurus and UMLS. Users preferred the hybrid approach. Thus, the combination of the ontology with the thesaurus was better than the components separately. The second study investigates implicit relevance feedback combined with genetic algorithms designed to adjust user queries for online searching. These were compared with pure relevance feedback algorithms. Users were divided into groups based on their overall performance. The genetic algorithm significantly helped low achievers, but hindered high achievers. Thus, the interactively elicited knowledge from relevance feedback was judged insufficient to guide machine learning for all users. The final study investigates ontologies combined with two natural language processing techniques: a shallow parser and an automatically created thesaurus. Both capture relations between phrases in biomedical text. Qualified researchers found all terms to be precise; however, terms that belonged to ontologies were more relevant. Parser relations were all precise. Thesaurus relations were less precise, but precision improved for relations that had their terms represented in ontologies. Thus, this integration of ontologies with natural language processing provided good results. In general, it was concluded that top-down encoded knowledge could be effectively integrated with bottom-up encoded knowledge for knowledge discovery in text. This is particularly relevant to business fields, which are text and knowledge intensive. In the future, it will be worthwhile to extend the parser and also to test similar hybrid approaches for data mining.
Transition words add important information and are useful for increasing text comprehension for r... more Transition words add important information and are useful for increasing text comprehension for readers. Our goal is to automatically detect transition words in the medical domain. We introduce a new dataset for identifying transition words categorized into 16 different types with occurrences in adjacent sentence pairs in medical texts from English and Spanish Wikipedia (70K and 27K examples, respectively). We provide classification results using a feedforward neural network with word embedding features. Overall, we detect the need for a transition word with 78% accuracy in English and 84% in Spanish. For individual transition word categories, performance varies widely and is not related to either the number of training examples or the number of transition words in the category. The best accuracy in English was for Examplification words (82%) and in Spanish for Contrast words (96%).
International Journal of Medical Informatics, Aug 1, 2005
Current approaches to word sense disambiguation use (and often combine) various machine learning ... more Current approaches to word sense disambiguation use (and often combine) various machine learning techniques. Most refer to characteristics of the ambiguity and its surrounding words and are based on thousands of examples. Unfortunately, developing large training sets is burdensome, and in response to this challenge, we investigate the use of symbolic knowledge for small datasets. A naïve Bayes classifier was trained for 15 words with 100 examples for each. Unified Medical Language System (UMLS) semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in nine experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. To investigate this large variance, we performed several followup evaluations, testing additional algorithms (decision tree and neural network), and gold standards (per expert), but the results did not significantly differ. However, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators. We conclude that neither algorithm nor individual human behavior cause these large differences, but that the structure of the UMLS Metathesaurus (used to represent senses of ambiguous words) contributes to inaccuracies in the gold standard, leading to varied performance of word sense disambiguation techniques.
Journal of the Association for Information Science and Technology, 2008
Since millions seek health information online, it is vital for this information to be comprehensi... more Since millions seek health information online, it is vital for this information to be comprehensible. Most studies use readability formulas, which ignore vocabulary, and conclude that online health information is too difficult. We developed a vocabularly-based, naïve Bayes classifier to distinguish between three difficulty levels in text. It proved 98% accurate in a 250-document evaluation. We compared our classifier with readability formulas for 90 new documents with different origins and asked representative human evaluators, an expert and a consumer, to judge each document. Average readability grade levels for educational and commercial pages was 10th grade or higher, too difficult according to current literature. In contrast, the classifier showed that 70-90% of these pages were written at an intermediate, appropriate level indicating that vocabulary usage is frequently appropriate in text considered too difficult by readability formula evaluations. The expert considered the pages more difficult for a consumer than the consumer did.
This tutorial teaches how to conduct evaluations that fit within the design science paradigm, i.e... more This tutorial teaches how to conduct evaluations that fit within the design science paradigm, i.e., evaluations of algorithms and entire systems. Design science is of increasing importance in IS with many of the main and top journals recognizing it as an important research approach in our field. However, many business schools and i-schools focus on behavioral or econometrics when teaching evaluation. This tutorial brings the complement to this: ANOVA and t-test for evaluation of artifacts under different conditions.
Human and machine generated knowledge have different strengths and weaknesses. Human knowledge is... more Human and machine generated knowledge have different strengths and weaknesses. Human knowledge is precise but has often limited coverage. Machine generated knowledge is less precise but can cover more ground efficiently. Knowledge based systems should tightly integrate both. Unfortunately, only few ways of combining human and machine generated knowledge will be practical and efficient. Three such knowledge integration approaches are developed and tested: human judgments to guide probabilistic and evolutionary information retrieval techniques, ontologies to provide the semantic context for an automatically created thesaurus, and ontologies to augment natural language processing. Human generated, precise, domain-specific ontologies were found to be well suited for integration with both machine learning algorithms and natural language processing for knowledge discovery. These conclusions are applied in the development of GeneScene, a knowledge based system being developed for biomedicine. GeneScene relies on a novel natural language processing technique: a 'function word'-based parser. This parser is integrated with medical ontologies to extract biomedical pathway information from text.
ABSTRACT With an increasing number of anonymous crime tips and reports being filed and digitized,... more ABSTRACT With an increasing number of anonymous crime tips and reports being filed and digitized, it is generally difficult for crime analysts to process and analyze crime reports efficiently. We are developing a decision support system (DSS), combining Natural Language Processing (NLP) techniques, a document similarity measure, and machine learning, i.e., a Naïve Bayes&#39; classifier, to support crime analysis and classify which crime reports discuss the same and different crime. The DSS is developed with text mining techniques and evaluated with an active crime analyst. We report here on an experiment that includes two datasets with 40 and 60 crime reports and 16 different types of crimes for each dataset. The results show that our system achieved the highest classification accuracy (94.82%), while the crime analyst&#39;s classification accuracy (93.74%) is slightly lower.
Simplifying medical texts facilitates readability and comprehension. While most simplification wo... more Simplifying medical texts facilitates readability and comprehension. While most simplification work focuses on English, we investigate whether features important for simplifying English text are similarly helpful for simplifying Spanish text. We conducted a user study on 15 Spanish medical texts using Amazon Mechanical Turk and measured perceived and actual difficulty. Using the median of the difficulty scores, we split the texts into easy and difficult groups and extracted 10 surface, 2 semantic and 4 grammatical features. Using t-tests, we identified those features that significantly distinguish easy text from difficult text in Spanish and compare with prior work in English. We found that easy Spanish texts use more repeated words and adverbs, less negations and more familiar words, similar to English. Also like English, difficult Spanish texts use more nouns and adjectives. However in contrast to English, easier Spanish texts contained longer sentences and used grammatical structures that were more varied.
Objective: Simplifying healthcare text to improve understanding is difficult but critical to impr... more Objective: Simplifying healthcare text to improve understanding is difficult but critical to improve health literacy. Unfortunately, few tools exist that have been shown objectively to improve text and understanding. We developed an online editor that integrates simplification algorithms that suggest concrete simplifications, all of which have been shown individually to affect text difficulty. Materials and Methods: The editor was used by a health educator at a local community health center to simplify 4 texts. A controlled experiment was conducted with community center members to measure perceived and actual difficulty of the original and simplified texts. Perceived difficulty was measured using a Likert scale; actual difficulty with multiple-choice questions and with free recall of information evaluated by the educator and 2 sets of automated metrics. Results: The results show that perceived difficulty improved with simplification. Several multiple-choice questions, measuring actual difficulty, were answered more correctly with the simplified text. Free recall of information showed no improvement based on the educator evaluation but was better for simplified texts when measured with automated metrics. Two follow-up analyses showed that self-reported education level and the amount of English spoken at home positively correlated with question accuracy for original texts and the effect disappears with simplified text. Discussion: Simplifying text is difficult and the results are subtle. However, using a variety of different metrics helps quantify the effects of changes. Conclusion: Text simplification can be supported by algorithmic tools. Without requiring tool training or linguistic knowledge, our simplification editor helped simplify healthcare related texts.
Autism spectrum disorder has become one of the most prevalent developmental disorders and one of ... more Autism spectrum disorder has become one of the most prevalent developmental disorders and one of the main impairments is difficulty with communication. One method of augmentative and alternative communication is the use of the Picture Exchange Communication System (PECS) to create messages using a series of images printed on cards and organized in binders. We are developing a digital alternative based on an image library that is displayed on a personal digital assistant (PDA). We conducted an initial user acceptance study that compared the effectiveness and usability of both systems. The study showed that the PDA system was able to communicate messages to adult recipients as effectively as PECS. However, the PDA was perceived to be more current, of higher quality, easier, and more normal looking than the PECS binder. I. INTRODUCTION Autism spectrum disorder is a serious developmental disorder that afflicts more than 500,000 children in the United States; it is more common than childhood cancer or Down's syndrome [1]. One of the primary impairments is difficulty with communication. Between one-third to one-half of people diagnosed with autism do not have functional verbal communication skills [2]. Research is being done in neuroscience, psychiatry, medicine, psychology, and many additional fields to determine causes and, eventually, a plan for prevention. The thousands of children who struggle with communication need immediate help. Frustration runs high
Limited health literacy is a barrier to understanding health information. Simplifying text can re... more Limited health literacy is a barrier to understanding health information. Simplifying text can reduce this barrier and possibly other known disparities in health. Unfortunately, few tools exist to simplify text with demonstrated impact on comprehension. By leveraging modern data sources integrated with natural language processing algorithms, we are developing the first semi-automated text simplification tool. We present two main contributions. First, we introduce our evidence-based development strategy for designing effective text simplification software and summarize initial, promising results. Second, we present a new study examining existing readability formulas, which are the most commonly used tools for text simplification in healthcare. We compare syllable count, the proxy for word difficulty used by most readability formulas, with our new metric 'term familiarity' and find that syllable count measures how difficult words 'appear' to be, but not their actual difficulty. In contrast, term familiarity can be used to measure actual difficulty.
Uploads
Books by Gondy Leroy
Papers by Gondy Leroy