Papers by Krishnaprasad Thirunarayan
2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)
Heart failure occurs when the heart is not able to pump blood and oxygen to support other organs ... more Heart failure occurs when the heart is not able to pump blood and oxygen to support other organs in the body as it should. Treatments include medications and sometimes hospitalization. Patients with heart failure can have both cardiovascular as well as non-cardiovascular comorbidities. Clinical notes of patients with heart failure can be analyzed to gain insight into the topics discussed in these notes and the major comorbidities in these patients. In this regard, we apply machine learning techniques, such as topic modeling, to identify the major themes found in the clinical notes specific to the procedures performed on 1,200 patients admitted for heart failure at the University of Illinois Hospital and Health Sciences System (UI Health). Topic modeling revealed five hidden themes in these clinical notes, including one related to heart disease comorbidities.
SemMOB enables dynamic registration of sensors via mobile devices, search, and near real-time inf... more SemMOB enables dynamic registration of sensors via mobile devices, search, and near real-time inference over sensor observations in ad-hoc mobile environments (e.g., fire fighting). We demonstrate Sem-MOB in the context of an emergency response use case that requires automatic and dynamic registrations of sensor devices and annotation of sensor observations, decoding of latitude-longitude information in terms of human sensible names, fusion and abstraction of sensor values using background knowledge, and their visualization using mash-up.
In this paper, we spot topically anomalous tweets in twitter streams by analyzing the content of ... more In this paper, we spot topically anomalous tweets in twitter streams by analyzing the content of the document pointed to by the URLs in the tweets in preference to their textual content. Existing approaches to anomaly detection ignore such URLs thereby missing opportunities to detect off-topic tweets. Specifically, we determine the divergence of claimed topic of a tweet as reflected by the hashtags and the actual topic as reflected by the referenced document content. Our approach avoids the need for labeled samples by selecting documents from reliable sources gleaned from the URLs present in the tweets. These documents are used for comparison against documents associated with unknown URLs in incoming tweets improving reliability, scalability and adaptability to rapidly changing topics. We evaluate our approach on three events and show that it can find topical inconsistencies not detectable by existing approaches.
arXiv (Cornell University), Aug 10, 2017
Extracting location names from informal and unstructured social media data requires the identific... more Extracting location names from informal and unstructured social media data requires the identification of referent boundaries and partitioning compound names. Variability, particularly systematic variability in location names (Carroll, 1983), challenges the identification task. Some of this variability can be anticipated as operations within a statistical language model, in this case drawn from gazetteers such as OpenStreetMap (OSM), Geonames, and DBpedia. This permits evaluation of an observed n-gram in Twitter targeted text as a legitimate location name variant from the same location-context. Using n-gram statistics and location-related dictionaries, our Location Name Extraction tool (LNEx) handles abbreviations and automatically filters and augments the location names in gazetteers (handling name contractions and auxiliary contents) to help detect the boundaries of multi-word location names and thereby delimit them in texts. We evaluated our approach on 4,500 event-specific tweets from three targeted streams to compare the performance of LNEx against that of ten state-of-the-art taggers that rely on standard semantic, syntactic and/or orthographic features. LNEx improved the average F-Score by 33-179%, outperforming all taggers. Further, LNEx is capable of stream processing. 1
arXiv (Cornell University), Aug 5, 2018
The ever-growing datasets published on Linked Open Data mainly contain encyclopedic information. ... more The ever-growing datasets published on Linked Open Data mainly contain encyclopedic information. However, there is a lack of quality structured and semantically annotated datasets extracted from unstructured realtime sources. In this paper, we present principles for developing a knowledge graph of interlinked events using the case study of news headlines published on Twitter which is a real-time and eventful source of fresh information. We represent the essential pipeline containing the required tasks ranging from choosing background data model, event annotation (i.e., event recognition and classification), entity annotation and eventually interlinking events. The state-of-the-art is limited to domain-specific scenarios for recognizing and classifying events, whereas this paper plays the role of a domain-agnostic road-map for developing a knowledge graph of interlinked events.
PLOS ONE
With the increasing legalization of medical and recreational use of cannabis, more research is ne... more With the increasing legalization of medical and recreational use of cannabis, more research is needed to understand the association between depression and consumer behavior related to cannabis consumption. Big social media data has potential to provide deeper insights about these associations to public health analysts. In this interdisciplinary study, we demonstrate the value of incorporating domain-specific knowledge in the learning process to identify the relationships between cannabis use and depression. We develop an end-to-end knowledge infused deep learning framework (Gated-K-BERT) that leverages the pre-trained BERT language representation model and domain-specific declarative knowledge source (Drug Abuse Ontology) to jointly extract entities and their relationship using gated fusion sharing mechanism. Our model is further tailored to provide more focus to the entities mention in the sentence through entity-position aware attention layer, where ontology is used to locate the ...
Proceedings of the International Conference on Web Intelligence
is paper exploits a large number of self-labeled emotion tweets as the training data from the sou... more is paper exploits a large number of self-labeled emotion tweets as the training data from the source domain to improve emotion identication in target domains (i.e., blogs and fairy tales), where there is a short supply of labeled data. Due to the noisy and ambiguous nature of self-labeled emotion training data, the existing domain adaptation methods that typically depend on high-quality labeled source-domain data do not work satisfactorily. is paper describes an adaptive source-domain training instance selection method to address the problem of noisy source-domain training data. e proposed approach can eectively identify the most informative training examples based on three carefully designed measures: consistency, diversity, and similarity. It uses an iterative method that consists of the following steps in each iteration: selecting informative samples from the source domain with the informativeness measures, merging with the target-domain training data, evaluating the performance of learned classier for the target domain, and updating the informativeness measures for the next iteration. It stops until no new training instance is selected or in a designated number of iterations. Experiments show that our approach performs eectively for cross-domain emotion identication and consistently outperforms baseline approaches across four domains. CCS CONCEPTS •Information systems →Sentiment analysis; •Computing methodologies →Natural language processing;
JMIR Public Health and Surveillance
Background Web-based resources and social media platforms play an increasingly important role in ... more Background Web-based resources and social media platforms play an increasingly important role in health-related knowledge and experience sharing. There is a growing interest in the use of these novel data sources for epidemiological surveillance of substance use behaviors and trends. Objective The key aims were to describe the development and application of the drug abuse ontology (DAO) as a framework for analyzing web-based and social media data to inform public health and substance use research in the following areas: determining user knowledge, attitudes, and behaviors related to nonmedical use of buprenorphine and illicitly manufactured opioids through the analysis of web forum data Prescription Drug Abuse Online Surveillance; analyzing patterns and trends of cannabis product use in the context of evolving cannabis legalization policies in the United States through analysis of Twitter and web forum data (eDrugTrends); assessing trends in the availability of novel synthetic opioi...
2019 IEEE 13th International Conference on Semantic Computing (ICSC), 2019
While the general analysis of named entities has received substantial research attention on unstr... more While the general analysis of named entities has received substantial research attention on unstructured as well as structured data, the analysis of relations among named entities has received limited focus. In fact, a review of the literature revealed a deficiency in research on the abstract conceptualization required to organize relations. We believe that such an abstract conceptualization can benefit various communities and applications such as natural language processing, information extraction, machine learning, and ontology engineering. In this paper, we present Comprehensive EVent Ontology (CEVO), built on Levin's conceptual hierarchy of English verbs that categorizes verbs with shared meaning, and syntactic behavior. We present the fundamental concepts and requirements for this ontology. Furthermore, we present three use cases employing the CEVO ontology on annotation tasks: (i) annotating relations in plain text, (ii) annotating ontological properties, and (iii) linking textual relations to ontological properties. These use-cases demonstrate the benefits of using CEVO for annotation: (i) annotating English verbs from an abstract conceptualization, (ii) playing the role of an upper ontology for organizing ontological properties, and (iii) facilitating the annotation of text relations using any underlying vocabulary. This resource is available at https://shekarpour.github.io/cevo.io/ using https://w3id.org/cevo namespace.
Proceedings of the 1st ACM SIGSPATIAL Workshop on Advances on Resilient and Intelligent Cities, 2018
We employ multi-modal data (i.e., unstructured text, gazetteers, and imagery) for location-centri... more We employ multi-modal data (i.e., unstructured text, gazetteers, and imagery) for location-centric demand/request matching in the context of disaster relief. After classifying the Need expressed in a tweet (the WHAT), we leverage OpenStreetMap to geolocate that Need on a computationally accessible map of the local terrain (the WHERE) populated with location features such as hospitals and housing. Further, our novel use of flood mapping based on satellite images of the affected area supports the elimination of candidate resources that are not accessible by road transportation. The resulting map-based visualization combines disaster-related tweets, imagery and pre-existing knowledge-base resources (gazetteers) to reduce decision-making latency and enhance resiliency by assisting individual decision-makers and first responders for relief effort coordination.
Lecture Notes in Social Networks, 2018
Social media provides a virtual platform for users to share and discuss their daily life, activit... more Social media provides a virtual platform for users to share and discuss their daily life, activities, opinions, health, feelings, etc. Such personal accounts readily generate Big Data marked by velocity, volume, value, variety, and veracity challenges. This type of Big Data analytics already supports useful investigations ranging from research into data mining and developing public policy to actions targeting an individual in a variety of domains such as branding and marketing, crime and law enforcement, crisis monitoring and management, as well as public and personalized health management. However, using social media to solve domain-specific problem is challenging due to complexity of the domain, lack of context, colloquial nature of language and changing topic relevance in temporally dynamic domain. In this article, we discuss the need to go beyond data-driven machine learning and natural language processing, and incorporate deep domain knowledge as well as knowledge of how experts and decision makers explore and perform contextual interpretation. Four use cases are used to demonstrate the role of domain knowledge in addressing each challenge.
ArXiv, 2018
This work addresses challenges arising from extracting entities from textual data, including the ... more This work addresses challenges arising from extracting entities from textual data, including the high cost of data annotation, model accuracy, selecting appropriate evaluation criteria, and the overall quality of annotation. We present a framework that integrates Entity Set Expansion (ESE) and Active Learning (AL) to reduce the annotation cost of sparse data and provide an online evaluation method as feedback. This incremental and interactive learning framework allows for rapid annotation and subsequent extraction of sparse data while maintaining high accuracy. We evaluate our framework on three publicly available datasets and show that it drastically reduces the cost of sparse entity annotation by an average of 85% and 45% to reach 0.9 and 1.0 F-Scores respectively. Moreover, the method exhibited robust performance across all datasets.
CEUR workshop proceedings, 2018
Our current health applications do not adequately take into account contextual and personalized k... more Our current health applications do not adequately take into account contextual and personalized knowledge about patients. In order to design "Personalized Coach for Healthcare" applications to manage chronic diseases, there is a need to create a Personalized Healthcare Knowledge Graph (PHKG) that takes into consideration a patient's health condition (personalized knowledge) and enriches that with contextualized knowledge from environmental sensors and Web of Data (e.g., symptoms and treatments for diseases). To develop PHKG, aggregating knowledge from various heterogeneous sources such as the Internet of Things (IoT) devices, clinical notes, and Electronic Medical Records (EMRs) is necessary. In this paper, we explain the challenges of collecting, managing, analyzing, and integrating patients' health data from various sources in order to synthesize and deduce meaningful information embodying the vision of the Data, Information, Knowledge, and Wisdom (DIKW) pyramid....
ArXiv, 2018
Having a quality annotated corpus is essential especially for applied research. Despite the recen... more Having a quality annotated corpus is essential especially for applied research. Despite the recent focus of Web science community on researching about cyberbullying, the community dose not still have standard benchmarks. In this paper, we publish first, a quality annotated corpus and second, an offensive words lexicon capturing different types type of harassment as (i) sexual harassment, (ii) racial harassment, (iii) appearance-related harassment, (iv) intellectual harassment, and (v) political harassment.We crawled data from Twitter using our offensive lexicon. Then relied on the human judge to annotate the collected tweets w.r.t. the contextual types because using offensive words is not sufficient to reliably detect harassment. Our corpus consists of 25,000 annotated tweets in five contextual types. We are pleased to share this novel annotated corpus and the lexicon with the research community. The instruction to acquire the corpus has been published on the Git repository.
ArXiv, 2020
Suicide is the 10 leading cause of death in the US and the 2 leading cause of death among teenage... more Suicide is the 10 leading cause of death in the US and the 2 leading cause of death among teenagers. Clinical and psychosocial factors contribute to suicide risk (SRFs), although documentation and self-expression of such factors in EHRs and social networks vary. This study investigates the degree of variance across EHRs and social networks. We performed subjective analysis of SRFs, such as self-harm, bullying, impulsivity, family violence/discord, using >13.8 Million clinical notes on 123,703 patients with mental health conditions. We clustered clinical notes using semantic embeddings under a set of SRFs. Likewise, we clustered 2180 suicidal users on r/SuicideWatch (∼30,000 posts) and performed comparative analysis. Top-3 SRFs documented in EHRs were depressive feelings (24.3%), psychological disorders (21.1%), drug abuse (18.2%). In r/SuicideWatch, gun-ownership (17.3%), self-harm (14.6%), bullying (13.2%) were Top-3 SRFs. Mentions of Family violence, racial discrimination, and ...
350 million people are suffering from clinical depression worldwide. 27 million Americans are... more 350 million people are suffering from clinical depression worldwide. 27 million Americans are diagnosed with clinical depression that is responsible for more than 30,000 suicides each year. Over 90% of people who commit suicide have been diagnosed with clinical depression or another diagnosable mental illness. According to the World Mental Health Survey conducted in 17 countries, about 5% of people reported having an episode of depression. Depression remains undiagnosed, untreated or under-treated phenomenon due to various reasons such as the denial of illness or the social stigma associated with it . Early recognition of depression symptoms and their treatment through timely intervention can prevent the onset of major depression. A common global effort to manage depression involves detecting depression through survey-based methods via phone or online questionnaires . However, these studies suffer from under-representation, sampling biases and incomplete information. A...
Proceedings of the 28th International Conference on Computational Linguistics, 2020
Existing studies on using social media for deriving mental health status of users focus on the de... more Existing studies on using social media for deriving mental health status of users focus on the depression detection task. However, for case management and referral to psychiatrists, healthcare workers require practical and scalable depressive disorder screening and triage system. This study aims to design and evaluate a decision support system (DSS) to reliably determine the depressive triage level by capturing fine-grained depressive symptoms expressed in user tweets through the emulation of Patient Health Questionnaire-9 (PHQ-9) that is routinely used in clinical practice. The reliable detection of depressive symptoms from tweets is challenging because the 280-character limit on tweets incentivizes the use of creative artifacts in the utterances and figurative usage contributes to effective expression. We propose a novel BERT based robust multi-task learning framework to accurately identify the depressive symptoms using the auxiliary task of figurative usage detection. Specifically, our proposed novel task sharing mechanism, co-task aware attention, enables automatic selection of optimal information across the BERT layers and tasks by soft-sharing of parameters. Our results show that modeling figurative usage can demonstrably improve the model's robustness and reliability for distinguishing the depression symptoms.
2019 IEEE 13th International Conference on Semantic Computing (ICSC), 2019
Mental Health America designed ten questionnaires that are used to determine the risk of mental d... more Mental Health America designed ten questionnaires that are used to determine the risk of mental disorders. They are also commonly used by Mental Health Professionals (MHPs) to assess suicidality. Specifically, the Columbia Suicide Severity Rating Scale (C-SSRS), a widely used suicide assessment questionnaire, helps MHPs determine the severity of suicide risk and offer an appropriate treatment. A major challenge in suicide treatment is the social stigma wherein the patient feels reluctance in discussing his/her conditions with an MHP, which leads to inaccurate assessment and treatment of patients. On the other hand, the same patient is comfortable freely discussing his/her mental health condition on social media due to the anonymity of platforms such as Reddit, and the ability to control what, when and how to share. The popular "SuicideWatch" subreddit has been widely used among individuals who experience suicidal thoughts, and provides significant cues for suicidality. The timeliness in sharing thoughts, the flexibility in describing feelings, and the interoperability in using medical terminologies make Reddit an important platform to be utilized as a complementary tool to the conventional healthcare system. As MHPs develop an implicit weighting scheme over the questionnaire (i.e., C-SSRS) to assess suicide risk severity, creating a relative weighting scheme for answers to be automatically generated to the questions in the questionnaire poses as a key challenge. In this interdisciplinary study, we position our approach towards a solution for an automated suicide risk-elicitation framework through a novel question answering mechanism. Our twofold approach benefits from using: 1) semantic clustering, and 2) sequence-to-sequence (Seq2Seq) models. We also generate a gold standard dataset of suicide posts with their risk levels. This work forms a basis for the next step of building conversational agents that elicit suicide-related natural conversation based on questions.
Uploads
Papers by Krishnaprasad Thirunarayan