Skip to main content

Einat Minkov

University of Haifa, Information Systems, Faculty Member

Followers

24

Following

5

Co-authors

5

Public Views

InterestsView All (6)

Uploads

Papers by Einat Minkov

Speech Recognition

We present a method to expand the number of languages covered by simple speech recognizers. Enabl... more We present a method to expand the number of languages covered by simple speech recognizers. Enabling speech recognition in users ’ primary languages greatly extends the types of mobile-phone-based applications available to people in developing regions. We describe how we expand language corpora through user-supplied speech contributions, how we quickly evaluate each contribution, and how we pay contributors for their work. Index Terms:

Crowd translator

ACM SIGOPS Operating Systems Review, 2010

We present a method to expand the number of languages covered by simple speech recognizers. Enabl... more We present a method to expand the number of languages covered by simple speech recognizers. Enabling speech recognition in users' primary languages greatly extends the types of mobile-phone-based applications available to people in developing regions. We describe how we expand language corpora through user-supplied speech contributions, how we quickly evaluate each contribution, and how we pay contributors for their work.

Transport Policy: Social Media and User-Generated Content in a Changing Information Paradigm

Springer eBooks, 2015

Rapid and recent developments in social media networks are providing a vision amongst transport s... more Rapid and recent developments in social media networks are providing a vision amongst transport suppliers, governments and academia of ‘next-generation’ information channels. This chapter identifies the main requirements for a social media information harvesting methodology in the transport context and highlights the challenges involved. Three questions are addressed concerning (1) The ways in which social media data can be used alongside or potentially instead of current transport data sources, (2) The technical challenges in text mining social media that create difficulties in generating high quality data for the transport sector and finally, (3) Whether there are wider institutional barriers in harnessing the potential of social media data for the transport sector. The chapter demonstrates that information harvested from social media can complement, enrich (or even replace) traditional data collection. Whilst further research is needed to develop automatic or semi-automatic methodologies for harvesting and analysing transport-related social media information, new skills are also needed in the sector to maximise the benefits of this new information source.

What are they up to ? Distilling the Twitter Stream of Subpopulations

Social network researchers have been tackling community detection / community search for over a d... more Social network researchers have been tackling community detection / community search for over a decade. Detecting communities – small groups of people who know each other and interact with each other – have numerous applications, starting from marketing and computational advertisement, all the way to the homeland security domain. By now, the problem can be considered mostly solved, in either its unsupervised form (community detection) or semi-supervised form (community search). In our quest to answer general – and very exciting – questions What are people up to? What do they care about? What are they discussing?, we move beyond detecting communities to circumscribing subpopulations – large groups of people who share some common characteristics, for example activists, students, engineers, New Yorkers, football fans etc. We want to know what are < · · · > talking about on Twitter, where < · · · > is any subpopulation. Initially, the subpopulation is characterized by a few ...

Identification of topical subpopulations on social media

Information Sciences, 2020

We tackle a major challenge of information filtering on social media (SM): rather than address th... more We tackle a major challenge of information filtering on social media (SM): rather than address the general question "what are people talking about on SM?", we consider a finer question, "what are ... talking about on SM?", where ... stands for some subpopulation of SM users of interest. We take a set expansion approach, where a seed of example members of the target subpopulation is initially defined, and additional SM users who belong to that subpopulation are identified, thus enabling the effective tracking of relevant information that pertains to that subpopulation on SM. Specifically, the Personalized PageRank (PPR) random walk measure is iteratively applied to detect additional members of the subpopulation based on their structural similarity to the seed set within the social media graph. There are several main contributions of this work. We outline Splash PPR , an efficient distributed computation of PPR adapted for potentially large seed sets and very large SM graphs. Using Splash PPR, we examine and tune graph representations towards the retrieval of two subpopulations from Twitter, namely human rights Activists, and Machine Learning practitioners. We believe this work is first to introduce and evaluate a generic framework for subpopulation identification at scale.

Quantifying the web browser ecosystem

PloS one, 2017

Contrary to the assumption that web browsers are designed to support the user, an examination of ... more Contrary to the assumption that web browsers are designed to support the user, an examination of a 900,000 distinct PCs shows that web browsers comprise a complex ecosystem with millions of addons collaborating and competing with each other. It is possible for addons to "sneak in" through third party installations or to get "kicked out" by their competitors without user involvement. This study examines that ecosystem quantitatively by constructing a large-scale graph with nodes corresponding to users, addons, and words (terms) that describe addon functionality. Analyzing addon interactions at user level using the Personalized PageRank (PPR) random walk measure shows that the graph demonstrates ecological resilience. Adapting the PPR model to analyzing the browser ecosystem at the level of addon manufacturer, the study shows that some addon companies are in symbiosis and others clash with each other as shown by analyzing the behavior of 18 prominent addon manufact...

SocialVec: Social Entity Embeddings

Cornell University - arXiv, Nov 5, 2021

This paper introduces SocialVec, a general framework for eliciting social world knowledge from so... more This paper introduces SocialVec, a general framework for eliciting social world knowledge from social networks, and applies this framework to Twitter. SocialVec learns lowdimensional embeddings of popular accounts, which represent entities of general interest, based on their co-occurrences patterns within the accounts followed by individual users, thus modeling entity similarity in socio-demographic terms. Similar to word embeddings, which facilitate tasks that involve text processing, we expect social entity embeddings to benefit tasks of social flavor. We have learned social embeddings for roughly 200,000 popular accounts from a sample of the Twitter network that includes more than 1.3 million users and the accounts that they follow, and evaluate the resulting embeddings on two different tasks. The first task involves the automatic inference of personal traits of users from their social media profiles. In another study, we exploit SocialVec embeddings for gauging the political bias of news sources in Twitter. In both cases, we prove SocialVec embeddings to be advantageous compared with existing entity embedding schemes. We will make the SocialVec entity embeddings publicly available to support further exploration of social world knowledge as reflected in Twitter.

Character-level HyperNetworks for Hate Speech Detection

Expert Systems with Applications

The massive spread of hate speech, hateful content targeted at specific subpopulations, is a prob... more The massive spread of hate speech, hateful content targeted at specific subpopulations, is a problem of critical social importance. Automated methods for hate speech detection typically employ state-of-the-art deep learning (DL)-based text classifiers-very large pre-trained neural language models of over 100 million parameters, adapting these models to the task of hate speech detection using relevant labeled datasets. Unfortunately, there are only numerous labeled datasets of limited size that are available for this purpose. We make several contributions with high potential for advancing this state of affairs. We present HyperNetworks for hate speech detection, a special class of DL networks whose weights are regulated by a small-scale auxiliary network. These architectures operate at character-level, as opposed to word-level, and are several magnitudes of order smaller compared to the popular DL classifiers. We further show that training hate detection classifiers using large amounts of automatically generated examples in a procedure named as data augmentation is beneficial in general, yet this practice especially boosts the performance of the proposed HyperNetworks. In fact, we achieve performance that is comparable or better than state-of-the-art language models, which are pre-trained and orders of magnitude larger, using this approach, as evaluated using five public hate speech datasets.

Fight Fire with Fire: Fine-tuning Hate Detectors using Large Samples of Generated Hate Speech

Findings of the Association for Computational Linguistics: EMNLP 2021, 2021

Automatic hate speech detection is hampered by the scarcity of labeled datasetd, leading to poor ... more Automatic hate speech detection is hampered by the scarcity of labeled datasetd, leading to poor generalization. We employ pretrained language models (LMs) to alleviate this data bottleneck. We utilize the GPT LM for generating large amounts of synthetic hate speech sequences from available labeled examples, and leverage the generated data in fine-tuning large pretrained LMs on hate detection. An empirical study using the models of BERT, RoBERTa and ALBERT, shows that this approach improves generalization significantly and consistently within and across data distributions. In fact, we find that generating relevant labeled hate speech sequences is preferable to using out-of-domain, and sometimes also within-domain, human-labeled examples.

Towards Hate Speech Detection at Large via Deep Generative Modeling

IEEE Internet Computing, 2021

Hate speech detection is a critical problem in social media platforms, being often accused for en... more Hate speech detection is a critical problem in social media platforms, being often accused for enabling the spread of hatred and igniting physical violence. Hate speech detection requires overwhelming resources including high-performance computing for online posts and tweets monitoring as well as thousands of human experts for daily screening of suspected posts or tweets. Recently, Deep Learning (DL)-based solutions have been proposed for automatic detection of hate speech, using modest-sized training datasets of few thousands of hate speech sequences. While these methods perform well on the specific datasets, their ability to detect new hate speech sequences is limited and has not been investigated. Being a data-driven approach, it is well known that DL surpasses other methods whenever a scale-up in train dataset size and diversity is achieved. Therefore, we first present a dataset of 1 million realistic hate and non-hate sequences, produced by a deep generative language model. We further utilize the generated dataset to train a well-studied DL-based hate speech detector, and demonstrate consistent and significant performance improvements across five public hate speech datasets. Therefore, the proposed solution enables high sensitivity detection of a very large variety of hate speech sequences, paving the way to a fully automatic solution.

SocialVec: Social Entity Embeddings

ArXiv, 2021

This paper introduces SocialVec, a general framework for eliciting social world knowledge from so... more This paper introduces SocialVec, a general framework for eliciting social world knowledge from social networks, and applies this framework to Twitter. SocialVec learns lowdimensional embeddings of popular accounts, which represent entities of general interest, based on their co-occurrences patterns within the accounts followed by individual users, thus modeling entity similarity in socio-demographic terms. Similar to word embeddings, which facilitate tasks that involve text processing, we expect social entity embeddings to benefit tasks of social flavor. We have learned social embeddings for roughly 200,000 popular accounts from a sample of the Twitter network that includes more than 1.3 million users and the accounts that they follow, and evaluate the resulting embeddings on two different tasks. The first task involves the automatic inference of personal traits of users from their social media profiles. In another study, we exploit SocialVec embeddings for gauging the political bia...

Multi-source named entity typing for social media

by Reuth Vexler and Einat Minkov

Proceedings of the Sixth Named Entity Workshop, 2016

An Email and Meeting Assistant Using Graph Walks

Ceas, 2006

We describe a framework for representing email as well as meeting information as a joint graph. I... more We describe a framework for representing email as well as meeting information as a joint graph. In the graph, documents and meeting descriptions are connected via other nontextual objects representing the underlying structure-rich data. This framework integrates content, social networks and a timeline in a structural graph. Extended similarity metrics for objects embedded in the graph can be derived using a lazy graph walk paradigm. In this paper we evaluate this general framework for two meeting and email related tasks. A novel task considered is finding email-addresses of relevant attendees for a given meeting. Another task we define and evaluate is finding the full set of email-address aliases for a person, given the corresponding name string. The experimental results show promise of this approach over other possible methods.

Activity-centred Search in Email

Ceas, 2008

We consider activity-centered tasks in email, including the novel task of predicting future invol... more We consider activity-centered tasks in email, including the novel task of predicting future involvement of persons from an enterprise in an ongoing activity represented by a folder, and a novel task where we identify email messages that are related to a to-do item. We also evaluate the task of email tagging to folders, where multiple choice is allowed, and an inverse task, of finding messages relevant to a folder associated with an ongoing project. Empirical evaluation using real world email data, applying a graph based link analysis method and a vector-space model, shows potential utility to facilitate activity management in email.

Improving Graph-Walk-Based Similarity with Reranking: Case Studies for Personal Information Management

Acm Transactions on Information Systems, 2010

Relational or semi-structured data is naturally represented by a graph, where nodes denote entiti... more Relational or semi-structured data is naturally represented by a graph, where nodes denote entities and directed typed edges represent the relations between them. Such graphs are heterogeneous in the sense that they describe different types of objects and links. We represent personal information as a graph that includes messages, terms, persons, dates and other object types, and relations like sent-to and has-term. Given the graph, we apply finite random graph walks to induce a measure of entity similarity, which can be viewed as a tool for performing search in the graph. Experiments conducted using personal email collections derived from the Enron corpus and other corpora show how the different tasks of alias finding, threading and person name disambiguation can be all addressed as search queries in this framework, where the graph-walk based similarity metric is preferable to alternative approaches, and further improvements are achieved with learning. While researchers have suggested to tune edge weight parameters to optimize the graph walk performance per task, we apply reranking to improve the graph walk results, using features that describe highlevel information such as the paths traversed in the walk. High performance, together with practical run times, suggest that the described framework is a useful search system in the PIM domain, as well as in other semi-structured domains.

Learning to understand web site update requests

Proceedings of the 19th International Joint Conference on Artificial Intelligence, Jul 30, 2005

We experimentally evaluate components of a system that learns to analyze natural-language request... more We experimentally evaluate components of a system that learns to analyze natural-language requests to update information on a database-backed website. Our long-term goal is to develop a system which can adapt to changes in the distribution of requests as the underlying database schema changes-in short, a system that performs deep analysis of text in a domain of discourse that is limited, but which grows over time. We describe a scheme for decomposing request-understanding into a sequence of entity recognition and text classification tasks, each of which can be solved using standard learning methods. We then present experimental results that quantify how well these tasks can be learned.

Graph based similarity measures for synonym extraction from parsed text

Workshop Proceedings of Textgraphs 7 on Graph Based Methods For Natural Language Processing, Jul 13, 2012

... Founda-tions of Statistical Natural Language Processing. MIT Press. Einat Minkov and William ... more

Learning to Walk Structured Text Networks

We propose representing a text corpus as a labeled directed graph, where nodes represent words an... more We propose representing a text corpus as a labeled directed graph, where nodes represent words and weighted edges represent the syntactic relations between them, as derived by dependency parsing. Given this graph, we adopt a graph-based similarity measure based on random walks to derive a similarity measure between words, and also use supervised learning to improve the derived similarity measure for a particular task. Empirical evaluation of the approach on the task of coordinate term extraction shows that the suggested framework improves on a state-of-the-art distributional similarity measure.

Speech Recognition

We present a method to expand the number of languages covered by simple speech recognizers. Enabl... more We present a method to expand the number of languages covered by simple speech recognizers. Enabling speech recognition in users ’ primary languages greatly extends the types of mobile-phone-based applications available to people in developing regions. We describe how we expand language corpora through user-supplied speech contributions, how we quickly evaluate each contribution, and how we pay contributors for their work. Index Terms:

Crowd translator

ACM SIGOPS Operating Systems Review, 2010

We present a method to expand the number of languages covered by simple speech recognizers. Enabl... more We present a method to expand the number of languages covered by simple speech recognizers. Enabling speech recognition in users' primary languages greatly extends the types of mobile-phone-based applications available to people in developing regions. We describe how we expand language corpora through user-supplied speech contributions, how we quickly evaluate each contribution, and how we pay contributors for their work.

Transport Policy: Social Media and User-Generated Content in a Changing Information Paradigm

Springer eBooks, 2015

Rapid and recent developments in social media networks are providing a vision amongst transport s... more Rapid and recent developments in social media networks are providing a vision amongst transport suppliers, governments and academia of ‘next-generation’ information channels. This chapter identifies the main requirements for a social media information harvesting methodology in the transport context and highlights the challenges involved. Three questions are addressed concerning (1) The ways in which social media data can be used alongside or potentially instead of current transport data sources, (2) The technical challenges in text mining social media that create difficulties in generating high quality data for the transport sector and finally, (3) Whether there are wider institutional barriers in harnessing the potential of social media data for the transport sector. The chapter demonstrates that information harvested from social media can complement, enrich (or even replace) traditional data collection. Whilst further research is needed to develop automatic or semi-automatic methodologies for harvesting and analysing transport-related social media information, new skills are also needed in the sector to maximise the benefits of this new information source.

What are they up to ? Distilling the Twitter Stream of Subpopulations

Social network researchers have been tackling community detection / community search for over a d... more Social network researchers have been tackling community detection / community search for over a decade. Detecting communities – small groups of people who know each other and interact with each other – have numerous applications, starting from marketing and computational advertisement, all the way to the homeland security domain. By now, the problem can be considered mostly solved, in either its unsupervised form (community detection) or semi-supervised form (community search). In our quest to answer general – and very exciting – questions What are people up to? What do they care about? What are they discussing?, we move beyond detecting communities to circumscribing subpopulations – large groups of people who share some common characteristics, for example activists, students, engineers, New Yorkers, football fans etc. We want to know what are < · · · > talking about on Twitter, where < · · · > is any subpopulation. Initially, the subpopulation is characterized by a few ...

Identification of topical subpopulations on social media

Information Sciences, 2020

We tackle a major challenge of information filtering on social media (SM): rather than address th... more We tackle a major challenge of information filtering on social media (SM): rather than address the general question "what are people talking about on SM?", we consider a finer question, "what are ... talking about on SM?", where ... stands for some subpopulation of SM users of interest. We take a set expansion approach, where a seed of example members of the target subpopulation is initially defined, and additional SM users who belong to that subpopulation are identified, thus enabling the effective tracking of relevant information that pertains to that subpopulation on SM. Specifically, the Personalized PageRank (PPR) random walk measure is iteratively applied to detect additional members of the subpopulation based on their structural similarity to the seed set within the social media graph. There are several main contributions of this work. We outline Splash PPR , an efficient distributed computation of PPR adapted for potentially large seed sets and very large SM graphs. Using Splash PPR, we examine and tune graph representations towards the retrieval of two subpopulations from Twitter, namely human rights Activists, and Machine Learning practitioners. We believe this work is first to introduce and evaluate a generic framework for subpopulation identification at scale.

Quantifying the web browser ecosystem

PloS one, 2017

Contrary to the assumption that web browsers are designed to support the user, an examination of ... more Contrary to the assumption that web browsers are designed to support the user, an examination of a 900,000 distinct PCs shows that web browsers comprise a complex ecosystem with millions of addons collaborating and competing with each other. It is possible for addons to "sneak in" through third party installations or to get "kicked out" by their competitors without user involvement. This study examines that ecosystem quantitatively by constructing a large-scale graph with nodes corresponding to users, addons, and words (terms) that describe addon functionality. Analyzing addon interactions at user level using the Personalized PageRank (PPR) random walk measure shows that the graph demonstrates ecological resilience. Adapting the PPR model to analyzing the browser ecosystem at the level of addon manufacturer, the study shows that some addon companies are in symbiosis and others clash with each other as shown by analyzing the behavior of 18 prominent addon manufact...

SocialVec: Social Entity Embeddings

Cornell University - arXiv, Nov 5, 2021

This paper introduces SocialVec, a general framework for eliciting social world knowledge from so... more This paper introduces SocialVec, a general framework for eliciting social world knowledge from social networks, and applies this framework to Twitter. SocialVec learns lowdimensional embeddings of popular accounts, which represent entities of general interest, based on their co-occurrences patterns within the accounts followed by individual users, thus modeling entity similarity in socio-demographic terms. Similar to word embeddings, which facilitate tasks that involve text processing, we expect social entity embeddings to benefit tasks of social flavor. We have learned social embeddings for roughly 200,000 popular accounts from a sample of the Twitter network that includes more than 1.3 million users and the accounts that they follow, and evaluate the resulting embeddings on two different tasks. The first task involves the automatic inference of personal traits of users from their social media profiles. In another study, we exploit SocialVec embeddings for gauging the political bias of news sources in Twitter. In both cases, we prove SocialVec embeddings to be advantageous compared with existing entity embedding schemes. We will make the SocialVec entity embeddings publicly available to support further exploration of social world knowledge as reflected in Twitter.

Character-level HyperNetworks for Hate Speech Detection

Expert Systems with Applications

The massive spread of hate speech, hateful content targeted at specific subpopulations, is a prob... more The massive spread of hate speech, hateful content targeted at specific subpopulations, is a problem of critical social importance. Automated methods for hate speech detection typically employ state-of-the-art deep learning (DL)-based text classifiers-very large pre-trained neural language models of over 100 million parameters, adapting these models to the task of hate speech detection using relevant labeled datasets. Unfortunately, there are only numerous labeled datasets of limited size that are available for this purpose. We make several contributions with high potential for advancing this state of affairs. We present HyperNetworks for hate speech detection, a special class of DL networks whose weights are regulated by a small-scale auxiliary network. These architectures operate at character-level, as opposed to word-level, and are several magnitudes of order smaller compared to the popular DL classifiers. We further show that training hate detection classifiers using large amounts of automatically generated examples in a procedure named as data augmentation is beneficial in general, yet this practice especially boosts the performance of the proposed HyperNetworks. In fact, we achieve performance that is comparable or better than state-of-the-art language models, which are pre-trained and orders of magnitude larger, using this approach, as evaluated using five public hate speech datasets.

Fight Fire with Fire: Fine-tuning Hate Detectors using Large Samples of Generated Hate Speech

Findings of the Association for Computational Linguistics: EMNLP 2021, 2021

Automatic hate speech detection is hampered by the scarcity of labeled datasetd, leading to poor ... more Automatic hate speech detection is hampered by the scarcity of labeled datasetd, leading to poor generalization. We employ pretrained language models (LMs) to alleviate this data bottleneck. We utilize the GPT LM for generating large amounts of synthetic hate speech sequences from available labeled examples, and leverage the generated data in fine-tuning large pretrained LMs on hate detection. An empirical study using the models of BERT, RoBERTa and ALBERT, shows that this approach improves generalization significantly and consistently within and across data distributions. In fact, we find that generating relevant labeled hate speech sequences is preferable to using out-of-domain, and sometimes also within-domain, human-labeled examples.

Towards Hate Speech Detection at Large via Deep Generative Modeling

IEEE Internet Computing, 2021

Hate speech detection is a critical problem in social media platforms, being often accused for en... more Hate speech detection is a critical problem in social media platforms, being often accused for enabling the spread of hatred and igniting physical violence. Hate speech detection requires overwhelming resources including high-performance computing for online posts and tweets monitoring as well as thousands of human experts for daily screening of suspected posts or tweets. Recently, Deep Learning (DL)-based solutions have been proposed for automatic detection of hate speech, using modest-sized training datasets of few thousands of hate speech sequences. While these methods perform well on the specific datasets, their ability to detect new hate speech sequences is limited and has not been investigated. Being a data-driven approach, it is well known that DL surpasses other methods whenever a scale-up in train dataset size and diversity is achieved. Therefore, we first present a dataset of 1 million realistic hate and non-hate sequences, produced by a deep generative language model. We further utilize the generated dataset to train a well-studied DL-based hate speech detector, and demonstrate consistent and significant performance improvements across five public hate speech datasets. Therefore, the proposed solution enables high sensitivity detection of a very large variety of hate speech sequences, paving the way to a fully automatic solution.

SocialVec: Social Entity Embeddings

ArXiv, 2021

This paper introduces SocialVec, a general framework for eliciting social world knowledge from so... more This paper introduces SocialVec, a general framework for eliciting social world knowledge from social networks, and applies this framework to Twitter. SocialVec learns lowdimensional embeddings of popular accounts, which represent entities of general interest, based on their co-occurrences patterns within the accounts followed by individual users, thus modeling entity similarity in socio-demographic terms. Similar to word embeddings, which facilitate tasks that involve text processing, we expect social entity embeddings to benefit tasks of social flavor. We have learned social embeddings for roughly 200,000 popular accounts from a sample of the Twitter network that includes more than 1.3 million users and the accounts that they follow, and evaluate the resulting embeddings on two different tasks. The first task involves the automatic inference of personal traits of users from their social media profiles. In another study, we exploit SocialVec embeddings for gauging the political bia...

Multi-source named entity typing for social media

by Reuth Vexler and Einat Minkov

Proceedings of the Sixth Named Entity Workshop, 2016

An Email and Meeting Assistant Using Graph Walks

Ceas, 2006

We describe a framework for representing email as well as meeting information as a joint graph. I... more We describe a framework for representing email as well as meeting information as a joint graph. In the graph, documents and meeting descriptions are connected via other nontextual objects representing the underlying structure-rich data. This framework integrates content, social networks and a timeline in a structural graph. Extended similarity metrics for objects embedded in the graph can be derived using a lazy graph walk paradigm. In this paper we evaluate this general framework for two meeting and email related tasks. A novel task considered is finding email-addresses of relevant attendees for a given meeting. Another task we define and evaluate is finding the full set of email-address aliases for a person, given the corresponding name string. The experimental results show promise of this approach over other possible methods.

Activity-centred Search in Email

Ceas, 2008

We consider activity-centered tasks in email, including the novel task of predicting future invol... more We consider activity-centered tasks in email, including the novel task of predicting future involvement of persons from an enterprise in an ongoing activity represented by a folder, and a novel task where we identify email messages that are related to a to-do item. We also evaluate the task of email tagging to folders, where multiple choice is allowed, and an inverse task, of finding messages relevant to a folder associated with an ongoing project. Empirical evaluation using real world email data, applying a graph based link analysis method and a vector-space model, shows potential utility to facilitate activity management in email.

Improving Graph-Walk-Based Similarity with Reranking: Case Studies for Personal Information Management

Acm Transactions on Information Systems, 2010

Relational or semi-structured data is naturally represented by a graph, where nodes denote entiti... more Relational or semi-structured data is naturally represented by a graph, where nodes denote entities and directed typed edges represent the relations between them. Such graphs are heterogeneous in the sense that they describe different types of objects and links. We represent personal information as a graph that includes messages, terms, persons, dates and other object types, and relations like sent-to and has-term. Given the graph, we apply finite random graph walks to induce a measure of entity similarity, which can be viewed as a tool for performing search in the graph. Experiments conducted using personal email collections derived from the Enron corpus and other corpora show how the different tasks of alias finding, threading and person name disambiguation can be all addressed as search queries in this framework, where the graph-walk based similarity metric is preferable to alternative approaches, and further improvements are achieved with learning. While researchers have suggested to tune edge weight parameters to optimize the graph walk performance per task, we apply reranking to improve the graph walk results, using features that describe highlevel information such as the paths traversed in the walk. High performance, together with practical run times, suggest that the described framework is a useful search system in the PIM domain, as well as in other semi-structured domains.

Learning to understand web site update requests

Proceedings of the 19th International Joint Conference on Artificial Intelligence, Jul 30, 2005

We experimentally evaluate components of a system that learns to analyze natural-language request... more We experimentally evaluate components of a system that learns to analyze natural-language requests to update information on a database-backed website. Our long-term goal is to develop a system which can adapt to changes in the distribution of requests as the underlying database schema changes-in short, a system that performs deep analysis of text in a domain of discourse that is limited, but which grows over time. We describe a scheme for decomposing request-understanding into a sequence of entity recognition and text classification tasks, each of which can be solved using standard learning methods. We then present experimental results that quantify how well these tasks can be learned.

Graph based similarity measures for synonym extraction from parsed text

Workshop Proceedings of Textgraphs 7 on Graph Based Methods For Natural Language Processing, Jul 13, 2012

... Founda-tions of Statistical Natural Language Processing. MIT Press. Einat Minkov and William ... more

Learning to Walk Structured Text Networks

We propose representing a text corpus as a labeled directed graph, where nodes represent words an... more We propose representing a text corpus as a labeled directed graph, where nodes represent words and weighted edges represent the syntactic relations between them, as derived by dependency parsing. Given this graph, we adopt a graph-based similarity measure based on random walks to derive a similarity measure between words, and also use supervised learning to improve the derived similarity measure for a particular task. Empirical evaluation of the approach on the task of coordinate term extraction shows that the suggested framework improves on a state-of-the-art distributional similarity measure.