The emergence of social media has made it more difficult to recognize and analyze misinformation ... more The emergence of social media has made it more difficult to recognize and analyze misinformation efforts. Popular messaging software Telegram (Durov, 2013) has developed into a medium for disseminating political messages and misinformation, particularly in light of the conflict in Ukraine (Wikipedia contributors, 2023). In this paper, we introduce a sizable corpus of Telegram posts containing pro-Russian propaganda and benign political texts. We evaluate the corpus by applying natural language processing (NLP) techniques to the task of text classification in this corpus. Our findings indicate that, with an overall accuracy of over 96% for confirmed sources as propagandists and oppositions and 92% for unconfirmed sources, our method can successfully identify and categorize pro-Russian propaganda posts. We highlight the consequences of our research for comprehending political communications and propaganda on social media.
The issue of factual consistency in abstractive summarization has received extensive attention in... more The issue of factual consistency in abstractive summarization has received extensive attention in recent years, and the evaluation of factual consistency between summary and document has become an important and urgent task. Most of the current metrics are adopted from the question answering (QA) or natural language inference (NLI) task. However, the application of QA-based metrics is extremely time-consuming in practice while NLI-based metrics are lack of interpretability. In this paper, we propose a cloze-based evaluation framework called ClozE and show its great potential. It inherits strong interpretability from QA, while maintaining the speed of NLI-level reasoning. We demonstrate that ClozE can reduce the evaluation time by nearly 96% relative to QA-based metrics while retaining their interpretability and performance through experiments on six human-annotated datasets and a meta-evaluation benchmark GO FIGURE[1]. Finally, we discuss three important facets of ClozE in practice, which further shows better overall performance of ClozE compared to other metrics. The codes and models are released at https://github.com/Mr-KenLee/ClozE.
With millions of documented recoveries from COVID-19 worldwide, various long-term sequelae have b... more With millions of documented recoveries from COVID-19 worldwide, various long-term sequelae have been observed in a large group of survivors. This paper is aimed at systematically analyzing user-generated conversations on Twitter that are related to long-term COVID symptoms for a better understanding of the Long COVID health consequences. Using an interactive information extraction tool built especially for this purpose, we extracted key information from the relevant tweets and analyzed the user-reported Long COVID symptoms with respect to their demographic and geographical characteristics. The results of our analysis are expected to improve the public awareness on long-term COVID-19 sequelae and provide important insights to public health authorities.
Proceedings of the 2nd Workshop on Deriving Insights from User-Generated Text
With millions of documented recoveries from COVID-19 worldwide, various long-term sequelae have b... more With millions of documented recoveries from COVID-19 worldwide, various long-term sequelae have been observed in a large group of survivors. This paper is aimed at systematically analyzing user-generated conversations on Twitter that are related to long-term COVID symptoms for a better understanding of the Long COVID health consequences. Using an interactive information extraction tool built especially for this purpose, we extracted key information from the relevant tweets and analyzed the user-reported Long COVID symptoms with respect to their demographic and geographical characteristics. The results of our analysis are expected to improve the public awareness on long-term COVID-19 sequelae and provide important insights to public health authorities.
Proceedings of the Workshop MultiLing 2019: Summarization Across Languages, Genres and Sources associated with RANLP 2019, 2019
Automatic headline generation is a subtask of one-line summarization with many reported applicati... more Automatic headline generation is a subtask of one-line summarization with many reported applications. Evaluation of systems generating headlines is a very challenging and undeveloped area. We introduce the Headline Evaluation and Analysis System (HEvAS) that performs automatic evaluation of systems in terms of a quality of the generated headlines. HEvAS provides two types of metricsone which measures the informativeness of a headline, and another that measures its readability. The results of evaluation can be compared to the results of baseline methods which are implemented in HEvAS. The system also performs the statistical analysis of the evaluation results and provides different visualization charts. This paper describes all evaluation metrics, baselines, analysis, and architecture, utilized by our system.
We are given temporal customer-oriented dataset, where each transaction consists of set of events... more We are given temporal customer-oriented dataset, where each transaction consists of set of events that are associated with a customer id and a timestamp. For each customer there are several transactions with different timestamps. One of the events is defined as the target event. We introduce the problem of mining target events rules that are based on the discovery of continuous sequential patterns over such databases. We propose two algorithms to solve this problem and evaluate their performance using reallife data. The presented algorithms, CTSPD and CSPADE, have comparable evaluations, but the CTSPD, as expected, turns out to work faster.
The HHD_gender dataset contains 819 handwritten forms written by volunteers of different educatio... more The HHD_gender dataset contains 819 handwritten forms written by volunteers of different educational backgrounds and ages (as young as 11 years old and as old as late 60s), both native and non-native Hebrew speakers. There are 50 variations of the forms; each form contains a text paragraph with 62 words on average. For the experiments, the HHD gender dataset was randomly subdivided into training (80%), validation (10%), and test (10%) sets. --------------------------------------------------------------------------------------------------------------- This database may be used for non-commercial research purpose only. If you publish material based on this database, we request you to include a reference to the following papers: [1] I. Rabaev, B. Kurar Barakat, A. Churkin and J. El-Sana. The HHD Dataset. The 17th International Conference on Frontiers in Handwriting Recognition, pp. 228-233, 2020, DOI: 10.1109/ICFHR2020.2020.00050 [2] I. Rabaev, M. Litvak, S. Asulin and O.H. Tabibi. Aut...
The first edition of the Implicit Author Characterization from Texts for Search and Retrieval (IA... more The first edition of the Implicit Author Characterization from Texts for Search and Retrieval (IACT'23) aims at bringing to the forefront the challenges involved in identifying and extracting from texts implicit information about authors (e.g., human or AI) and using it in IR tasks. The IACT workshop provides a common forum to consolidate multidisciplinary efforts and foster discussions to identify the wide-ranging issues related to the task of extracting implicit author-related information from the textual content, including novel tasks and datasets. We will also discuss the ethical implications of implicit information extraction. In addition, we announce a shared task focused on automatically determining the literary epochs of written books. CCS CONCEPTS • Information systems → Information retrieval; • Computing methodologies → Machine learning; Information extraction; • Applied computing → Document management and text processing; Law, social and behavioral sciences; • Social and professional topics → User characteristics.
Extractive text summarization aims at selecting a small subset of sentences so that the contents ... more Extractive text summarization aims at selecting a small subset of sentences so that the contents and meaning of the original document are best preserved. In this paper we describe an unsupervised approach to extractive summarization. It combines hierarchical topic modeling (TM) with the Minimal Description Length (MDL) principle and applies them to Chinese language. Our summarizer strives to extract information that provides the best description of text topics in terms of MDL. This model is applied to the NLPCC 2015 Shared Task of Weibo-Oriented Chinese News Summarization [1], where Chinese texts from news articles were summarized with the goal of creating short meaningful messages for Weibo (Sina Weibo is a Chinese microblogging website, one of the most popular sites in China.) [2]. The experimental results disclose superiority of our approach over other summarizers from the NLPCC 2015 competition.
The emergence of social media has made it more difficult to recognize and analyze misinformation ... more The emergence of social media has made it more difficult to recognize and analyze misinformation efforts. Popular messaging software Telegram (Durov, 2013) has developed into a medium for disseminating political messages and misinformation, particularly in light of the conflict in Ukraine (Wikipedia contributors, 2023). In this paper, we introduce a sizable corpus of Telegram posts containing pro-Russian propaganda and benign political texts. We evaluate the corpus by applying natural language processing (NLP) techniques to the task of text classification in this corpus. Our findings indicate that, with an overall accuracy of over 96% for confirmed sources as propagandists and oppositions and 92% for unconfirmed sources, our method can successfully identify and categorize pro-Russian propaganda posts. We highlight the consequences of our research for comprehending political communications and propaganda on social media.
The issue of factual consistency in abstractive summarization has received extensive attention in... more The issue of factual consistency in abstractive summarization has received extensive attention in recent years, and the evaluation of factual consistency between summary and document has become an important and urgent task. Most of the current metrics are adopted from the question answering (QA) or natural language inference (NLI) task. However, the application of QA-based metrics is extremely time-consuming in practice while NLI-based metrics are lack of interpretability. In this paper, we propose a cloze-based evaluation framework called ClozE and show its great potential. It inherits strong interpretability from QA, while maintaining the speed of NLI-level reasoning. We demonstrate that ClozE can reduce the evaluation time by nearly 96% relative to QA-based metrics while retaining their interpretability and performance through experiments on six human-annotated datasets and a meta-evaluation benchmark GO FIGURE[1]. Finally, we discuss three important facets of ClozE in practice, which further shows better overall performance of ClozE compared to other metrics. The codes and models are released at https://github.com/Mr-KenLee/ClozE.
With millions of documented recoveries from COVID-19 worldwide, various long-term sequelae have b... more With millions of documented recoveries from COVID-19 worldwide, various long-term sequelae have been observed in a large group of survivors. This paper is aimed at systematically analyzing user-generated conversations on Twitter that are related to long-term COVID symptoms for a better understanding of the Long COVID health consequences. Using an interactive information extraction tool built especially for this purpose, we extracted key information from the relevant tweets and analyzed the user-reported Long COVID symptoms with respect to their demographic and geographical characteristics. The results of our analysis are expected to improve the public awareness on long-term COVID-19 sequelae and provide important insights to public health authorities.
Proceedings of the 2nd Workshop on Deriving Insights from User-Generated Text
With millions of documented recoveries from COVID-19 worldwide, various long-term sequelae have b... more With millions of documented recoveries from COVID-19 worldwide, various long-term sequelae have been observed in a large group of survivors. This paper is aimed at systematically analyzing user-generated conversations on Twitter that are related to long-term COVID symptoms for a better understanding of the Long COVID health consequences. Using an interactive information extraction tool built especially for this purpose, we extracted key information from the relevant tweets and analyzed the user-reported Long COVID symptoms with respect to their demographic and geographical characteristics. The results of our analysis are expected to improve the public awareness on long-term COVID-19 sequelae and provide important insights to public health authorities.
Proceedings of the Workshop MultiLing 2019: Summarization Across Languages, Genres and Sources associated with RANLP 2019, 2019
Automatic headline generation is a subtask of one-line summarization with many reported applicati... more Automatic headline generation is a subtask of one-line summarization with many reported applications. Evaluation of systems generating headlines is a very challenging and undeveloped area. We introduce the Headline Evaluation and Analysis System (HEvAS) that performs automatic evaluation of systems in terms of a quality of the generated headlines. HEvAS provides two types of metricsone which measures the informativeness of a headline, and another that measures its readability. The results of evaluation can be compared to the results of baseline methods which are implemented in HEvAS. The system also performs the statistical analysis of the evaluation results and provides different visualization charts. This paper describes all evaluation metrics, baselines, analysis, and architecture, utilized by our system.
We are given temporal customer-oriented dataset, where each transaction consists of set of events... more We are given temporal customer-oriented dataset, where each transaction consists of set of events that are associated with a customer id and a timestamp. For each customer there are several transactions with different timestamps. One of the events is defined as the target event. We introduce the problem of mining target events rules that are based on the discovery of continuous sequential patterns over such databases. We propose two algorithms to solve this problem and evaluate their performance using reallife data. The presented algorithms, CTSPD and CSPADE, have comparable evaluations, but the CTSPD, as expected, turns out to work faster.
The HHD_gender dataset contains 819 handwritten forms written by volunteers of different educatio... more The HHD_gender dataset contains 819 handwritten forms written by volunteers of different educational backgrounds and ages (as young as 11 years old and as old as late 60s), both native and non-native Hebrew speakers. There are 50 variations of the forms; each form contains a text paragraph with 62 words on average. For the experiments, the HHD gender dataset was randomly subdivided into training (80%), validation (10%), and test (10%) sets. --------------------------------------------------------------------------------------------------------------- This database may be used for non-commercial research purpose only. If you publish material based on this database, we request you to include a reference to the following papers: [1] I. Rabaev, B. Kurar Barakat, A. Churkin and J. El-Sana. The HHD Dataset. The 17th International Conference on Frontiers in Handwriting Recognition, pp. 228-233, 2020, DOI: 10.1109/ICFHR2020.2020.00050 [2] I. Rabaev, M. Litvak, S. Asulin and O.H. Tabibi. Aut...
The first edition of the Implicit Author Characterization from Texts for Search and Retrieval (IA... more The first edition of the Implicit Author Characterization from Texts for Search and Retrieval (IACT'23) aims at bringing to the forefront the challenges involved in identifying and extracting from texts implicit information about authors (e.g., human or AI) and using it in IR tasks. The IACT workshop provides a common forum to consolidate multidisciplinary efforts and foster discussions to identify the wide-ranging issues related to the task of extracting implicit author-related information from the textual content, including novel tasks and datasets. We will also discuss the ethical implications of implicit information extraction. In addition, we announce a shared task focused on automatically determining the literary epochs of written books. CCS CONCEPTS • Information systems → Information retrieval; • Computing methodologies → Machine learning; Information extraction; • Applied computing → Document management and text processing; Law, social and behavioral sciences; • Social and professional topics → User characteristics.
Extractive text summarization aims at selecting a small subset of sentences so that the contents ... more Extractive text summarization aims at selecting a small subset of sentences so that the contents and meaning of the original document are best preserved. In this paper we describe an unsupervised approach to extractive summarization. It combines hierarchical topic modeling (TM) with the Minimal Description Length (MDL) principle and applies them to Chinese language. Our summarizer strives to extract information that provides the best description of text topics in terms of MDL. This model is applied to the NLPCC 2015 Shared Task of Weibo-Oriented Chinese News Summarization [1], where Chinese texts from news articles were summarized with the goal of creating short meaningful messages for Weibo (Sina Weibo is a Chinese microblogging website, one of the most popular sites in China.) [2]. The experimental results disclose superiority of our approach over other summarizers from the NLPCC 2015 competition.
Uploads
Papers by Marina Litvak