COVID-19 Named Entity Recognition for Vietnamese

Đạt Nguyễn

COVID-19 Named Entity Recognition for Vietnamese

Đạt Nguyễn

2021, arXiv (Cornell University)

visibility

…

description

8 pages

link

1 file

The current COVID-19 pandemic has lead to the creation of many corpora that facilitate NLP research and downstream applications to help fight the pandemic. However, most of these corpora are exclusively for English. As the pandemic is a global problem, it is worth creating COVID-19 related datasets for languages other than English. In this paper, we present the first manuallyannotated COVID-19 domain-specific dataset for Vietnamese. Particularly, our dataset is annotated for the named entity recognition (NER) task with newly-defined entity types that can be used in other future epidemics. Our dataset also contains the largest number of entities compared to existing Vietnamese NER datasets. We empirically conduct experiments using strong baselines on our dataset, and find that: automatic Vietnamese word segmentation helps improve the NER results and the highest performances are obtained by finetuning pre-trained language models where the monolingual model PhoBERT for Vietnamese (Nguyen and Nguyen, 2020) produces higher results than the multilingual model XLM-R

COVID-19 Named Entity Recognition for Vietnamese Thinh Hung Truong, Mai Hoang Dao and Dat Quoc Nguyen VinAI Research, Hanoi, Vietnam {v.thinhth88, v.maidh3, v.datnq9}@vinai.io arXiv:2104.03879v1 [cs.CL] 8 Apr 2021 Abstract The current COVID-19 pandemic has lead to the creation of many corpora that facilitate NLP research and downstream applications to help fight the pandemic. However, most of these corpora are exclusively for English. As the pandemic is a global problem, it is worth creating COVID-19 related datasets for languages other than English. In this paper, we present the first manuallyannotated COVID-19 domain-specific dataset for Vietnamese. Particularly, our dataset is annotated for the named entity recognition (NER) task with newly-defined entity types that can be used in other future epidemics. Our dataset also contains the largest number of entities compared to existing Vietnamese NER datasets. We empirically conduct experiments using strong baselines on our dataset, and find that: automatic Vietnamese word segmentation helps improve the NER results and the highest performances are obtained by finetuning pre-trained language models where the monolingual model PhoBERT for Vietnamese (Nguyen and Nguyen, 2020) produces higher results than the multilingual model XLM-R (Conneau et al., 2020). We publicly release our dataset at: https://github.com/ VinAIResearch/PhoNER_COVID19. 1 Introduction As of early November 2020, the total number of COVID-19 cases worldwide has surpassed 50M.1 The world is once again hit by a new wave of COVID-19 infection with record-breaking numbers of new cases reported everyday. Along with the outbreak of the pandemic, information about the COVID-19 is aggregated rapidly through different types of texts in different languages (Aizawa et al., 2020). Particularly, in Vietnam, text reports containing official information from the government about COVID-19 cases are presented in great 1 https://www.worldometers.info/coronavirus/worldwidegraphs/#total-cases detail, including de-identified personal information, travel history, as well as information of people who come into contact with the cases. The reports are frequently kept up to date at reputable online news sources, playing a significant role to help the country combat the pandemic. It is thus essential to build systems to retrieve and condense information from those official sources so that related people and organizations can promptly grasp the key information for epidemic prevention tasks, and the systems should also be able to adapt and sync quickly with epidemics that take place in the future. One of the first steps to develop such systems is to recognize relevant named entities mentioned in the texts, which is also known as the NER task. Compared to other languages, data resources for the Vietnamese NER task are limited, including only two public datasets from the VLSP 2016 and 2018 NER shared tasks (Huyen and Luong, 2016; Nguyen et al., 2018b). Here, the VLSP-2018 NER dataset is an extension of the VLSP-2016 NER dataset with more data. These two datasets only focus on recognizing generic entities of person names, organizations, and locations in online news articles. Thus, making them difficult to adapt to the context of extracting key entity information related to COVID-19 patients. This leads to our work’s main goals that are: (i) To develop a NER task in the COVID-19 specified domain, that potentially impacts research and downstream applications, and (ii) To provide the research community with a new dataset for recognizing COVID-19 related named entities in Vietnamese. In this paper, we present a named entity annotated dataset with newly-defined entity types that can be applied to future epidemics. The dataset contains informative sentences related to COVID19, extracted from articles crawled from reputable Vietnamese online news sites. Here, we do not consider other types of popular social media in Vietnam such as Facebook as they contain much Label PATIENT_ID PERSON_NAME AGE GENDER OCCUPATION LOCATION ORGANIZATION SYMPTOM&DISEASE TRANSPORTATION DATE Definition Unique identifier of a COVID-19 patient in Vietnam. An PATIENT_ID annotation over “X” refers to as the Xth patient having COVID-19 in Vietnam. Name of a patient or person who comes into contact with a patient. Age of a patient or person who comes into contact with a patient. Gender of a patient or person who comes into contact with a patient. Job of a patient or person who comes into contact with a patient. Locations/places that a patient was presented at. Organizations related to a patient, e.g. company, government organization, and the like, with structures and their own functions. Symptoms that a patient experiences, and diseases that a patient had prior to COVID-19 or complications that usually appear in death reports. Means of transportation that a patient used. Here, we only tag the specific identifier of vehicles, e.g. flight numbers and bus/car plates. Any date that appears in the sentence. Table 1: Definitions of entity types in our annotation guidelines. We do not annotate nested entities. noisy information and are not as reliable as official news sources. We then empirically evaluate strong baseline models on our dataset. Our contributions are summarized as follows: • We introduce the first manually annotated Vietnamese dataset in the COVID-19 domain. Our dataset is annotated with 10 different named entity types related to COVID19 patients in Vietnam. Compared to the VLSP-2016 and VLSP-2018 Vietnamese NER datasets, our dataset has the largest number of entities, consisting of 35K entities over 10K sentences. • We empirically investigate strong baselines on our dataset, including BiLSTM-CNN-CRF (Ma and Hovy, 2016) and the pre-trained language models XLM-R (Conneau et al., 2020) and PhoBERT (Nguyen and Nguyen, 2020). We find that: (i) Automatic Vietnamese word segmentation helps improve the NER results, and (ii) The highest results are obtained by fine-tuning the pre-trained language models, where PhoBERT does better than XLM-R. • We publicly release our dataset for research or educational purposes. We hope that our dataset can serve as a starting point for future COVID-19 related Vietnamese NLP research and applications. 2 Related work Most COVID-19 related datasets are constructed from two types of sources. The first one is scientific publications, including the datasets CORD19 (Wang et al., 2020) and LitCovid (Chen et al., 2020), that help facilitate many types of research works, such as building search engines to retrieve relevant information from scholarly articles (Esteva et al., 2020; Zhang et al., 2020; Verspoor et al., 2020), question answering and summarization (Lee et al., 2020; Su et al., 2020). Recently, Colic et al. (2020) fine-tune a BERT-based NER model on the CRAFT corpus (Verspoor et al., 2012) to recognize and then normalize biomedical ontology and terminology entities in LitCovid. The second type is social media data, particularly Tweets. COVID-19 related Tweet datasets are built for many analytic tasks such as identification of informative Tweets (Nguyen et al., 2020b), and disinformation detection and fact-checking (Shahi and Nandini, 2020; Alam et al., 2020; Alsudias and Rayson, 2020). The most relevant work to ours is proposed by Zong et al. (2020), that aims to extract COVID-19 events reporting test results, death cases, cures and prevention from English Tweets. As Twitter is rarely used by Vietnamese people, we could not use it for data collection. 3 3.1 Our dataset Entity types We define 10 entity types with the aim of extracting key information related to COVID-19 patients, which are especially useful in downstream applications. In general, these entity types can be used in the context of not only the COVID-19 pandemic but also in other future epidemics. The description of each entity type is briefly described in Table 1. See the Appendix for entity examples as well as some notices over the entity types. 3.2 We first crawl articles tagged with "COVID-19" or "COVID" keywords from the reputable Vietnamese online news sites, including VnExpress,2 ZingNews,3 BaoMoi4 and ThanhNien.5 These articles are dated between February 2020 and August 2020. We then segment the crawled news articles’ primary text content into sentences using RDRSegmenter (Nguyen et al., 2018a) from VnCoreNLP (Vu et al., 2018). To retrieve informative sentences about COVID19 patients, we employ BM25Plus (Trotman et al., 2014) with search queries of common keywords appearing in sentences that report confirmed, suspected, recovered, or death cases as well as the travel history or location of the cases. From the top 15K sentences ranked by BM25Plus, we manually filter out sentences that do not contain information related to patients in Vietnam, thus resulting in a dataset of 10027 raw sentences. 3.3 Annotation process We develop an initial version of our annotation guidelines and then randomly sample a pilot set of 1K sentences from the dataset of 10027 raw sentences for the first phase of annotation. Two of the guideline developers are employed to annotate the pilot set independently. Following Brandsen et al. (2020), we utilize F1 score to measure the interannotator agreement between the two annotators at the entity span level, resulting in an F1 score of 0.88. We then host a discussion session to resolve annotation conflicts, identify complex cases, and refine the guidelines. In the second annotation phase, we divide the whole dataset of 10027 sentences into 10 nonoverlapping and equal subsets. Each subset contains 100 sentences from the pilot set from the first annotation phase. For this second phase, we employ 10 annotators who are undergraduate students with strong linguistic abilities (here, each annotator annotates a subset, paid 0.05 USD per sentence). Annotation quality of each annotator is measured by F1 calculated over the 100 sentences that already have gold annotations from the pilot set. All annotators are asked to revise their annotations until they achieve an F1 of at least 0.92. Finally, we 2 https://vnexpress.net https://zingnews.vn 4 https://baomoi.com 5 https://thanhnien.vn 3 Entity Type PATIENT_ID PERSON_NAME AGE GENDER OCCUPATION LOCATION ORGANIZATION SYMPTOM&DISEASE TRANSPORTATION DATE # Entities in total # Sentences in total COVID-19 related data collection Train 3240 349 682 542 205 5398 1137 1439 226 2549 15767 5027 Valid. 1276 188 361 277 132 2737 551 766 87 1103 7478 2000 Test 2005 318 582 462 173 4441 771 1136 193 1654 11735 3000 All 6521 855 1625 1281 510 12576 2459 3341 506 5306 34984 10027 Table 2: Statistics of our dataset. revisit each annotated sentence to make further corrections if needed, resulting in a final gold dataset of 10027 annotated sentences. Note that when written in Vietnamese texts, in addition to marking word boundaries, white space is also used to separate syllables that constitute words. Therefore, the annotation process is performed at syllable-level text for convenience. To obtain a word-level variant of the dataset, we apply the RDRSegmenter to perform automatic Vietnamese word segmentation, e.g. a 4syllable written text “bệnh viện Đà Nẵng” (Da Nang hospital) is word-segmented into a 2-word text “bệnh_việnhospital Đà_NẵngDa_Nang ”. Here, automatic Vietnamese word segmentation outputs do not affect gold boundaries of entity mentions. 3.4 Data partitions We randomly split the gold annotated dataset of 10027 sentences into training/validation/test sets with a ratio of 5/2/3, ensuring comparable distributions of entity types across these three sets. Statistics of our dataset is presented in Table 2. 4 4.1 Experiments Experimental setup We formulate the COVID-19 NER task for Vietnamese as a sequence labeling problem with the BIO tagging scheme. We conduct experiments on our dataset using strong baselines to investigate: (i) the influence of automatic Vietnamese word segmentation (here, input sentence can be represented in either syllable or word level), and (ii) the usefulness of pre-trained language models. The baselines include: BiLSTM-CNN-CRF (Ma and Hovy, 2016) and the pre-trained language models XLM-R (Conneau et al., 2020) and PhoBERT (Nguyen and Word Syllable Model BiL-CRF XLM-Rbase XLM-Rlarge BiL-CRF PhoBERTbase PhoBERTlarge PAT. 0.953 0.978 0.982 0.953 0.981 0.980 PER. 0.855 0.902 0.933 0.874 0.903 0.944 AGE 0.943 0.957 0.962 0.950 0.962 0.967 GEN. 0.947 0.842 0.958 0.947 0.954 0.968 OCC. 0.588 0.560 0.692 0.605 0.749 0.791 LOC. 0.915 0.941 0.943 0.911 0.943 0.940 ORG. 0.808 0.842 0.853 0.831 0.870 0.876 SYM. 0.801 0.858 0.854 0.799 0.883 0.885 TRA. 0.794 0.924 0.943 0.902 0.966 0.967 DAT. 0.976 0.982 0.987 0.976 0.987 0.989 Mic-F1 0.906 0.925 0.938 0.910 0.942 0.945 Mac-F1 0.858 0.879 0.911 0.875 0.920 0.931 Table 4: Strict F1 score for each entity type (denoted by its first 3 characters), and Micro- and Macro-average F1 scores (denoted by Mic-F1 and Mac-F1 , respectively). BiL-CRF abbreviates the baseline BiLSTM-CNN-CRF. Syllable and Word denote results obtained when using syllable- and word-level based dataset settings, respectively. Hyper-parameter Optimizer Learning rate Mini-batch size LSTM hidden state size Number of BiLSTM layers Dropout Character embedding size Filter length, i.e. window size Number of filters Value Adam 0.001 36 200 2 [0.25, 0.25] 50 3 30 Table 3: Hyper-parameters for BiLSTM-CNN-CRF. Nguyen, 2020). XLM-R is a multi-lingual variant of RoBERTa (Liu et al., 2019), pre-trained on a 2.5TB multilingual dataset that contains 137GB of syllable-level Vietnamese texts. PhoBERT is a monolingual variant of RoBERTa, pre-trained on a 20GB word-level Vietnamese dataset. We employ the BiLSTM-CNN-CRF implementation from AllenNLP (Gardner et al., 2018). Training BiLSTM-CNN-CRF requires input pretrained syllable- and word-level embeddings for the syllable- and word-level settings, respectively. Thus we employ the pre-trained Word2Vec syllable and word embeddings for Vietnamese from Nguyen et al. (2020a). These embeddings are fixed during training. Optimal hyper-parameters that we gridsearched for BiLSTM-CNN-CRF are presented in Table 3. We utilize the transformers library (Wolf et al., 2020) to fine-tune XLM-R and PhoBERT for the syllable- and word-level settings, respectively, using Adam (Kingma and Ba, 2014) with a fixed learning rate of 5.e-5 and a batch size of 32 (Liu et al., 2019). The baselines are trained/fine-tuned for 30 epochs. We evaluate the Micro-average F1 score after each epoch on the validation set (here, we apply early stopping if we find no performance improvement after 5 continuous epochs). We then choose the best model checkpoint to report the final score on the test set. Note that each F1 score reported is an average over 5 runs with different random seeds. 4.2 Main results Table 4 shows the final entity-level NER results of the baselines on the test set. In addition to the standard Micro-average F1 score, we also report the Macro-average F1 score. We categorize the results under two comparable settings of using syllable-level dataset and its automatically-segmented word-level variant for training and evaluation. We find that the performances of word-level models are higher than their syllable-level counterparts, showing that automatic Vietnamese word segmentation helps improve NER, e.g. BiLSTM-CNN-CRF improves from 0.906 to 0.910 Micro-F1 and from 0.858 to 0.875 Macro-F1 . We also find that fine-tuning the pre-trained language models XLM-R and PhoBERT helps produce better performances than BiLSTM-CNNCRF. Here, PhoBERT outperforms XLM-R (Micro-F1 : 0.945 vs. 0.938; Macro-F1 : 0.931 vs. 0.911), thus reconfirming the effectiveness of pre-trained monolingual language models on the language-specific downstream tasks (Nguyen and Nguyen, 2020). 4.3 Error analysis We perform an error analysis using the best performing model PhoBERTlarge that produces 353 incorrect predictions in total on the validation set. The first error group consists of 69/353 instances with correct entity boundaries (i.e. exact spans) and incorrect entity labels. It is largely due to the fact that the model could not differentiate between LOCATION and ORGANIZATION entities. This is not surprising because of the ambiguity between these two entity types, in which the same entity mention may act as either LOCATION or ORGA- NIZATION depending on the sentence context. Also, in terms of contact tracing, it would be more useful to label an organization-like entity mention as LOCATION if we can infer that a patient presented at that organization; however, such inference requires additional world knowledge about the entity. In addition, in this error group, the model also struggles to recognize OCCUPATION entities correctly. Recall that OCCUPATION entity mention must represent the job of a particular person labeled with PERSON_NAME or PATIENT_ID. Therefore, it may cause confusion to the model for deciding whether an occupation is linked to a determined person or not in a single sentence context. The second error group contains 65/353 instances with inexact spans overlapped with gold spans but having correct entity labels. These errors generally happen with multi-word ORGANIZATION entity mentions, where (i) an ORGANIZATION entity contains a nested location inside its span, e.g. “Bệnh viện Lao và Bệnh phổi Cần Thơ” (Can Tho hospital for Tuberculosis and Lung disease; here, “Can Tho” is a province in Vietnam), or (ii) an organization is a subdivision of a larger organization, e.g. “Khoa tim mạch - Bệnh viện Bạch Mai” (Department of Cardiology - Bach Mai Hospital).6 The third group of 8/353 errors with overlapped inexact spans and incorrect entity labels does not provide us with any useful insight. The final group of remaining 211/353 errors is accounted for predicted entities corresponding with gold O labels. Particularly in the case of LOCATION, where generic mentions, such as “Bệnh viện tỉnh” (province hospital), “Trạm y tế xã” (commune medical station), “chung cư” (apartment), are recognized as entities, while in fact, they are not. 5 Conclusion In this paper, we have presented the first manuallyannotated Vietnamese dataset in the COVID-19 domain, focusing on the named entity recognition task. We empirically conduct experiments on our dataset to compare strong baselines and find that the input representations and the pre-trained language models all have influences on this COVID19 related NER task. We hope that our dataset can serve as the starting point for further Vietnamese NLP research and applications in fighting the COVID-19 and other future epidemics. 6 Word segmentation is not shown for simplification. References Akiko Aizawa, Frederic Bergeron, Junjie Chen, Fei Cheng, Katsuhiko Hayashi, Kentaro Inui, Hiroyoshi Ito, Daisuke Kawahara, Masaru Kitsuregawa, Hirokazu Kiyomaru, Masaki Kobayashi, Takashi Kodama, Sadao Kurohashi, Qianying Liu, Masaki Matsubara, Yusuke Miyao, Atsuyuki Morishima, Yugo Murawaki, Kazumasa Omura, Haiyue Song, Eiichiro Sumita, Shinji Suzuki, Ribeka Tanaka, Yu Tanaka, Masashi Toyoda, Nobuhiro Ueda, Honai Ueoka, Masao Utiyama, and Ying Zhong. 2020. A System for Worldwide COVID-19 Information Aggregation. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020. Firoj Alam, Shaden Shaar, Fahim Dalvi, Hassan Sajjad, Alex Nikolov, Hamdy Mubarak, Giovanni Da San Martino, Ahmed Abdelali, Nadir Durrani, Kareem Darwish, and Preslav Nakov. 2020. Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society. arXiv preprint, arXiv:2005.00033. Lama Alsudias and Paul Rayson. 2020. COVID-19 and Arabic Twitter: How can Arab world governments and public health organizations learn from social media? In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020. Alex Brandsen, Suzan Verberne, Milco Wansleeben, and Karsten Lambers. 2020. Creating a Dataset for Named Entity Recognition in the Archaeology Domain. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 4573– 4577. Qingyu Chen, Alexis Allot, and Zhiyong Lu. 2020. Keep up with the latest coronavirus research. Nature, 579:193. Nico Colic, Lenz Furrer, and Fabio Rinaldi. 2020. Annotating the Pandemic: Named Entity Recognition and Normalisation in COVID-19 Literature. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics, pages 8440– 8451. Andre Esteva, Anuprit Kale, Romain Paulus, Kazuma Hashimoto, Wenpeng Yin, Dragomir Radev, and Richard Socher. 2020. CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization. arXiv preprint, arXiv:2006.09595. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A Deep Semantic Natural Language Processing Platform. In Proceedings of Workshop for NLP Open Source Software, pages 1–6. Nguyen Thi Minh Huyen and Vu Xuan Luong. 2016. VLSP 2016 shared task: Named entity recognition. Proceedings of Vietnamese Speech and Language Processing. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint, arXiv:1412.6980. Jinhyuk Lee, Sean S. Yi, Minbyul Jeong, Mujeen Sung, WonJin Yoon, Yonghwa Choi, Miyoung Ko, and Jaewoo Kang. 2020. Answering Questions on COVID-19 in Real-Time. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint, arXiv:1907.11692. Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNsCRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064–1074. Anh Tuan Nguyen, Mai Hoang Dao, and Dat Quoc Nguyen. 2020a. A pilot study of text-to-SQL semantic parsing for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4079–4085. Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1037– 1042. Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark Dras, and Mark Johnson. 2018a. A Fast and Accurate Vietnamese Word Segmenter. In Proceedings of the 11th International Conference on Language Resources and Evaluation, pages 2582–2587. Gautam Kishore Shahi and Durgesh Nandini. 2020. FakeCovid – A Multilingual Cross-domain Fact Check News Dataset for COVID-19. In Proceedings of the International Workshop on Cyber Social Threats. Dan Su, Yan Xu, Tiezheng Yu, Farhad Bin Siddique, Elham Barezi, and Pascale Fung. 2020. CAiRE-COVID: A Question Answering and Queryfocused Multi-Document Summarization System for COVID-19 Scholarly Information Management. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020. Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to BM25 and language models examined. In Proceedings of the 2014 Australasian Document Computing Symposium, pages 58–65. Karin Verspoor, Kevin Bretonnel Cohen, Arrick Lanfranchi, Colin Warner, Helen L. Johnson, Christophe Roeder, Jinho D. Choi, Christopher S. Funk, Yuriy Malenkiy, Miriam Eckert, Nianwen Xue, William A. Baumgartner Jr., Michael Bada, Martha Palmer, and Lawrence E. Hunter. 2012. A corpus of fulltext journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC Bioinform., 13:207. Karin Verspoor, Simon Šuster, Yulia Otmakhova, Shevon Mendis, Zenan Zhai, Biaoyan Fang, Jey Han Lau, Timothy Baldwin, Antonio Jimeno Yepes, and David Martinez. 2020. COVID-SEE: Scientific Evidence Explorer for COVID-19 related research. arXiv preprint arXiv:2008.07880. Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark Dras, and Mark Johnson. 2018. VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 56–60. Lucy Lu Wang, Kyle Lo, et al. 2020. CORD-19: The COVID-19 Open Research Dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020. Thomas Wolf, Lysandre Debut, et al. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45. Dat Quoc Nguyen, Thanh Vu, Afshin Rahimi, Mai Hoang Dao, Linh The Nguyen, and Long Doan. 2020b. WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets. In Proceedings of the 6th Workshop on Noisy User-generated Text, pages 314–318. Edwin Zhang, Nikhil Gupta, Raphael Tang, Xiao Han, Ronak Pradeep, Kuang Lu, Yue Zhang, Rodrigo Nogueira, Kyunghyun Cho, Hui Fang, et al. 2020. Covidex: Neural ranking models and keyword search infrastructure for the covid-19 open research dataset. arXiv preprint, arXiv:2007.07846. Huyen TM Nguyen, Quyen T Ngo, Luong X Vu, Vu M Tran, and Hien TT Nguyen. 2018b. VLSP shared task: Named entity recognition. Journal of Computer Science and Cybernetics, 34(4):283–294. Shi Zong, Ashutosh Baheti, Wei Xu, and Alan Ritter. 2020. Extracting COVID-19 Events from Twitter. arXiv preprint, arXiv:2006.02567. Appendix Annotation examples Example 1: LOC OCC PAT Bệnh nhân " 669 " là bác sĩ làm việc tại Bệnh viện Đa khoa Đồng Nai OCC LOC PAT Patient " 669 " is a doctor working at Dong Nai General Hospital Example 2: ORG DAT Bệnh viện Bệnh Nhiệt đới TP HCM xét nghiệm dương tính lần một đêm 12/3 . ORG DAT Ho Chi Minh City Hospital for Tropical Diseases returns a positive test result in the evening of 12/3 . Example 3: LOC Hai nữ điều dưỡng Bệnh viện Bạch Mai lây từ bên ngoài và lây nhiễm cho nhau. LOC Two nurses of Bach Mai Hospital got infected from external source and then infected each other. Example 4: LOC SYM Bệnh viện Phổi Đà Nẵng viêm phổi nặng , Bệnh nhân tử vong tại với chẩn đoán SYM SYM suy đa tạng không hồi phục , trên bệnh nhân suy thận mạn giai đoạn cuối . LOC SYM The patient died at Da Nang Lung Hospital , diagnosed with severe pneumonia with history of SYM SYM unrecoverable multiorgan dysfunction syndrome , terminal chronic kidney failure . Here, PAT, OCC, LOC, DAT and SYM abbreviate PATIENT_ID, OCCUPATION, LOCATION, DATE and SYMPTOM&DISEASE, respectively. Recall that an annotation PATIENT_ID over “X” refers to as the Xth patient having COVID-19 in Vietnam (e.g. in Example 1: "669" refers to as the 669th patient). Notices over entity types We have two principles for selecting the ten entity types: (i) Entities should contain key information related to the COVID-19 patients (here, the information should be helpful in the context of contact tracing and monitoring the growth of the pandemic); and (ii) The availability of entity types in the text, i.e., how frequent does each of the entity types appear. This is decided based on manual observations of news articles. In the context of contact tracing, it is more useful to broaden the scope of location. For example, when a patient is presented at an organization, we refer to that organization as a location if we can infer its specific location on the map. In Example 1, we would label the entity mention “Bệnh viện Đa khoa Đồng Nai” (Dong Nai General Hospital) with LOCATION as its provide information about the place that a patient used to be at. On the other hand, in Example 2, the entity mention “Bệnh viện Bệnh Nhiệt đới TP HCM” (Ho Chi Minh City Hospital for Tropical Diseases) is labeled as ORGANIZATION because it acts as the subject executing a specific action (i.e. reporting a test result). For OCCUPATION, AGE and GENDER entities, we only tag them if we can link the corresponding entity mentions to a specific entity with NAME or PATIENT_ID label within the same sentence. In Example 1, “bác sĩ” (doctor) is the occupation of patient “669”, thus we label this mention as an entity of type OCCUPATION. However, in Example 3, we do not label “ điều dưỡng” (nurses) as OCCUPATION as we cannot link this mention to any determined person. For SYMPTOM&DISEASE entities, we prefer the entities to be as detailed as possible. For instance, in Example 4 we consider words denoting the levels of severity as part of diseases, such as “nặng” (severe), “không hồi phục” (unrecoverable), “giai đoạn cuối” (terminal) and “mạn (chronic).

Log In

COVID-19 Named Entity Recognition for Vietnamese

Related papers

Related papers

Related topics