CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Ningyu Zhang∗ Zhen Bi∗ Xiaozhuan Liang∗ Lei Li∗ Xiang Chen∗ Shumin Deng∗
Luoqiu Li Xin Xie Hongbin Ye Xin Shang Kangping Yin Chuanqi Tan
arXiv:2106.08087v1 [cs.CL] 15 Jun 2021
Jian Xu Mosha Chen Fei Huang Luo Si Yuan Ni Guotong Xie Zhifang Sui
Baobao Chang Hui Zong Zheng Yuan Linfeng Li Jun Yan Hongying Zan
CBLUE Team ‡
Abstract
Artificial Intelligence (AI), along with the recent progress in biomedical language
understanding, is gradually changing medical practice. With the development of
biomedical language understanding benchmarks, AI applications are widely used in
the medical field. However, most benchmarks are limited to English, which makes
it challenging to replicate many of the successes in English for other languages.
To facilitate research in this direction, we collect real-world biomedical data and
present the first Chinese Biomedical Language Understanding Evaluation (CBLUE)
benchmark: a collection of natural language understanding tasks including named
entity recognition, information extraction, clinical diagnosis normalization, single-
sentence/sentence-pair classification, and an associated online platform for model
evaluation, comparison, and analysis. To establish evaluation on these tasks,
we report empirical results with the current 11 pre-trained Chinese models, and
experimental results show that state-of-the-art neural models perform by far worse
than the human ceiling. Our benchmark is released at https://tianchi.aliyun.
com/dataset/dataDetail?dataId=95414&lang=en-us.
1 Introduction
Artificial intelligence is gradually changing the landscape of healthcare, and biomedical research [35].
With the fast advancement of biomedical datasets, biomedical natural language processing (BioNLP)
has facilitated a broad range of applications such as biomedical text mining, which leverages textual
data in Electronic Health Records (EHRs). For example, BioNLP methods can be employed to provide
recommendations for specialized healthcare to those most at risk during pandemics (COVID-19)
using the text and information in EHRs.
∗
Equal contribution.
†
Corresponding author.
‡
Author contributions are listed in the appendix.
2 Related Work
Several benchmarks have been developed to evaluate general language understanding over the past
few years. GLUE [29] is one of the first frameworks developed as a formal challenge affording
straightforward comparison between task-agnostic transfer learning techniques. SuperGLUE [28],
styled after GLUE, introduce a new set of more difficult language understanding tasks, a software
toolkit, and a public leaderboard. Other similarly motivated benchmarks include DecaNLP [22],
which recast a set of target tasks into a general question-answering format and prohibit task-specific
parameters, and SentEval [2], which evaluate explicitly fixed-size sentence embeddings. Non-
English benchmarks include RussianSuperGLUE [25] and CLUE [34], which is a community-driven
benchmark with nine Chinese natural language understanding tasks. These benchmarks in the general
domain provide a north star goal for researchers and are part of the reason we can confidently say we
have made great strides in our field.
For BioNLP, many datasets and benchmarks have been proposed [30, 18, 33] which promote the
biomedical language understanding [1, 17, 16]. Tsatsaronis et al. [27] propose biomedical language
understanding datasets as well as a competition on large-scale biomedical semantic indexing and
question answering. Jin et al. [13] propose PubMedQA, a novel biomedical question answering
dataset collected from PubMed abstracts. Pappas et al. [23] propose BioRead, which is a publicly
available cloze-style biomedical machine reading comprehension (MRC) dataset. Gu et al. [10]
create a leaderboard featuring the Biomedical Language Understanding & Reasoning Benchmark
(BLURB). Unlike a general domain corpus, the annotation of a biomedical corpus needs expert
intervention and is labor-intensive and time-consuming. Moreover, most of the benchmarks are based
on English; ignoring other languages means that potentially valuable information may be lost, which
can be helpful for generalization.
In this study, we focus on Chinese and aim to develop the first Chinese biomedical language
understanding benchmark. Note that Chinese is linguistically different from English and other
Indo-European languages, necessitating an evaluation BioNLP benchmark designed explicitly for
Chinese.
2
Dataset Task Train Dev Test Metrics
CMeEE NER 15,000 5,000 3,000 Micro F1
CMeIE Information Extraction 14,339 3,585 4,482 Micro F1
CHIP-CDN Diagnosis Normalization 6,000 2,000 10,192 Micro F1
CHIP-STS Sentence Similarity 16,000 4,000 10,000 Macro F1
CHIP-CTC Sentence Classification 22,962 7,682 10,000 Macro F1
KUAKE-QIC Intent Classification 6,931 1,955 1,994 Accuracy
KUAKE-QTR Query-Document Relevance 24,174 2,913 5,465 Accuracy
KUAKE-QQR Query-Query Relevance 15,000 1,600 1,596 Accuracy
Table 1: Task descriptions and statistics in CBLUE. CMeEE and CMeIE are sequence labeling tasks.
Others are single sentence or sentence pair classification tasks.
3 CBLUE Overview
CBLUE consists of 8 biomedical language understanding tasks in Chinese. We will introduce the
task definitions, detailed data collection procedures, and characteristics of CBLUE followingly.
3.1 Tasks
CMeEE Chinese Medical Named Entity Recognition, a dataset first released in CHIP20204 , is
used for CMeEE task. Given a pre-defined schema, the task is to identify and extract entities from the
given sentence and classify them into nine categories: disease, clinical manifestations, drugs, medical
equipment, medical procedures, body, medical examinations, microorganisms, and department.
CMeIE Chinese Medical Information Extraction, a dataset that is also released in CHIP2020
[11], is used for CMeIE task. The task is aimed at identifying both entities and relations in a
sentence following the schema constraints. There are 53 relations defined in the dataset, including 10
synonymous sub-relationships and 43 other sub-relationships.
CHIP-CDN CHIP Clinical Diagnosis Normalization, a dataset that aims to standardize the terms
from the final diagnoses of Chinese electronic medical records, is used for the CHIP-CDN task.
Given the original phrase, the task is required to normalize it to standard terminology based on the
International Classification of Diseases (ICD-10) standard for Beijing Clinical Edition v601.
CHIP-CTC CHIP Clinical Trial Classification, a dataset aimed at classifying clinical trials eligibil-
ity criteria, which are fundamental guidelines of clinical trials defined to identify whether a subject
meets a clinical trial or not [38], is used for the CHIP-CTC task. All text data are collected from the
website of the Chinese Clinical Trial Registry (ChiCTR) 5 , and a total of 44 categories are defined.
The task is like text classification; although it is not a new task, studies and corpus for the Chinese
clinical trial criterion are still limited, and we hope to promote future researches for social benefits.
CHIP-STS CHIP Semantic Textual Similarity, a dataset for sentence similarity in the non-i.i.d.
(non-independent and identically distributed) setting, is used for the CHIP-STS task. Specifically, the
task aims to transfer learning between disease types on Chinese disease questions and answer data.
Given question pairs related to 5 different diseases (The disease types in the training and testing set
are different), the task intends to determine whether the semantics of the two sentences are similar.
KUAKE-QIC KUAKE Query Intent Classification, a dataset for intent classification, is used for the
KUAKE-QIC task. Given the queries of search engines, the task requires to classify each of them into
one of 11 medical intent categories defined in KUAKE-QIC, including diagnosis, etiology analysis,
treatment plan, medical advice, test result analysis, disease description, consequence prediction,
precautions, intended effects, treatment fees, and others.
4
http://cips-chip.org.cn/
5
http://chictr.org.cn/
3
KUAKE-QTR KUAKE Query Title Relevance, a dataset used to estimate the relevance of the title
of a query document, is used for the KUAKE-QTR task. Given a query (e.g., “Symptoms of vitamin
B deficiency”), the task aims to find the relevant title (e.g., “The main manifestations of vitamin B
deficiency”).
KUAKE-QQR KUAKE Query-Query Relevance, a dataset used to evaluate the relevance of the
content expressed in two queries, is used for the KUAKE-QQR task. Similar to KUAKE-QTR, the
task aims to estimate query-query relevance, which is an essential and challenging task in real-world
search engines.
Since machine learning models are mostly data-driven, data plays a critical role, and it is pretty
often in the form of a static dataset [8]. We collect data for different tasks from diverse sources,
including clinical trials, EHRs, medical books, and search logs from real-world search engines. As
biomedical data may contain private information such as the patient’s name, age, and gender, all
collected datasets are anonymized and reviewed by the ethics committee to preserve privacy.
We introduce the data collection details followingly.
We collect clinical trial eligibility criteria text from ChiCTR, a non-profit organization that provides
registration for clinical trial information. Eligibility criteria text is organized as a paragraph in the
inclusion criteria and exclusion criteria in each trial registry file. We exclude meaningless text and
annotate the remained text to generate the CHIP-CTC dataset.
We obtain the final diagnoses of the medical records from several Class A tertiary hospitals and
sample a few diagnosis items from different medical departments to construct the CHIP-CDN dataset
for research purposes. No privacy information is involved in the final diagnoses.
Due to the COVID-19 pandemic, online consultation becomes more and more popular via the Internet.
To promote data diversity, we select the online questions by patients. Note that most of the questions
are chief complaints. To ensure the authority and practicability of the corpus, we also select medical
textbooks of Pediatrics [31], Clinical Pediatrics [26] and Clinical Practice6 . We collect data from
these sources to construct the CMeIE and CMeEE datasets.
We also collect search logs from real-world search engines like the Alibaba KUAKE Search Engine7 .
First, we filter the search queries in the raw search logs by the medical tag to obtain candidate
medical texts. Then, we sample the documents for each query with non-zero relevance scores (i.e., to
determine if the document is relevant to the query). Specifically, we divide all the documents into
three categories, namely high, middle, and tail documents, and then uniformly sample the data to
guarantee diversity. We leverage the data from search logs to construct KUAKE-QTC, KUAKE-QTR,
and KUAKE-QQR datasets.
3.3 Annotation
Each sample is annotated by three to five crowd workers, and the annotation with the majority of
votes is taken to estimate human performance. During the annotation phase, we add control questions
to prevent dishonest behaviors by the crowd workers. Consequently, we reject any annotations made
by crowd workers who fail in the training phase and do not adopt the results of those who achieved
6
http://www.nhc.gov.cn/
7
https://www.myquark.cn/
4
Relation type Relation subtype
Sentence
Figure 1: Analysis of the named entity recognition and information extraction datasets. (a) illustrates
the entity (coarse-grained) distribution in CMeEE, and (b) shows the relation hierarchy in CMeIE.
low performance on the control tasks. We maintain strict and high criteria for approval and review at
least 10 random samples from each worker to decide whether to approve or reject all their HITs. We
also calculate the average inter-rater agreement between annotators using Fleiss’ Kappa scores [7],
finding that five out of six annotations show good moderate agreement (κ = 0.9).
3.4 Characteristics
Real-world Distribution To promote the generalization of models, all the data in our CBLUE
benchmark follow real-world distribution without up/downsampling. As shown in Figure 1(a), our
dataset follows long-tail distribution following Zipf’s law so that all data will inevitably be long-
tailed. Further, some datasets, such as CMedIE, have label hierarchy with both coarse-grained and
fine-grained relation labels, as shown in Figure 1(b).
Diverse Tasks Setting Our CBLUE benchmark includes eight diverse tasks, including named
entity recognition, relation extraction, and single-sentence/sentence-pair classification. Besides the
independent and i.i.d. scenarios, our CBLUE benchmark also contains a specific transfer learning
scenario supported by the CHIP-STS dataset, in which the testing set has a different distribution from
the training set.
3.5 Leaderboard
We provide a leaderboard for users to submit their own results on CBLUE. The evaluation system
will give final scores for each task when users submit their prediction results. The platform offers 60
free GPU hours from Aliyun8 to help researchers develop and train their models.
Our CBLUE benchmark was released online on April 1, 2021. Up to now, more than 300 researchers
have applied the dataset, and over 80 teams have submitted their model predictions to our platform,
including medical institutions (Peking Union Medical College Hospital, etc.), universities (Tsinghua
University, Zhejiang University, etc.), and companies (Baidu, JD, etc.). We will continue to maintain
the benchmark by attending to meet new requests and adding new tasks.
8
https://tianchi.aliyun.com/notebook-ai/
5
Model CMeEE CMeIE CDN CTC STS QIC QTR QQR Avg.
BERT-base 62.1 54.0 55.4 69.2 83.0 84.3 60.0 84.7 69.1
BERT-wwm-ext-base 61.7 54.0 55.4 70.1 83.9 84.5 60.9 84.4 69.4
RoBERTa-large 62.1 54.4 56.5 70.9 84.7 84.2 60.9 82.9 69.6
RoBERTa-wwm-ext-base 62.4 53.7 56.4 69.4 83.7 85.5 60.3 82.7 69.3
RoBERTa-wwm-ext-large 61.8 55.9 55.7 69.0 85.2 85.3 62.8 84.4 70.0
ALBERT-tiny 50.5 35.9 50.2 61.0 79.7 75.8 55.5 79.8 61.1
ALBERT-xxlarge 61.8 47.6 37.5 66.9 84.8 84.8 62.2 83.1 66.1
ZEN 61.0 50.1 57.8 68.6 83.5 83.2 60.3 83.0 68.4
MacBERT-base 60.7 53.2 57.7 67.7 84.4 84.9 59.7 84.0 69.0
MacBERT-large 62.4 51.6 59.3 68.6 85.6 82.7 62.9 83.5 69.6
PCL-MedBERT 60.6 49.1 55.8 67.8 83.8 84.3 59.3 82.5 67.9
Human 67.0 66.0 65.0 78.0 93.0 88.0 71.0 89.0 77.1
Table 2: Performance of baseline models on CBLUE benchmark.
3.7 Reproducibility
To make it easier to use the CBLUE benchmark, we also offer a toolkit implemented in PyTorch [24]
for reproducibility. Our toolkit supports mainstream pre-training models and a wide range of target
tasks. Different from existing pre-training model toolkits [37], the toolkit is aimed at fast validating
performance on the CBLUE benchmark.
4 Experiments
Baselines We conduct experiments with baselines based on different Chinese pre-trained language
models. We add an additional output layer (e.g., MLP) for each CBLUE task and fine-tune the pre-
trained models. Code for reproducibility is available in https://github.com/CBLUEbenchmark/
CBLUE.
Models We evaluate CBLUE on the following public available Chinese pre-trained models:
• BERT-base [5]. We use the base model with 12 layers, 768 hidden layers, 12 heads, and 110
million parameters.
• BERT-wwm-ext-base [4]. A Chinese pre-trained BERT model with whole word masking.
• RoBERTa-large [21]. Compared with BERT, RoBERTa removes the next sentence prediction
objective and dynamically changes the masking pattern applied to the training data.
• RoBERTa-wwm-ext-base/large. RoBERTa-wwm-ext is an efficient pre-trained model which
integrates the advantages of RoBERTa and BERT-wwm.
• ALBERT-tiny/xxlarge [14]. ALBERT is a pre-trained model with two objectives: Masked
Language Modeling (MLM) and Sentence Ordering Prediction (SOP), which shares weights
across different layers in the transformer.
• ZEN [6]. A BERT-based Chinese text encoder enhanced by N-gram representations, where
different combinations of characters are considered during training.
• Mac-BERT-base/large [3]. Mac-BERT is an improved BERT with novel MLM as a correc-
tion pre-training task, which mitigates the discrepancy of pre-training and fine-tuning.
• PCL-MedBERT9 . A pre-trained medical language model proposed by the Intelligent Medical
Research Group at the Peng Cheng Laboratory, with excellent performance in medical
question matching and named entity recognition.
We implement all baselines with PyTorch [24]. Note that BERT-base, ALBERT-tiny/xxlarge, and
RoBERTa-large are representatives of pre-trained language models. BERT-wwm-ext-base, RoBERTa-
wwm-ext-base/large, ZEN, Mac-BERT-base/large utilize the specific characteristics (e.g., words and
phrases) of the Chinese language. PCL-MedBERT further utilize domain-adaptive pre-training [12],
9
https://code.ihub.org.cn/projects/1775
6
CMeEE CMeIE CDN CTC STS QIC QTR QQR
annotator 1 69.0 62.0 60.0 73.0 94.0 87.0 75.0 80.0
annotator 2 62.0 65.0 69.0 75.0 93.0 91.0 62.0 88.0
Trained
annotator 3 69.0 67.0 62.0 80.0 88.0 83.0 71.0 90.0
annotation
avg 66.7 64.7 63.7 76.0 91.7 87.0 69.3 86.0
majority 67.0 66.0 65.0 78.0 93.0 88.0 71.0 89.0
best model 62.4 55.9 59.3 70.9 85.6 85.5 62.9 84.7
Table 3: Human performance of two-stage evaluation scores with the best-performed model. “avg”
refers to the mean score from the three annotators. “majority” indicates the performance taken from
the majority vote. Bold text denotes the best result among human and model prediction.
Wrong entity boundary Need syntactic knowledge Need domain knowledge Irrelevant description
7.5%
9% 7% 8.7% 26.2%
8% 12% Multiple trigger
Others Annotation error
10.0%
Annotation error
24% 18%
Ambiguity Need domain knowledge Rare words
11.2% Colloquialism
22.5%
13.7%
22% Ambiguity
Overlap entity
Figure 2: We conduct error analysis on dataset CMeEE and QIC. For CMeEE, we divide error
cases into 6 categories, including ambiguity, need domain knowledge, overlap entity, wrong entity
boundary, annotation error, and others (long sequence, rare words, etc.). For KUAKE-QIC, we
divide error cases into 7 categories, including multiple triggers, colloquialism, ambiguity, rare words,
annotation error, irrelevant description, and need domain knowledge.
which can consistently improve performance on tasks in the biomedical domain. We tune all the
hyper-parameters based on the performance of each model on the development set. We implement
each experiment five times and calculate the average performance. All the training details can be
found in the appendix.
We report the results of our baseline models on the CBLUE benchmark in Table 2. We notice that
the model obtain better performance with larger pre-trained language models. We also observe that
models which use whole word masking do not always yield better performance than others in some
tasks, such as CTC, QIC, QTR, and QQR, indicating that tasks in our benchmark are challenging, and
more sophisticated technologies should be developed. Further, we find that ALBERT-tiny achieves
comparable performance to base models in tasks of CDN, STS, QTR, and QQR, illustrating that
smaller models may also be efficient in specific tasks. Finally, we notice that PCL-MedBERT, which
tends to be state-of-the-art in Chinese biomedical text processing tasks, while does not perform as
well as we expected. This further demonstrates the difficulty of our benchmark, and contemporary
models may find it difficult to quickly achieve outstanding performance.
For all of the tasks in CBLUE, we ask human annotators to label instances from the testing set and
compute the annotators’ majority vote against the gold label. Similar to SuperGLUE [28], we first
need to train the annotators before they work on the testing data. Annotators are asked to annotate
some data from the development set; then, their annotations are validated against the gold standard.
Annotators need to correct their annotation mistakes repeatedly so that they can master the specific
tasks. Finally, they annotate instances from the testing data, and these annotations are used to compute
the final human scores. The results are shown in Table 3 and the last row of Table 2. In most tasks,
humans tend to behave better than machine learning models. We analyze the human performance
detailedly in the next section.
7
Sentence Word Label RO MB
血液生化分析的结果显示维生素B缺乏率 血液生化分析 Ite Pro Pro
约为12%~19%。
The results of blood biochemical analysis blood biochemical analy- Ite Pro Pro
show that vitamin B lack rate is about 12% sis
to 19%.
皮疹可因宿主产生特异性的抗毒素抗体而 抗毒素抗体 Bod O Bod
减少。
The rash can be reduced by the host producing anti-toxin antibodies Bod O Bod
specific anti-toxin antibodies.
根据遗传物质的结构和功能改变的不同, 缺失, 易位, 倒位 Sym, Sym, O Sym, Sym,
可将遗传病分为五类:1.染色体病指染色 Sym Sym
体数目异常,或者染色体结构异常,包括
缺失、易位、倒位等
According to the structure and function of ge- deletions, translocations, Sym, Sym, O Sym, Sym,
netic material, genetic diseases are divided inversions Sym Sym
into five categories: 1. Chromosomal dis-
eases refer to abnormal chromosome number
or chromosome structure abnormalities, in-
cluding deletions, translocations, inversions...
Table 4: Case studies in CMeEE. We evaluate roberta-wwm-ext and PCL-MedBERT on 3 sampled
sentences, with their gold labels and model predictions. Ite (medical examination items), Pro (medical
procedure), Bod (body), and Sym (clinical symptoms) are labeled for medical named words. O means
that the model fails to extract the entity from sentences. RO=roberta-wwm-ext, MB=PCL-MedBERT.
We choose two datasets: CMeEE and KUAKE-QIC, a sequence labeling and classification task,
respectively, to conduct case studies. As shown in Figure 2, we report the statistics of the proportion
of various types of error cases10 . For CMeEE, we notice that overlap entity, ambiguity, need domain
knowledge, annotation error are major reasons that result in the prediction failure. Furthermore,
there exist many instances with overlap entity, which may lead to confusion for the named entity
recognition task. While in the analysis for KUAKE-QIC, almost half of bad cases are due to multiple
triggers and colloquialism. Colloquialism is natural in search queries, which means that some
descriptions of the Chinese medical text are too simplified, colloquial, or inaccurate.
We show some cases on CMeEE in Table 4. In the second row, we notice that given the instance
of “皮疹可因宿主产生特异性的抗毒素抗体而减少 (Rash can be reduced by the host producing
specificanti-toxin antibodies.)”, ROBERTA and PCL-MedBERT obtain different predictions. The
reason is that there exist medical terminologies such as “抗毒素抗体 (anti-toxin antibodies)”.
ROBERTA can not identify those tokens correctly, but PCL-MedBERT, pre-trained on the medical
corpus, can successfully make it. Moreover, PCL-MedBERT can accurately extract entities “缺失,易
位,倒位 (eletions, translocations, inversions)” from the long sentences, which is challenging for
other models.
We further show some cases on KUAKE-QIC in Table 5. In the first case, we notice that both BERT
and BERT-ext fail to obtain the intent label of the query “请问淋巴细胞比率偏高、中性细胞比率
偏低有事吗? (Does it matter if the ratio of lymphocytes is high and the ratio of neutrophils is low?)”,
while MedBERT can obtain the correct prediction. Since “淋巴细胞比率 (ratio of lymphocytes)”
and “中性细胞比率 (ratio of neutrophils)” are biomedical terminologies, and the general pre-trained
language model has to leverage domain knowledge to understand those phrases. Moreover, we
observe that all model obtain incorrect predictions for the query “咨询:请问小孩一般什么时候出
水痘 (Consultation: When do children usually get chickenpox?)” in the second case. Note that there
exists lots of colloquial text in search queries (colloquialism), which have different distributions, thus,
mislead the model predictions.
10
See definitions of errors in the appendix.
8
Model
Query Gold
BERT BERT-ext MedBERT
请问淋巴细胞比率偏高、中性细胞比率偏 病情诊断 病情诊断 指标解读 指标解读
低有事吗?
Does it matter if the ratio of lymphocytes is high Diagnosis Diagnosis Test results Test results
and the ratio of neutrophils is low? analysis analysis
咨询:请问小孩一般什么时候出水痘? 其他 其他 其他 疾病表述
Consultation: When do children usually get Other Other Other Disease
chickenpox? description
老人收缩压160,舒张压只有40多,是什么 病情诊断 病情诊断 病情诊断 治疗方案
原因?怎么治疗?
The systolic blood pressure of the elderly is 160, Diagnosis Diagnosis Diagnosis Treatment
and the diastolic blood pressure is only more
than 40. What is the reason? How to treat?
Table 5: Case studies in KUAKE-QIC. We evaluate the performance of baselines with 3 sampled
instances. The correlation between Query and Title is divided into 3 levels (0-2), which means
‘poorly related or unrelated’, ‘related’ and ‘strongly related’. BERT = BERT-base, BERT-ext =
BERT-wwm-ext-base, MedBERT = PCL-MedBERT.
In summary, we conclude that tasks in CBLUE are not easy to solve since the Chinese language
has unique characteristics, and more robust models that fully understand the semantics of Chinese,
especially the informal or formal usages in the medical domain, should be taken into consideration.
4.4 Limitations
Although our CBLUE offers diverse settings, there are still some tasks not covered by the benchmark,
such as medical dialogue generation [20, 19, 36] or medical diagnosis [32]. We encourage researchers
in both academics and industry to contribute new datasets. Besides, our benchmark is static; thus,
models may still achieve outstanding performance on tasks but fail on simple challenge examples and
falter in real-world scenarios. We leave this as future works to construct a platform including dataset
creation, model development, and assessment, leading to more robust and informative benchmarks.
The COVID-19 (coronavirus disease 2019) pandemic has had a significant impact on society, both
because of the severe health effects of COVID-19 and the public health measures implemented to
slow its spread. A lack of information fundamentally causes many difficulties experienced during the
outbreak; attempts to address these needs caused an information overload for both researchers and the
public. Biomedical natural language processing—the branch of artificial intelligence that interprets
human language—can be applied to address many of the information needs making urgent by the
COVID-19 pandemic. Unfortunately, most language benchmarks are in English, and no biomedical
benchmark currently exists in Chinese. Our benchmark CBLUE, as the first Chinese biomedical
language understanding benchmark, can serve as an open testbed for model evaluations to promote
the advancement of this technology.
9
References
[1] Iz Beltagy, Kyle Lo, and Arman Cohan. Scibert: A pretrained language model for scientific text.
In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China,
November 3-7, 2019, pages 3613–3618. Association for Computational Linguistics, 2019.
[2] Alexis Conneau and Douwe Kiela. Senteval: An evaluation toolkit for universal sentence
representations. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck,
Sara Goggi, Kôiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo,
Asunción Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings
of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018,
Miyazaki, Japan, May 7-12, 2018. European Language Resources Association (ELRA), 2018.
[3] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. Revisiting
pre-trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922,
2020.
[4] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu.
Pre-training with whole word masking for chinese bert. arXiv preprint arXiv:1906.08101, 2019.
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. In NAACL-HLT, 2018.
[6] Shizhe Diao, Jiaxin Bai, Yan Song, Tong Zhang, and Yonggang Wang. Zen: pre-training chinese
text encoder enhanced by n-gram representations. arXiv preprint arXiv:1911.00720, 2019.
[7] Joseph L Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin,
76(5):378, 1971.
[8] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna M.
Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets. CoRR, abs/1803.09010,
2018.
[9] Pieter Gijsbers, Erin LeDell, Janek Thomas, Sébastien Poirier, Bernd Bischl, and Joaquin
Vanschoren. An open source automl benchmark. arXiv preprint arXiv:1907.00909, 2019.
[10] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan
Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for
biomedical natural language processing. CoRR, abs/2007.15779, 2020.
[11] T. Guan, H. Zan, X. Zhou, H. Xu, and K Zhang. CMeIE: Construction and Evaluation of
Chinese Medical Information Extraction Dataset. Natural Language Processing and Chinese
Computing, 9th CCF International Conference, NLPCC 2020, Zhengzhou, China, October
14–18, 2020, Proceedings, Part I, 2020.
[12] Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,
and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In
Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July
5-10, 2020, pages 8342–8360. Association for Computational Linguistics, 2020.
[13] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. Pubmedqa:
A dataset for biomedical research question answering. In Kentaro Inui, Jing Jiang, Vincent
Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 2567–2577.
Association for Computational Linguistics, 2019.
[14] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu
Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv
preprint arXiv:1909.11942, 2019.
10
[15] Hyukki Lee, Soohyung Kim, Jong Wook Kim, and Yon Dohn Chung. Utility-preserving
anonymization for health data publishing. BMC Medical Informatics Decis. Mak., 17(1):104:1–
104:12, 2017.
[16] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and
Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical
text mining. Bioinformatics, 36(4):1234–1240, 2020.
[17] Patrick Lewis, Myle Ott, Jingfei Du, and Veselin Stoyanov. Pretrained language models for
biomedical and clinical tasks: Understanding and extending the state-of-the-art. In Proceedings
of the 3rd Clinical Natural Language Processing Workshop, pages 146–157, Online, November
2020. Association for Computational Linguistics.
[18] J. Li, Yueping Sun, Robin J. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, A. P.
Davis, C. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. Biocreative v cdr task corpus:
a resource for chemical disease relation extraction. Database: The Journal of Biological
Databases and Curation, 2016, 2016.
[19] Shuai Lin, Pan Zhou, Xiaodan Liang, Jianheng Tang, Ruihui Zhao, Ziliang Chen, and Liang
Lin. Graph-evolving meta-learning for low-resource medical dialogue generation. CoRR,
abs/2012.11988, 2020.
[20] Wenge Liu, Jianheng Tang, Jinghui Qin, Lin Xu, Zhen Li, and Xiaodan Liang. Meddg: A large-
scale medical consultation dataset for building medical dialogue system. CoRR, abs/2010.07497,
2020.
[21] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692, 2019.
[22] Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural
language decathlon: Multitask learning as question answering. CoRR, abs/1806.08730, 2018.
[23] Dimitris Pappas, Ion Androutsopoulos, and Haris Papageorgiou. Bioread: A new dataset for
biomedical reading comprehension. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri,
Thierry Declerck, Sara Goggi, Kôiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani,
Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors,
Proceedings of the Eleventh International Conference on Language Resources and Evaluation,
LREC 2018, Miyazaki, Japan, May 7-12, 2018. European Language Resources Association
(ELRA), 2018.
[24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative
style, high-performance deep learning library. In Advances in Neural Information Processing
Systems, pages 8024–8035, 2019.
[25] Tatiana Shavrina, Alena Fenogenova, Anton A. Emelyanov, Denis Shevelev, Ekaterina Arte-
mova, Valentin Malykh, Vladislav Mikhailov, Maria Tikhonova, Andrey Chertok, and Andrey
Evlampiev. Russiansuperglue: A russian language understanding evaluation benchmark. In
Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Confer-
ence on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November
16-20, 2020, pages 4717–4726. Association for Computational Linguistics, 2020.
[26] Xiaoming Shen and Yonghao Gui. Clinical Pediatrics 2nd edn. People’s Medical Publishing
House, 2013.
[27] George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias
Zschunke, Michael R. Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris
Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari,
Thierry Artières, Axel-Cyrille Ngonga Ngomo, Norman Heino, Éric Gaussier, Liliana Barrio-
Alvers, Michael Schroeder, Ion Androutsopoulos, and Georgios Paliouras. An overview of the
BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC
Bioinform., 16:138:1–138:28, 2015.
11
[28] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-
purpose language understanding systems. In Hanna M. Wallach, Hugo Larochelle, Alina
Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances
in Neural Information Processing Systems 32: Annual Conference on Neural Information
Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages
3261–3275, 2019.
[29] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.
GLUE: A multi-task benchmark and analysis platform for natural language understanding. In
7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA,
May 6-9, 2019. OpenReview.net, 2019.
[30] Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide,
Kathryn Funk, Rodney Kinney, Ziyang Liu, William Merrill, Paul Mooney, Dewey A. Murdick,
Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Stilson, Alex D. Wade, Kuansan Wang,
Chris Wilhelm, Boya Xie, Douglas Raymond, Daniel S. Weld, Oren Etzioni, and Sebastian
Kohlmeier. CORD-19: the covid-19 open research dataset. CoRR, abs/2004.10706, 2020.
[31] Weiping Wang, Kun Song, and Liwen Chang. Pediatrics 9th edn. People’s Medical Publishing
House, 2018.
[32] Zhongyu Wei, Qianlong Liu, Baolin Peng, Huaixiao Tou, Ting Chen, Xuanjing Huang, Kam-
Fai Wong, and Xiangying Dai. Task-oriented dialogue system for automatic diagnosis. In
Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018,
Volume 2: Short Papers, pages 201–207. Association for Computational Linguistics, 2018.
[33] Y. Wu, Ruibang Luo, H. Leung, H. Ting, and T. Lam. Renet: A deep learning approach for
extracting gene-disease associations from literature. In RECOMB, 2019.
[34] Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun,
Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun
Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang,
He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang,
Kyle Richardson, and Zhenzhong Lan. CLUE: A chinese language understanding evaluation
benchmark. In Donia Scott, Núria Bel, and Chengqing Zong, editors, Proceedings of the 28th
International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain
(Online), December 8-13, 2020, pages 4762–4772. International Committee on Computational
Linguistics, 2020.
[35] Kun-Hsing Yu, Andrew L Beam, and Isaac S Kohane. Artificial intelligence in healthcare.
Nature biomedical engineering, 2(10):719–731, 2018.
[36] Guangtao Zeng, Wenmian Yang, Zeqian Ju, Yue Yang, Sicheng Wang, Ruisi Zhang, Meng
Zhou, Jiaqi Zeng, Xiangyu Dong, Ruoyu Zhang, Hongchao Fang, Penghui Zhu, Shu Chen, and
Pengtao Xie. Meddialog: Large-scale medical dialogue datasets. In Bonnie Webber, Trevor
Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages
9241–9250. Association for Computational Linguistics, 2020.
[37] Zhe Zhao, Hui Chen, Jinbin Zhang, Xin Zhao, Tao Liu, Wei Lu, Xi Chen, Haotang Deng, Qi Ju,
and Xiaoyong Du. UER: an open-source toolkit for pre-training models. In Sebastian Padó and
Ruihong Huang, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language Process-
ing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019 - System Demonstrations,
pages 241–246. Association for Computational Linguistics, 2019.
[38] Hui Zong, Jinxuan Yang, Zeyu Zhang, Zuofeng Li, and Xiaoyan Zhang. Semantic categorization
of chinese eligibility criteria in clinical trials using machine learning methods. BMC Medical
Informatics Decis. Mak., 21(1):128, 2021.
12
A CBLUE Background
Standard datasets and shared tasks have played essential roles in promoting the development of
AI technology. Taking the Chinese BioNLP community as an example, the CHIP (China Health
Information Processing) conference releases biomedical-related shared tasks every year, which has
extensively advanced Chinese biomedical NLP technology. However, some datasets are no longer
available after the end of shared tasks, which has raised issues in the data acquisition and future
research of the datasets.
In recent years, we can obtain state-of-the-art performance for many downstream tasks with the help
of pre-trained language models. A significant trend is the emergence of multi-task leaderboards,
such as GLUE (General Language Understanding Evaluation) and CLUE (Chinese Language Un-
derstanding Evaluation). These leaderboards provide a fair benchmark that attracts the attention of
many researchers and further promotes the development of language model technology. For example,
Microsoft has released BLURB (Biomedical Language Understanding & Reasoning Evaluation)
at the end of 2020 in the medical field. Recently, the Tianchi platform has launched the CBLUE
(Chinese Biomedical Language Understanding Evaluation) public benchmark under the guidance of
the CHIP Society. We believe that the release of the CBLUE will further attract researchers’ attention
to the medical AI field and promote the development of the community.
CBLUE 1.011 comprises the previous shared tasks of the CHIP conference and the dataset from
Alibaba QUAKE Search Engine, including named entity recognition, information extraction, clinical
diagnosis normalization, single-sentence/sentence-pair classification.
Task Description This task is defined as given the pre-defined schema and an input sentence
to identify medical entities and to classify them into 9 categories, including disease (dis), clinical
symptoms(sym), drugs (dru), medical equipment (equ), medical procedures (pro), body (bod), medical
examination items (ite), microorganisms (mic), department (dep). Examples are shown in Table 6.
Dataset Statistic This task has 15,000 training set data, 5,000 validation set data, 3,000 test set
data. The corpus contains 938 files and 47,194 sentences. The average number of words contained
per file is 2,355. The dataset contains 504 common pediatric diseases, 7,085 body parts, 12,907
clinical symptoms, and 4,354 medical procedures in total.
13
Entity type Entity subtype Label Example
疾病或综合症
disease or syndrome 尿潴留者易继发泌尿系感染
疾病 中毒或受伤 Patients with urinary retention are
dis
disease poisoned or injured prone to secondary infections of the
器官或细胞受损 urinary system.
damage to organs or cells
逐渐出现呼吸困难、阵发性喘憋,
发作时呼吸快而浅,并伴有呼气
症状 性喘鸣,明显鼻扇及三凹征
临床表现 symptom Then dyspnea and paroxysmal
sym
clinical manifestations 体征 asthma may occur, along with
physical sign shortness of breath, expiratory
stridor, obvious flaring nares,
and three-concave sign.
用免疫学方法检测黑种病原体的
特异抗原很有诊断价值,因其简
检查程序 单快速,常常用于早期诊断,诊
check procedure 断意义常较抗体检测更为可靠
医疗程序 治疗 It is of great diagnostic value
pro to detect the specific antigen of
medical procedure treatment
a certain pathogen with immunoassay,
或预防程序
a simple and quick assay that is
or preventive procedure
intended for early diagnosis and
proves more reliable than the
antibody assay.
Task Background Entity and relation extraction is an essential information extraction task for
natural language processing and knowledge graph (KG), which is used to detect pairs of entities and
their relations from unstructured text. The technology of this task can apply to the medical field.
For example, with entity and relation extraction, unstructured and semi-structured medical texts can
construct medical knowledge graphs, which can serve lots of downstream tasks.
Task Description Given the schema and sentence, in which defines the relation (Predicate)
and its related Subject and Object, such as (“subject_type”: “疾 病”,“predicate”: “药 物 治
疗”,“object_type”: “药物”). The task requires the model to automatically analyzing the sen-
tence and then extract all the T riples = [(S1, P 1, O1), (S2, P 2, O2)...] in the sentence. Table 7
shows the examples in the data set, and 53 SCHEMAs include 10 kinds of genus relations, 43 other
sub-relations. The details are in the 53_schema.json file.
Evaluation Metrics The SPO results given by the participants need to be accurately matched. The
strict Micro-F1 is used for evaluation.
Dataset Statistic This task has 14,339 training set data, 3,585 validation set data, 4,482 test set
data. The dataset is from the pediatric corpus and common disease corpus. The pediatric corpus
originates from 518 pediatric diseases, and the common disease corpus is derived from 109 common
diseases. The dataset contains nearly 75,000 triples, 28,000 disease sentences, and 53 schemas.
14
Relation type Relation subtype Example
预防
{’predicate’: ’预防-prevention’, ’subject’: ’麻风病-
疾病_其他 prophylaxis
Leprosy’, ’subject_type’: ’疾病-disease’, ’object’: ’利
disease_other
福-rifampicin’, ’object_type’: ’其他-others’}
阶段
{’predicate’: ’阶段-phase’, ’subject’: ’肿瘤-tumor’,
phase
’subject_type’: ’疾病-disease’, ’object’: ’I期-phase_‘’,
’object_type’: ’其他-others’}
就诊科室
{’predicate’: ’就诊科室-treatment_department’, ’sub-
treatment department
ject’: ’腹 主 动 脉 瘤-abdominal_aortic_aneurysm’,
’subject_type’: ’疾病-disease’, ’object’: ’初级医疗保
健医处-primary_medical_care_clinic’, ’object_type’:
’其他-others’}
辅助治疗
{’predicate’: ’辅 助 治 疗-adjuvant_therapy’,
疾病_其他治疗 adjuvant therapy
’subject’: ’皮 肤 鳞 状 细 胞 癌-
disease_other treatment
utaneous_squamous_cell_carcinoma’, ’sub-
ject_type’: ’疾 病-disease’, ’object’: ’非 手 术 破
坏-non_surgical_destructio’, ’object_type’: ’其他治
疗-other_treatment’}
化疗
{’predicate’: ’化 疗-chemotherapy’, ’subject’: ’肿
chemotherapy
瘤-tumour’, ’subject_type’: ’皮 肤 鳞 状 细 胞 癌-
cutaneous_squamous_cell_carcinoma’, ’object’: ’局
部化疗-local_chemotherapy’, ’object_type’: ’其他治
疗-other_treatment’}
放射治疗
{’predicate’: ’放 射 治 疗-radiation_therapy’, ’sub-
radiotherapy
ject’: ’非 肿 瘤 性 疼 痛-non_cancer_pain’, ’sub-
ject_type’: ’疾 病-disease’, ’object’: ’外 照 射-
external_irradiation’, ’object_type’: ’其 他 治 疗-
other_treatment’}
疾病_手术治疗 手术治疗
{’predicate’: ’手 术 治 疗-surgical_treatment’,
disease_surgical treatment surgical treatment
’subject’: ’皮 肤 鳞 状 细 胞 癌-cutaneous
_squamous_cell_carcinoma’, ’subject_type’:
’疾 病-disease’, ’object’: ’传 统 手 术 切 除-
surgical_resection(traditional_therapy)’, ’ob-
ject_type’: ’手术治疗-surgical_treatment’}
Task Background Clinical term normalization is a crucial task for both research and industry use.
Clinically, there might be up to hundreds of different synonyms for the same diagnosis, symptoms, or
procedures; for example, “heart tack” and “MI” both stand for the standard terminology “myocardial
infarction”. The goal of this task is to find the standard phrases (i.e., ICD codes) for the given
clinical term. With the help of the standard code, it can help ease the burden of researchers for the
statistical analysis of clinical trials; also, it can be helpful for the insurance companies on the DRGs
15
or DIP-related applications. This task is proposed for this purpose, and the originally shared task was
released at the CHIP2020 conference.
Task Description The task aims to standardize the terms from the final diagnoses of Chinese
electronic medical records. No privacy information is involved in the final diagnoses. Given the
original terms, it is required to predict its corresponding standard phrase from the standard vocabulary
of “International Classification of Diseases (ICD-10) for Beijing Clinical Edition v601.” Examples
are shown in Table 8.
Evaluation Metrics The F1 score is calculated with (original diagnosis terms, standard phrases)
pairs. Say, if the test set has m golden pairs, and the predicted result has n pairs, where k pairs are
predicted correctly, then:
P = k/n, R = k/m, F 1 = 2 ∗ P ∗ R/(P + R). (1)
Dataset Statistic 8,000 training instances and 10,000 testing instances are provided. We split the
original training set into 6,000 and 2,000 for the training and validation set, respectively.
Task Background Clinical trials refer to scientific research conducted by human volunteers to
determine the efficacy, safety, and side effects of a drug or a treatment method. It plays a crucial role
in promoting the development of medicine and improving human health. Depending on the purpose
of the experiment, the subjects may be patients or healthy volunteers. The goal of this task is to
predict whether a subject meets a clinical trial or not. Recruitment of subjects for clinical trials is
generally done through manual comparison of medical records and clinical trial screening criteria,
which is time-consuming, laborious, and inefficient. In recent years, methods based on natural
language processing have got successful in many biomedical applications. This task is proposed with
the purpose of automatically classifying clinical trial eligibility criteria for the Chinese language, and
the original task is released at the CHIP2019 conference. All the data comes from real clinical trials
collected from the website of the Chinese Clinical Trial Registry (ChiCTR) 12 , which is a non-profit
organization providing registration for clinical trials information.
Task Description A total of 44 pre-defined semantic categories are defined for this task, and the
goal is to predict a given text to the correct category. Examples of labeled data are shown in Table 9.
12
http://chictr.org.cn/
16
ID Clinical trial sentence Category
年龄>80岁
S1 Age
Age: > 80
近期颅内或椎管内手术史
S2 Therapy or Surgery
Recent intracranial/intraspinal surgery
血糖<2.7mmol/L
S3 Laboratory Examinations
Blood glucose < 2.7 mmol/L
Evaluation Metrics The evaluation of this task uses Macro-F1. Suppose we have n categories,
C1 , ..., Ci , ..., Cn . The accuracy rate Pi is the number of records correctly predicted to class Ci /
the number of records predicted to be class Ci . Recall rate Ri = the number of records correctly
predicted as the class Ci / the number of records of the real Ci class.
n
X 2 ∗ P i ∗ Ri
Average − F 1 = (1/n) (2)
i=1
P i + Ri
Dataset Statistic This task has 22,962 training set, 7,682 validation set, and 10000 test set.
Dataset Provider The dataset is provided by the School of Life Sciences and Technology, Tongji
University.
Task Background CHIP-STS task aims to learn similar knowledge between disease types based on
the Chinese online medical questions. Specifically, given question pairs from 5 different diseases, it
is required to determine whether the semantics of the two sentences are similar or not. The originally
shared task was released at the CHIP2019 conference.
Task Description The category represents the name of the disease type, including diabetes, hyper-
tension, hepatitis, aids, and breast cancer. The label indicates whether the semantics of the questions
are the same. If they are the same, they are marked as 1, and if they are not the same, they are marked
as 0. Examples of labeling are shown in Table 10.
Dataset Statistic This task has 16,000 training set, 4,000 validation set, and 10,000 tests set data.
17
B.6 KUAKE-Query Intent Classification Dataset (KUAKE-QIC)
Task Background In medical search scenarios, the understanding of query intent can significantly
improve the relevance of search results. In particular, medical knowledge is highly specialized, and
classifying query intentions can also help integrate medical knowledge to enhance the performance
of search results. This task is proposed for this purpose.
Task Description There are 11 categories of medical intent labels, including diagnosis, etiology
analysis, treatment plan, medical advice, test result analysis, disease description, consequence
prediction, precautions, intended effects, treatment fees, and others. Examples are shown in Table 11.
Intent Sentences
最近早上起来浑身无力是怎么回事?
病情诊断 Why do I always feel weak after I get up in the morning?
disease diagnosis 我家宝宝快五个月了,为什么偶尔会吐清水带?
Why does my 5-month-old baby occasionally vomit clear liquid?
哮喘应该注意些什么
What should patients with asthma pay attention to?
孕妇能不能吃榴莲
注意事项 Can a pregnant woman eat durians?
precautions 柿子不能和什么一起吃
Which food cannot be eaten together with persimmons?
糖尿病人饮食注意什么啊?
What should patients with diabetes pay attention to about their diet?
糖尿病该做什么检查?
就医建议 What examination should patients with diabetes receive?
medical advice 肚子疼去什么科室?
Which department should patients with stomachache visit?
Dataset Statistic This task has 6,931 training set data, 1,955 validation set data, and 1,994 test set
data.
Task Background KUAKE Query Title Relevance is a dataset for query document (title) relevance
estimation. For example, give the query “Symptoms of vitamin B deficiency”, the relevant title should
be “The main manifestations of vitamin B deficiency”.
Task Description The correlation between Query and Title is divided into 4 levels (0-3), 0 is the
worst, and 3 stands for the best match. Examples are shown in Table 12.
Evaluation Metrics Same as the KUAKE-QIC task, accuracy is used for the evaluation of this
task.
Dataset Statistic This task has 24,174 training set data, 2,913 validation set data, and 54,65 test set
data.
18
Query Title Level
维生素b缺乏症的主要表现
缺维生素b的症状
What are the major symptoms of 3
Symptoms of Vitamin B deficiency
Vitamin B deficiency?
大腿软组织损伤怎么办 腿部软组织损伤怎么办
How can I treat a soft tissue What’s the treatment for a soft tissue 2
injury in the thigh? injury in the leg?
小腿抽筋后一直疼怎么办
小腿抽筋是什么原因引起的
How can I treat pains caused by lower 1
What causes lower leg cramps?
leg cramps?
挑食是什么原因造成的 挑食是什么原因造成的
0
What is the cause of picky eating? What is the cause of picky eating?
Task Background KUAKE Query-Query Relevance is a dataset that evaluates the relevance
between two given queries to resolve the long-tail challenges for search engines. Similar to KUAKE-
QTR, query-query relevance is an essential and challenging task in real-world search engines.
Task Description The correlation between Query and Title is divided into 3 levels (0-2), 0 is the
worst, and 2 stands for the best correlation. Examples are shown in Table 13.
Evaluation Metrics Same with the KUAKE-QIC and KUAKE-QTR tasks, accuracy is used for
the evaluation metrics.
Dataset Statistic This task has 15,000 training set data, 1,600 validation set data, and 1,596 test set
data.
C Experiments Details
This section details the training procedures and hyper-parameters for each of the data sets. We utilize
Pytorch to conduct experiments, and all running hyper-parameters are shown in the following Tables.
There are two stages in CMeIE, namely, entity recognition (CMeEE-ER) and relation classification
(CMeEE-RE). So we detail the hyper-parameters in CMeEE-ER and CMeEE-RE, respectively.
Requirements
• python3
19
Method Value
warmup_proportion 0.1
weight_decay 0.01
adam_epsilon 1e-8
max_grad_norm 1.0
Table 15: Hyper-parameters for the training of pre-trained models with a token classification head
on top for named entity recognition of the CMeEE task.
• pytorch 1.7
• transformers 4.5.1
• jieba
• gensim
Hyper-parameters for Specific Task is shown in Table 14-25
Table 16: Hyper-parameters for the training of pre-trained models with a token-level classifier for
subject and object recognition of the CMeIE task.
20
Model epoch batch_size max_length learning_rate
bert-base 8 32 128 5e-5
bert-wwm-ext 8 32 128 5e-5
roberta-wwm-ext 8 32 128 4e-5
roberta-wwm-ext-large 8 16 80 4e-5
roberta-large 8 16 80 2e-5
albert-tiny 10 32 128 4e-5
albert-xxlarge 8 16 80 1e-5
zen 8 20 128 4e-5
macbert-base 8 32 128 4e-5
macbert-large 8 20 80 2e-5
PCL-MedBERT 8 32 128 4e-5
Table 17: Hyper-parameters for the training of pre-trained models with a classifier for the entity
pairs relation prediction of the CMeIE task.
Table 18: Hyper-parameters for the training of pre-trained models with a sequence classification
head on top for screening criteria classification of the CHIP-CTC task.
Param Value
recall_k 200
num_negative_sample 10
Table 19: Hyper-parameters for the CHIP-CDN task. We model the CHIP-CDN task with two stages:
recall stage and ranking stage. num_negative_sample sets the number of negative samples sampled
for the training ranking model during the ranking stage. recall_k sets the number of candidates
recalled in the recall stage.
21
Model epoch batch_size max_length learning_rate
bert-base 3 32 128 4e-5
bert-wwm-ext 3 32 128 5e-5
roberta-wwm-ext 3 32 128 4e-5
roberta-wwm-ext-large 3 32 40 4e-5
roberta-large 3 32 40 4e-5
albert-tiny 3 32 128 4e-5
albert-xxlarge 3 32 40 1e-5
zen 3 20 128 4e-5
macbert-base 3 32 128 4e-5
macbert-large 3 32 40 2e-5
PCL-MedBERT 3 32 128 4e-5
Table 20: Hyper-parameters for the training of pre-trained models with a sequence classifier for the
ranking model of the CHIP-CDN task. We encode the pairs of the original term and standard phrase
from candidates recalled during the recall stage and then pass the pooled output to the classifier,
which predicts the relevance between the original term and standard phrase.
Table 21: Hyper-parameters for the training of pre-trained models with a sequence classifier for the
prediction of the number of standard phrases corresponding to the original term in the CHIP-CDN
task.
Table 22: Hyper-parameters for the training of pre-trained models with a sequence classifier for
sentence similarity predication of the CHIP-STS task.
22
Model epoch batch_size max_length learning_rate
bert-base 3 16 50 2e-5
bert-wwm-ext 3 16 50 2e-5
roberta-wwm-ext 3 16 50 2e-5
roberta-wwm-ext-large 3 16 50 2e-5
roberta-large 3 16 50 3e-5
albert-tiny 3 16 50 5e-5
albert-xxlarge 3 16 50 1e-5
zen 3 16 50 2e-5
macbert-base 3 16 50 3e-5
macbert-large 3 16 50 2e-5
PCL-MedBERT 3 16 50 2e-5
Table 23: Hyper-parameters for the training of pre-trained models with a sequence classifier for
query intention prediction of the KUAKE-QIC task.
Table 24: Hyper-parameters of training the sequence classifier for the KUAKE-QTR task.
Table 25: Hyper-parameters of training the sequence classifier for the KUAKE-QQR task.
23
Sentence Golden RO ME
另一项研究显示,减荷鞋 内侧膝骨关节炎| 辅助 膝骨关节炎|辅助治 膝骨关节炎| 辅助治
对内侧膝骨关节炎也没有 治疗| 减荷鞋 疗|减荷鞋 疗|减荷鞋
效。
Another study showed that medial knee medial knee medial knee
load-reducing shoes were osteoarthritis, adjuvant osteoarthritis, adjuvant osteoarthritis, adjuvant
not effective for medial knee therapy, load-reducing therapy, load-reducing therapy, load-reducing
osteoarthritis. shoes shoes shoes
精神疾病:焦虑和抑郁与 焦虑|相关(导致)|失 无|无|无 焦虑|相关(导致)|失
失眠症高度相关。 眠症 眠症
Mental illness: anxiety and anxiety, related cause, None|None|None anxiety, related cause,
depression are related to in- insomnia insomnia
somnia.
在狂犬病感染晚期,患者 狂犬病|相关(转 无|无|无 无|无|无
常出现昏迷。 化)|昏迷
In the late stage of rabies rabies, transform, None|None|None None|None|None
infection, patients often ap- comatose
pear comatose.
Table 26: Error cases in CMeIE. We evaluate roberta-wwm-ext and PCL-MedBERT on 3 sampled
sentences, with their gold labels and model predictions. Each label consists of subject | predicate |
Object. None means that the model fails to predict. RO = roberta-wwm-ext, MB = PCL-MedBERT.
Sentence Label RO MB
右第一足趾创伤性足趾切断 单趾切断 足趾损伤 单趾切断
Right first toe traumatic toe cutting Single toe cut Toe injury Single toe cut
C3-4脊髓损伤 颈部脊髓损伤 脊髓损伤 脊髓损伤
C3-4 spinal cord injury Neck spinal cord Spinal cord injury Spinal cord injury
injury
肿瘤骨转移胃炎 骨继发恶性肿 反流性胃炎##转移 骨盆部肿瘤##转移
瘤##转移性肿 性肿瘤##胃炎 性肿瘤##胃炎
瘤##胃炎
Tumor bone metastatic gastritis Junior malignant Reflux Pelvic
tumor##Metastatic gastritis##Metastatic tumor##Metastatic
tumor##Gastritis tumor##Gastritis tumor##Gastritis
Table 27: Error cases in CHIP-CDN. We evaluate roberta-wwm-ext and PCL-MedBERT on 3
sampled sentences, with their gold labels and model predictions. There may be multiple predicted
values, separated by a "##". RO = roberta-wwm-ext, MB = PCL-MedBERT.
Need syntactic knowledge indicates that there exists complex syntactic structure in the instance, and
the model fails to understand the correct meaning.
Overlap entity indicates there exist multiple overlapping entities in the instance.
Long sequence indicates that the input instance is very long.
Annotation error indicates that the annotated label is wrong.
Wrong entity boundary indicates that the instance has the wrong entity boundary.
Rare words indicates that there exist low-frequency words in the instance.
Multiple triggers indicates that there exist multiple indicative words which mislead the prediction.
Colloquialism (very common in the search queries) indicates that the instance is quite different from
written language (e.g., with many abbreviations), thus, challenging the prediction model.
Irrelevant description indicates that the instance has lots of irrelevant information, which mislead
the prediction.
24
Sentence Label RO MB
既往多次行剖腹手术或腹腔广泛粘连者 含有多类别的语句 治疗或手术 治疗或手术
Previous multi-time crashed surgery or ab- Multiple Therapy or Surgery Therapy or Surgery
dominal adhesive
术前认知发育筛查(DST)发现发育迟 诊断 疾病 诊断
缓
Preoperative cognitive development screen- Diagnostic Disease Diagnostic
ing test(DST) finds development slow
已知发生中枢神经系统转移的患者 肿瘤进展 疾病 疾病
Patients who have been transferred in central Neoplasm Status Disease Disease
nervous system
Table 28: Error cases in CHIP-CTC. We evaluate roberta-wwm-ext and PCL-MedBERT on 3
sampled sentences, with their gold labels and model predictions. RO = roberta-wwm-ext, MB =
PCL-MedBERT.
Model
Query-A Query-B Gold
BE BE+ MB
汗液能传播乙肝病毒吗? 乙肝的传播途径? 0 0 0 1
Can sweat spread the hepatitis B How is hepatitis B transmitted?
virus?
哪种类型糖尿病? 我是什么类型的糖尿病? 1 1 1 0
What type of diabetes? What type of diabetes am I?
如何防治艾滋病? 艾滋病防治条例。 1 0 0 1
How to prevent AIDS? AIDS Prevention and Control
Regulations.
Table 29: Error cases in CHIP-STS. We evaluate performance of baselines with 3 sampled instances.
The similarity between queries is divided into 2 levels (0-1), which means ’unrelated’ and ’related’.
BE = BERT-base, BE+ = BERT-wwm-ext-base, MB = PCL-MedBERT.
Model
Query-A Query-B Gold
BE BE+ MB
吃药能吃螃蟹吗? 你好,吃完螃蟹后,可不可以 3 3 3 0
吃药呢
Can I eat crabs with medicine? Hello, does it matter to take
medicine after eating crabs?
一颗蛋白卡路里。 一个鸡蛋白的热量。 1 1 0 3
Calories per egg white. One egg white calories.
氨基酸用法用量。 氨基酸的功效及用法用量。 2 2 2 1
Amino acid usage and dosage. Efficacy and dosage of amino
acids.
Table 30: Error cases in KUAKE-QTR. We evaluate performance of baselines with 3 sampled
instances. The correlation between Query and Title is divided into 4 levels (0-3), which means
’unrelated’, ’poorly related’, ’related’ and ’strongly related’. BE = BERT-base, BE+ = BERT-wwm-
ext-base, MB = PCL-MedBERT.
25
Model
Query-A Query-B Gold
BE ZEN MB
益生菌是饭前喝还是饭后喝。 益生菌是饭前喝还是饭后喝比 1 2 1 2
Should probiotics be drunk before 较好。
or after meals. Is it better to drink probiotics be-
fore or after meals
糖尿病能吃肉吗? 高血糖能吃肉吗? 1 1 1 0
Can diabetics eat meat? Can hyperglycemic patients eat
meat?
神经衰弱吃什么药去根? 神经衰弱吃什么药有效? 0 0 2 2
What drug does neurasthenic pa- What drug does neurasthenic pa-
tient take effective? tient take effective?
‘
Table 31: Error cases in KUAKE-QQR. We evaluate performance of baselines with 3 sampled
instances. The correlation between Query and Title is divided into 3 levels (0-2), which means
’poorly related or unrelated’, ’related’ and ’strongly related’. BE = BERT-base, ZEN = ZEN, MB =
PCL-MedBERT.
Contributions
Ningyu Zhang, Zhen Bi, Xiaozhuan Liang, Lei Li, Xiang Chen, Shumin Deng from Zhejiang
University, AZFT Joint Lab for Knowledge Engine, Hangzhou Innovation Center wrote the paper.
Luoqiu Li, Xin Xie, Hongbin Ye from Zhejiang University, AZFT Joint Lab for Knowledge Engine,
Hangzhou Innovation Center implemented an early version of the Figures and Tables in the paper.
Kunli Zhang from School of Information Engineering, Zhengzhou University, Peng Cheng Labo-
ratory, China and Baobao Chang from Key Laboratory of Computational Linguistics, Ministry of
Education, Peking University, Peng Cheng Laboratory, China contributed the dataset of CMeEE.
Hongying Zan from School of Information Engineering, Zhengzhou University, Peng Cheng Lab-
oratory, China and Zhifang Sui from Key Laboratory of Computational Linguistics, Ministry of
Education, Peking University, Peng Cheng Laboratory, China contributed the dataset of CMeIE.
Linfeng Li, Jun Yan from Yidu Cloud Technology Inc., Beijing, China contributed the dataset of
CHIP-CDN.
Hui Zong from School of Life Sciences and Technology, Tongji University contributed the dataset of
CHIP-CTC.
Yuan Ni from Pingan Health Technology, Shanghai, China and Guotong Xie from Pingan Health
Technology, China, Ping An Health Cloud Company Limited, China, Ping An International Smart
City Technology Co., Ltd, China contributed the dataset of CHIP-STS.
Kangping Yin, Jian Xu from Alibaba Group and Xin Shang from School of Mathematical Science,
Zhejiang University contributed the datasets of KUAKE-QIC, KUAKE-QTR, and KUAKE-QQR.
Chuanqi Tan, Mosha Chen, Fei Huang, Luo Si from Alibaba Group and Zheng Yuan from the
Center for Statistical Science, Tsinghua University contributed the CBLUE benchmark leaderboard
and transformed the eight datasets from self-defined data format to unified JSON format.
Huajun Chen from Zhejiang University, AZFT Joint Lab for Knowledge Engine, Hangzhou Innova-
tion Center, Buzhou Tang, Qingcai Chen from Harbin Institute of Technology (Shenzhen), Peng
Cheng Laboratory, China advised the project, suggested tasks, and led the research.
26