1 s2.0 S2001037023002933 Main
1 s2.0 S2001037023002933 Main
1 s2.0 S2001037023002933 Main
Research article
A R T I C L E I N F O A B S T R A C T
Keywords: Objective: Transformer-based language models are prevailing in the clinical domain due to their excellent per
Natural language processing formance on clinical NLP tasks. The generalizability of those models is usually ignored during the model
Electronic health records development process. This study evaluated the generalizability of CancerBERT, a Transformer-based clinical NLP
Information extraction
model, along with classic machine learning models, i.e., conditional random field (CRF), bi-directional long
Generalizability
short-term memory CRF (BiLSTM-CRF), across different clinical institutes through a breast cancer phenotype
extraction task.
Materials and methods: Two clinical corpora of breast cancer patients were collected from the electronic health
records from the University of Minnesota (UMN) and Mayo Clinic (MC), and annotated following the same
guideline. We developed three types of NLP models (i.e., CRF, BiLSTM-CRF and CancerBERT) to extract cancer
phenotypes from clinical texts. We evaluated the generalizability of models on different test sets with different
learning strategies (model transfer vs locally trained). The entity coverage score was assessed with their asso
ciation with the model performances.
Results: We manually annotated 200 and 161 clinical documents at UMN and MC. The corpora of the two in
stitutes were found to have higher similarity between the target entities than the overall corpora. The Cancer
BERT models obtained the best performances among the independent test sets from two clinical institutes and the
permutation test set. The CancerBERT model developed in one institute and further fine-tuned in another
institute achieved reasonable performance compared to the model developed on local data (micro-F1: 0.925 vs
0.932).
Conclusions: The results indicate the CancerBERT model has superior learning ability and generalizability among
the three types of clinical NLP models for our named entity recognition task. It has the advantage to recognize
complex entities, e.g., entities with different labels.
1. Introduction purposes [1]. The published articles on PubMed about “clinical NLP”
and “EHR” increased from 24 in 2017–70 in 2022. They were widely
With the widespread of electronic health records (EHR), natural adopted in applications such as real-time cancer case identification [2],
language processing (NLP) methods have gained momentum in medical prescription classification [3], and automatic extraction of
leveraging clinical texts for clinical decision support and research heart failure information from EHR [4]. Also, NLP plays an important
Abbreviations: EHR, electronic health records; NLP, natural language processing; UMN, University of Minnesota; BlueBERT, Biomedical Language Understanding
Evaluation Bidirectional Encoder Representations from Transformers; MC, Mayo Clinic; CRF, conditional random field; BiLSTM-CRF, bidirectional long short-term
memory CRF; BERT, bidirectional encoder representations from transformers; IRB, Institutional Review Board; ECR, entity coverage ratio; IAA, inter-annotator
agreement; TF-IDF, term frequency-inverse document frequency; NER, name entity extraction; GloVe, Global Vectors for Word Representation.
* Correspondence to: 420 Delaware Street SE, Minneapolis, MN 55455, USA.
E-mail address: [email protected] (R. Zhang).
https://doi.org/10.1016/j.csbj.2023.08.018
Received 16 June 2023; Received in revised form 15 August 2023; Accepted 21 August 2023
Available online 22 August 2023
2001-0370/© 2023 The Authors. Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
S. Zhou et al. Computational and Structural Biotechnology Journal 22 (2023) 32–40
role in clinical data management, including clinical data cleaning [5,6], data, posing challenges to the development of generalizable models.
data interoperability improvement [7,8] and EHR data capture [9,10]. Therefore, comprehending the generalizability of Transformer-based
These NLP methods can be mainly classified into either symbolic or models holds utmost importance in the clinical NLP domain, as signifi
statistical approaches [11]. The symbolic approaches dominated the cant savings in labor and computational resources can be achieved by
clinical domain in the early years since they could meet the basic in avoiding the training of similar models if they are portable and gener
formation needs of many applications in the clinical domain while not alizable across different clinical institutions. Recently, we have intro
needing a large amount of annotated data, which requires intensive duced CancerBERT, a cancer-specific language model designed to
human labor [12]. Furthermore, the interpretability of their results was identify eight breast cancer phenotypes using clinical texts from breast
an advantage [12]. However, symbolic approaches suffer from a notable cancer patients obtained from the University of Minnesota’s M Health
limitation in terms of portability [1]. In recent years, benefiting from the Fairview (UMN) [21]. CancerBERT was developed based on the
development of large language models (transformer-based models [13]) Biomedical Language Understanding Evaluation Bidirectional Encoder
and increased annotation data in the clinical domain, statistical ap Representations from Transformers (BlueBERT) language model [22]
proaches were developed rapidly and have achieved remarkable per and exhibited exceptional performance on the breast cancer phenotype
formance on various clinical NLP tasks. Nonetheless, a substantial gap extraction task. However, the CancerBERT models were exclusively
exists between the performance of these large language models and the developed and evaluated using the EHR corpus from UMN. The objec
understanding of their generalizability [14]. The successfully deployed tive of this study is to assess the generalizability of CancerBERT models
clinical NLP systems are often developed based on data from a single by evaluating their performance on a separate corpus collected from
healthcare institute, their performances can be various if applied in Mayo Clinic (MC), another healthcare institution. Additionally, we
different healthcare institutes [15]. The variations in EHR platforms, conducted evaluations and comparisons with other benchmark models,
clinical documentation rules, and conventions across healthcare in including conditional random field models (CRF) and bi-directional
stitutions can lead to divergent clinical texts even when documenting long-short memory CRF (BiLSTM-CRF) models. To evaluate the gener
highly similar medical events. These variations encompass both syn alizability of the models, we employed a clinical information extraction
tactic and semantic differences [16], and they tend to accumulate across task from various perspectives. The contributions of this study include:
the entire clinical corpus. Consequently, the impact of these variations
on the generalizability of NLP models developed within a single 1. We assessed the impact of corpus heterogeneity on the generaliz
healthcare institution remains an important research gap. ability of Bidirectional Encoder Representations from Transformers
Limited research has been conducted to investigate the generaliz (BERT) based NLP models.
ability of clinical NLP models. For instance, Sohn et al. examined the 2. We constructed a permutation dataset to analyze the robustness of
portability of a rule-based NLP system designed to identify asthma pa the models.
tients within a clinical cohort [1]. The NLP system was developed based 3. We compared two strategies for transferring models between clinical
on data from a single hospital and was subsequently evaluated using an institutions: i) direct transfer and ii) continuous fine-tuning.
external cohort, revealing a significant drop in performance due to
variations in clinical documentation [1]. Fan et al. externally evaluated The remainder of this paper is structured as follows: Section 2 pro
the cTAKES tool for a part-of-speech tagging task and found the per vides an in-depth discussion of the methodology, covering the data
formance dropped by 5 % when the tool was tested on data from another preparation and experimental setup, including the details of the datasets
clinical institute. In another study, a rule-based NLP system was devel used, and evaluation metrics for each step. The results of the experi
oped to identify patients with a family history of pancreatic cancer and ments are summarized in Section 3, following the same order as Section
tested on data from two different clinical institutions [17]. The findings 2. In Section 4, we interpreted the results, draw connections with pre
indicated that, for highly specific NLP tasks, the rule-based system vious studies, and explore the implications of our findings. We also
demonstrated portability as long as the rules remained simple and could included the limitations of our current study in this section and propose
be updated using new data. Liu et al. [18] evaluated the performance of directions for future research.
an NLP system in detecting smoking status across multiple institutions.
The system incorporated both machine learning and rule-based com 2. Methods and materials
ponents. The results suggested that moderate efforts were required to
make the NLP system portable, including annotating additional data to 2.1. Overview of the study
further train the machine learning module and incorporating new rules
based on the new data. In a recent study, a rule-based NLP model was This study was approved by the institutional review boards (IRB) of
developed to extract social determinants from the clinical notes. Though UMN and MC. Fig. 1 shows the pipeline of the study. We collected the
clinical notes were mostly consistent in describing social determinants at clinical texts of breast cancer patients from EHR in the two institutes
geographically distinct institutions, the accuracy dropped by around 6 % (UMN and MC). MC team annotated their corpus following the same
when externally evaluating the model. Institution-specific modification annotation guideline as previously defined by the UMN team [21]. The
of rules is necessary to maintain the generalizability of the model [19]. CancerBERTUMN models, along with the benchmark CRFUMN and
These studies primarily focused on assessing the generalizability of BiLSTMUMN models developed on the UMN corpus were originally
rule-based and traditional machine learning methods. However, there is designed to extract the eight types of breast cancer phenotypes (i.e.,
currently a dearth of studies investigating the generalizability of Hormone receptor type, Hormone receptor status, Tumor size, Tumor site,
transformer-based models in the clinical domain [14]. Khambete et al. Cancer grade, Histological type, Cancer laterality, and Cancer stage) from
explored the generalization of the SciBERT model on a clinical sentiment the clinical texts. We externally evaluated the performances of models
classification task. The SciBERT was trained on one medical specialty trained on the UMN site using data from MC site. Additionally, we
data from MIMIC III, and when tested on data of other medical spe compared two transfer learning strategies for the NLP models: (1)
cialties, the AUC drop about 8 %. The MIMIC III data cannot reflect the continuously trained models refined on the MC site (MC refined models)
data heterogeneity among different clinical institutes [20]. and (2) models solely trained locally on the MC site (MC model).
Transformer-based models have demonstrated outstanding perfor Furthermore, we conducted two experiments to explore the advantages
mance in various clinical natural language processing (NLP) tasks. of BERT-based models over traditional BiLSTM-CRF and CRF models,
However, the development of these models necessitates substantial which served as baseline models, in terms of model robustness.
computing resources and human effort for data annotation. Further
more, privacy concerns often prevent the sharing of annotated clinical
33
S. Zhou et al. Computational and Structural Biotechnology Journal 22 (2023) 32–40
Fig. 1. The pipeline of the study. Data were collected and annotated from the UMN and MC. The UMN models were externally evaluated on MC data. UMN models
were further refined on MC data and evaluated as comparisons. Permutation dataset evaluation and Entity coverage ratio analysis were conducted to explore the
model generalizability.
2.2. Data sources category. Cosine similarity was calculated for the same breast cancer
phenotype category from the two corpora. The phenotype similarity is
The data used in this study was obtained from the EHRs of two used to investigate the relationship between performance changes
clinical institutes, i.e., the UMN Clinical Data Repository, and MC. The resulting from model transfer and the similarity between the corpora.
EHR data from the UMN contains the health records of 17,970 breast Specifically, we calculated the Pearson correlation coefficients [23]
cancer patients from the years 2001–2018. The EHR data from MC between the phenotype similarities and the performance drops of each
contain 54,050 breast cancer patients from 2000 to 2020. model. The Pearson correlation coefficient was defined as Eq. (2), Xi
refers to the phenotype similarity between two corpora, X is the mean of
similarity scores of all phenotypes. Yi is the performance drop (change of
2.3. Manual annotation and corpora comparison F-1 score) for the corresponding phenotype for a specific model, Y is the
mean of performance drops for all phenotypes.
The annotation schema was introduced in the previous study [21].
We used the same schema to annotate the clinical texts extracted from (Σ(Xi − X)(Yi − Y) )
r = √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ (2)
MC EHR. We randomly sampled one hundred breast cancer patients, and ( 2) ( 2)
Σ(Xi − X) ∗ Σ(Yi − Y)
for each patient, we randomly sampled one clinical note and one pa
thology report within 90 days after the diagnosis date for annotation. The performance drops were quantified as the differences between
This resulted in 81 clinical notes and 80 pathology reports. Two anno the performances of UMN models evaluated on the UMN test set [21]
tators worked on 10 % of the text annotation independently, the and the performances of UMN models directly evaluated on the MC test
inter-annotator agreement (IAA) was evaluated as Cohen’s Kappa of set.
0.80. Then, one annotator completed all annotation tasks.
To better understand the impact of corpus across sites on the per
formance of NLP methods, we compared the annotated clinical texts 2.4. Portability evaluation for breast cancer phenotype extraction models
from the two sites. Several basic corpus statistics such as the number of
sentences, the average length of the sentences, and the unique number of We evaluated the generalizability of three types of machine learning
tokens were summarized and compared. Additionally, we measured the models in this study, i.e., CRF, BiLSTM-CRF, and CancerBERT [21]
similarity between the two corpora by representing the two corpora as models. The CRF and BiLSTM-CRF are conventional machine learning
the normalized term frequency-inverse document frequency (TF-IDF) models that have demonstrated efficacy in named entity recognition
vectors and calculating the cosine similarity [1]. Each corpus vector (NER) tasks within the clinical domain [24–26]. For CRF and BiLSTM
contains the average TF-IDF values for unique terms in the corpus. The baseline models, we experimented with both GloVe (trained on Wiki
average TF-IDF value of term t is defined as Eq. (1), where N is the total pedia) [27] and Word2vec (trained on Google News) embeddings [28]
number of documents in the corpus, the TFi(t) is the frequency of token t as input. The CancerBERT models were developed based on the BERT
in ith document divided by the total number of tokens in the ith docu language model and showed superior performance compared to the CRF
ment, and the IDF(t) is the log (N) divided by the number of documents and BiLSTM-CRF models for our breast cancer phenotype extraction task
containing token t. [21].
MC annotated data were divided into a training set (60 %), a
∑N IDF(t)
TF − IDF(t) = TFi (t) ∗ (1) development set (10 %), and a test set (30 %). The data splitting, model
N training and fine-tuning procedures were consistent with our previous
i=1
Besides the corpus level similarity, we also calculated the similarity study [21]. For all CancerBERT variants, the number of parameters re
of each breast cancer phenotype category (phenotype similarity) be mains the same (about 336 million) during the pre-training stage, and a
tween the two corpora. Similarly, each breast cancer phenotype was softmax layer was added for NER prediction during the fine-tuning
represented by a normalized TF-IDF vector that contains the average TF- stage. The fine-tuned CancerBERT models use parameter sharing
IDF values of distinct terms under each breast cancer phenotype inherent in the BERT architecture, the weights in the transformer layers
34
S. Zhou et al. Computational and Structural Biotechnology Journal 22 (2023) 32–40
are shared across all input tokens. In this study, we evaluated the per 2.6. Evaluation of model generalizability with entity coverage ratio (ECR)
formance of the following three sets of models on MC corpus test set.
The assessment of machine learning model portability often involves
1. Models were originally trained only on the UMN corpus [21]. These quantifying the changes in standard metrics such as F1 score, precision,
UMN models (CRFUMN, BiLSTM-CRFUMN, CancerBERTUMN_397, Can and recall when applied to different test sets. However, this approach
cerBERTUMN_997) were evaluated on MC data (test set) to test their offers only a broad understanding of portability and lacks fine-grained
portability through the task of extracting the breast cancer pheno analysis [33]. To analyze the model portability from a different
types from the clinical texts. perspective, we adapted the ECR proposed in the previous study for our
2. Continuously fine-tuned UMN models on MC corpus, including NER task [33]. The ECR measures to what degree the target entity in the
CRFMC_refined, BiLSTM-CRFMC_refined, and CancerBERTMC_refined. The test set has been seen in the training set with the same category. It was
CancerBERTUMN_397 model was further fine-tuned on MC annotated defined as the Eq. (2):
data to obtain MC refined CancerBERTMC_refined model. To refine the ⎧0 , C=0
UMN models, we fine-tuned all the UMN locally trained models on ⎪
⎨
( ) ( te,k )/
the training and development sets of MC annotated data to obtain the ECR(ei ) = ∑K # etr,k # ei (2)
⎪ i
Cte
corresponding MC refined models. ⎩ k=1 Ctr
, otherwise
3. CancerBERT model trained only on MC corpus (without UMN data).
Following the same steps [21], we developed another Cancer ∑K ( tr,k ) ∑ ( )
Where Ctr = k=1 ei and Cte = Kk=1 ete,k . ei refers to a test entity,
BERTMC_397 model, which was pre-trained only on MC corpus (con i
( )
tains about 5 million clinical documents) and fine-tuned on the #(ei ) is the number of entity ei in the training set with label k, ete,k
tr,k
is
i
training and development sets of MC annotated data as a benchmark
the number of entity ei in the test set with label k. The ECR scores range
model.
from 0 to 1, indicating the difficulty of predicting an entity (from easy to
difficult) [33].
The breast cancer phenotype extraction can be framed as a NER task,
In our study, we calculated the ECR scores for the entities in the test
and phenotype level (name entity) evaluation was applied. We used the
set of UMN annotated data. We further divided the entities in the test set
micro-average (assign equal weight to each sample) F1 score and macro-
into different groups based on their ECR values, i.e., 0 < = ECR < 0.33,
average (assign equal weight to each category) F1 score as metrics, and
0.33 < = ECR < 0.67, 0.67 < = ECR< 1, and ECR = 1. We evaluated
both exact match and lenient match were used following the i2b2
the UMN models (i.e., CancerBERTUMN_397, BiLSTM-CRFUMN, and
standard [29]. F1 score was calculated as the harmonic mean of the
CRFUMN models developed solely on UMN data) on the UMN test set and
precision and recall scores and provides a better assessment of model
investigated the relationships between ECR values and the model’s
performance when data is imbalanced. All experiments were conducted
performance on corresponding entities.
10 times, and the one-way ANOVA tests were conducted to ascertain if
there are significant differences between the means of performance
among different models with a 95 % confidence level. When ANOVA 2.7. Experiment environments at UMN and MC
reveals statistically significant differences, it does not specify which
groups differ from each other. Thus, we further conducted pairwise The corpora comparison and portability evaluation was conducted at
t-tests with Bonferroni correction which are used to adjust the signifi MC, the models were fine-tuned and evaluated on a Centos server with
cance level for multiple comparisons and ensure the probability of one Nvidia Tesla V100 GPU. The permutation data evaluation and ECR
making a Type I error remains controlled and does not inflate due to analysis were conducted at UMN, under a Windows server with one
multiple testing. [30,31]. The model with the highest performance is Nvidia Tesla V100 GPU. We used Tensorflow 1.15 frameworks in all
significantly better than others if all pairwise t-tests were significant. settings.
The utilization of a permutation dataset allows for the simulation of 3.1. Comparison of UMN vs MC corpora
data variations, thereby providing insights into the generalizability of
models when confronted with new data [14,32]. In this study, we We compared the corpora of UMN and MC from various perspectives
employed the manually created permutation test set to investigate the (Table 1). Overall, the UMN corpus has more clinical documents per
impact of data variances on model performance. Specifically, the patient (190.5 vs 140.8), while the average length of each clinical
BiLSTM-CRFUMN, CRFUMN, and CancerBERTUMN models were evaluated document was shorter compared to MC (259 tokens vs 361 tokens).
using this permutation test set. The entity permutation was used since Table 1 also displays the number of breast cancer phenotypes annotated
entities are the core of the NER task. To create the permutation test set, from the clinical texts of both sites, with the distinct number of terms for
we first identified all distinct entities in MC annotated data that are not each phenotype entity indicated in parentheses. More entities related to
in the UMN annotated data; then we replaced the entities in the UMN breast cancer phenotypes were annotated from the UMN data. The IAA
test set with entities that were randomly sampled from MC unique en scores for annotations are 0.91 and 0.80 at UMN and MC sites, respec
tities under the same category. This approach resulted in a permutation tively. The similarities of phenotypes between the two corpora were
dataset consisting of novel combinations of target entities and their calculated using the TF-IDF. The results revealed that breast cancer
corresponding contextual information. Evaluating the models on this phenotypes exhibited higher similarity scores compared to the overall
permutation test set enabled us to assess their capacity to learn corpus (phenotype similarity average: 0.9088 vs overall: 0.5411).
contextual information of entities rather than relying solely on entity
memorization [32]. We evaluated all models developed using the UMN 3.2. Portability evaluations of machine learning-based models and BERT-
data using the permutation test set. The evaluation schema is the same as based models
the portability evaluation of the models introduced in the previous
section. The portability evaluation results (strict match (lenient match) F1
scores) for the two classic NER models, i.e., CRF and BiLSTM-CRF, are
shown in Table 2. We used two word embeddings as input for the models
(Glove Wikipedia 6B [27] and Word2Vec Google News [28]), and
35
S. Zhou et al. Computational and Structural Biotechnology Journal 22 (2023) 32–40
evaluated the models using MC test set. The results show the perfor
3.4. Evaluation of model generalizability with ECR
mances of models that were directly obtained from the UMN site
(CRFUMN, BiLSTM-CRFUMN) and the models that were further fine-tuned
Fig. 3 shows the performances (strict F1 scores) of different types of
on MC data (CRFMC_refined, BiLSTM-CRFMC_refined).
models for entities under different ECR groups: 1) 0 < = ECR < 0.33, 2)
Table 3 shows the portability evaluation results (strict match (lenient
0.33 < = ECR < 0.67, 3) 0.67 < = ECR< 1 and 4) ECR = 1. All three
match) F1 scores) for the BERT-based models. We evaluated the Can
types of models achieved relatively high performances for groups 3 & 4.
cerBERT models developed at UMN (CancerBERTUMN), along with two
The CancerBERTUMN_397 model obtained significantly better perfor
benchmark BERT-based models (BERT-large origin [34] and BlueBERT
mances in groups 1 & 2 compared to the other two models.
[22]). The CancerBERTUMN model has two variants, one with
frequency-based 997 customized words (CancerBERTUMN_997) in its vo
4. Discussions
cabulary and another one has 397 knowledge-based customized words
(CancerBERTUMN_397). Each UMN model (Table 3, column UMN) was
A significant proportion of pertinent information resides within un
directly evaluated on MC test set. Then these models were further
structured formats in EHR data. Before the era of NLP, information
fine-tuned on MC data to obtain MC refined models (Table 3,
extraction from EHR data mainly relied on the use of structured data (e.
sub-column MC refined) and evaluated again. In addition, we evaluated
g., medical codings) and key term searches of unstructured clinical texts,
the CancerBERTMC_397 model trained solely on MC corpus for
yielding suboptimal performance [35–37]. Due to the robust capabilities
Table 2
Portability evaluation results (strict match (lenient match) F1 scores) for CRF and BiLSTM-CRF models. Note that models with subscript “UMN” indicate models trained
only on UMN corpus and models with subscript “MC_refined” are UMN models with continuous fine tuning on MC corpus.
Word embeddings Glove Wikipedia 6B Glove Wikipedia 6B Word2Vec Google News Word2Vec Google News
Models CRFUMN CRF MC_refined BiLSTM-CRF UMN BiLSTM-CRF MC_refined CRFUMN CRF MC_refined BiLSTM-CRF UMN BiLSTM-CRF MC_refined
Hormone Receptor type 0.939 0.944 0.925 0.943 0.945 0.948 0.939 0.918
(0.941) (0.954) (0.925) (0.954) (0.948) (0.951) (0.939) (0.929)
Hormone Receptor status 0.527 0.542 0.794 0.876 * 0.529 0.491 0.867 0.837
(0.527) (0.542) (0.794) (0.876 *) (0.529) (0.491) (0867) (0.837)
Tumor size 0.224 0.694 0.392 0.711 * 0.244 0.663 0.367 0.592
(0.294) (0.749) (0.509) (0.813 *) (0.350) (0.724) (0.421) (0.738)
Tumor site 0.053 0.303 0.266 0.472 0.044 0.321 0.205 0.361 *
(0.136) (0.600) (0.400) (0.758) (0.128) (0.596) (0.323) (0.668 *)
Tumor grade 0.647 0.881 0.903 0.916 * 0.647 0.861 0.849 0.881
(0.681) (0.925 *) (0.903) (0.916) (0.685) (0.913) (0.849) (0.881)
Tumor laterality 0.846 0.935 0.882 0.952 * 0.853 0.934 0.874 0.930
(0.846) (0.935) (0.882) (0.952 *) (0.853) (0.934) (0.874) (0.930)
Cancer stage 0.773 0.868 0.578 0.891 * 0.632 0.873 0.593 0.838
(0.773) (0.868) (0.578) (0.891 *) (0.632) (0.873) (0.593) (0.838)
Histological type 0.829 0.937 0.847 0.934 0.845 0.934 0.839 0.899
(0.896) (0.964) (0.917) (0.959) (0.907) (0.965) (0.921) (0.917)
F1 Macro average 0.605 0.763 0.698 0.837 * 0.593 0.753 0.691 0.782
(0.637) (0.817) (0.738) (0.889 *) (0.629) (0.806) (0.723) (0.842)
F1 Micro average 0.803 0.883 0.785 0.893 * 0.804 0.880 0.777 0.853
(0.822) (0.905) (0.814) (0.922 *) (0.825) (0.903) (0.802) (0.885)
Note: The scores were averaged scores based on 10 runs, the texts in bold indicate the highest performance, * indicates the score is statistically higher than other
methods (confidence level: 95 %).
36
S. Zhou et al. Computational and Structural Biotechnology Journal 22 (2023) 32–40
Table 3
Portability evaluation results (strict match (lenient match) F1 scores) for BERT-based models on MC test data set. The column of “UMN” includes models only trained
on the UMN corpus, and the column “MC refined” contains models with fine-tuning on MC corpus. The column “MC only” is the model trained only on MC corpus.
Entity type BERT-large Origin BlueBERT (PubMed+MIMIC III) CancerBERT CancerBERT CancerBERT
UMN_997 UMN_397 MC_397
UMN MC refined UMN MC refined UMN MC refined UMN MC refined MC only
Hormone Receptor type 0.926 0.967 0.897 0.977 0.923 0.984 0.942 0.975 0.993 *
(0.935) (0.969) (0.911) (0.977) (0.963) (0.988) (0.947) (0.981) (0.993 *)
Hormone Receptor status 0.816 0.910 0.897 0.932 0.842 0.901 0.819 0.926 0.943 *
(0.816) (0.910) (0.897) (0.932) (0.842) (0.901) (0.819) (0.926) (0.943 *)
Tumor size 0.633 0.837 0.440 0.839 0.595 0.765 0.745 0.864 0.862
(0.737) (0.903) (0.648) (0.915) (0.664) (0.813) (0.781) (0.928 *) (0.907)
Tumor site 0.666 0.609 0.308 0.590 0.186 0.733 * 0.709 0.601 0.661
(0.739) (0760) (0.671) (0.790) (0.742) (0.792) (0.759) (0.786) (0.832 *)
Tumor grade 0.827 0.909 0.886 0.859 0.869 0.891 0.846 0.927 * 0.863
(0.827) (0.922) (0. 886) (0.939) (0.869) (0.891) (0.846) (0.943 *) (0.933)
Tumor laterality 0.896 0.928 0.936 0.954 0.928 0.939 0.903 0.959 0.962
(0.896) (0.928) (0.936) (0.954) (0.928) (0.939) (0.903) (0.959) (0.962)
Cancer stage 0.774 0.934 0.799 0.953 0.806 0.870 0.829 0.949 0.950
(0.774) (0.934) (0.799) (0.953) (0.806) (0.870) (0.829) (0.949) (0.950)
Histological type 0.793 0.926 0.794 0.931 0.828 0.849 0.815 0.934 0.950 *
(0.888) (0.950) (0.902) (0.958) (0.923) (0.922) (0.914) (0.965) (0.981 *)
Macro average 0.724 0.874 0.744 0.879 0.747 0.867 0.828 0.892 0.898 *
(0.829) (0.908) (0.831) (0.927) (0.842) (0.889) (0.849) (0.930) (0.932)
Micro average 0.843 0.905 0.817 0.917 0.829 0.903 0.864 0.925 0.932 *
(0.877) (0.922) (0.876) (0.943) (0.886) (0.925) (0.906) (0.947) (0.952 *)
Note: The scores were averaged scores based on 10 runs, the texts in bold indicate the highest performance, * indicates the score is statistically higher than other
methods (confidence level: 95 %).
Table 4
Evaluation results (strict match (lenient match) F1 scores) for CancerBERTUMN_397, BiLSTM-CRFUMN, and CRFUMN models on permutation dataset. Changed F1 score
columns show the differences between F1 scores on the permutation set compared to the F1 scores on the normal test set.
Models CRFUMN Changed F1 score BiLSTM-CRFUMN Changed F1 score CancerBERTUMN_397 Changed F1 score
Entities
Note: Texts in bold indicate the smallest changed F1 scores. The results were averaged scores based on 10 runs, * indicates the change of F1 score is statistically lower
than other methods (confidence level: 95 %).
and rapid development of NLP, it has emerged as a pivotal tool in the phenotypes/100 tokens). Remarkably, the similarity between breast
extraction of this invaluable information, thereby enhancing clinical cancer phenotypes was significantly higher than the overall token sim
decision-making, administrative reporting, and academic research en ilarity between the two corpora (0.9088 vs 0.5411), which corroborates
deavors [11]. Clinical texts found in EHRs differ significantly from findings from a prior investigation [1]. It indicates that clinicians may
general language due to their inclusion of professional terminology, utilize consistent clinical language when describing specific medical
medical jargon, acronyms, and abbreviations. The semantic and concepts within their clinical texts, thereby establishing a foundation for
contextual information conveyed within clinical texts extracted from facilitating the transferability of NLP models among diverse clinical
EHRs can vary considerably across different healthcare institutions. institutes. For instance, clinicians commonly employ standardized
These variations should be carefully considered during the development medical language such as "HER-2/neu positive" to describe receptor
of NLP algorithms tailored for clinical applications. We compared the status and "Grade 2" to denote tumor grade.
corpora of UMN and MC from various perspectives. UMN’s annotated In this study, we mainly explored the generalizability of the BERT-
data revealed a greater number of unique phenotypes identified (673 vs based model (CancerBERT models), along with other two classic ma
592), while MC corpus demonstrated a higher density of breast cancer chine learning models, i.e., CRF and BiLSTM-CRF. To assess the gener
phenotypes within the clinical texts (3.7 phenotypes/100 tokens vs 2.1 alizability of these models, we evaluated their performance on the
37
S. Zhou et al. Computational and Structural Biotechnology Journal 22 (2023) 32–40
Fig. 2. The performances (strict F1 scores) of CRFUMN, BiLSTM-CRFUMN, and CancerBERTUMN_397 models on different test sets. The original test set is the UMN test
set, and the portability test set is MC test set. All models were UMN models trained solely on UMN data.
cancer phenotype extraction task. The results show that when directly CancerBERT model over the CRF and BiLSTM-CRF models in this per
evaluating the UMN models on MC test data, only the CancerBERT mutation set evaluation. For instance, Table 4 shows that the exact
models achieved reasonable performances, indicating their advantages match (lenient match) macro-average F1 scores only dropped 0.270
in portability compared to the BiLSTM-CRF and CRF models. Further (0.169) for the CancerBERTUMN_397 model, while for CRFUMN and
more, after refining the models using MC data, the CancerBERT models BiLSTM-CRFUMN models, the corresponding macro-average F1 scores
consistently achieved the best overall performance and demonstrated drop by 0.440 (0.355) and 0.524 (0.438), respectively. Practically, the
greater stability, the average drop of micro F1 score (CancerBERTUMN vs CancerBERTUMN_397 model identified 9.24 fewer phenotypes per patient
CancerBERTMC_refined) is 0.067 for CancerBERT models, for CRF and on the permutation test set compared to the normal test set. In com
BiLSTM-CRF, the drop of scores are 0.078 (CRFUMN vs CRFMC_refined) and parison, the CRFUMN and BiLSTM-CRFUMN models identified 22.83 and
0.092 (BiLSTM-CRFUMN vs BiLSTM-CRFMC_refined). Although the Can 24.61 fewer phenotypes per patient, respectively. These results show
cerBERTMC_397 model trained from scratch using MC data obtained the that among the three types of models, the CancerBERT model exhibits
best overall performance, the performance of the CancerBERTMC_refined superior robustness and a greater ability to capture the contextual in
model remained comparable, with an F1 score of only 0.007 lower than formation of entities, enabling it to effectively handle novel entity var
the CancerBERTMC_397 model. These results indicate that BERT-based iants that were previously unseen.
models can be effectively transferred to other clinical institutes while The ECR was employed to conduct an in-depth analysis of the
maintaining a relatively high level of performance, requiring minimal portability evaluation results. Notably, the CancerBERTUMN_397 model
effort as only a small amount of annotated data is needed to fine-tune the has significantly better performances in all groups, with particularly
models rather than training them from scratch on a new corpus. We pronounced advantages for phenotypes in Groups 0 and 1 (ECR<0.67).
calculated the correlation between the similarities of cancer phenotypes These groups contain the target test phenotypes that were either
of two corpora and the performance drops (UMN models evaluated on appearing in the training sets with different labels, or were absent from
the UMN test set versus UMN models directly evaluated on MC test set). the training sets, indicating the relatively challenging nature of
The Pearson correlation scores are –0.678, –0.345, and –0.712 for extracting phenotypes within these two groups. The results indicate the
CRFUMN, BiLSTM-CRFUMN, and CancerBERTUMN_397 models respectively, CancerBERTUMN_397 model has enhanced learning ability to learn the
indicating medium (BiLSTM-CRFUMN) to strong (CRFUMN, Cancer target phenotypes and their contexts compared to the BiLSTM-CRFUMN
BERTUMN_397) negative correlations. Our findings suggest that the and CRFUMN models.
portability of these NLP models is positively influenced by the similarity The study has several limitations. We evaluated the generalizability
among the targeted entities, such as cancer phenotypes, across different of machine learning models through an NLP task to extract breast cancer
corpora. When the entities exhibit higher similarity scores between phenotypes from clinical texts within two clinical institutes. It is a
distinct corpora, the model has a better capability to retain its original common NER task in the clinical domain. However, there are many
performance when transferred between the corpora. other NLP tasks, for instance, relation extraction and text classification.
The assessment of models on the permutation test set serves as an It is worth additional investigation on other clinical NLP tasks. Our
indicator of whether the model is effectively capturing the underlying investigation primarily focused on the performances of NLP models in
patterns in the text or merely memorizing the phenotypes present in the the NER task, while the downstream implications of divergent NLP
training set. Our findings reveal a significant advantage of the performance on clinical applications warrant further exploration in
38
S. Zhou et al. Computational and Structural Biotechnology Journal 22 (2023) 32–40
Fig. 3. The performances (strict F1 scores) of CRFUMN, BiLSTM-CRFUMN, and CancerBERTUMN_397 models for the identification of entities in different ECR groups.
Group 1: 0 < = ECR < 0.33, Group 2: 0.33 < = ECR < 0.67, Group 3: 0.67 < = ECR< 1, and Group 4: ECR = 1.
39
S. Zhou et al. Computational and Structural Biotechnology Journal 22 (2023) 32–40
Data Availability Annual Symposium Proceedings, vol. 2011. American Medical Informatics
Association,; 2011. p. 382.
[16] Friedman C, Kra P, Rzhetsky A. Two biomedical sublanguages: a description based
The data underlying this article cannot be shared publicly due to the on the theories of Zellig Harris. J Biomed Inform 2002 1;35(4):222–35.
privacy of patient health information. [17] Mehrabi S, Krishnan A, Roch AM, Schmidt H, Li D, Kesterson J, Beesley C, Dexter P,
Schmidt M, Palakal M, Liu H. Identification of patients with family history of
pancreatic cancer-Investigation of an NLP system portability. Stud Health Technol
Acknowledgement Inform 2015;216:604.
[18] Liu M, Shah A, Jiang M, Peterson NB, Dai Q, Aldrich MC, Chen Q, Bowton EA,
NA. Liu H, Denny JC, Xu H. A study of transportability of an existing smoking status
detection module across institutions. AMIA Annual Symposium Proceedings, vol.
2012. American Medical Informatics Association,; 2012. p. 577.
References [19] Magoc T, Allen KS, McDonnell C, Russo JP, Cummins J, Vest JR, Harle CA.
Generalizability and portability of natural language processing system to extract
[1] Sohn S, Wang Y, Wi CI, Krusemark EA, Ryu E, Ali MH, Juhn YJ, Liu H. Clinical individual social risk factors. Int J Med Inform 2023 5:105115.
documentation variations and NLP system portability: a case study in asthma birth [20] Khambete MP, Su W, Garcia JC, Badgeley MA. Quantification of BERT diagnosis
cohorts across institutions. J Am Med Inform Assoc 2018;25(3):353–9 (Mar). generalizability across medical specialties using semantic dataset distance. AMIA
[2] Xie F, Lee J, Munoz-Plaza CE, Hahn EE, Chen W. Application of text information Summits Transl Sci Proc 2021;2021:345.
extraction system for real-time cancer case identification in an integrated [21] Zhou S, Wang N, Wang L, Liu H, Zhang R. CancerBERT: a cancer domain-specific
healthcare organization. J Pathol Inform 2017;8(1):48. Jan 1. language model for extracting breast cancer phenotypes from electronic health
[3] Carchiolo V, Longheu A, Reitano G, Zagarella L. Medical prescription classification: records. J Am Med Inform Assoc 2022. Mar 25.
a NLP-based approach. Sep 1. 2019 Federated Conference on Computer Science [22] Peng Y, Chen Q, Lu Z. An empirical study of multi-task learning on BERT for
and Information Systems (FedCSIS). IEEE,; 2019. p. 605–9. Sep 1. biomedical text mining. arXiv preprint arXiv:2005.02799 2020;2005:02799. May
[4] Vijayakrishnan R, Steinhubl SR, Ng K, Sun J, Byrd RJ, Daar Z, Williams BA, 6.
Defilippi C, Ebadollahi S, Stewart WF. Prevalence of heart failure signs and [23] Cohen I, Huang Y, Chen J, Benesty J, Benesty J, Chen J, Huang Y, Cohen I. Pearson
symptoms in a large primary care population identified through the use of text and correlation coefficient. Noise Reduct Speech Process 2009:1–4.
data mining of the electronic health record. J Card Fail 2014;20(7):459–64. Jul 1. [24] Liu Z, Tang B, Wang X, Chen Q. De-identification of clinical notes via recurrent
[5] Mavrogiorgos K, Mavrogiorgou A, Kiourtis A, Zafeiropoulos N, Kleftakis S, neural network and conditional random field. J Biomed Inform 2017;75:S34–42.
Kyriazis D. Automated rule-based data cleaning using NLP. Nov 9. 2022 32nd Nov 1.
Conference of Open Innovations Association (FRUCT). IEEE,; 2022. p. 162–8. Nov [25] Chapman AB, Peterson KS, Alba PR, DuVall SL, Patterson OV. Detecting adverse
9. drug events with rapidly trained classification models. Drug Saf 2019;42(1):
[6] Valmianski I, Frost N, Sood N, Wang Y, Liu B, Zhu JJ,Karumuri S, Finn IM, Zisook 147–56 (Jan).
DS. SmartTriage: A system for personalized patientdata capture, documentation [26] Unanue IJ, Borzeshi EZ, Piccardi M. Recurrent neural networks with specialized
generation, and decision support. InMachineLearning for Health 2021 Nov 28 (pp. word embeddings for health-domain named-entity recognition. J Biomed Inform
75-96). PMLR. 2017;76:102–9.
[7] Manias G., Mavrogiorgou A., Kiourtis A., Kyriazis D. SemAI: A novel approach for [27] Pennington J., Socher R., Manning C. Glove: Global vectors for word
achieving enhanced semantic interoperability in public policies. InArtificial representation. InProceedings of the 2014 Conference on Empirical Methods in
Intelligence Applications and Innovations: 17th IFIP WG 12.5 International Natural Language Processing (EMNLP). 2014; pp:1532–1543.
Conference, AIAI 2021, Hersonissos, Crete, Greece, June 25–27, 2021, Proceedings [28] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and
17 2021 (pp. 687–699). Springer International Publishing. phrases and their compositionality. Adv Neural Inf Process Syst 2013:3111–9.
[8] Digan W, Névéol A, Neuraz A, Wack M, Baudoin D, Burgun A, Rance B. Can [29] Yang X, Bian J, Hogan WR, Wu Y. Clinical concept extraction using transformers.
reproducibility be improved in clinical natural language processing? A study of 7 J Am Med Inform Assoc 2020;27(12):1935–42. Dec 9.
clinical NLP suites. J Am Med Inform Assoc 2021 1;28(3):504–15. [30] Kim HY. Analysis of variance (ANOVA) comparing means of more than two groups.
[9] Kaufman DR, Sheehan B, Stetson P, Bhatt AR, Field AI, Patel C, Maisel JM. Natural Restor Dent Endod 2014;39(1):74–7. Feb 1.
language processing–enabled and conventional data capture methods for input to [31] Armstrong RA. When to use the Bonferroni correction. Ophthalmic Physiol Opt
electronic health records: a comparative usability study. JMIR Med Inform 2016 2014;34(5):502–8 (Sep).
28;4(4):e5544. [32] Schutte D, Vasilakes J, Bompelli A, Zhou Y, Fiszman M, Xu H, Kilicoglu H,
[10] Devine EB, Van Eaton E, Zadworny ME, Symons R, Devlin A, Yanez D, Yetisgen M, Bishop JR, Adam T, Zhang R. Discovering novel drug-supplement interactions
Keyloun KR, Capurro D, Alfonso-Cristancho R, Flum DR. Automating electronic using SuppKG generated from the biomedical literature. J Biomed Inform 2022;
clinical data capture for quality improvement and research: the CERTAIN 131:104120. Jul 1.
validation project of real world evidence. eGEMs 2018;6:1. [33] Fu J, Liu P, Zhang Q. Rethinking generalization of neural models: a named entity
[11] Wen A, Fu S, Moon S, El Wazir M, Rosenbaum A, Kaggal VC, Liu S, Sohn S, Liu H, recognition case study. Proc AAAI Conf Artif Intell 2020;34(05):7732–9. Apr 3.
Fan J. Desiderata for delivering NLP to accelerate healthcare AI advancement and a [34] Devlin J, Chang MW, Lee K, Toutanova K. Bert:Pre-training of deep bidirectional
Mayo Clinic NLP-as-a-service implementation. NPJ Digit Med 2019;2(1):1–7. Dec transformers for language understanding.arXiv preprint arXiv:1810.04805. 2018
17. Oct 11.
[12] Wang Y, Wang L, Rastegar-Mojarad M, Moon S, Shen F, Afzal N, Liu S, Zeng Y, [35] Botsis T, Hartvigsen G, Chen F, Weng C. Secondary use of EHR: data quality issues
Mehrabi S, Sohn S, Liu H. Clinical information extraction applications: a literature and informatics opportunities. Summit Transl Bioinforma 2010;2010:1.
review. J Biomed Inform 2018 1;77:34–49. [36] Coquet J, Bozkurt S, Kan KM, Ferrari MK, Blayney DW, Brooks JD, Hernandez-
[13] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Boussard T. Comparison of orthogonal NLP methods for clinical phenotyping and
Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst 2017:30. assessment of bone scan utilization among prostate cancer patients. J Biomed
[14] Li L, Chen X, Ye H, Bi Z, Deng S, Zhang N, Chen H. On robustness and bias analysis Inform 2019;94:103184. Jun 1.
of bert-based relation extraction. Nov 4. China Conference on Knowledge Graph [37] Halpern Y, Horng S, Choi Y, Sontag D. Electronic medical record phenotyping using
and Semantic Computing. Singapore: Springer,; 2021. p. 43–59. Nov 4. the anchor and learn framework. J Am Med Inform Assoc 2016;23(4):731–40. Jul
[15] Fan JW, Prasad R, Yabut RM, Loomis RM, Zisook DS, Mattison JE, Huang Y. Part- 1.
of-speech tagging for clinical text: wall or bridge between institutions?. AMIA
40