Breast Cancer Diagnosis Using Machine
Breast Cancer Diagnosis Using Machine
Breast Cancer Diagnosis Using Machine
ABSTRACT
Machine learning is a subdomain of artificial intelligence that has proved its performance in the medical
fields, especially in the classification of the diseases. In previous researches we tried to classify breast
cancer into its two categories using several machine learning algorithms, some algorithms have proved their
performance but others have produced a weak accuracy. In this study, we will try to improve the accuracy
of weak machine learning algorithms using the normalization/ standardization and the ensemble methods
like: voting, stacking, bagging and boosting in the classification of breast cancer disease using the large
SEER database and the python library. The goal of this paper is not only the improvement of the classifiers
accuracy, but also the proposition of new architecture of breast cancer diagnosis based on SEER database
features for predicting breast cancer in the earlier stage and with the right way. All the examined techniques
have proved their performance in the improvement of the accuracy of classification of breast cancer,
Specially Voting technique. It obtained the higher accuracy except the case of voting all classifiers, but it
was enhanced by the normalization/ standardization of features.
Keywords: SEER, Machine learning, Ensemble methods, Breast cancer, Diagnosis.
594
Journal of Theoretical and Applied Information Technology
15th February 2021. Vol.99. No 3
© 2021 Little Lion Scientific
classification and performance evaluation using the Comparative study of machines learning
Jupyter Notebook. techniques approaches that are employed in the
modeling of cancer progression with different input
The rest of this paper is structured as
features, the review presents the performance of
follows. In part two we cited some researches that
machines learning techniques in both prediction of
have used the SEER dataset; in part three we
cancer recurrence and survival [13]. An ensemble
explained our methodology and the materials used,
of machines learning techniques, logistic regression
then in part four we executed some machine
(LR), support vector machines (SVM), random
learning in classification of breast cancer and also
forest (RF) and deep learning (DL), are examined
we tried to improve their performance by the
to predict survival of pancreatic neuroendocrine
techniques of normalization/standardization, then
tumors (PNETs). All algorithms gave accuracy
we tried also to improve their accuracy by ensemble
more than 80% that is better than the AJCC stage
methods in part five, a comparison of our results
system for PNETs cases in the SEER
with existing work was done in part six and finally
database[14].To predict 10-year breast cancer
conclusion.
patient survival some machine learning algorithms
are trained like Logistic Regression (LR), Naive
2. RELATED WORKS
Bayes, and C4.5 Decision Tree. The obtained
Up to now, several searches have been accuracies are 76.29% for Logistic Regression,
carried out with SEER dataset and machines 59.71% for Naive Bayes, and 77.43% for C4.5
learning techniques, not only for the diagnosis of Decision Tree. Therefore, C4.5 Decision Tree
cancer but also in the prediction of the cancer proved to be the most accurate predictor of patient
recurrence and the duration of survivability of survival in ten years in this research [15]. Several
patients, and all of them showed the performance of supervised machines learning algorithms are
machine learning techniques in the domain of applied to predict lung cancer patient survival,
cancers predictions. J48 and priority based decision among them linear regression, Decision Trees,
tree algorithms are applied for breast cancer Gradient Boosting Machines (GBM), Support
classification, the results show that priority based Vector Machines (SVM), and a custom ensemble
decision tree algorithm gives higher accuracy [16]. To predict 2-year colorectal cancer
98.51% [8]. The classification of breast cancer into survivability several machines learning algorithms
two categories “Carcinoma in situ” and “Malignant are used like logistic regression, random forest,
potential” was made by C4.5, The accuracy AdaBoost, and neural network. The importance of
obtained in training phase 94% and in testing phase ethnicity on model performance was investigated,
93% [9]. The three machine learning techniques the models proved their performance in single-
Decision tree, Support Vector Machine and ethnicity populations better than mixed-ethnicity
Random Forest are examined for the early populations [17].
diagnosis and prevention of the breast cancer. The
original dataset was divided into 10 groups to apply 3. MATERIAL AND METHODOLOGY
the three machines learning algorithms in all of
In this section, we will present the SEER
these groups. The higher accuracy was obtained by
database and methodology followed for breast
Random Forest in all of groups [10].
cancer diagnosis using the proposed techniques.
Three machines learning techniques,
Decision Tree (DT), Support Vector Machine 3.1 Seer Database
(SVM) and Artificial Neural Network (ANN) were The massive SEER (Surveillance,
performed to predict breast cancer recurrence for Epidemiology, and End Results) database provided
cancer patients. The higher accuracy was given by by the National Cancer Institute, it collects data of
Decision Tree with 94.15 % followed by Support cancer incidence, diagnosis, treatment, survival and
Vector Machine 91.95% then Artificial Neural mortality of all types of cancers from population-
Network with 90.86% [11]. The same three based cancer registries and it cover 34.6% of the
machines learning techniques Decision Tree (c4.3), population of the United State. The last submission
Support Vector Machine (SVM) and Artificial of SEER database is 2019 submission, it contains
Neural Network (ANN) were trained for predicting the data from 1975 to 2017 and it covers more than
breast cancer recurrence but with higher accuracies, 10,985,942 cases. We will work with the
93,6% for Decision Tree 94,7% for Artificial submission of 2018 and we will extract only the
Neural Network and 95,7% for Support Vector data of breast cancer from 2008 to 2016 to execute
Machine [12].
595
Journal of Theoretical and Applied Information Technology
15th February 2021. Vol.99. No 3
© 2021 Little Lion Scientific
the proposed techniques where the number of database contain both numeric (continuous) and
missing data is reduced. categorical (discrete) attributes.
596
Journal of Theoretical and Applied Information Technology
15th February 2021. Vol.99. No 3
© 2021 Little Lion Scientific
(2004-2015)
CS Reg The number of b. Transformation of categorical data
nodes Eval Number regional nodes
(2004-2015) evaluated. This step consists to transform categorical data to
In (2016+) all integer format using also pandas python library,
variables of which our predictive models can better understand.
Tumor Size
tumor
Summary Number Like for example Behavior recode for analysis has
evaluation are
(2016+) two possible values (In situ or Malignant) are
summarized in
this variable. transformed to 0 and 1, the same thing for the
This variable others categorical data.
identifies the
Code site in which c. Dividing data into groups
Primary Site
the primary
tumor Due to the large number of data extracted we
originated. divided the data, by years of diagnosis, into 9
Variable groups to evaluate the performance of executed
First identify if algorithms in the classification of large data of
malignant Yes there is first breast cancer. The total number of data in each
primary No malignant group is the following: (60992 for 2008, 62894 for
indicator primary 2009, 62002 for 2010, 63735 for 2011, 65377 for
indicator. 2012, 67723 for 2013, 68741 for 2014, 71019 for
Total number
Count the total 2015 and 70521 for 2016).
of In 00-98: Valid values number of
Situ/malignan
99: (unknown) cancers that d. Data normalization /standardization
t tumors for
patients have.
patient Normalization and standardization are two
Behavior In situ techniques of data preprocessing which make
recode for Type of tumor.
analysis
Malignant features in the same scale, so that no one has more
3.2.2 Data preprocessing influence than the others on classification. The
This step is divided into four tasks: eliminating difference between them that normalization scale
missing data, transforming categorical data to features between minimum and maximum values
integer, dividing the data by group of years and and standardization rescale data to have a mean of 0
finally data normalization and standardization to and a standard deviation of 1.
improve the accuracy of classification. 3.2.3 Classification
The final step is classification in which we will
a. Eliminating missing data apply the selected algorithms using the scikit learn
this library provides many classification algorithms
The first step in data preprocessing is eliminating and facilitate the use of them; this step is divided
missing data, we used the pandas python library, into two tasks: first, we will evaluate the
the missing value of categorical data is identified performance of the selected machine learning
by Unknown and for continuous data by 999 or 99. algorithms in the classification of large breast
The Total number of data after eliminating missing cancer dataset then we will show the impact of
values is 593004 (97548 in situ and 495456 normalization/standardization in the improvement
malignant), (522483) data of 2008-2015 and of classification accuracy.
(70521) data of 2016. Second, we will test the capacity of ensemble
methods in the improvement of the accuracy of
600000 machine learning algorithms that get a low
accuracy.
400000 In situ 4. CLASSIFICATION AND MACHINES
LEARNING ALGORITHMS
200000 Malignant
Classification is a supervised learning
process that categories data into classes using
0
machine learning classifiers. In this paper we will
Figure 3: Behavior recode for analysis class try to classify the breast cancer into its two
distribution. categories using machine learning algorithms, the
597
Journal of Theoretical and Applied Information Technology
15th February 2021. Vol.99. No 3
© 2021 Little Lion Scientific
data is divided into training and testing data. We strong independence between the features. The
executed the classifiers into training data then we naïve Bayes model is easy in the construction and
examined their performance in testing data. Some can be used in huge set of data [18]. Naive
classifiers have approved their performance, but for Bayesian classifier assumes that the existence of a
others we will try to improve their performance by feature in a class is independent of the existence of
two techniques. First, through the others features.
normalization/standardization techniques, then by
ensemble methods.
Without MinMaxScaler
4.1 K-Nearest Neighbors Normalizer StandardScaler
K-Nearest neighbors (KNN) is one of the
top 10 machine learning algorithm [18], from the 100
category of Lazy Learning that can be used in 80
classification also in regression. k-nearest 60
neighbors tries to classify the unknown sample of
testing data based on the known classification of its 40
neighbors from training data by calculating the 20
distance between them [19] , The KNN search in 0
the training data the k closest simples to unknown 2008 2009 2010 2011 2012 2013 2014 2015 2016
test simple, then the classification of test simple can
be defined based on those closest simples. Figure 5: Accuracy of NB.
After examining Naive Bayes performance
in testing data, also it’s gives low accuracy which
Without MinMaxScaler
vary between 83.16% for 2004 and 83.68% for
Normalizer StandardScaler 2008. The higher improvement was given by both
MinMaxScaler and StandardScaler. The Normalizer
100
does not give any improvement and in some case it
80 decreased more the accuracy of classification
60 especially in the years 2011, 2012 and 2013.
40
4.3 Decision Tree
20 Decision Tree is learning method used in
0 both classification and regression. It is similar to
2008 2009 2010 2011 2012 2013 2014 2015 2016
flowchart [20] where the internal nodes represent
Figure 4: Accuracy of KNN. the test on the attributes, the branches represents
the result of the test, and the leaf contain the
The figure above shows the results of the prediction results. They are two ways for building
examination of k-nearest neighbors (KNN) in decision tree, from top to bottom and from bottom
breast cancer database. The KNN algorithm shows to top.
a low accuracy without doing any The most popular decision Trees
normalization/standardization of data. All the three algorithms are: ID3, C4.5, C5, J48 and CART.
techniques Normalizer, MinMaxScaler and
StandardScaler have improved the accuracy of
classification of the KNN, but the higher Without MinMaxScaler
improvement was done by Normalizer for the years Normalizer StandardScaler
2008-2015 except 2011. For example, for the year
2008 the improvement was more than 12%, and for 100
2016 the higher improvement was done by
StandardScaler with an improvement of 13.97% 90
followed by MinMaxScaler, the Normalizer does
not give a big improvement. 80
598
Journal of Theoretical and Applied Information Technology
15th February 2021. Vol.99. No 3
© 2021 Little Lion Scientific
80 Without MinMaxScaler
Normalizer StandardScaler
70
2008 2009 2010 2011 2012 2013 2014 2015 2016 100
Figure 7: Accuracy of RF. 80
Decision Tree and Random Forest, those 60
two algorithms show their performance without 40
needing to any pre-processing of data as presented
20
in figures 6 and 7. And also, the three techniques
of Normalization/Standardization have improved 0
2008 2009 2010 2011 2012 2013 2014 2015 2016
more the accuracy of classification.
Figure 9: Accuracy of MLP.
4.5 Logistic Regression
In this section we examined the sex
Logistic regression [21] is one of the
machines learning algorithms k-nearest neighbors,
generalized linear models much used in machine
Naive Bayes, Decision Tree, Random Forest,
learning. Logistic regression predicts the
Logistic regression and Multi-layer Perceptron.
probability of a result that can take two values from
Some of them have a good performance without
a set of predictor variables. Logistic regression is
needing to normalizing data or doing any ensemble
mainly used for prediction and also to calculate the
methods like Decision Tree and Random Forest,
probability of success.
others give lower accuracy, in which we apply the
normalization techniques. For the most
Without MinMaxScaler MinMaxScaler and StandardScaler give more
interesting result, the Normalizer did an
Normalizer StandardScaler
improvement but not big like other, but for k-
100 nearest neighbors it worked well. So we conclude,
80
that the Normalization/Standardization have a good
impact in the performance of machine leaning
60 classifiers. In the next section, we will try to
40 improve the accuracy of poor algorithms using
20 ensemble methods.
0
2008 2009 2010 2011 2012 2013 2014 2015 2016 5. ENSEMBLE METHODS
Figure 8: Accuracy of LR. Ensemble methods are an ensemble of
techniques that aim to produce better prediction
performance using multiple models, by combining
599
Journal of Theoretical and Applied Information Technology
15th February 2021. Vol.99. No 3
© 2021 Little Lion Scientific
between them. So, we will use those techniques to classifiers does not give a big difference, so we
improve the classification of breast cancer. tried to improve the performance of this assembling
5.1 Voting by Normalization/Standardization of data.
Voting algorithm is a technique that
combines between an ensemble of classifiers to
improve the accuracy of classification. The Without MinMaxScaler
principle of voting technique that each machine Normalizer StandardScaler
learning technique gives classification or output
then the vote of those outputs will be taken as 100
classification. 90
80
70
2008 2009 2010 2011 2012 2013 2014 2015 2016
RF+KNN+DT RF+NB+DT
600
Journal of Theoretical and Applied Information Technology
15th February 2021. Vol.99. No 3
© 2021 Little Lion Scientific
NB Bagging NB
80
60
40
20
0
2008 2009 2010 2011 2012 2013 2014 2015 2016
MLP Bagging MLP
Figure 17: Accuracy of Bagging MLP.
601
Journal of Theoretical and Applied Information Technology
15th February 2021. Vol.99. No 3
© 2021 Little Lion Scientific
20
0
2008 2009 2010 2011 2012 2013 2014 2015 2016
LR Boosting LR
60 20
40 0
2008 2009 2010 2011 2012 2013 2014 2015 2016
20
KNN +MLP with RF KNN +MLP with DT
0
2008 2009 2010 2011 2012 2013 2014 2015 2016 LR+NB with RF LR+NB with DT
Figure 21: Comparison between Bagging and Boosting As already said, stacking combines
of NB multiple classifiers with meta-classifier to improve
the accuracy of classification, the meta classifier
100 taken in this step are DT and RF grace to their
80 performance in the classification of breast cancer.
The good result was given when stacking LR+NB
60 with RF and LR+NB with DT except the year 2016,
40 in which KNN+MLP with RF and KNN+MLP with
DT worked more good. All the ensemble methods
20
have improved the accuracy of the classification of
0 the weak algorithms. For some algorithms voting
2008 2009 2010 2011 2012 2013 2014 2015 2016 technique was better for improving their accuracy
like KNN and NB, Bagging for NB and MLP for
LR Bagging LR Boosting LR
some years, Boosting for LR and Stacking for LR
Figure 22: Comparison between Bagging and Boosting
and NB.
of LR.
6. COMPARISON WITH EXISTING WORK
5.4 Stacking
Stacking algorithm has a different Our proposed methods are compared with
paradigm from bagging and boosting. The principle others researches, some of them used the same
602
Journal of Theoretical and Applied Information Technology
15th February 2021. Vol.99. No 3
© 2021 Little Lion Scientific
SEER database with different features and different and boosting. The improvement by
algorithms [8], [9], [10] and [22]. Others worked Normalization/standardization techniques goes until
with ensemble methods but with others breast 16% for MinMaxScaler and StandardScaler and
cancer databases [23] and [24]. 12% for Normalizer. And when using ensemble
methods the improvement by Bagging goes until
The Table 2 shows that our proposed
15,97% , by stacking goes until 16,23%, by
method Voting Naive Bayes, Decision Tree and
Boosting goes until 16,77% and by voting goes
Random Forest with the selected features give
until 16,83%. So, the higher improvement and the
better results compared with the others researches.
higher accuracy were given by voting technique.
The result shows that the
Table 2: Comparative study with the existing work of
breast cancer classification
Normalization/standardization and ensemble
methods have a big impact in the improvement of
Proposed classification accuracy of the weak algorithms.
Work Accuracy
Method There are some limitations in this work.
Voting Naive First, the proposed methods are not examined on
Bayes, others breast cancer dataset. Second, Bagging and
Ours Decision Tree 99.99%
and Random
Boosting in same cases didn’t give a good
Forest improvement.
Majority- In future work, those prosed models can be
Assiri et al. [23] based voting 99.42% tested on others breast cancer datasets, or on others
mechanism cancers datasets. Also, features selection techniques
Stacking can be used to select relevant features and others
Naive Bayes combination of machine learning algorithms can be
Mathew1 et
with Logistic 97.8% done. And finally this research can be a good start
al.[24]
Regression for classifying breast cancer from medical image.
and SMO
Farooqui et al. Random
73% ACKNOWLEDGEMENTS:
[10] Forest
Weighted H. Saoud acknowledges financial support for this
Area Under research from the "Centre National pour la
97.10% (for
the Receiver Recherche Scientifique et Technique" CNRST,
WBC dataset)
Operating
Wang et al.[22] and 76.42% for Morocco.
Characteristic
Curve
(SEER Also we acknowledge National Cancer Institute
database). (NCI) for providing as the access to SEER cancer
Ensemble
(WAUCE) database.
Hamsagayathri Priority Based
98.51%
el al. [8] Decision Tree REFRENCES:
Rajesh et al. [9] C4.5 93%
[1] ‘U.S. Breast Cancer Statistics’, Breastcancer.org,
Jan. 27, 2020.
7. CONCLUSION
https://www.breastcancer.org/symptoms/underst
To conclude, in this paper we tried to and_bc/statistics (accessed Apr. 15, 2020).
examine and improve the performance of machine [2] ‘Ascent of machine learning in medicine | Nature
learning techniques in the classification of massive Materials’.
breast cancer database like SEER database using https://www.nature.com/articles/s41563-019-
the python library scikit learn that facilitate for us 0360-1 (accessed Sep. 14, 2019).
the use and the implementation of the executed [3] ‘Surveillance, Epidemiology, and End Results
algorithms. First, we tested the performance of Program’. https://seer.cancer.gov/ (accessed
several machine learning techniques in the Sep. 14, 2019).
classification of large SEER breast cancer database,
[4] H. Saoud, A. Ghadi, M. Ghailani, and B. A.
the KNN, NB, LR and MLP techniques show low
Abdelhakim, ‘Application of Data Mining
accuracy in contrary of DT and RF that proved their
Classification Algorithms for Breast Cancer
performance, then we tried to improve the
Diagnosis’, in Proceedings of the 3rd
performance of those weak algorithms by
International Conference on Smart City
Normalization/standardization techniques and
Applications - SCA ’18, Tetouan, Morocco,
ensemble methods like: Voting, Stacking, Bagging
2018, pp. 1–7, doi: 10.1145/3286606.3286861.
603
Journal of Theoretical and Applied Information Technology
15th February 2021. Vol.99. No 3
© 2021 Little Lion Scientific
[5] H. Saoud, A. Ghadi, M. Ghailani, and B. A. [14] Y. Song, S. Gao, W. Tan, Z. Qiu, H. Zhou, and
Abdelhakim, ‘Using Feature Selection Y. Zhao, ‘Multiple Machine Learnings
Techniques to Improve the Accuracy of Breast Revealed Similar Predictive Accuracy for
Cancer Classification’, in Innovations in Smart Prognosis of PNETs from the Surveillance,
Cities Applications Edition 2, M. Ben Ahmed, A. Epidemiology, and End Result Database’, J.
A. Boudhir, and A. Younes, Eds. Cham: Cancer, vol. 9, no. 21, pp. 3971–3978, 2018,
Springer International Publishing, 2019, pp. doi: 10.7150/jca.26649.
307–315. [15] D. Solti and H. Zhai, ‘Predicting Breast Cancer
[6] ‘Proposed approach for breast cancer diagnosis Patient Survival Using Machine Learning’, in
using machine learning | Proceedings of the 4th Proceedings of the International Conference on
International Conference on Smart City Bioinformatics, Computational Biology and
Applications’. https://sci- Biomedical Informatics - BCB’13, Wshington
hub.tw/https://dl.acm.org/doi/abs/10.1145/3368 DC, USA, 2007, pp. 704–705, doi:
756.3369089 (accessed Apr. 15, 2020). 10.1145/2506583.2512376.
[7] H. Saoud, A. Ghadi, and M. Ghailani, ‘Hybrid [16] C. M. Lynch et al., ‘Prediction of lung cancer
Method for Breast Cancer Diagnosis Using patient survival via supervised machine learning
Voting Technique and Three Classifiers’, in classification techniques’, Int. J. Med. Inf., vol.
Innovations in Smart Cities Applications 108, pp. 1–8, Dec. 2017, doi:
Edition 3, Cham, 2020, pp. 470–482, doi: 10.1016/j.ijmedinf.2017.09.013.
10.1007/978-3-030-37629-1_34. [17] S. Li and T. Razzaghi, ‘Personalized Colorectal
[8] P. Hamsagayathri and P. Sampath, ‘Priority Cancer Survivability Prediction with Machine
based decision tree classifier for breast cancer Learning Methods’, ArXiv190103896 Cs Stat,
detection’, in 2017 4th International Jan. 2019, Accessed: Sep. 11, 2019. [Online].
Conference on Advanced Computing and Available: http://arxiv.org/abs/1901.03896.
Communication Systems (ICACCS), [18] X. Wu et al., ‘Top 10 algorithms in data
Coimbatore, India, Jan. 2017, pp. 1–6, doi: mining’, Knowl. Inf. Syst., vol. 14, no. 1, pp. 1–
10.1109/ICACCS.2017.8014598. 37, Jan. 2008, doi: 10.1007/s10115-007-0114-2.
[9] K. Rajesh, D. S. Anand, and P. Student, [19] A. Mucherino, P. J. Papajorgji, and P. M.
‘Analysis of SEER Dataset for Breast Cancer Pardalos, ‘k-Nearest Neighbor Classification’,
Diagnosis using C4.5 Classification Algorithm’, in Data Mining in Agriculture, vol. 34, New
vol. 1, no. 2, p. 6. York, NY: Springer New York, 2009, pp. 83–
[10] N. A. Farooqui, ‘A STUDY ON EARLY 106.
PREVENTION AND DETECTION OF [20] J. Han and M. Kamber, ‘Data Mining : Concepts
BREAST CANCER USING THREE- and Techniques’, p. 772.
MACHINE LEARNING TECHNIQUES’, Int. [21] H. Yusuff, N. Mohamad, U. K. Ngah, and A. S.
J. Adv. Res. Comput. Sci., p. 7, 2018. Yahaya, ‘BREAST CANCER ANALYSIS
[11] P. H. Abreu, M. S. Santos, M. H. Abreu, B. USING LOGISTIC REGRESSION’, p. 9, 2012.
Andrade, and D. C. Silva, ‘Predicting Breast [22] H. Wang, B. Zheng, S. W. Yoon, and H. S. Ko,
Cancer Recurrence Using Machine Learning ‘A support vector machine-based ensemble
Techniques: A Systematic Review’, ACM algorithm for breast cancer diagnosis’, Eur. J.
Comput. Surv., vol. 49, no. 3, pp. 1–40, Oct. Oper. Res., vol. 267, no. 2, pp. 687–699, Jun.
2016, doi: 10.1145/2988544. 2018, doi: 10.1016/j.ejor.2017.12.001.
[12] A. Lg and E. At, ‘Using Three Machine [23] A. S. Assiri, S. Nazir, and S. A. Velastin,
Learning Techniques for Predicting Breast ‘Breast Tumor Classification Using an
Cancer Recurrence’, J. Health Med. Inform., Ensemble Machine Learning Method’, J.
vol. 04, no. 02, 2013, doi: 10.4172/2157- Imaging, vol. 6, no. 6, p. 39, May 2020, doi:
7420.1000124. 10.3390/jimaging6060039.
[13] K. Kourou, T. P. Exarchos, K. P. Exarchos, M. [24] T. E. Mathew, K. S. A. Kumar, and K. S.
V. Karamouzis, and D. I. Fotiadis, ‘Machine Kumar, ‘Breast Cancer Diagnosis using
learning applications in cancer prognosis and Stacking and Voting Ensemble models with
prediction’, Comput. Struct. Biotechnol. J., vol. Bayesian Methods as Base Classifiers’, p. 14,
13, pp. 8–17, 2015, doi: 2020.
10.1016/j.csbj.2014.11.005.
604