International Journal of Information Sciences and Techniques (IJIST) Vol.6, No.1/2, March 2016
SURVEY OF DATA MINING TECHNIQUES USED IN
HEALTHCARE DOMAIN
Sheenal Patel and Hardik Patel
Department of Computer Science and Applications, Charotar University of Science &
Technology, Changa, Gujarat, India
ABSTRACT
Health care industry produces enormous quantity of data that clutches complex information relating to
patients and their medical conditions. Data mining is gaining popularity in different research arenas due to
its infinite applications and methodologies to mine the information in correct manner. Data mining
techniques have the capabilities to discover hidden patterns or relationships among the objects in the
medical data. In last decade, there has been increase in usage of data mining techniques on medical data
for determining useful trends or patterns that are used in analysis and decision making. Data mining has
an infinite potential to utilize healthcare data more efficiently and effectually to predict different kind of
disease. This paper features various Data Mining techniques such as classification, clustering, association
and also highlights related work to analyse and predict human disease.
KEYWORDS
Data Mining, Health Care, Classification, Clustering, Association
1. INTRODUCTION
Data mining is an assortment of algorithmic techniques to extract instructive patterns from raw
data. Healthcare industry today produces huge amounts of multifarious data about hospitals,
resources, disease diagnosis, electronic patient records, etc. The large amount of data is crucial to
be processed and scrutinized for knowledge extraction that empowers support for understanding
the prevailing circumstances in healthcare industry. Data mining processes include framing a
hypothesis, gathering data, performing pre-processing, estimating the model, and understanding
the model and draw the conclusions [2]. Before studying how data mining algorithms are being
applied on medical data, let us understand what types of algorithms exists in data mining and how
they are functioning.
It came into existence somewhere in the middle of 1990’s and appeared as a strong tool that
extracts needful information from a bulk of data. In common, Knowledge Discovery (KDD) and
Data Mining are related terms and are used interchangeably but several researchers assume that
both terms are dissimilar as Data Mining is one of the most vital stages of the KDD process.
According to Fayyad et al., the Knowledge Discovery in database is systematized in various
stages whereas the first stage is selection of data in which data is gathered from different sources,
the second stage is pre-processing the selected data, the third stage is transforming the data into
suitable format so that it can be processed further, the fourth stage consist of Data Mining where
suitable Data Mining technique is applied on the transformed data for extracting valuable
information and evaluation is the last stage as shown in Figure 1 [28].
DOI : 10.5121/ijist.2016.6206
53
International Journal of Information Sciences and Techniques (IJIST) Vol.6, No.1/2, March 2016
Figure 1. Stages of Knowledge Discovery Process
Knowledge Discovery in databases is the process of retrieving high-level knowledge from lowlevel data. It is an iterative process that comprises steps like Selection of Data, Pre-processing the
selected data, Transformation of data into appropriate form, Data mining to extract necessary
information and Interpretation/Evaluation of data [20].
Selection step collects the heterogeneous data from varied sources for processing. Real life
medical data may be incomplete, complex, noisy, inconsistent, and/or irrelevant which requires a
selection process that gathers the important data from which knowledge is to be extracted.
Pre-processing step performs basic operations of eliminating the noisy data, try to find the
missing data or to develop a strategy for handling missing data, detect or remove outliers and
resolve inconsistencies among the data.
Transformation step transforms the data into forms which is suitable for mining by performing
task like aggregation, smoothing, normalization, generalization, and discretization. Data
reduction task shrinks the data and represents the same data in less volume, but produces the
similar analytical outcomes.
Data mining is a main component in KDD process. Data mining includes choosing the data
mining algorithm(s) and using the algorithms to generate previously unknown and hypothetically
beneficial information from the data stored in the database. This comprises deciding which
models/algorithms and parameters may be suitable and matching a specific data mining method
with the general standards of the KDD process. Data mining methods includes classification,
summarization, clustering, regression, etc. [20]
Interpretation/ Evaluation step includes presentation of mined patterns in understandable form.
Various types of information need different type of representation, in this step the mined patterns
are interpreted. Evaluation of the outcomes is prepared with statistical justification and
significance testing.
54
International Journal of Information Sciences and Techniques (IJIST) Vol.6, No.1/2, March 2016
Knowledge discovery: integrating the extracted knowledge into another system for further action,
or merely documenting the same and broadcasting it to interested parties. This step also
comprises checking and resolving possible conflicts with previously extracted knowledge. [29]
KDD can be effective at working with bulky data to define significant pattern and to develop
strategic results. A health care organization can implement Knowledge Discovery in databases
(KDD) by the help of experienced employee who has good understanding in health care domain
[5].
Generally data mining algorithms are classified in two categories: descriptive model (or
unsupervised learning) and predictive model (or supervised learning). Descriptive data-mining
model is to discover patterns in the data and identifies the associations between attributes
represented by the data. In contrast, the purpose of Predictive mining model is largely to predict
the future outcome than existing behaviour [19].
2. DATA MINING TECHNIQUES
Data mining techniques such as association, classification and clustering are used by healthcare
organization to increase their capability for building appropriate conclusions regarding patient
health from raw facts and figures [24].
2.1. Classification
Classification comprises of two footsteps: - 1) Training and 2) Testing. Training builds a
classification model on the basis of training data collected for generating classification rules. The
IF-THEN prediction rule is highly popular in data mining; they signify facts at a high level of
abstraction. The accuracy of classification model hinge on the degree to which classifying rules
are true which is estimated by test data [9]. In health care domain classification can be made
useful
as
“if
DiabeticFamilyHistory=yes
AND
HighSugerIntake=yes
THEN
DiabetesPossiblity=High”. Hatice et al., to analyse skin diseases by using weighted KNN
classifier [1].
2.2. Clustering
Clustering is different from classification; it does not have predefined classes. A large database is
divided into number of small subgroups called clusters. It divides the data based on similarities it
have. Clustering algorithms discovers collections of the data such that objects in the same cluster
are more identical to each other than other groups [13]. Tapia et al. examined the gene expression
data with support of hierarchical clustering approach by using genetic algorithm [11].
2.3. Association
Association also has great impact in the health care industry to discover the relationships between
diseases, state of human health and the symptoms of disease. Ji et al., used association in order to
learn uncommon casual relationships in Electronic health databases [12]. An integrated approach
of using Association and Classification techniques also improved the capabilities of Data Mining.
Soni et al., have used this integrated approach of association and classification for studying health
care data. This integrated approach is useful for determining rules in the database and then by
using these rules, an effective classifier is raised. The study made experiment on the data of heart
patients and generate rules by weighted associative classifier [26]. Thus, Association also has an
ample influence in the healthcare field to identify the relationships among various diseases, state
of human health and the symptoms of disease.
55
International Journal of Information Sciences and Techniques (IJIST) Vol.6, No.1/2, March 2016
3. APPLICATION OF DATA MINING TECHNIQUES IN HEALTH CARE
The different classification algorithms mentioned below in figure 1 are used to predict or to
analyse various diseases.
Figure 2. Different techniques in Healthcare domain
Summary of Techniques for Medical data mining.
In terms of prediction and decision making, Data mining techniques have substantial expansion in
medical industry with respect to various diseases like diabetes, heart disease, liver diseases,
cancer and others. Table 1 summarizes the medical data mining, its techniques used and for the
related disease.
56
International Journal of Information Sciences and Techniques (IJIST) Vol.6, No.1/2, March 2016
Table 1. Summary of medical data mining techniques
Disease
Technique Used
Extracting patterns & detecting trends using Neural Networks
Conventional Pathology Data [3].
Prediction models using Decision Tree Algorithms such as
ID3,
Coronary heart disease
C4.5, C5, and CART [3] [32].
Lymphoma Disease and
Lung
Distinguish disease subtypes using Ensemble approach [4] [6].
Cancer
Predicate the probability of a psychiatric patient on the basis
Psychiatric Diseases
detected symptoms using BBN Bayesian networks [7].
Identify frequency of diseases in particular
geographical
area
Fre quent Disease
using Apriori algorithm [8].
Liver diseases
Classification using Bayesian Ying Yang (BYY) [10].
Categorization of skin disease using integrated
decision
tree
Skin Disease
model with neural network classification methods [14].
Diabetes
Classification of Medical Data using Genetic Algorithm [15].
Functional Magnetic
Integration of Clustering and Classification of biomedical
Resonance Imaging (fMRI) databases [16].
Constructed a model using Artificial Neural Network
(ANN)
Chest Disease
[17].
Classification of Disease using k-Nearest Neighbour
Diabetes, Cancer
[18].
Improving classification accuracy using Naive
Bayesian
[21]
Coronary Heart Disease
[30].
Chronic Disease
Prediction of Diseases Using Apriori Algorithm [22].
Disease classification using Support Vector Machine
Diabetes
[23]
Accurate Classification of medical data using Kmeans,
SelfBreast Cancer
Organizing Map (SOM) and Naïve Bayes [25].
Diagnose Cardio Vascular Disease using Classification
algorithm
Cardio Vascular Diseases
[27].
Familiarized an adaptive Fuzzy K-NN approach for diagnosing
Parkinson Disease
the disease [31]
57
International Journal of Information Sciences and Techniques (IJIST) Vol.6, No.1/2, March 2016
4. CONCLUSION
With the recent rapid rise in the quantity of biomedical data that is gathered by electronic means
in critical care and the rampant availability of inexpensive and dependable computing equipment,
many researchers has started, or are eager to start, exploring these data. In this paper we observe
some data mining techniques that has been employed for medical data. As there is voluminous
records in this industry and because of this, it has become requisite to use data mining techniques
to help in decision support and prediction in the field of Healthcare to identify the kind of disease.
The medical data mining produces business intelligence which is useful for diagnosing of the
disease. This paper throws light into data mining techniques that is used for medical data for
various diseases which are identified and diagnosed for human health.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
C. Hattice & K. Metin, “A DIAGNOSTIC SOFTWARE TOOL FOR SKIN DISEASES WITH
BASIC AND WEIGHTED K-NN”, Innovations in Intelligent Systems and Applications (INISTA),
2012.
Dhanya P Varghese & Tintu P B, “A SURVEY ON HEALTH DATA USING DATA MINING
TECHNIQUES”, International Research Journal of Engineering and Technology (IRJET), Volume:
02 Issue: 07, Oct-2015.
Doron Shalvi & Nicholas DeClaris, “AN UNSUPERVISED NEURAL NETWORK APPROACH TO
MEDICAL DATA MINING TECHNIQUES”, IEEE, 1998.
Gustavo Santos-Garcia & Gonzalo Varela & Nuria Novoa & Marcelo F. Jimenez, “PREDICTION OF
POSTOPERATIVE MORBIDITY AFTER LUNG RESECTION USING AN ARTIFICIAL
NEURAL NETWORK ENSEMBLE”, Artificial Intelligence in Medicine 30:61–69, 2004.
Harleen Kaur & Siri Krishan Wasan, “EMPIRICAL STUDY ON APPLICATIONS OF DATA
MINING TECHNIQUES IN HEALTHCARE”, Journal of Computer Science 2 (2): 194-200, 2006.
Hojin Moon & Hongshik Ahn & Ralph Kodell & Songjoon Baek & Chien- Ju Lin & James Chen,
“ENSEMBLE METHODS FOR CLASSIFICATION OF PATIENTS FOR PERSONALIZED
MEDICINE WITH HIGH-DIMENSIONAL DATA”. Artificial Intelligence in Medicine 41:197–207,
2007.
I. Curiac & G. Vasile & O. Banias & C. Volosencu & A. Albu, “BAYESIAN NETWORK MODEL
FOR DIAGNOSIS OF PSYCHIATRIC DISEASES”, Proceedings of the ITI 2009 31st Int. Conf. on
Information Technology Interfaces, Cavtat, Croatia, 22-25 June-2009.
Ilayaraja & T. Meyyappan, “MINING MEDICAL DATA TO IDENTIFY FREQUENT DISEASES
USING APRIORI ALGORITHM”, Proceedings of the 2013 International Conference on Pattern
Recognition, Informatics and Mobile Engineering, 21-22 February-2013.
Illhoi Yoo & Patricia Alafaireet & Miroslav Marinov & Keila Pena-Hernandez & Rajitha Gopidi &
Jia-Fu Chang & Lei Hua, “DATA MINING IN HEALTHCARE AND BIOMEDICINE: A SURVEY
OF THE LITERATURE”, Springer, May-2011.
Jeong-Yon Shim & Lei Xu, “MEDICAL DATA MINING MODEL FOR ORIENTAL MEDICINE
VIA BYY BINARY INDEPENDENT FACTOR ANALYSIS”, IEEE.P1-4, 2003.
J.J.Tapia & E. Morett & E. E. Vallejo, “A CLUSTERING GENETIC ALGORITHM FOR
GENOMIC DATA MINING”, Foundations of Computational Intelligence, Studies in Computational
Intelligence, Volume:204, 2009.
J.Yanqing & H.Ying & J.Tran & P.Dews & A.Mansour & R.Michael Massanari, “MINING
INFREQUENT CAUSAL ASSOCIATIONS IN ELECTRONIC HEALTH DATABASES”, 11th
IEEE International Conference on Data Mining Workshops, 2011.
K.Sharmila & Dr.S.A.Vethamanickam, “SURVEY ON DATA MINING ALGORITHM AND ITS
APPLICATION IN HEALTHCARE SECTOR USING HADOOP PLATFORM”, International
Journal of Emerging Technology and Advanced Engineering ISSN 2250-2459, Volume: 05, Issue: 01,
January-2015.
58
International Journal of Information Sciences and Techniques (IJIST) Vol.6, No.1/2, March 2016
[14] L.Chang & C.H.Chen, “APPLYING DECISION TREE AND NEURAL NETWORK TO INCREASE
QUALITY OF DERMATOLOGIC DIAGNOSIS”, Expert Systems with Applications- Elsevier,
Volume: 36, pp. 4035-4041, 2009.
[15] Markus Brameier & Wolfgang Banzhaf, “A COMPARISON OF LINEAR GENETIC
PROGRAMMING AND NEURAL NETWORKS IN MEDICAL DATA MINING”, IEEE.p1-10,
2001.
[16] Michael Barnathan & Jingjing Zhang & Vasileios, “A WEB-ACCESSIBLE FRAMEWORK FOR
THE AUTOMATED STORAGE AND TEXTURE ANALYSIS OF BIOMEDICAL IMAGES”,
IEEE. P1-3. 2008.
[17] O.Er & N. Yumusakc & F. Temurtas, “CHEST DISEASES DIAGNOSIS USING ARTIFICIAL
NEURAL NETWORKS”, Expert Systems with Applications- Elsevier, Volume: 37, pp. 76487655,
2010.
[18] Ping-Hung Tang &, Ming-Hseng Tseng, “MEDICAL DATA MINING USING BGA AND RGA
FOR WEIGHTING OF FEATURES IN FUZZY K-NN CLASSIFICATION”, IEEE.P1-6, July-2009.
[19] Pradnya P. Sondwale, “OVERVIEW OF PREDICTIVE AND DESCRIPTIVE DATA MINING
TECHNIQUES”, International Journal of Advanced Research in Computer Science and Software
Engineering, Volume: 05 Issue: 04, April-2015.
[20] Prakash Mahindrakar & Dr. M. Hanumanthappa, “DATA MINING IN HEALTHCARE: A SURVEY
OF TECHNIQUES AND ALGORITHMS WITH ITS LIMITATIONS AND CHALLENGES”,
Prakash Mahindrakar et al Int. Journal of Engineering Research and Applications: 2248-9622, pp.937941, Volume: 03 Issue 06, Nov-Dec 2013.
[21] Ranjit Abraham & Jay B.Simha &Iyengar, “A COMPARATIVE ANALYSIS OF
DISCRETIZATION METHODS FOR MEDICAL DATAMINING WITH NAÏVE BAYESIAN
CLASSIFIER”, IEEE. P1-2, 2006.
[22] R.Karthiyayini & J.Jayaprakash, “ASSOCIATION TECHNIQUE ON PREDICTION OF CHRONIC
DISEASES USING APRIORI ALGORITHM”, International Journal of Innovative Research in
Science, Engineering and Technology, Volume: 04, Special Issue 06, May 2015.
[23] Sarojini Balakrishnan & Ramaraj Narayanaswamy, “FEATURE SELECTION USING FCBF IN
TYPE II DIABETES DATABASES”, Special Issue of the International Journal of the Computer, the
Internet and Management, Volume: 17 No. SP1, March-2009.
[24] Sheetal L. Patil, “SURVEY OF DATA MINING TECHNIQUES IN HEALTHCARE”, International
Research Journal of Innovative Engineering, Volume: 01 Issue: 09, September-2015.
[25] Syed Zahid Hassan & Brijesh Verma, “A HYBRID DATA MINING APPROACH FOR
KNOWLEDGE EXTRACTION AND CLASSIFICATION IN MEDICAL DATABASES”. IEEE.
P1-6, 2007.
[26] S. Soni & O. P. Vyas, “USING ASSOCIATIVE CLASSIFIERS FOR PREDICTIVE ANALYSIS IN
HEALTH CARE DATA MINING”, International Journal of Computer Applications, Volume: 04,
No: 05, July-2010.
[27] Tsang-Hsiang Cheng & Chih-Ping Wei & Vincent S. Tseng, “FEATURE SELECTION FOR
MEDICAL DATA MINING: COMPARISONS OF EXPERT JUDGMENT AND AUTOMATIC
APPROACHES”, IEEE. P1-6, 2006.
[28] U.Fayyad, G.Piatetsky-Shapiro and P.Smyth, “THE KDD PROCESS OF EXTRACTING USEFUL
KNOWLEDGE FORM VOLUMES OF DATA”, Communications of the ACM, pp. 27-34 Volume:
39, No: 11, November-1996.
[29] Usama Fayyad & Gregory Piatetsky & Padhraic Smyth, “Knowledge Discovery and Data Mining:
Towards a Unifying Framework” KDD-96 Proceedings, 1996.
[30] Weimin Xue & Yanan Sun & Yuchang Lu, “RESEARCH AND APPLICATION OF DATA MINING
IN TRADITIONALCHINESE MEDICAL CLINIC DIAGNOSIS”, IEEE.p1-4, 2006.
[31] W.L.Zuoa & Z.Y.Wanga & T.Liua & H.L.Chenc, “EFFECTIVE DETECTION OF PARKINSON’S
DISEASE USING AN ADAPTIVE FUZZY K-NEAREST NEIGHBOR APPROACH”, Biomedical
Signal Processing and Control, Elsevier, pp. 364373, 2013.
[32] Yanwei Xing & Jie Wang & Zhihong Zhao & Yonghong Gao, “COMBINATION DATA MINING
METHODS WITH NEW MEDICAL DATA TO PREDICTING OUTCOME OF CORONARY
HEART DISEASE”, International Conference on Convergence Information Technology, 2007.
59
International Journal of Information Sciences and Techniques (IJIST) Vol.6, No.1/2, March 2016
AUTHORS
Sheenal Patel received her B.C.A. and M.C.A. degree from Dharmsinh Desai University,
Nadiad, Gujarat, India in 2012 and 2014 respectively. She is presently working as an
Assistant Professor at Smt. Chandaben Mohanbhai Patel Institute of Computer
Applications, Charusat University, Changa, Gujarat, India since 2014. Her research area
include knowledge processing, data mining.
Hardik Patel has received M.C.A. degree from Smt. Chandaben Mohanbhai Patel
Institute of Computer Applications, Charusat University, Changa, Gujarat, India in 2014
and B.C.A degree from M.B.Patel Science College, Anand, Gujarat, India. Now he is an
Assistant Professor at Charotar Institute of Computer Applications – Changa, India. His
research areas include data mining, cloud computing, content-based image and video
analysis.
60