Knowledge Discovery in Databases (KDD) : An Overview

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

International Journal of Computer Science and Information Security (IJCSIS),

Vol. 15, No. 12, December 2017

Knowledge Discovery in Databases (KDD): An Overview


Nwagu, Chikezie Kenneth, Omankwu, Obinnaya Chinecherem, and Inyiama, Hycient
1
Computer Science Department, Nnamdi Azikiwe University, Awka Anambra State, Nigeria,
[email protected]
2
Computer Science Department, Michael Okpara University of Agriculture, Umudike
Umuahia, Abia State, Nigeria
[email protected]
3
Electronics & Computer Engineering Department, Nnamdi Azikiwe University, Awka
Anambra State, Nigeria.

ABSTRACT Wide Web is another challenge for indexing and searching


through a continually changing and growing "database."
Knowledge Discovery in Databases is the process of searching
for hidden knowledge in the massive amounts of data that we Our ability to wade through the data and turn it into
are technically capable of generating and storing. Data, in its meaningful information is hampered by the size and
raw form, is simply a collection of elements, from which little complexity of the stored information base. In fact, the shear
knowledge can be gleaned. With the development of data size of the data makes human analysis untenable in many
discovery techniques the value of the data is significantly instances, negating the effort spent in collecting the data.
improved. A variety of methods are available to assist in There are several viable options currently being used to assist
extracting patterns that when interpreted provide valuable, in weeding out usable information. The information retrieval
possibly previously unknown, insight into the stored data. This process using these various tools is referred to as Knowledge
information can be predictive or descriptive in nature. Data Discovery in Databases (KDD).
mining, the pattern extraction phase of KDD, can take on
many forms, the choice dependent on the desired results. KDD "The basic task of KDD is to extract knowledge (or
is a multi-step process that facilitates the conversion of data to information) from lower level data (databases) (Fayyad et al,
useful information.Our increased ability to gain information 1995). There are several formal definitions of KDD, all agree
from stored data raises the ethical dilemma of how the that the intent is to harvest information by recognizing patterns
information should be treated and safeguarded. in raw data. Let us examine definition proposed by Fayyad,
Piatetsky-Shapiro and Smyth, "Knowledge Discovery in
Databases is the non-trivial process of identifying valid, novel,
Keywords potentially useful, and ultimately understandable patterns in
Knowledge Discovery Databases, Data Mining, Knowledge data (Fayyad et al, 1995). The goal is to distinguish from
Mining unprocessed data, something that may not be obvious but is
valuable or enlightening in its discovery. Extraction of
knowledge from raw data is accomplished by applying Data
1. INTRODUCTION Mining methods. KDD has a much broader scope, of which
data mining is one step in a multidimensional process.
The desire and need for information has led to the
development of systems and equipment that can generate and Knowledge Discovery in Databases Process
collect massive amounts of data. Many fields, especially those
involved in decision making, are participants in the Steps in the KDD process are depicted in the following
information acquisition game. Examples include: finance, diagram. It is important to note that KDD is not accomplished
banking, retail sales, manufacturing, monitoring and diagnosis, without human interaction. The selection of a data set and
health care, marketing and science data acquisition. Advances subset requires an understanding of the domain from which
in storage capacity and digital data gathering equipment such the data is to be extracted. For example, a database may
as scanners, has made it possible to generate massive datasets,
contain customer address that would not be pertinent to
sometimes called data warehouses that measure in terabytes.
For example, NASA's Earth Observing System is expected to discovering patterns in the selection of food items at a grocery
return data at rates of several gigabytes per hour by the end of store. Deleting non-related data elements from the dataset
the century (Way, 1991). Modern scanning equipment record reduces the search space during the data mining phase of
millions of transactions from common daily activities such as KDD. If the dataset can be analyzed using a sampling of the
supermarket or department store checkout-register sales. The
explosion in the number of resources available on the World

13 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 12, December 2017

data, the sample size and composition are determined during Data Mining Models
this stage.
A few of the many model functions being incorporated in
KDD include:

Classification: mapping or classifying data into one of several


predefined classes (Hand, 1981). For example, a bank may
establish classes based on debt to income ratio. The
classification algorithm determines within which of the two
classes an applicant falls and generates a loan decision based
on the result.

Fig. I: Steps in KDD Process.


Databases are notoriously "noisy" or contain inaccurate or
missing data. During the preprocessing stage the data is
cleaned. This involves the removal of "outliers" if appropriate;
deciding strategies for handling missing data fields;
accounting for time sequence information, and applicable
normalization of data( Fayyad.1996)

In the transformation phase attempts to limit or reduce the


number of data elements that are evaluated while maintaining Figure 2: Regression Analysis
the validity of the data. During this stage data is organized,
converted from one type to another (i.e. changing nominal to
numeric) and new or "derived" attributes are defined.
Regression: "a learning function which maps a data item to a
At this point the data is subjected to one or several data mining real-valued prediction variable (Hand, 1981). Comparing a
methods such as classification, regression, or clustering. The particular instance of an electric bill to a predetermined norm
data mining component of KDD often involves repeated for that same time period and observing deviations from that
iterative application of particular data mining methods. "For norm is an example of regression analysis.
example, to develop an accurate, symbolic classification
model that predicts whether magazine subscribers will renew Clustering: "maps a data item into one of several categorical
their subscriptions, a circulation manager might need to first classes (or clusters) in which the classes must be determined
use clustering to segment the subscriber database, and then from the data, unlike classification in which the classes are
apply rule induction to automatically create a classification for predefined. Clusters are defined by finding natural groupings
each desired cluster (Simoudis, 1996). Various data mining of data items based on similarity metrics or probability density
methods will be discussed in more detail in following sections. models (Fayyad et al, 1996). An example of this technique
would be grouping patients based on symptoms exhibited. The
The final step is the interpretation and documentation of the clusters need not be mutually exclusive.
results from the previous steps. Actions at this stage could
consist of returning to a previous step in the KDD process to Summarization: generating a concise description of the data.
further refine the acquired knowledge, or translating the Routine examples of these techniques include the mean and
knowledge into a form understandable to the user. A standard deviation of specific data elements within the dataset.
commonly used interpretive technique is visualization of the
extracted patterns. The results should be critically reviewed Dependency modeling: developing a model that shows a how
and conflicts with previously believed or extracted knowledge variables are interrelated. An example would be a model
resolved. showing that electrical usage is highly correlated with the
ambient temperature.
Understanding and committing to all phases of the data mining
process is crucial to its success.

14 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 12, December 2017

Choosing a Data Mining Model recognizing patterns in data, a task that is exceeds human
ability as the size of data warehouses increase. New methods
There are no established guidelines to assist in choosing the of analysis and pattern extraction are being developed and
correct algorithm to apply to a dataset. Typically, the more adapted to KDD. Which method is used depends on the
complex models may fit the data better but may also be more domain and results expected. The accuracy of the recorded
difficult to understand and to fit reliably (Fayyad et al, 1995). data must not be overlooked during the KDD process. Domain
Successful applications often use simpler models due to the specific knowledge assists with the subjective analysis of
their ease of translation. Each technique tends to lend itself to KDD results. Much attention has been given to the data
a particular type problem. Understanding the domain will mining phase of KDD but earlier steps, such as data cleaning,
assist in determining what kind of information is needed from play a significant role in the validity of the results.
the discovery process thereby narrowing the field of choice.
Results can be broken into two general categories; prediction The potential benefits of discovery driven data mining
and description. Prediction, as the name infers, attempts to techniques in extracting valuable information from large
forecast the possible future values of data elements. Prediction complex databases are unlimited. Successful applications are
is being applied extensively in the area of finance in an surfacing in industries and areas were data retrieval is
attempt to forecast movement in the stock market. Description outpacing man's ability to effectively analyze its content.
seeks to discover interpretable patterns in the data. Fraud Users must be aware of the potential moral conflicts to using
detection is an application that uses description to identify sensitive information.
characteristics of potential fraudulent transactions.

Classification, clustering, summarization and dependency REFERENCES


modeling are descriptive models, while regression is
predictive. (1) Way, J.; and Smith, E.A. "The evolution of Synthetic
Radar Systems and Their Progression to the EOS SAR." IEEE
Current Applications of KDD Trans. Geoscience and Remote Sensing. Vol 29. No. 6. 1991.
Pp962-985.
Several Knowledge Discovery Applications have been
successfully implemented. "SKICAT, a system which (2) Fayyad, U.; Simoudis, E.; "Knowledge Discovery and
automatically detects and classifies sky objects image data Data Mining Tutorial MA1" from Fourteenth International
resulting from a major astronomical sky survey. SKICAT can Joint Conference on Artificial Intelligence (IJCAI-95) July 27,
outperform astronomers in accurately classifying faint sky 1995 www-aig.jpl.nasa.gov/public/kdd95/tutorials/IJCAI95-
objects(Fayyad et al, 1995). KDD is being used to flag tutorial.html
suspicious activities on two frontiers: Falcon alerts banks of
possible fraudulent credit card transactions and the FAIS (3) Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P; "From Data
system being employed by the Financial Crimes Enforcement Mining to Knowledge Discovery: An overview" in Advances
Network detects financial transactions that may indicate in Knowledge discovery and Data Mining. Fayyad, U.;
money laundering (Simoudis, 1996). Market Basket Analysis Piatetsky-Shapiro, G; Smyth, P; Uthurusamy, R. MIT Press.
(MBA) has incorporated discovery driven data mining Cambridge, Mass.. 1996 pp. 1-36
techniques to gain insights about customer behavior. Other
applications are being used in the Molecular Biology, Global (4) Fayyad, U. "Data Mining and Knowledge Discovery:
Climate Change Modeling and other concentrations where the Making Sense Out of Data" in IEEE Expert October 1996 pp.
volume of data exceeds our ability to decipher its meaning. 20-25

Privacy Concerns and Knowledge Discovery (5) Simoudis, E. "Reality Check for Data Mining" in IEEE
Expert October 1996 pp. 26-33
Although not unique to Knowledge Discovery, sensitive
information is being collected and stored in these huge data (6) Hand, D. J. 1981 Discrimination and Classification.
warehouses. Concerns have been raised about what Chichester, U.K.: John Wiley and Sons
information should be protected from KDD-type access. The
ethical and moral issues of invasion of privacy are intrinsically (7) Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P; "From Data
connected to pattern recognition. Safeguards are being Mining to Knowledge Discovery: An overview" in Advances
discussed to prevent misuses of the technology. in Knowledge discovery and Data Mining. Fayyad, U.;
Piatetsky-Shapiro, G; Smyth, P; Uthurusamy, R. MIT Press.
Summary Cambridge, Mass.. 1996 pp. 1-36

Knowledge Discovery in Databases is answering a need to (8) Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P.; "The KKD
make use of the mountains of data that is accumulating daily. Process for Extracting Useful Knowledge from Volumes of
KDD enlists the power of computers to assist in the Data" in Communications of the ACM, November 1996/Vol
39, No.11 pp.27-34

15 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 12, December 2017

(9) Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P; "From Data


Mining to Knowledge Discovery: An overview" in Advances
in Knowledge discovery and Data Mining. Fayyad, U.;
Piatetsky-Shapiro, G; Smyth, P; Uthurusamy, R. MIT Press.
Cambridge, Mass.. 1996 pp. 1-36

(10) Fayyad, U.; Piatetsky-Shapiro, G; Smyth, P; "From Data


Mining to Knowledge Discovery: An overview" in Advances
in Knowledge discovery and Data Mining. Fayyad, U.;
Piatetsky-Shapiro, G; Smyth, P; Uthurusamy, R. MIT Press.
Cambridge, Mass.. 1996 pp. 1-36

(11) Simoudis, E. "Reality Check for Data Mining" in IEEE


Expert October 1996 pp. 26-33

16 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

You might also like