Paper Ljupce Markusheski PHD
Paper Ljupce Markusheski PHD
Paper Ljupce Markusheski PHD
Ljupce Markusheski, Ph D1
Igor Zdravkoski, Ph D2
Miroslav Andonovski, Ph D3
Aleksandra Jovanoska, Ms C4
Abstract
Data Mining is a powerful tool for companies to extract the most important information
from their data warehouse. These tools allow you to predict future trends and behaviors in order
to be able to provide activities based on specific knowledge. Volume of information is increasing
every day that we can handle from business transactions, scientific data, sensor data, Pictures,
videos, etc. So, we need a system that will be capable of extracting essence of information
available and that can automatically generate report, views or summary of data for better
decision-making.
Data mining is used in business to make better managerial decisions by:
Automatic summarization of data,
Extracting essence of information stored,
Discovering patterns in raw data.
Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data stored in
databases.
Keywords: Data Minig, tools, databases, data warehouse, knowledge.
Introduction
1
Faculty of Economics – Prilep, Republic of North Macedonia, e-mail: [email protected].
2
Faculty of Economics – Prilep, Republic of North Macedonia, e-mail: [email protected].
3
Faculty of Economics – Prilep, Republic of North Macedonia, e-mail: miroslav.andonovski@uklo. edu.mk.
4
Faculty of Economics – Prilep, Republic of North Macedonia, e-mail: [email protected].
Knowledge Discovery in Databases (KDD) is an automatic, exploratory analysis and
modeling of large data repositories. KDD is the organized process of identifying valid, novel,
useful, and understandable patterns from large and complex data sets. Data Mining (DM) is the
core of the KDD process, involving the inferring of algorithms that explore the data, develop the
model and discover previously unknown patterns. The model is used for understanding
phenomena from the data, analysis and prediction. The accessibility and abundance of data today
makes knowledge discovery and Data Mining a matter of considerable importance and necessity.
Given the recent growth of the field, it is not surprising that a wide variety of methods is now
available to the researchers and practitioners. No one method is superior to others for all cases.
The handbook of Data Mining and Knowledge Discovery from Data aims to organize all
significant methods developed in the field into a coherent and unified catalog; presents
performance evaluation approaches and techniques; and explains with cases and software tools
the use of the different methods.
The goals of this introductory chapter are to explain the KDD process, and to position
DM within the information technology tiers. Research and development challenges for the next
generation of the science of KDD and DM are also defined. The rationale, reasoning and
organization of the handbook are presented in this chapter. In this chapter there are six sections
followed by a brief reference primer list containing leading papers, books, conferences and
journals in the field:
1. The KDD Process
2. Taxonomy of Data Mining Methods
3. Data Mining within the Complete Decision Support System
4. KDD & DM Research Opportunities and Challenges
5. KDD & DM Trends
6. The Organization of the Handbook
The special recent aspects of data availability that are promoting the rapid development
of KDD and DM are the electronically readiness of data (though of different types and
reliability). The internet and intranet fast development in particular promote data accessibility.
Methods that were developed before the
Internet revolution considered smaller amounts of data with less variability in data types and
reliability [2]. Since the information age, the accumulation of data has become easier and storing
it inexpensive. It has been estimated that the amount of stored information doubles every twenty
months. Unfortunately, as the amount of electronically stored information increases, the ability
to understand and make use of it does not keep pace with its growth. Data Mining is a term
coined to describe the process of sifting through large databases for interesting patterns and
relationships. The studies today aim at evidence-based modeling and analysis, as is the leading
practice in medicine, finance and many other fields. The data availability is increasing
exponentially, while the human processing level is almost constant. Thus, the gap increases
exponentially. This gap Is the opportunity for the KDD/DM field, which therefore becomes
increasingly important and necessary.
1. The KDD Process
The process starts with determining the KDD goals, and “ends” with the implementation
of the discovered knowledge. Then the loop is closed – the Active Data Mining part starts (which
is beyond the scope of this book and the process defined here). As a result, changes would have
to be made in the application domain (such as offering different features to mobile phone users in
order to reduce churning). This closes the loop, and the effects are then measured on the new
data repositories, and the KDD process is launched again.
Following is a brief description of the nine-step KDD process, starting with
a managerial step:
1. Developing an understanding of the application domain This is the initial preparatory step.
It prepares the scene for understanding what should be done with the many decisions (about
transformation, algorithms, representation, etc.). The people who are in charge of a KDD project
need to understand and define the goals of the end-user and the environment in which the
knowledge discovery process will take place (including relevant prior knowledge). As the KDD
process proceeds, there may be even a revision of this step. Having understood the KDD goals,
the preprocessing of the data starts, defined in the next three steps (note that some of the methods
here are similar to Data Mining algorithms, but are used in the preprocessing context);
3. Preprocessing and cleansing. In this stage, data reliability is enhanced. It includes data
clearing, such as handling missing values and removal of noise or outliers. There are many
methods explained in the handbook, from doing nothing to becoming the major part (in terms of
time consumed) of a KDD project in certain projects. It may involve complex statistical methods
or using a Data Mining al1gorithm in this context. For example, if one suspects that a certain
attribute is of insufficient reliability or has many missing data, then this attribute could become
the goal of a data mining supervised algorithm. A prediction model for this attribute will be
developed, and then missing data can be predicted. The extension to which one pays attention to
this level depends on many factors. In any case, studying the aspects is important and often
revealing by itself, regarding enterprise information systems.
4. Data transformation. In this stage, the generation of better data for the data mining is
prepared and developed. Methods here include dimension reduction (such as feature selection
and extraction and record sampling), and attribute transformation (such as discretization of
numerical attributes and functional transformation). This step can be crucial for the success of
the entire KDD project, and it is usually very project-specific. For example, in medical
examinations, the quotient of attributes may often be the most important factor, and not each one
by itself. In marketing, we may need to consider effects beyond our control as well as efforts and
temporal issues (such as studying the effect of advertising accumulation). However, even if we
do not use the right transformation at the beginning, we may obtain a surprising effect that hints
to us about the transformation needed (in the next iteration). Thus, the KDD process reflects
upon itself and leads to an understanding of the transformation needed. Having completed the
above four steps, the following four steps are related to the Data Mining part, where the focus is
on the algorithmic aspects employed for each project:
5. Choosing the appropriate Data Mining task. We are now ready to decide on which type of
Data Mining to use, for example, classification, regression, or clustering. This mostly depends on
the KDD goals, and also on the previous steps. There are two major goals in Data Mining:
prediction and description. Prediction is often referred to as supervised Data Mining, while
descriptive Data Mining includes the unsupervised and
visualization aspects of Data Mining. Most data mining techniques are based on inductive
learning, where a model is constructed explicitly or implicitly by generalizing from a sufficient
number of training examples. The underlying assumption of the inductive approach is that the
trained model is applicable to future cases. The strategy also takes into account the level of meta-
learning for the particular set of available data.
6. Choosing the Data Mining algorithm. Having the strategy, we now decide on the tactics.
This stage includes selecting the specific method to be used for searching patterns (including
multiple inducers). For example, in considering precision versus understandability, the former is
better with neural networks, while the latter is better with decision trees. For each strategy of
meta-learning there are several possibilities of how it can be accomplished. Meta-learning
focuses on explaining what causes a Data Mining algorithm to be successful or not in a particular
problem. Thus, this approach attempts to understand the conditions under which a Data Mining
algorithm is most appropriate. Each algorithm has parameters and tactics of learning (such as
ten-fold cross-validation or another division for training and testing).
7. Employing the Data Mining algorithm. Finally, the implementation of the Data Mining
algorithm is reached. In this step we might need to employ the algorithm several times until a
satisfied result is obtained, for instance by tuning the algorithm’s control parameters, such as the
minimum number of instances in a single leaf of a decision tree [4].
8. Evaluation. In this stage we evaluate and interpret the mined patterns (rules, reliability etc.),
with respect to the goals defined in the first step. Here we consider the preprocessing steps with
respect to their effect on the Data Mining algorithm results (for example, adding features in Step
4, and repeating from there). This step focuses on the comprehensibility and usefulness of the
induced model. In this step the discovered knowledge is also documented for further usage. The
last step is the usage and overall feedback on the patterns and discovery results obtained by the
Data Mining:
9. Using the discovered knowledge [5]. We are now ready to incorporate the knowledge into
another system for further action. The knowledge becomes active in the sense that we may make
changes to the system and measure the effects. Actually, the success of this step determines the
effectiveness of the entire KDD process. There are many challenges in this step, such as loosing
the “laboratory conditions” under which we have operated. For instance, the knowledge was
discovered from a certain static snapshot (usually sample) of the data, but now the data becomes
dynamic. Data structures may change (certain attributes become unavailable), and the data
domain may be modified (such as, an attribute may have a value that was not assumed before).
There are many methods of Data Mining used for different purposes and goals. Taxonomy is
called for to help in understanding the variety of methods, their interrelation and grouping. It is
useful to distinguish between two main types of Data Mining: verification-oriented (the system
verifies the user’s hypothesis) and discovery-oriented (the system finds new rules and patterns
autonomously). Figure 2 presents this taxonomy. Discovery methods are those that automatically
identify patterns in the data. The discovery method branch consists of prediction methods versus
description methods. Descriptive methods are oriented to data interpretation, which focuses on
understanding (by visualization for example) the way the underlying data relates to its parts.
Prediction-oriented methods aim to build a behavioral model, which obtains new and unseen
samples and is able to predict values of one or more variables related to the sample. It also
develops patterns which form the discovered knowledge in a way which is understandable and
easy to operate upon. Some prediction-oriented methods can also help provide understanding of
the data. Most of the discovery-oriented Data Mining techniques (quantitative in particular) are
based on inductive learning, where a model is constructed, explicitly or implicitly, by
generalizing from a sufficient number of training examples. The underlying assumption of the
inductive approach is that the trained model is applicable to future unseen examples.
Verification methods, on the other hand, deal with the evaluation of a hypothesis
proposed by an external source (like an expert etc.). These methods include the most common
methods of traditional statistics, like goodness off it test, tests of hypotheses (e.g., t-test of
means), and analysis of variance (ANOVA). These methods are less associated with Data
Mining than their discovery-oriented counterparts, because most Data Mining problems are
concerned with discovering an hypothesis (out of a large set of hypotheses), rather than testing a
known one. Much of the focus of traditional statistical methods is on model estimation as
opposed to one of the main objectives of Data Mining: model identification and construction,
which is evidence based (though overlap occurs).
Another common terminology, used by the machine-learning community, to the
prediction methods as supervised learning, as opposed to unsupervised learning. Unsupervised
learning refers to modeling the distribution of instances in a typical, high-dimensional input
space. Unsupervised learning refers mostly to techniques that group instances without a
prespecified, dependent attribute. Thus, the term “unsupervised learning” covers only a portion
of the description methods presented in Figure 1.2. For instance, it covers clustering methods but
not visualization methods. Supervised methods are methods that attempt to discover the
relationship between input attributes (sometimes called independent variables) and a target
attribute sometimes referred to as a dependent variable). The relationship discovered is
represented in a structure referred to as a model. Usually models describe and explain
phenomena, which are hidden in the data set and can be used for predicting the value of the
target attribute knowing the values of the input attributes. The supervised methods can be
implemented on a variety of domains, such as marketing, finance and manufacturing. It is useful
to distinguish between two main supervised models: classification models and regression
models. The latter map the input space into areal-valued domain. For instance, a regressor can
predict the demand for a certain product given its characteristics. On the other hand, classifiers
map the input space into predefined classes. For example, classifiers can be used to classify
mortgage consumers as good (fully payback the mortgage on time) and bad (delayed payback),
or as many target classes as needed. There are many alternatives to represent classifiers. Typical
examples include, support vector machines, decision trees, probabilistic summaries, or algebraic
function.
Data Mining methods are becoming part of integrated Information Technology (IT)
software packages. Figure 3 illustrates the three tiers of the decision support aspect of IT.
Starting from the data sources (such as operational databases, semi- and non-structured
data and reports, Internet sites etc.), the first tier is the data warehouse, followed by OLAP (On
Line Analytical Processing) servers and concluding with analysis tools, where Data Mining tools
are the most advanced.
The main advantage of the integrated approach is that the preprocessing steps are much
easier and more convenient. Since this part is the major burden for the KDD process (and often
consumes most of the KDD project time), this industry trend is very important for expanding the
use and utilization of Data Mining [6]. However, the risk of the integrated IT approach comes
from the fact that those DM techniques are much more complex and intricate than OLAP, for
example, so the users need to be trained appropriately. This handbook shows the variety of
strategies, techniques and evaluation measurements. We can naively distinguish among three
levels of analysis. The simple stone is achieved by report generators (for example, presenting all
claims that occurred because of a certain cause last year, such as car theft). We then proceed to
OLAP multi-level analysis (for example presenting the ten towns where there was the highest
increase of vehicle theft in the last month as compared to with the month before). Finally a
complex analysis is carried out in discovering the patterns that predict car thefts in these cities,
and what might occur if antitheft devices were installed. The latter is based on modeling of the
phenomena, where the first two levels are ways of data aggregation and fast manipulation.
This handbook covers the current state-of-the-art status of Data Mining. The field is still
in its early stages in the sense that further basic methods are being developed. The art expands
but so does the understanding and the automation of the nine steps and their interrelation. For
this to happen we need of the KDD problem spectrum and definition.
The terms KDD and DM are not well-defined in terms of what methods they contain,
what types of problem are best solved by these methods, and what results to expect. How are
KDD/DM compared to statistics, machine learning, operations research, etc.? If subset or
superset of the above fields? Or an extension/adaptation of them? Or a separate field by itself? In
addition to the methods – which are the most promising fields of application and what is the
vision KDD/DM brings to these fields? Certainly, we already see the great results and
achievements of KDD/DM, but we cannot estimate their results with respect to the potential of
this field. All these basic analysis have to be studied and we see several trends for future research
and implementation, including [8]:
Active DM – closing the loop, as in control theory, where changes to the system are made
according to the KDD results and the full cycle starts again. Stability and controllability
which will be significantly different in these type of systems, need to be well-defined.
Full taxonomy – for all the nine steps of the KDD process. We have shown a taxonomy
for the DM methods, but a taxonomy is needed foreach of the nine steps. Such a
taxonomy will contain methods appropriate for each step (even the first one), and for the
whole process as well.
Meta-algorithms – algorithms that examine the characteristics of the data in order to
determine the best methods, and parameters (including decompositions).
Benefit analysis – to understand the effect of the potential KDD/DM results on the
enterprise.
Problem characteristics – analysis of the problem itself for its suitability to the KDD
process.
Expanding the database for Data Mining inference to include also data from pictures,
voice, video, audio, etc. This will require adapting and developing new methods (for
example, for comparing pictures using clustering and compression analysis).
Distributed Data Mining – The ability to seamlessly and effectively employ Data Mining
methods on databases that are located in various sites.
This problem is especially challenging when the data structures are heterogeneous rather
than homogeneous.
Expanding the knowledge base for the KDD process, including not only data but also
extraction from known facts to principles (for example, extracting from a machine its
principle, and thus being able to apply it in other situations).
Expanding Data Mining reasoning to include creative solutions, not just the ones that
appears in the data, but being able to combine solutions and generate another approach.
The last two are beyond the scope of KDD/DM definition here, and this is the last point, to
define KDD/DM for the next phase of this science.
This handbook is organized in eight parts. Starting with the KDD process, through to part
six, the book presents a comprehensive but concise description of different methods used
throughout the KDD process. Each part describes the classic methods as well as the extensions
and novel methods developed recently. Along with the algorithmic description of each method,
the reader is provided with an explanation of the circumstances in which this method is
applicable and the consequences and the trade-offs of using the method including references for
further readings. Part seven presents real-world case studies and how they can be solved. The last
part surveys some software and tools available today.
The first part is about preprocessing methods, starting with data cleansing, followed by
the handling of missing attributes. Following issues in feature extraction, selection and
dimensional reductions are discussed. These chapters are followed by discretization methods and
outlier detection. This covers the preprocessing methods (Steps3, 4 of the KDD process).
The Data Mining methods starts in the second part with the introduction and the very
often-used decision tree method, followed by other classical methods, such as Bayesian
networks, regression (in the Data Mining framework), support vector machines and rule
induction.
The third part of the handbook considers the unsupervised methods, starting with
visualization (suited for high dimensional data bases). Then the important methods of clustering,
association rules and frequent set mining are treated Finally in this part two more topics are
presented for constraint-based Data Mining and link analysis.
The fourth part is about methods termed soft computing, which include fuzzy logic,
evolutionary algorithms, reinforcement learning, neural networks and ending with granular
computing and rough sets.
Having established the foundation, we now proceed with supporting methods needed for Data
Mining in the fifth part, starting with statistical methods for Data Mining followed with logic,
wavelets and fractals.
Having covered the basics, we proceed with advanced methods in the sixth part, which
covers topics like meta-learning in, bias vs. variance and rare cases. Additional topics include
mining high dimensional data, text mining and information extraction, spatial methods,
imbalanced data sets, relational Data Mining, web mining, causality, ensemble and
decomposition methods, information fusion, parallel and grid-based, collaborative and
organizational Data Mining.
With all the methods described so far, the next section, the seventh, is concerned with
applications for medicine, biology, manufacturing, design, telecommunication and finance. The
next topic is about intrusion detection with Data Mining methods, followed by software testing,
CRM application and target marketing.
The last and final part of this handbook deals with software tools. This part is not a
complete survey of the software available, but rather a selected representative from different
types of software packages that exist in today’s market. This section begins by public domain
open source research-type software, Weka, followed by two integrated tools (Data Mining tools
integrated with database, data warehouse and the entire support software environment)
represented by Oracle and Microsoft. These software systems employ various Data Mining
methods discussed in detail previously in the book.
7. Summary
The potential benefits of discovery driven data mining techniques in extracting valuable
information from large complex databases are unlimited. Successful applications are surfacing in
industries and areas were data retrieval is outpacing man's ability to effectively analyze its
content. Users must be aware of the potential moral conflicts to using sensitive information.
References
[1] Hey, T., Tansley, S., and Tolle, K. (Eds.). Jim Gray on e-Science (2009), A Transformed
Scientific Method. Microsoft Research.
[2] Liu, H., Wu, Z. H., Zhang, X., and Hsu, D. F. (2013), A skeleton pruning algorithm based on
information fusion. Pattern Recognition Letters 34(10), pages 1138-1145.
[3] Tan, P., Steinback, M., and Kumar, V. (2006), Introduction to Data Mining, Addison Wesley.
[4] Dietterich, T., (2000), An Empirical Comparison of Three Methods for Constructing
Ensembles of Decision Trees: Bagging, Boosting and Randomization. Machine Learning, 40(2),
pages 139–157.
[5] Calders, T., Goethals, B., (2002), Mining all non-derivable frequent item sets. In Proceedings
of the 6th European Conference on Principles and Practice of Knowledge Discovery in
Databases (PKDD’02) Lecture Notes in Artificial Intelligence, volume 2431 of LNCS, pages 74–
85. Springer-Verlag.
[6] Domingos, P., Hulten, G., (2000), Mining High-Speed Data Streams. Proceedings of the
Sixth ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining,
Boston, MA, pages 71–80.
[7] Kumar, V., Kumar, N. AND Pandey, K. M. Data Mining and Business Intelligence: Concept
and Component. IJARSE, 4 (5), p. 188-191.
[8] Zaki, M., (2000), Scalable algorithms for association mining. IEEE Transactions on
Knowledge and Data Engineering, 12(3), pages 372–390.
[9] Rokach, L., Maimon, O., (2001), Theory and Application of Attribute Decomposition,
Proceedings of the First IEEE International Conference on Data Mining, IEEE Computer Society
Press.