Data Mining Techniques

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8
At a glance
Powered by AI
Big data and data mining can help extract valuable insights from large datasets.

Supervised, semi-supervised, and unsupervised techniques are discussed as the main data mining methods.

The nine steps involved are selection, pre-processing, transformation, data mining, interpretation/evaluation.

Data Mining Techniques 2016

White Paper

Data Mining Techniques


Prepared by

Mehmet BEYAZ

TTG International, L.T.D.

www.ttgint.com

30/06/2016

Words of Wisdom

You will see it as you like to see.

- Mevlana Jalaluddin Rumi-

1|7
Data Mining Techniques 2016

Introduction

Everyone knows that the Internet and smart phones have changed how businesses operate,
governments function, and society lives and communicates. Recently, new technological trend is just
as transformative: big data. Big data starts with the fact that there is a lot more information
floating around these days than ever before, and it is being put to extraordinary new uses. Big data is
about more than just communication. Since, we live in the world of Big Data. The idea is that we can
learn from a large body of information that we could not comprehend when we used only smaller
amounts.

DATA MINING

We are living in a world, where a vast amount of digital data which is called big data. Plus as the
world becomes more and more connected via the Internet of Things (IoT). The IoT has been a major
influence on the Big Data landscape. These data are collected consciously from 5 minutes to hourly
and daily basses from different sources every day. The analysis of such big data brings ahead
business competition to the next level of innovation and productivity. Therefore, the extraction and
interpretation of hidden patterns in data sets is of great importance. Data mining is a modern tool
that aims to discover meaningful knowledge from large data sets and prediction trends. Data mining
offers not only a retrospective view on a business process, but also enables humans to develop a
successful market strategy.

Origins

The Data mining originates in the 80s, when it was introduced and utilized within a research
community. Data mining also known as KDD (Knowledge Discovery in Databases) and sometimes
refer as a Data Analytics as well. The data mining is defined as the component of KDD process and
deals with the examination of inner patterns in databases. Besides that, KDD is concerned about the
evaluation and interpretation of discovered patterns. Although, exact meanings of KDD and data
mining terms differ from each other, often they are used interchangeably. In this paper I utilize KDD
and data mining as synonyms, if it is not specified. Data mining is the analysis of large data
observational data sets to find out unknown relationships with in the verity of data set and to
summarize the data in novel ways that are both understandable and useful to the data owner. Data
mining computational methods find themselves in the intersection of classical statistics, artificial
intelligence, and machine learning. Data mining as a whole knowledge discovery process also
involves many disciplines, such as databases, data cleaning, visualization, exploratory data analysis,
and performance and KPI evaluation.

Methods

Data mining techniques are categorized into supervised, semi-supervised, and unsupervised
methods. Supervised method is where you have input variables (x) and an output variable (Y)
and you use an algorithm to learn the mapping function from the input to the output.

2|7
Data Mining Techniques 2016

Y = f(X)

The goal is to approximate the mapping function so well that when you have new input data
(x) that you can predict the output variables (Y) for that data.

It is called supervised learning because the process of algorithm learning from the training
dataset can be thought of as a teacher supervising the learning process.

Unlike the supervised approach, the unsupervised technique is to model the underlying
structure or distribution in the data in order to learn more about the data.

These are called unsupervised learning because unlike supervised learning above there are no
correct answers and there is no teacher. Algorithms are left to their own devises to discover
and present the interesting structure in the data.

Also data scientists identify semi-supervised learning, which is similar to a supervised one.
Problems where you have a large amount of input data (X) and only some of the data is
labelled (Y) are called semi-supervised learning problems.

These problems sit in between both supervised and unsupervised learning.

A good example is a photo archive where only some of the images are labelled, (e.g. dog, cat,
cow, person) and the majority are unlabelled.

Many real world machine learning problems fall into this area. This is because it can be
expensive or time consuming to label data as it may require access to domain experts.
Whereas unlabelled data is cheap and easy to collect and store.

You can use unsupervised learning techniques to discover and learn the structure in the input
variables.

In Summary

In this paper you learned the difference between supervised, unsupervised and semi-
supervised learning. You now know that:

Supervised: All data is labelled and the algorithms learn to predict the output from the input
data.
Unsupervised: All data is unlabelled and the algorithms learn to inherent structure from the
input data.
Semi-supervised: Some data is labelled but most of it is unlabelled and a mixture of
supervised and unsupervised techniques can be used.

3|7
Data Mining Techniques 2016

The aim of the Data mining is may be distinguished in different processes categories. While discovery
focuses on searching a database for hidden patterns without a predefined hypothesis about the
nature of the pattern and deriving a model of the causal generator of the data. Data mining usually
falls into two main categories. They are Predictive and Descriptive. See figure 1 at below.

Predictive:

Classification aims to categorize unseen input data records into known classes. The
assignment model or classifier learns from the training data set, where the relationship
between records and classes is provided.
Time series forecasting predicts the future value of a target function based on the previously
observed measurements

Figure 1 Data mining technics.

Descriptive:

Data mining requires some data to find the pattern. Predictive and Descriptive data mining
are also classified in different parts.
Regression aims to predict numerical values for input data records. The mapping function
learns from the training data set, where the relationship between records and their values is
known.

4|7
Data Mining Techniques 2016

Anomaly detection extracts points or outliers that are considerably different from the rest manifold
of data points.

Descriptive:

Clustering identifies manifolds of points called clusters with similar properties or behaviours.
Association analysis discovers relationships between records within the same data set.

Knowledge Discovery in Databases Process

The KDD is an automatic, exploratory data analysis and modelling of large data sources. The KDD is
the organized process of identifying valid, novel, useful, and human eye understandable patterns
from large and complex data sets. Data Mining is the core of the KDD process, involving the
connecting of algorithms that explore the data, develop the model and discover previously unknown
patterns. The KDD knowledge discovery process is repetitive, interactive, and consists of nine steps.

Figure 2

The unifying goal of the KDD process is to extract useful information from data in the context of large
databases. Data mining refers to the set of computational methods that extract valuable patterns
from original data. Additionally, KDD process is concerned about manipulation with massive data,
scaling algorithms for better performance, proper interpretation of retrieved information, and
human interaction with the overall process. KDD process is a sequential analysis that includes the
following steps, see Figure 2:

selection,
pre-processing,
transformation,

5|7
Data Mining Techniques 2016

data mining,
and information interpretation

However, this sequential knowledge extraction approach may involve iterations, because at any
point the data analyst can change settings and repeat previous steps again. The process starts with
determining the KDD goals, and ends with the implementation of the discovered knowledge. Thus,
the basic KDD sequence may include closed loops, and the effects are then measured on the new
data repositories, and the KDD process is launched again.

The knowledge exploration process starts with the development of necessary theoretical and

practical background in the application domain. The understanding of relevant knowledge is

important to achieve customers goals. The followings are a brief description of the nine step KDD

process;

Selection

It implies the selection of the target data set based on goals. Determine what data will be used for
the knowledge discovery, such as: what data is available, obtaining additional necessary data, and
the integrating all the data for the knowledge discovery into one data set. This process is very
important because the data mining learns and discovers from the available data.

Pre-processing

The quality of the selected data is often inappropriate for further analysis, because of multiple
reasons. Outliers, missing variables, or high level of noise during the measurements require special
data strategies. Hence, Data reliability is enhanced in this stage.

Transformation

This step can be crucial for the success of the entire KDD project, and it is usually very project
specific. Transformation projects an original data into a low dimensional (dimension reduction) space
embedded space and includes linear and nonlinear method. The reduced set of embedded features
allows visual inspection and facilitates the further mining of knowledge.

Data mining

The core element of the KDD process is the data mining phase, which includes several steps.
Depending on the customers goal, a specific data mining task is chosen classification, anomaly
detection, regression, or clustering. There are two major goals in data mining: prediction and
description. Then, the chosen data mining algorithm is executed to search for underlying patterns
and valuable knowledge.

Interpretation/Evaluation

The final step of the KDD process is interpretation and evaluation of the retrieved information with
respect to the goals defined in the first step. This step involves techniques for visual analysis and a

6|7
Data Mining Techniques 2016

number of performance metrics. The correct interpretation of results is important, because it allows
checking assumptions and tuning parameters of previous KDD components.

Finally, the discovered knowledge and designed KDD algorithm may be incorporate into an existing
business model. The possible usage scenarios encompass reporting and prediction, optimization and
automation of the business processes.

7|7
Data Mining Techniques 2016

References

1. Detecting Cellular Network Anomalies Using the Knowledge Discovery Process by, Sergey
Chernov, JYVSKYL 2015
2. The UCI KDD Archive of Large Data Sets for Data Mining Research and Experimentation by,
Stephen D. Bay, Dennis Kibler, Michael J. Pazzani, and Padhraic Smyth Department of
Information and Computer Science University of California, Irvine Irvine, CA 92697
3. Data mining and complex telecommunications problems modeling Janusz Granat
4. DATA MINING IN TELECOMMUNICATIONS Gary M. Weiss Department of Computer and
Information Science Fordham University
5. Data Mining with Big Data - IEEE Xplore
ieeexplore.ieee.org/iel7/69/4358933/06547630.pdf?arnumber=6547630
6. Data Mining for Big Data: A Review Bharti Thakur, Manish Mann Computer Science
Department LRIET, Solan (H.P), India
7. https://blog.udemy.com/knowledge-discovery-in-databases/
8. http://www.economist.com/node/15557443
9. http://www.neural-forecasting.com/nn_for_data_mining.htm
10. http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/KDD3.htm

8|7

You might also like