Machine Learning 1.4.19

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23

MACHINE LEARNING AND ITS IMPLEMENTATION TECHNIQUES

1. INTRODUCTION

In the past decade there has been a rapid paradigm shift in the field of computer
science due to apex achievements in artificial intelligence .Machine learning which is a sub
field of artificial intelligence has taken the capabilities of imparting the intelligence across
various disciplines beyond the horizon. In 1959, Arthur Samuel defined machine learning
as a "Field of study that gives computers the ability to learn without being explicitly
programmed" (Samuel 1959).The machine learning algorithms works on the fact that the
learning happens persistently from the training data or with the past experience and can
enhance their performance by synthesizing the underlying relationships among data and
the given problem without any human intervention. In contrast with the optimization
problems, the machine learning algorithms generally encompasses a well-defined function
that can be optimized through learning. This optimization of the decision-making processes
based on learning has led to rapid rise in employing automation in innumerable areas like
Healthcare, Finance, Retail, E-governance etc. However, machine learning has been
considered as the giant step forward in the AI revolution the development in neural
networks has taken the AI to a completely new level. Deep learning which a subset of
machine learning is incorporates neural networks as their building blocks have remarkable
advances in natural language and image processing.

With big data landscape being able to store massive amount of data that is
generated every day by various businesses and users the machine learning algorithms can
harvest the exponentially growing data in deriving accurate predictions. The complexity
raised in maintaining a large computational on primes infrastructure to ensure successful
learning has been efficiently addressed through cloud computing by eliminating the need
to maintain expensive computing hardware, software and dedicated space. The businesses
have started adopting Machine Learning as a service (MLaaS) into their technology stacks
since they offer machine learning as a part of their service, as the name suggests. The major
attraction is that these services offer data modeling APIs, machine learning algorithms, data
transformations and predictive analytics without having to install software or provision
their own servers, just like any other cloud service. Moreover MLaas can help manage big
data better by collecting huge amounts of data to get insights by correlating the data,
crunching numbers and understanding patterns of the data to helps business take quick
decisions. As data sources proliferate along with the computing power to process them,
going straight to the data is one of the most straightforward ways to quickly gain insights
and make predictions. The combination of these two mainstream technologies yields
beneficial outcome for the organizations. Machine learning is heavily recommended for the
problems that involve complex learning. However, it is essential to remember that
Machine learning is not always an optimal solution to every type of problem. There are
certain problems where robust solutions can be developed without using Machine-learning
techniques.

This chapter will explore the end-to-end process of investigating data through a
machine-learning lens from how to extract and identify useful features from the data; some
of the most commonly used machine-learning algorithms, to identifying and evaluating the
performance of the machine learning algorithms. Section 2 introduces steps for developing
suitable machine learning model and various paradigms of machine learning techniques
such as supervised, unsupervised and reinforcement learning. Section 3 discusses about
various applications of machine learning in various fields and then concludes whole
chapter with research insights.
2. DEVELOPING A MACHINE LEARNING MODEL

As discussed, machine Learning is the field where an agent is said to learn from the
experience with respect to some class of tasks and the performance measure P. The task
could be answering exams in a particular subject or it could be of diagnosing patients of a
specific illness. As shown in the figure 1 given below, it is the subset of Artificial intelligence
(AI) where it contains artificial neurons and reacts to the given stimuli whereas machine
learning uses statistical techniques for knowledge discovery. Deep learning is the subset of
machine learning where it uses artificial neural networks for learning process.

Figure 1: Taxonomy of Knowledge Discovery


Further, machine learning can be categorized into supervised, unsupervised and
reinforcement learning. In any kind of tasks, the machine learning involves three
components. The first component is, defining the set of tasks on which learning will take
place and second is setting up a performance measure P. Whether learning is happening or
not, defining some kind of performance criteria P is mandatory in machine learning tasks.
Consider an example of answering questions in an exam then the performance criterion
would be the number of marks that you get. Similarly, consider an example of diagnosing
patient with specific illness then the performance measure would be the number of
patients who did not have adverse reaction to the given drugs. So, there exists different
ways for defining various performance metrics depending on what you are looking for
within a given domain. The last important component machine learning is experience. For
an example, experience in the case of writing exams could be writing more exams which
means the better you write, the better you get or it could be the number of patient's in the
case of diagnosing illnesses i.e. the more patients that you look at the better you become an
expert in diagnosing illness. Hence, these are three components involved in learning; class
of tasks, performance measure and well-defined experience. This kind of learning where
you are learning to improve your performance based on experience is known as inductive
learning. There are various machine-learning paradigms as shown in the figure 2.

Figure 2: Categorization of Various Machine Learning Algorithms


The first one is supervised learning where one learns from an input to output map.
For an example, it could be a description of the patient who comes to the clinic and the
output would be whether the patient has a certain disease or not in the case of diagnosing
patients. Similarly, take an example of writing the exam where the input could be some
kind of equation then output would be the answer to the question or it could be a true or
false question i.e. it will give you a description of the question then you have to state
whether it is true or false as the output. So, the essential part of supervised learning is
mapping from the given input to the required output. If the output that you are looking for
happens to be a categorical output such as whether he has a disease or does not have a
disease or whether the answer is true or false then the learning is called supervised
learning. If the output happens to be a continuous value like how long will this product last
before it fails right or what is the expected rainfall tomorrow then those kinds of problems
would be called as regression problems. Thus, classification and regression are called
classes of supervised learning process.
The second paradigm is known as unsupervised learning problems where input to
output is not required. The main goal is not to produce an output in response to the given
input indeed it tries to discover some patterns out of it. Therefore, in unsupervised learning
there is no real desired output that we are looking for instead it looks for finding closely
related patterns in the data. Clustering is one such task where it tries to find cohesive
groups among the given input pattern. For an example, one might be looking at customers
who comes to the shop and want to figure out if they are into different categories of
customers like college students or IT professionals so on so forth. The other popular
unsupervised learning paradigm is known as association rule mining or frequent pattern
mining where one is interested in finding a frequent co-occurrence of items in the data that
is given to them i.e. whenever A comes to the shop B also comes to the shop. Therefore, one
can learn these kinds of relationship via associations between data.
The third form of learning which is called as reinforcement learning. It is neither
supervised nor unsupervised in nature. In reinforcement learning you have an agent who is
acting in an environment, you want to figure out what actions the agent must take at every
step, and the action that the agent takes is based on the rewards or penalties that the agent
gets in different states.
Apart from these three types of learning, one more learning is also possible which is
called as semi-supervised learning. It is the combination of supervised and unsupervised
learning i.e. you have some labeled training data and you also have a larger amount of
unlabeled training data and you can try to come up with some learning out of them that can
work even when the training data is limited features.
Irrespective of domain and the type of leaning, every task needs to have some kind
of a performance measure. In classification, the performance measure would be
classification error i.e. how many instances are misclassified to the total number of
instances. Similarly, the prediction error is supposed to be a performance measure in
regression i.e. if I said, it's going to rain like 23 millimeters and then it would ends up
raining like 49 centimeters then this huge difference in actual and predicted value is called
prediction error. In case of clustering, it is little hard to define performance measures as we
do not know what is a good clustering algorithm and do not know how to measure the
quality of clusters. Therefore, there exists different kinds of clustering measures and so one
of the measures is a scatter or spread of the cluster that essentially tells you how to spread
out the points that belong to a single group. Thus, good clustering algorithms should
minimize intra-cluster distance and maximize inter-cluster distance. Association rule
mining use variety of measures called support and confidence whereas reinforcement
learning tries to minimize the cost to accrue while controlling the system. There are several
challenges exists when trying to build a machine learning solution to the given problem
and few of these are given below.
First issue is about how good is a model and type of performance measures used.
Most of the measures discussed above were finds to be insufficient and there are other
practical considerations that come into play such as user skills, experience etc. while
selecting a model and measures. The second issue is of course presence of noisy and
missing data. Presence of these kinds of data leads to an error in the predicted value.
Suppose medical data is recorded as 225. so what does that mean it could be 225 days in
which case it is a reasonable number it could be twenty two point five years again is a
reasonable number or twenty two point five months is reasonable but if it is 225 years it's
not a reasonable number so there's something wrong in the data. Finally, the biggest
challenge is size of the dataset since algorithms perform well when data is large but not all.
The following are the basic steps to be followed while developing any kind of machine-
learning applications. They are,
A. Formulating the Problem/ Define Your Machine Learning Problem
B. Collecting Labeled Data/ Gathering Data

C. Preparing that data/ Analyzing Your Data


D. Feature selection/ Feature Processing

E. Splitting the Data into Training and Evaluation Data

F. Choosing a model:
G. Training
H. Evaluation
I. Parameter Tuning
J. Prediction/ Generating and Interpreting Predictions
The following subsections will give a detailed scheme for developing a suitable
machine-learning model for the given problem.

A. Describing the Problem

The first step in developing a model is to clearly define and describe about the
problem that need to be addressed with machine learning. In other words formulating the
core of the problem will help in deciding what the model has to predict. The formulation
can be done in different ways such as understanding problem through sentence
description, deriving problem from the solved similar problems from the past. Choosing
how to define the problem varies depending upon the use case or business need. It is very
important to avoid over-complicating the problem and to frame the simplest solution as
per the requirement. The motivation for solving the problem is to be evaluated against how
the solution would benefit the business. Some of the common ways of describing the
problems are

o Similar problems

After detailed discussions with stakeholders, identifying the pain-points the common
and most affordable strategy is to derive the problem with previous similar experiences.
Other problems can inform details about the current problem by highlighting limitations in
the problem such as time dimensions and conceptual drift and can point to algorithms, data
transformations that could be adapted to spot check performance.
o Informal description
The other simplest way is to describe the problems informally by highlighting the
basic spaces of the problem in a sentence for initial understanding about the possible
solution. However, this step must be considered only for initial level problem formation
substituted with any other approach for detailed problem formation.

o Using Assumptions
Creating a list of assumptions about the problem such as domain specific
information that will lead to a viable solution that can be tested against real data .It can also
be useful to highlight areas of the problem specification that may need to be challenged,
relaxed or tightened.

o Formalism
The most structured approach is Tom Mitchell’s machine learning formalism. A
computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E.
Use this formalism to define the T, P, and E for your problem.
 Task (T):
 Experience (E):
 Performance (P):
B. Data Collection

This step is the most expensive and most time-consuming aspect of any machine
learning project because the quality and quantity of data that you gather will directly
determine the success of the project .The paper published by Yuji Roh et al. (2018) discuss
in detail about high level research landscape of data collection for machine learning such as
data acquisition, data labeling and improve the labeling of any existing data. The Machine
learning problems require many data for better prediction. With the rapid adoption of
standard IOT solution enormous volume of sensor data can be collected from the industries
for the Machine learning problems, other sources like social media and third party data
providers can provide enough data for better solution predictions. The labeled data is a
group of samples that have been tagged with one or more labels. Labeling typically takes a
set of unlabeled data and augments each piece of that unlabeled data with meaningful tags
that are informative. In supervised Machine learning, the algorithm teaches itself to learn
from the labeled examples. Labeled data typically takes a set of unlabeled data and
augments each piece of that unlabeled data with some sort of meaningful "tag," "label," or
"class" that is somehow informative or desirable to know. Often, data is not readily
available in a labeled form. Collecting and preparing the variables and the target are often
the most important steps in solving a problem. The example data should be representative
of the data that is used by the model to make a prediction. Unsupervised learning is the
opposite of supervised learning, where unlabeled data is used because a training set does
not exist. Semi-supervised learning is aimed at integrating unlabeled and labeled data to
build better and more accurate models.
C. Data Preparation
Data preparation is the process of combining, structuring and organizing data so it
can be analyzed through machine learning applications. Good enough visualizations of the
data will help in finding any relevant relationships between the different variables and to
find any data imbalances present. The Collected data is spilt into two parts. The first part
that is used in training the model will be the majority of the dataset and the second will be
used for the evaluation of the trained model’s performance.
The data might need a lot of cleaning and preprocessing before it is feed into the
machine learning system. The process of cleaning involves various processes such as
getting rid of errors & noise and removal of redundancies to avoid ambiguities that arise
out of the data. The preprocessing involves renaming categorical values to numbers,
rescaling (normalization), abstraction, and aggregation.

The appropriate usage of attributes can lead to unexpected improvements in model


accuracy. Deriving new attributes from the training data in the modeling process can boost
model performance. Similarly, removal of redundant or duplicate attributes can increase
the performance. Transformations of training data can reduce the skewness of data as well
as the prominence of outliers in the data. Outliers are extreme values that fall a long way
outside of the other observations. The outliers in input data can skew and mislead the
training process of machine learning algorithms resulting in longer training times, less
accurate models and ultimately poorer results. Outliers can skew the summary distribution
of attribute values in descriptive statistics like mean and standard deviation and in plots
such as histograms and scatterplots, compressing the body of the data. Charu C. Aggarwal
in his book “Outlier Analysis “suggests following methods such as,
o Extreme Value Analysis: Determine the statistical tails of the underlying
distribution of the data. For example, statistical methods like the z-scores on
univariate data.
o Probabilistic and Statistical Models: Determine unlikely instances from a
probabilistic model of the data.
o Linear Models: Projection methods that model the data into lower dimensions
using linear correlations.
o Proximity-based Models: Data instances that are isolated from the mass of the
data as determined by cluster, density or nearest neighbor analysis.
o Information Theoretic Models: Outliers are detected as data instances that
increase the complexity (minimum code length) of the dataset.
o High-Dimensional Outlier Detection: Methods that search subspaces for outliers
give the breakdown of distance-based measures in higher dimensions (curse of
dimensionality).

D. Feature Engineering & Feature Selection

Feature engineering is about creating new input features from your existing ones.
Feature engineering is the process of transforming raw data into features that had better
represent the underlying problem to the predictive models, resulting in improved model
accuracy on unseen data. An iterative process interplays with data selection and model
evaluation, repeatedly, until better prediction is achieved. The process of involves the
following steps

 Brainstorm features: Examining the problem and data closely and by study
feature engineering on other problems to extract similar patterns
 Devise features: Decide on using use automatic feature extraction or manual
feature construction and mixtures of the two.
 Select features: Using different feature importance scorings and feature
selection methods to prepare different view of the model.
 Evaluate models: Estimate model accuracy on unseen data using the chosen
features.

Feature selection is the process of selecting a subset of relevant features (variables,


predictors) for use in machine learning model construction. Feature selection is also called
variable selection or attributes selection. The data features are used to train the machine
learning models have a huge influence on the performance. Hence choosing irrelevant or
partially relevant features can negatively influence the model performance. A feature
selection algorithm can be seen as the combination of a search technique for proposing
new feature subsets, along with an evaluation measure, which scores the different feature
subsets. In real world, applications the models usually choke due to very high
dimensionality of the data presented along with exponential increase in training time and
risk of over fitting with increasing number of features. Identifying better features can
provide the flexibility in even choosing a slightly wrong algorithm but ending up in getting
good results. The three general classes of feature selection algorithms are filter methods,
wrapper methods and embedded methods.

o Filter Methods
Filter feature selection methods apply a statistical measure to assign a scoring to
each feature. The features are ranked by the score and either selected to be kept or
removed from the dataset. Some examples of filter methods include Pearson’s
Correlation, Linear discriminant analysis, ANOVA, Chi-Square.
o Wrapper Methods
The Wrapper Methods generate considers the selection of a set of features as a
search problem, where different combinations are prepared, evaluated and compared
to other combinations. Some examples of wrapper methods include recursive feature
elimination, forward feature selection, backward feature elimination.
o Embedded Methods
Embedded methods combine the qualities’ of filter and wrapper methods.
Algorithms that have their own built-in feature selection methods implement it. Some
examples of embedded methods include regularization algorithms (LASSO, Elastic Net
and Ridge Regression), Memetic algorithm, and Random multinomial logit.

E. Splitting up the Data


The crux of any machine-learning model is to generalize beyond the instances used
to train the model. In other words, a model should be judged on its ability to predict new,
unseen data. Hence evaluating (testing) the model with the same data used for training will
generally result in over fitting. It should be noted that the data should not either over fit or
underfit. The common strategy is to take into consideration of all the available labeled data
and split it into training and evaluation (testing) subsets. The Training set used to fit and
tune the model while the test sets are put aside as unseen data to evaluate the model. The
split is usually done with a ratio of 70-80 percent for training and 20-30 percent for
evaluation based upon the nature of the problem and the model that is been adapted. One
of the best practices that is adopted is to make this split before starting the training process
in order to get reliable outcomes of the models performance. In addition, the test data
should be kept aside until the model is trained good enough to handle unseen data. The
performance comparison against the entire test dataset and training dataset will give a
clear picture about the data over fits the model. Some of the common ways of splitting up
the labeled data are
o Sequentially Split
Sequential split is the simplest way to guarantee distinct training and evaluation
data split. This method is extremely convenient if the data holds date or time range
since it retains the order of the data records.
o Random Split
Random split is the most commonly adopted approach since it is easy to
implement. However, for models that are more complex the random selection will
result high variance.
o Cross-validation
Cross-validation is a method for getting a reliable estimate of model
performance using only the training data. There are several ways to cross-validate the
commonly used method is k-fold cross-validation. In k-fold cross-validation the data is
split into k-equally sized folds, k models are trained and each fold is used as the holdout
set where the model is trained on all remaining folds.

F. Choosing an appropriate Model: Choosing a suitable Machine-learning model can be


very confusing because it depends on a number of following factors
o Nature of problem
The nature of the problem can be a significant factor in deciding which Machine-
learning model works best among the possible models.
o Volume of the training set
The training set volume can be helpful in selecting the Machine-learning models
based on bias and variance factors
o Accuracy
Deciding on the level of accuracy can sometimes guide in determine the suitable
model. If the project does not require most accurate results then approximate methods
can be adopted. The approximation can provide better results due to reduces
processing time and usually avoid overfitting.
o Training time
The training time heavily depends on the accuracy and the volume of the
dataset. If the training time is limited then it can be a considerable factor in picking a
model particularly when the data set is large.
o Number of parameters
The number of parameters can tamper the model behavior in various ways like
error tolerance or number of iterations. Moreover, algorithms with large numbers
parameters require the most trial and error to find a good combination.
o Number of features
The amount of features incorporated in the datasets can be very large compared
to the number of data points. The huge amount of features can pull down the efficiency
some learning models, making training time very long.
The following two-step process can guide in choosing the model

Step 1: Categorize the problem


The categorization of the problem can be by the input feed into the machine-
learning model or the output expected out of the machine-learning model. If the input data
is labeled then, supervised learning model can be a good choice in contrast if input data is
unlabeled data then unsupervised learning model can be adopted. The reinforcement
learning models can be used for optimizing the objective function interacting with the
environment. Similarly, if the output of the model is a number then regression models suits
best whereas the classification models can an ideal solution if the output of the model is a
class. The clustering models will be most appropriate for models that output a set of input
groups.

Step 2: Find the available algorithms.


Once the categorization of the problem is completed, the apposite model can be
pinpointed with ease. Some of the commonly used algorithms discussed below for better
understanding. Classification of Machine Learning Algorithms as follow;
o Supervised learning
o Unsupervised learning
o Semi-supervised learning
o Reinforcement learning

G. Training the model


The process of training works by finding a relationship between a label and its
features. The training dataset is used to prepare a model, to train it. Each sample in the
selected training data will define how each feature affects the label. This data is used to
incrementally improve the model’s ability to predict. This process then repeated and
updated to fit the data as best as possible. A single iteration of this process is called one
training step. In general, a trained model is not exposed to the test dataset during training
and any predictions made on that dataset are designed to be indicative of the performance
of the model. Model training is the crux of machine learning that is done by fitting a model
to the data. In other words, training a model with existing data to fit the model parameters.
Parameters are the key to machine learning models since they are the part of the model
that is learned from historical training data. There are two types of parameters that are
used in machine learning models.

o Model parameters

A model parameter is a configuration variable that is internal to the model and


whose value can be estimated from data. These parameters are learnt during training
by the classifier or other machine-learning model. The Model parameters provide the
estimate of learning during the training. Example: The support vectors in a support
vector machine.

o Hyper parameters

A model hyper parameter is a configuration that is external to the model and


whose value cannot be estimated from data. Hyperactive parameters are usually
fixed before the actual training process begins. Example: number of clusters in a k-
means clustering. In essence, a hyperactive parameter is a parameter whose value is
set before the learning process begins. While the values of other parameters are
derived via training. Model hyper parameters, on the other hand, are common for
similar models and cannot be learnt during training but are set beforehand. The key
distinction is that model parameters can be learned directly from the training data
while hyperactive parameters cannot. Other training parameters are

o Learning Rate

The amount of change to the model during each step of this search process, or
the step size, is called the “learning rate” and provides perhaps the most important
hyperparameter to tune for your neural network in order to achieve good performance
on your problem. The learning rate is an important parameter in Gradient descent.

o Model Size

Model size depends on the product being used and what is included in the
model. This can vary from implementation to implementation, type of problem
(classification, regression), algorithm (SVM, neural net etc.), data type (image, text etc.),
feature size etc. Large models have practical implications, such as requiring more RAM
to hold the model while training and when generating predictions. We can reduce the
model size by using regularization or by specifically restricting the model size by
specifying the maximum size.

o Regularization

Generalization refers to how well the concepts learned by a machine-learning


model apply to specific examples not seen by the model when it was learning.
Overfitting refers to a model that models the training data too well and performs
poorly with unseen data. Regularization helps prevent models from overfitting
training data examples by penalizing extreme weight values. Some of the common
types of regularization techniques are L2 and L1 regularization, Dropout, Data
augmentation and early stopping.
Figure 3 gives the most popular and widely used algorithms in machine learning based
applications. The following subsections discusses in detail about these techniques.

Figure 3: Popular Machine Learning Algorithms

 Supervised Learning

As shown in the figure 4, it is categorized into two types.


o Classification
o Regression

It is one of the machine learning techniques used for data analysis and used to
construct the classification model. It is used to predict future trends analysis. It is also
known as supervised learning.
i. Classification
The classification models used to predict categorical class labels whereas and prediction
models predict continuous valued.
For example, classification model for bank is used to classify bank loan applications
as either safe or risky one. A prediction model is used to predict the potential customers
who will buy computer equipment given their income and occupation.
Some other examples of data analysis task of classification are given below.
 A bank loan officer wants to analyze the data in order to predict which loan
applicant is risky or which are safe.
 A marketing manager at a company needs to analyze a customer with a given
profile, who will buy a new computer as shown in figure 3.
In both the cases, a model is constructed to predict the categorical labels. These labels are
risky or safe for loan application and yes or no for marketing data.

Figure 4: Supervised Learning Algorithms

It is the task of building a model that describe and distinguish data class of an
object. This is used to predict class label for an object where class label information is not
available (Jiawei et al. 2006). It is an example of learning from samples. The first phase
called model construction is also referred to as training phase, where a model is built based
on the features present in the training data. This model is then used to predict class labels
for the testing data, where class label information is not available. A test set is used to
determine the accuracy of the model. Usually, the given data set is divided into training and
test sets, with training set used to construct the model and test set used to validate it.
Decision trees are commonly used to represent classification models. A decision
tree is similar to a flowchart like structure where every node represents a test on an
attribute value and branches denote a test outcome and tree leaves represent actual
classes. Other standard representation techniques include K-nearest neighbor, Bayesians
classification algorithm, if-then rules and neural networks (Jiawei et al. 2006). It is also
known as supervised learning process. Effectiveness of prediction depends on training
dataset used to train the model.
The classification is two steps process. They are,
o Phase I: Building the Classifier(Training Phase)
o Phase II: Using Classifier (Testing Phase)

Phase I: Training Phase:


This is the first step in classification and in this step a classification algorithm is used to
construct the classifier model shown in figure 6. The model is built from the training
dataset which contain tuples called records with the associated class labels. Each tuple
presents in the training dataset is called as category or class.
Consider that training dataset of a bank_loan schema contains value for the following
attributes.
<Name, Age, Income, Loan_decision>
and class label here is Loan_decision and possible class label are risky, safe and low_risky.
Say for an example, classification algorithm uses ID3 then classification model is the
decision tree which is shown below figure 5. A decision tree is a tree that includes a root
node, branches and leaf nodes. Each internal node denotes a test on an attribute, each
branch denotes the outcome of the test, and each leaf node holds a class label. The node
without parent is the root node. Nodes without children is called leaf node and it
represents the outcome.
Once the decision tree was built, then it uses the IF-THEN rules on nodes present in
the node to find the class label of a tuple in the testing dataset.
May be following six rules are derived from the above tree.
1. If Age=young and Income=low then Loan_decision= risky
2. If Age=Senior and Income=low then Loan_decision= risky
3. If Age=Middle_Aged and Income=low then Loan_decision= risky
4. If Age=young and Income=High then Loan_decision= Safe
5. If Age=Middle_Aged and Income=High then Loan_decision=Safe
6. If Age=Senior and Income=High then Loan_decision= Low_risky

Figure 5. Decision Tree


Once the model is built then next step is testing the classifier using some sample
testing dataset which is shown above figure. Here, the testing dataset is used to measure
the accuracy of classification model shown in figure 7. There are two different metrics
such as precision and recall used for measuring accuracy of a classification model. Figure 8
gives the sample decision tree construction based on ID3 algorithm.

Figure 6. Training Process of Classification


Phase II: Testing Phase

Figure 7. Testing Process of Classification

Figure 8: Sample Decision tree using ID3 Algorithm

ii. Regression
Prediction is an analytic process designed to explore data for consistent patterns
or systematic relationships among variables and then to validate the findings by applying
the detected patterns to new subsets of data. (Jiawei et al. 2006) uncover that the
predictive data mining is the most common type of data mining and it has the most direct
business applications.
The process of predictive data mining task consists of three stages:
 Data exploration
 Model building
 Deployment
Data Exploration usually starts with data preparation which may involve data
cleaning, data transformations, selecting subsets of records and feature selection. Feature
selection is one of the important operations in the exploration process. It is defined as
reducing the numbers of variables to a manageable range if the datasets are with large
number of variables performing some preliminary feature selection operations. Then, a
simple choice of straightforward predictors for a regression model is used to elaborate
exploratory analyses. The most widely used graphical and statistical method is exploratory
data analysis. Model building and validation steps involve considering various models and
choosing the best one based on their predictive performance. Deployment is the final step
which involves selecting the best model in the previous step and applying it to a new data
in order to generate predictions or estimates of the expected outcome.
Both classification and prediction are used for data analysis but there exists some
issues dealing with preparing the data for data analysis. It involves the following activities,
o Data Cleaning − Data cleaning involves removing the noisy, incomplete and
inconsistent data and methods for handling missing values of an attribute. The
noisy data is removed by applying smoothing techniques such as binning and
then problem of missing values is handled by replacing a missing value with
most commonly occurring value for that attribute or replacing missing value by
mean value of that attribute or replacing the missing value by global constant
and so on.
o Relevance Analysis − Datasets may also have some irrelevant attributes and
hence correlation analysis is performed to know whether any two given
attributes are related or not. All irrelevant attributes are removed.
o Normalization: Normalization involves scaling all values for given attribute in
order to make them fall within a small-specified range. Ex. Min_Max
normalization.
o Generalization – it is data generalization method where data at low levels are
mapped to some higher level there by reducing the number of values of an
attributes. For this purpose, we can use the concept hierarchies. Example is
shown in figure 8 below.
 Unsupervised Learning
Figure 9: Unsupervised Learning Categorization

Unsupervised learning is the process of grouping the objects based on the similarity
present in it. It can be classified into three types as shown in the figure 9.
o Clustering
o Association Rule
o Dimensionality Reduction
i. Clustering
Clustering is the process of grouping of objects into classes of similar objects based
on some similarity measures between them (Sathiyamoorthi & Murali Baskaran 2011b). It
is unsupervised leaning method. Each cluster can be represented as one group and while
performing cluster analysis, first partition objects into groups based on the similarity
between them and then assign the class labels to those groups. The main difference
between clustering and classification is that, clustering is adaptable to changes and helps
select useful features that distinguish objects into different groups. Some of the popular
algorithms are shown in the figure 10.

Figure 10: Popular Algorithms in Unsupervised Learning

All these algorithms will belong to any one of the following methods as shown in the figure
11.

 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method

Figure 11: Types of Clustering

o Partitioning Method
Given a database of ‘n’ objects and then the partitioning algorithm groups the
objects into ‘k’ partition where k ≤ n. Each group should have at least one object. Also,
objects in the same group should satisfy the following criteria.
 Each group contains at least one object.
 Each object must belong to exactly one group.
 Objects within clusters are highly similar and objects present in the different
clusters are highly dissimilar.
 Kmeans algorithm is the most popular algorithm in this category. It works as
follows.
 For a given number of partitions (say K), the Kmeans partitioning will create an
initial partitioning representing K clusters using some distance measure.
 Then it uses the iterative technique to improve the partitioning by moving objects
from one group to other. The problem with Kmeans algorithm is that K (number of
partition) value is fixed before executing cluster and it does not change.
 Another algorithm is Kmedoid which is an improvement of Kmeans algorithm and
provides better performance.

o Hierarchical Clustering
In this method, it tries to create a hierarchical decomposition of given objects into
various groups. There are two approaches used here for decomposition.
 Agglomerative Approach
 Divisive Approach
In agglomerative approach, clustering starts with each object forming a different group.
Then, it keeps on merging the objects that are close to one another into groups. It repeats it
until all of the groups are merged into one or until the termination condition holds. It is
also known as bottom-up approach.
In divisive approach, clustering starts with all the objects representing a single cluster
as a root. In each iteration, it tries to split the cluster into smaller clusters having similar i.e.
objects that are close to one another. It proceeds towards down and split the cluster until
each object in one cluster or the termination condition holds. This method is inflexible
means that once a merging or splitting is done then it cannot be undone. It is also known as
top-down approach.

o Density-based Clustering
It is based on the concept of density i.e. each clusters should have minimum number
of data objects within the cluster radius. Here a cluster is continuing growing as long as the
density in the neighborhood exceeds some threshold.

o Grid-based Clustering
In this clustering, the objects together form a grid. The object space is quantized into
finite number of cells that form a grid structure. The main advantage of this approach is
that it produces the cluster faster and takes less processing time.

o Model-based Clustering
In this approach, a model is used to build each cluster and find the best fit of data
object for a given clusters. This method locates the clusters by using the density function. It
reflects spatial distribution of the data objects among the clusters. It determines the
number of clusters based on statistics, taking outlier or noise into account. Also, it yields
robust clustering algorithm.

o Constraint-based Clustering
In this approach, the clustering is performed by incorporating the user and application
constraints or requirements. Here, a constraint is the user expectation or the properties of
desired clustering results. It is so interactive since constraints provide an interactive way of
communication with the clustering process. Constraints can be specified by the user or by
the application.

ii. Association Rule Mining


As defined by (Jiawei et al. 2006), an association rule identifies the collection of
data attributes that are statistically related to one another. The association rule mining
problem can be defined as follows: Given a database of related transactions, a minimal
support and confidence value, find all association rules whose confidence and support are
above the given threshold. In general, it produces a dependency rule that predicts an object
based on the occurrences of other objects.
An association rule is of the form X->Y where X is called antecedent and Y is
called consequent. There are two measures that assist in identification of frequent items
and generate rules from it. One such measure is confidence which is the conditional
probability of Y given X, Pr(Y|X), and the other is support which is the prior probability of X
and Y, Pr(X and Y) (Jiawei et al 2006). It can be classified into either single dimensional
association rule or multidimensional association rule based on number of predicates it
contains (Jiawei et al. 2006). It can be extended to better fit in the application domains like
genetic analysis and electronic commerce and so on. Aprior algorithm, FP growth algorithm
and vertical data format are some of the standard algorithm used to identify the frequent
items present in the large data set (Jiawei et al. 2006).Association rule mining metrics is
shown in the figure 12.

Figure 12: Evaluating Association Rules

Algorithms used for association rule mining are given below.


o Aprior algorithm
o FP-Growth
o Vertical Data format algorithm

Issues Related to Clustering


 Scalability: Clustering algorithms should be scalable and can handle large
databases.
 Ability to deal with different kinds of attributes: clustering algorithms should
be in such a way that it should be capable of handling different kinds of data such as
numerical data, categorical, and binary data and so on.
 Discovery of clusters with attribute shape: Clustering algorithms should be
capable of producing clusters of arbitrary shape using different measures.
 High dimensionality: Clustering algorithm should be designed in such way that it
should be capable of handling both low as well as high dimensional data.
 Ability to deal with noisy data: Data sources may contain noisy, missing or
erroneous data. So presence of these data may leads too poor quality clusters. Hence
clustering algorithm should be designed in such way that it should handle noisy,
missing and error data and produce high quality clusters.
 Interpretability: The results of clustering should be readable, interpretable,
comprehensible into different form and useful to the end users

iii. Dimensionality Reduction


In statistics, machine learning, and information theory, dimensionality
reduction or dimension reduction is the process of reducing the number of random
variables under consideration by obtaining a set of principal variables. It can be divided
into feature selection and feature extraction.
Principal Component Analysis(PCA) is one of the most popular linear dimension
reduction. Sometimes, it is used alone and sometimes as a starting solution for other
dimension reduction methods. PCA is a projection based method which transforms the data
by projecting it onto a set of orthogonal axes.
H. Evaluating the model
Choosing the right evaluation metrics for the machine learning models will help the
model to learn patterns that generalize well for unseen data instead of just memorizing the
data. This will help in evaluating the model performs and to select better parameters.
However using all the different metrics available will create chaos due to different outputs
generated from these metrics. The different machine learning tasks have different
performance metrics so the choice of metric completely depends on the type of model and
the implementation plan of the model. There are different metrics for the tasks of
classification, regression, ranking, clustering, topic modeling, etc.

Underfitting vs. Overfitting

Figure 13: Model Evaluation based on Overfitting and Underfitting

o Underfitting

A statistical model or a machine-learning algorithm is said to have underfitting


when it cannot capture the underlying trend of the data in other words model does not
fit the data well enough. Overfitting occurs if the model or algorithm shows low
bias but high variance.

o Overfitting
A statistical model is said to have overfitted, when trained with large data but
the model cannot capture the underlying trend of the data. Underfitting occurs if the
model or algorithm shows low variance but high bias. Both overfitting and
underfitting lead to poor predictions on new data sets. Some of the commonly used
metrics are Hold-Out Validation, Bootstrapping, Hyperparameter, Tuning,
Classification Accuracy, Logarithmic Loss, Confusion Matrix, Area under Curve, F1
Score, Mean Absolute Error, and Mean Squared Error.

I. Tuning the parameter


The machine models applied to vary according to the Machine learning models are
parameterized so that their behavior can be tuned for a given problem. These models can
have many parameters and finding the best combination of parameters can be treated as a
search problem. However, this very term called parameter may appear unfamiliar to you if
you are new to applied machine learning. Some of the parameters to tune when optimizing
neural nets (NNs) include:

o learning rate

o momentum

o regularization

o dropout probability

o batch normalization

Once the evaluation is over, any further improvement in your training can be possible
by tuning the parameters. A few parameters were implicitly assumed when the training
was done. Another parameter included is the learning rate that defines how far the line is
shifted during each step, based on the information from the previous training step. These
values all play a role in the accuracy of the training model, and how long the training will
take.
For models that are more complex, initial conditions play a significant role in the
determination of the outcome of training. Differences can be seen depending on whether a
model starts training with values initialized to zeroes versus some distribution of values,
which then leads to the question of which distribution is to be used. Since there are many
considerations at this phase of training, it is important that you define what makes a model
good. These parameters are referred to as Hyper parameters. The adjustment or tuning of
these parameters depends on the dataset, model, and the training process. Once you are
done with these parameters and are satisfied, you can move on to the last step.
Once you have done evaluation, it is possible that you want to see if you can further
improve your training in any way. We can do this by tuning our parameters. We implicitly
assumed a few parameters when we did our training, and now is a good time to go back, test
those assumptions, and try other values. One example is how many times we run through
the training dataset during training. What I mean by that is we can “show” the model our full
dataset multiple times, rather than just once. This can sometimes lead to higher accuracies.
Another parameter is “learning rate”. This defines how far we shift the line during each step,
based on the information from the previous training step. These values all play a role in how
accurate our model can become, and how long the training takes.

For more complex models, initial conditions can play a significant role in determining
the outcome of training. Differences can be seen depending on whether a model starts off
training with values initialized to zeroes versus some distribution of values, which leads to
the question of which distribution to use. As you can see there are many considerations at
this phase of training, and it’s important that you define what makes a model “good enough”,
otherwise you might find yourself tweaking parameters for a very long time. These
parameters are typically referred to as “hyper parameters”. The adjustment, or tuning, of
these hyperactive parameters, remains a bit of an art, and is more of an experimental
process that heavily depends on the specifics of your dataset, model, and training process.
Once you’re happy with your training and hyper parameters, guided by the evaluation step,
it’s time to finally use your model to do something useful!

J. Generating Predictions
Machine learning is using data to answer questions. The final step is where the model
can predict the outcome and establish a learning path. The predictions can be generated in
two ways real-time predictions and batch predictions.
o Real-time predictions

The real-time predictions, when you want to obtain predictions at low latency.
The real-time prediction API accepts a single input observation serialized as a JSON
string, and synchronously returns the prediction and associated metadata as part of
the API response. You can simultaneously invoke the API more than once to obtain
synchronous predictions in parallel.

o Batch predictions

Use asynchronous predictions, or batch predictions, when you have a number


of observations and would like to obtain predictions for the observations all at once.
The process uses a data source as input, and outputs predictions into a .csv file
stored in an S3 bucket of your choice. You need to wait until the batch prediction
process completes before you can access the prediction results.

3. APPLICATIONS OF DATA MINING


Data mining applications include ( Sathiyamoorthi, 2016),
 Market basket analysis and management
o Helps in determining customer purchase pattern i.e. what kind of consumer
going to buy what kind of products.
o Helps in finding the best products for different consumers. Here prediction is
a data mining technique used to find the users interests based on available
data.
o Performs correlations analysis between product and sales.
o Helps in finding clusters of consumers who share the same purchase
characteristics such as user interests, regular habits, and monthly income
and so on.
o Is used in analyzing and determining customer purchasing pattern.
o Provides multidimensional analysis on user data and support various
summary reports.
 Corporate analysis and risk management in industries
o It performs cash flow analysis and prediction, contingent claim analysis to
evaluate assets.
o Where it summarizes and compares the resource utilization i.e. how much
resources are allocated and how much are currently available. it, helps in
production planning and control system
o Current trend analysis where it monitors competitors and predict future
market directions.

 Fraud detection or Outlier detection


o It is also known as outlier analysis which is used in the fields of credit card
analysis and approval and telecommunication industry to detect fraudulent
users.
o In communication department, it helps in finding the destination of the fraud
call, time duration of the fraud call, at what time the user made a call and the
day or week of the calls and so on.
o It helps in analyzing the patterns that are deviating from the normal behavior
called outlier.

 Spatial and time series data analysis


o For predicting stock market trends and bond analysis
o Identifying areas that shares similar characteristics

 Image retrieval and analysis


o Image segmentation and classification
o Face recognition and detection
 Web mining
o Web content mining
o Web structure mining
o Web log mining

You might also like