Machine Learning 1.4.19
Machine Learning 1.4.19
Machine Learning 1.4.19
1. INTRODUCTION
In the past decade there has been a rapid paradigm shift in the field of computer
science due to apex achievements in artificial intelligence .Machine learning which is a sub
field of artificial intelligence has taken the capabilities of imparting the intelligence across
various disciplines beyond the horizon. In 1959, Arthur Samuel defined machine learning
as a "Field of study that gives computers the ability to learn without being explicitly
programmed" (Samuel 1959).The machine learning algorithms works on the fact that the
learning happens persistently from the training data or with the past experience and can
enhance their performance by synthesizing the underlying relationships among data and
the given problem without any human intervention. In contrast with the optimization
problems, the machine learning algorithms generally encompasses a well-defined function
that can be optimized through learning. This optimization of the decision-making processes
based on learning has led to rapid rise in employing automation in innumerable areas like
Healthcare, Finance, Retail, E-governance etc. However, machine learning has been
considered as the giant step forward in the AI revolution the development in neural
networks has taken the AI to a completely new level. Deep learning which a subset of
machine learning is incorporates neural networks as their building blocks have remarkable
advances in natural language and image processing.
With big data landscape being able to store massive amount of data that is
generated every day by various businesses and users the machine learning algorithms can
harvest the exponentially growing data in deriving accurate predictions. The complexity
raised in maintaining a large computational on primes infrastructure to ensure successful
learning has been efficiently addressed through cloud computing by eliminating the need
to maintain expensive computing hardware, software and dedicated space. The businesses
have started adopting Machine Learning as a service (MLaaS) into their technology stacks
since they offer machine learning as a part of their service, as the name suggests. The major
attraction is that these services offer data modeling APIs, machine learning algorithms, data
transformations and predictive analytics without having to install software or provision
their own servers, just like any other cloud service. Moreover MLaas can help manage big
data better by collecting huge amounts of data to get insights by correlating the data,
crunching numbers and understanding patterns of the data to helps business take quick
decisions. As data sources proliferate along with the computing power to process them,
going straight to the data is one of the most straightforward ways to quickly gain insights
and make predictions. The combination of these two mainstream technologies yields
beneficial outcome for the organizations. Machine learning is heavily recommended for the
problems that involve complex learning. However, it is essential to remember that
Machine learning is not always an optimal solution to every type of problem. There are
certain problems where robust solutions can be developed without using Machine-learning
techniques.
This chapter will explore the end-to-end process of investigating data through a
machine-learning lens from how to extract and identify useful features from the data; some
of the most commonly used machine-learning algorithms, to identifying and evaluating the
performance of the machine learning algorithms. Section 2 introduces steps for developing
suitable machine learning model and various paradigms of machine learning techniques
such as supervised, unsupervised and reinforcement learning. Section 3 discusses about
various applications of machine learning in various fields and then concludes whole
chapter with research insights.
2. DEVELOPING A MACHINE LEARNING MODEL
As discussed, machine Learning is the field where an agent is said to learn from the
experience with respect to some class of tasks and the performance measure P. The task
could be answering exams in a particular subject or it could be of diagnosing patients of a
specific illness. As shown in the figure 1 given below, it is the subset of Artificial intelligence
(AI) where it contains artificial neurons and reacts to the given stimuli whereas machine
learning uses statistical techniques for knowledge discovery. Deep learning is the subset of
machine learning where it uses artificial neural networks for learning process.
F. Choosing a model:
G. Training
H. Evaluation
I. Parameter Tuning
J. Prediction/ Generating and Interpreting Predictions
The following subsections will give a detailed scheme for developing a suitable
machine-learning model for the given problem.
The first step in developing a model is to clearly define and describe about the
problem that need to be addressed with machine learning. In other words formulating the
core of the problem will help in deciding what the model has to predict. The formulation
can be done in different ways such as understanding problem through sentence
description, deriving problem from the solved similar problems from the past. Choosing
how to define the problem varies depending upon the use case or business need. It is very
important to avoid over-complicating the problem and to frame the simplest solution as
per the requirement. The motivation for solving the problem is to be evaluated against how
the solution would benefit the business. Some of the common ways of describing the
problems are
o Similar problems
After detailed discussions with stakeholders, identifying the pain-points the common
and most affordable strategy is to derive the problem with previous similar experiences.
Other problems can inform details about the current problem by highlighting limitations in
the problem such as time dimensions and conceptual drift and can point to algorithms, data
transformations that could be adapted to spot check performance.
o Informal description
The other simplest way is to describe the problems informally by highlighting the
basic spaces of the problem in a sentence for initial understanding about the possible
solution. However, this step must be considered only for initial level problem formation
substituted with any other approach for detailed problem formation.
o Using Assumptions
Creating a list of assumptions about the problem such as domain specific
information that will lead to a viable solution that can be tested against real data .It can also
be useful to highlight areas of the problem specification that may need to be challenged,
relaxed or tightened.
o Formalism
The most structured approach is Tom Mitchell’s machine learning formalism. A
computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E.
Use this formalism to define the T, P, and E for your problem.
Task (T):
Experience (E):
Performance (P):
B. Data Collection
This step is the most expensive and most time-consuming aspect of any machine
learning project because the quality and quantity of data that you gather will directly
determine the success of the project .The paper published by Yuji Roh et al. (2018) discuss
in detail about high level research landscape of data collection for machine learning such as
data acquisition, data labeling and improve the labeling of any existing data. The Machine
learning problems require many data for better prediction. With the rapid adoption of
standard IOT solution enormous volume of sensor data can be collected from the industries
for the Machine learning problems, other sources like social media and third party data
providers can provide enough data for better solution predictions. The labeled data is a
group of samples that have been tagged with one or more labels. Labeling typically takes a
set of unlabeled data and augments each piece of that unlabeled data with meaningful tags
that are informative. In supervised Machine learning, the algorithm teaches itself to learn
from the labeled examples. Labeled data typically takes a set of unlabeled data and
augments each piece of that unlabeled data with some sort of meaningful "tag," "label," or
"class" that is somehow informative or desirable to know. Often, data is not readily
available in a labeled form. Collecting and preparing the variables and the target are often
the most important steps in solving a problem. The example data should be representative
of the data that is used by the model to make a prediction. Unsupervised learning is the
opposite of supervised learning, where unlabeled data is used because a training set does
not exist. Semi-supervised learning is aimed at integrating unlabeled and labeled data to
build better and more accurate models.
C. Data Preparation
Data preparation is the process of combining, structuring and organizing data so it
can be analyzed through machine learning applications. Good enough visualizations of the
data will help in finding any relevant relationships between the different variables and to
find any data imbalances present. The Collected data is spilt into two parts. The first part
that is used in training the model will be the majority of the dataset and the second will be
used for the evaluation of the trained model’s performance.
The data might need a lot of cleaning and preprocessing before it is feed into the
machine learning system. The process of cleaning involves various processes such as
getting rid of errors & noise and removal of redundancies to avoid ambiguities that arise
out of the data. The preprocessing involves renaming categorical values to numbers,
rescaling (normalization), abstraction, and aggregation.
Feature engineering is about creating new input features from your existing ones.
Feature engineering is the process of transforming raw data into features that had better
represent the underlying problem to the predictive models, resulting in improved model
accuracy on unseen data. An iterative process interplays with data selection and model
evaluation, repeatedly, until better prediction is achieved. The process of involves the
following steps
Brainstorm features: Examining the problem and data closely and by study
feature engineering on other problems to extract similar patterns
Devise features: Decide on using use automatic feature extraction or manual
feature construction and mixtures of the two.
Select features: Using different feature importance scorings and feature
selection methods to prepare different view of the model.
Evaluate models: Estimate model accuracy on unseen data using the chosen
features.
o Filter Methods
Filter feature selection methods apply a statistical measure to assign a scoring to
each feature. The features are ranked by the score and either selected to be kept or
removed from the dataset. Some examples of filter methods include Pearson’s
Correlation, Linear discriminant analysis, ANOVA, Chi-Square.
o Wrapper Methods
The Wrapper Methods generate considers the selection of a set of features as a
search problem, where different combinations are prepared, evaluated and compared
to other combinations. Some examples of wrapper methods include recursive feature
elimination, forward feature selection, backward feature elimination.
o Embedded Methods
Embedded methods combine the qualities’ of filter and wrapper methods.
Algorithms that have their own built-in feature selection methods implement it. Some
examples of embedded methods include regularization algorithms (LASSO, Elastic Net
and Ridge Regression), Memetic algorithm, and Random multinomial logit.
o Model parameters
o Hyper parameters
o Learning Rate
The amount of change to the model during each step of this search process, or
the step size, is called the “learning rate” and provides perhaps the most important
hyperparameter to tune for your neural network in order to achieve good performance
on your problem. The learning rate is an important parameter in Gradient descent.
o Model Size
Model size depends on the product being used and what is included in the
model. This can vary from implementation to implementation, type of problem
(classification, regression), algorithm (SVM, neural net etc.), data type (image, text etc.),
feature size etc. Large models have practical implications, such as requiring more RAM
to hold the model while training and when generating predictions. We can reduce the
model size by using regularization or by specifically restricting the model size by
specifying the maximum size.
o Regularization
Supervised Learning
It is one of the machine learning techniques used for data analysis and used to
construct the classification model. It is used to predict future trends analysis. It is also
known as supervised learning.
i. Classification
The classification models used to predict categorical class labels whereas and prediction
models predict continuous valued.
For example, classification model for bank is used to classify bank loan applications
as either safe or risky one. A prediction model is used to predict the potential customers
who will buy computer equipment given their income and occupation.
Some other examples of data analysis task of classification are given below.
A bank loan officer wants to analyze the data in order to predict which loan
applicant is risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given
profile, who will buy a new computer as shown in figure 3.
In both the cases, a model is constructed to predict the categorical labels. These labels are
risky or safe for loan application and yes or no for marketing data.
It is the task of building a model that describe and distinguish data class of an
object. This is used to predict class label for an object where class label information is not
available (Jiawei et al. 2006). It is an example of learning from samples. The first phase
called model construction is also referred to as training phase, where a model is built based
on the features present in the training data. This model is then used to predict class labels
for the testing data, where class label information is not available. A test set is used to
determine the accuracy of the model. Usually, the given data set is divided into training and
test sets, with training set used to construct the model and test set used to validate it.
Decision trees are commonly used to represent classification models. A decision
tree is similar to a flowchart like structure where every node represents a test on an
attribute value and branches denote a test outcome and tree leaves represent actual
classes. Other standard representation techniques include K-nearest neighbor, Bayesians
classification algorithm, if-then rules and neural networks (Jiawei et al. 2006). It is also
known as supervised learning process. Effectiveness of prediction depends on training
dataset used to train the model.
The classification is two steps process. They are,
o Phase I: Building the Classifier(Training Phase)
o Phase II: Using Classifier (Testing Phase)
ii. Regression
Prediction is an analytic process designed to explore data for consistent patterns
or systematic relationships among variables and then to validate the findings by applying
the detected patterns to new subsets of data. (Jiawei et al. 2006) uncover that the
predictive data mining is the most common type of data mining and it has the most direct
business applications.
The process of predictive data mining task consists of three stages:
Data exploration
Model building
Deployment
Data Exploration usually starts with data preparation which may involve data
cleaning, data transformations, selecting subsets of records and feature selection. Feature
selection is one of the important operations in the exploration process. It is defined as
reducing the numbers of variables to a manageable range if the datasets are with large
number of variables performing some preliminary feature selection operations. Then, a
simple choice of straightforward predictors for a regression model is used to elaborate
exploratory analyses. The most widely used graphical and statistical method is exploratory
data analysis. Model building and validation steps involve considering various models and
choosing the best one based on their predictive performance. Deployment is the final step
which involves selecting the best model in the previous step and applying it to a new data
in order to generate predictions or estimates of the expected outcome.
Both classification and prediction are used for data analysis but there exists some
issues dealing with preparing the data for data analysis. It involves the following activities,
o Data Cleaning − Data cleaning involves removing the noisy, incomplete and
inconsistent data and methods for handling missing values of an attribute. The
noisy data is removed by applying smoothing techniques such as binning and
then problem of missing values is handled by replacing a missing value with
most commonly occurring value for that attribute or replacing missing value by
mean value of that attribute or replacing the missing value by global constant
and so on.
o Relevance Analysis − Datasets may also have some irrelevant attributes and
hence correlation analysis is performed to know whether any two given
attributes are related or not. All irrelevant attributes are removed.
o Normalization: Normalization involves scaling all values for given attribute in
order to make them fall within a small-specified range. Ex. Min_Max
normalization.
o Generalization – it is data generalization method where data at low levels are
mapped to some higher level there by reducing the number of values of an
attributes. For this purpose, we can use the concept hierarchies. Example is
shown in figure 8 below.
Unsupervised Learning
Figure 9: Unsupervised Learning Categorization
Unsupervised learning is the process of grouping the objects based on the similarity
present in it. It can be classified into three types as shown in the figure 9.
o Clustering
o Association Rule
o Dimensionality Reduction
i. Clustering
Clustering is the process of grouping of objects into classes of similar objects based
on some similarity measures between them (Sathiyamoorthi & Murali Baskaran 2011b). It
is unsupervised leaning method. Each cluster can be represented as one group and while
performing cluster analysis, first partition objects into groups based on the similarity
between them and then assign the class labels to those groups. The main difference
between clustering and classification is that, clustering is adaptable to changes and helps
select useful features that distinguish objects into different groups. Some of the popular
algorithms are shown in the figure 10.
All these algorithms will belong to any one of the following methods as shown in the figure
11.
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
o Partitioning Method
Given a database of ‘n’ objects and then the partitioning algorithm groups the
objects into ‘k’ partition where k ≤ n. Each group should have at least one object. Also,
objects in the same group should satisfy the following criteria.
Each group contains at least one object.
Each object must belong to exactly one group.
Objects within clusters are highly similar and objects present in the different
clusters are highly dissimilar.
Kmeans algorithm is the most popular algorithm in this category. It works as
follows.
For a given number of partitions (say K), the Kmeans partitioning will create an
initial partitioning representing K clusters using some distance measure.
Then it uses the iterative technique to improve the partitioning by moving objects
from one group to other. The problem with Kmeans algorithm is that K (number of
partition) value is fixed before executing cluster and it does not change.
Another algorithm is Kmedoid which is an improvement of Kmeans algorithm and
provides better performance.
o Hierarchical Clustering
In this method, it tries to create a hierarchical decomposition of given objects into
various groups. There are two approaches used here for decomposition.
Agglomerative Approach
Divisive Approach
In agglomerative approach, clustering starts with each object forming a different group.
Then, it keeps on merging the objects that are close to one another into groups. It repeats it
until all of the groups are merged into one or until the termination condition holds. It is
also known as bottom-up approach.
In divisive approach, clustering starts with all the objects representing a single cluster
as a root. In each iteration, it tries to split the cluster into smaller clusters having similar i.e.
objects that are close to one another. It proceeds towards down and split the cluster until
each object in one cluster or the termination condition holds. This method is inflexible
means that once a merging or splitting is done then it cannot be undone. It is also known as
top-down approach.
o Density-based Clustering
It is based on the concept of density i.e. each clusters should have minimum number
of data objects within the cluster radius. Here a cluster is continuing growing as long as the
density in the neighborhood exceeds some threshold.
o Grid-based Clustering
In this clustering, the objects together form a grid. The object space is quantized into
finite number of cells that form a grid structure. The main advantage of this approach is
that it produces the cluster faster and takes less processing time.
o Model-based Clustering
In this approach, a model is used to build each cluster and find the best fit of data
object for a given clusters. This method locates the clusters by using the density function. It
reflects spatial distribution of the data objects among the clusters. It determines the
number of clusters based on statistics, taking outlier or noise into account. Also, it yields
robust clustering algorithm.
o Constraint-based Clustering
In this approach, the clustering is performed by incorporating the user and application
constraints or requirements. Here, a constraint is the user expectation or the properties of
desired clustering results. It is so interactive since constraints provide an interactive way of
communication with the clustering process. Constraints can be specified by the user or by
the application.
o Underfitting
o Overfitting
A statistical model is said to have overfitted, when trained with large data but
the model cannot capture the underlying trend of the data. Underfitting occurs if the
model or algorithm shows low variance but high bias. Both overfitting and
underfitting lead to poor predictions on new data sets. Some of the commonly used
metrics are Hold-Out Validation, Bootstrapping, Hyperparameter, Tuning,
Classification Accuracy, Logarithmic Loss, Confusion Matrix, Area under Curve, F1
Score, Mean Absolute Error, and Mean Squared Error.
o learning rate
o momentum
o regularization
o dropout probability
o batch normalization
Once the evaluation is over, any further improvement in your training can be possible
by tuning the parameters. A few parameters were implicitly assumed when the training
was done. Another parameter included is the learning rate that defines how far the line is
shifted during each step, based on the information from the previous training step. These
values all play a role in the accuracy of the training model, and how long the training will
take.
For models that are more complex, initial conditions play a significant role in the
determination of the outcome of training. Differences can be seen depending on whether a
model starts training with values initialized to zeroes versus some distribution of values,
which then leads to the question of which distribution is to be used. Since there are many
considerations at this phase of training, it is important that you define what makes a model
good. These parameters are referred to as Hyper parameters. The adjustment or tuning of
these parameters depends on the dataset, model, and the training process. Once you are
done with these parameters and are satisfied, you can move on to the last step.
Once you have done evaluation, it is possible that you want to see if you can further
improve your training in any way. We can do this by tuning our parameters. We implicitly
assumed a few parameters when we did our training, and now is a good time to go back, test
those assumptions, and try other values. One example is how many times we run through
the training dataset during training. What I mean by that is we can “show” the model our full
dataset multiple times, rather than just once. This can sometimes lead to higher accuracies.
Another parameter is “learning rate”. This defines how far we shift the line during each step,
based on the information from the previous training step. These values all play a role in how
accurate our model can become, and how long the training takes.
For more complex models, initial conditions can play a significant role in determining
the outcome of training. Differences can be seen depending on whether a model starts off
training with values initialized to zeroes versus some distribution of values, which leads to
the question of which distribution to use. As you can see there are many considerations at
this phase of training, and it’s important that you define what makes a model “good enough”,
otherwise you might find yourself tweaking parameters for a very long time. These
parameters are typically referred to as “hyper parameters”. The adjustment, or tuning, of
these hyperactive parameters, remains a bit of an art, and is more of an experimental
process that heavily depends on the specifics of your dataset, model, and training process.
Once you’re happy with your training and hyper parameters, guided by the evaluation step,
it’s time to finally use your model to do something useful!
J. Generating Predictions
Machine learning is using data to answer questions. The final step is where the model
can predict the outcome and establish a learning path. The predictions can be generated in
two ways real-time predictions and batch predictions.
o Real-time predictions
The real-time predictions, when you want to obtain predictions at low latency.
The real-time prediction API accepts a single input observation serialized as a JSON
string, and synchronously returns the prediction and associated metadata as part of
the API response. You can simultaneously invoke the API more than once to obtain
synchronous predictions in parallel.
o Batch predictions