ML Lect1
ML Lect1
ML Lect1
1
Contents
2
Introduction
• Data Mining is process to find the pattern from large data ( or Big Data) using
the techniques like Artificial Intelligence, Machine Learning, Statistics and
database systems
• The overall goal of the data mining process is to extract useful information
from data set and transform it into an under stable structure of further use.
• Data Mining is the analysis step of the “Knowledge Discovery to Database”
process which is known as KDD
• The objective is to discover pattern rather than data itself
3
We are data rich but information poor
Why Data Mining is required?
• Volume of information is increasing everyday that we can handle
from business transactions, scientific data, sensor data, Pictures,
videos, etc. So, we need a system that will be capable of extracting
essence of information available and that can automatically generate
report, views or summary of data for better decision-making.
4
Data Mining?
• The actual data mining task is the automatic or semi-automatic analysis of the
5
Why Data Mining
• The fast-growing, great amount of data, collected and stored in large and many
data repositories, has far exceeded our human ability for understanding without
powerful tools.
8
Steps of KDD
1. Data Cleaning: Data cleaning is defined as the removal of noisy
and irrelevant data from the collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data
transformation tools.
13
Why do we need Machine Learning?
• Some tasks cannot be defined well, except by examples (e.g.
recognition of faces or people).
• Large amounts of data may have hidden relationships and correlations.
Only automated approaches may be able to detect these.
• The amount of knowledge about a certain problem / task may be too
large for explicit encoding by humans (e.g. in medical diagnostics)
• Environments change over time, and new knowledge is constantly
being discovered. A continuous redesign of the systems “by hand” may
be difficult.
Machine Learning Concept
Traditional Programming
Data
Output
Computer
Program
Machine Learning
Data
Program
Computer
Output
The Machine Learning Approach
Input
ML Classifier
Data
e.g. Gene
Machine Prediction:
Expression
Learning Yes / No
Profiles, …
Machine Learning
• Learning Task:
– What do we want to learn or predict?
• Data and assumptions:
– What data do we have available?
– What is their quality?
– What can we assume about the given problem?
• Representation:
– What is a suitable representation of the examples to be classified?
• Method and Estimation:
– Are there possible hypotheses?
– Can we adjust our predictions based on the given results?
• Evaluation:
– How well does the method perform?
– Might another approach/model perform better?
18
19
20
Classification
• Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the
model to predict the class of objects whose class label is unknown.
• The derived model is based on the analysis of a set of training data (i.e., data
objects whose class label is known).
Classification Example
Classification Algorithms
• LDA (Linear Discriminant Analysis)
• QDA (Quadratic Discriminant Analysis )
• K-NN (k- Nearest Neighbor)
• SVM (Support Vector Machine)
• Decision Tree
• Random Forest
• ……………………….and many more
Classification Accuracy
24
Evaluating a Classifier:
n-fold Cross Validation
• Suppose m labeled
instances
– Divide into n
subsets (“folds”) of
equal size
• Run classifier n times,
with each of the
subsets as the test set
– The rest (n-1) for
training
– Each run gives an
accuracy result
25
Evaluating a Classifier:
Confusion Matrix
TP TP
Precision = Recall =
TP + FP TP + FN
28
What is Regression?
•Regression analysis is defined in Wikipedia as:
•In statistical modeling, regression analysis is a set of statistical
processes for estimating the relationships between a dependent
variable (often called the ‘outcome variable’) and one or
more independent variables (often called ‘predictors’, ‘covariates’,
or ‘features’).
29
Regression
Curve Fitting
Example: curve fitting
Agent
40
41
Data Mining/Machine learning Techniques
• Regression and classification are supervised learning approach that maps
an input to an output based on example input-output pairs, while clustering
is a unsupervised learning approach.
• Regression: It predicts continuous valued output. The Regression analysis
is the statistical model which is used to predict the numeric data instead of
labels. It can also identify the distribution trends based on the available data
or historic data. Predicting a person’s income from their age, education is
example of regression task.
• Classification: It defines/predicts discrete number of values. In
classification the data is categorized under different labels according to
some parameters and then the labels are predicted for the data. Classifying
emails as either spam or not spam is example of classification problem.
• Clustering: Clustering is the task of partitioning the dataset into groups,
called clusters.The goal is to split up the data in such a way that points
within single cluster are very similar and points in different clusters are
different. It determines grouping among unlabeled data.
42
Classification vs Clustering
▪In general, in classification you have a set of predefined classes and want to
know which class a new object belongs to.
▪Clustering tries to group a set of objects and find whether there is some
relationship between the objects.
▪In the context of machine learning, classification is supervised learning and
clustering is unsupervised learning.
43
Classification vs Regression
▪Classification and Regression are two major prediction problems which
are usually dealt in Data mining.
▪Predictive modeling is the technique of developing a model or function
using the historic data to predict the new data.
▪The significant difference between Classification and Regression is that
classification maps the input data object to some discrete labels.
▪On the other hand, regression maps the input data object to the continuous
real values.
44
Clustering vs Association rule
▪By definition, clustering is grouping a set of objects in such a manner that
objects in the same group are more similar than to those object belonging to
other groups.
▪Whereas, association rules is about finding associations amongst items within
large commercial databases.
45
46
What is Deep Learning?
▪‘Deep Learning’ means using a neural network with several layers
of nodes between input and output
47
Machine Learning vs Deep Learning
48
Deep Learning and ML
49
Applications
• Image Classification
• Speech Recognition
• Language translation
• Stock Exchange Prediction
• Biomedical and diagnosis system
• Vehicular Communication
• Face detection
• Video Surveillance
50
51