ML Lect1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Introduction to Machine Learning

Dr. Amit M. Joshi


Assistant Professor,
Dept. of Electronics & Communication
Malaviya National Institute of Technology
Jaipur (Rajasthan)

1
Contents

❖ Introduction to Data Mining


❖ Need of Data Mining
❖ Knowledge Discovery Process
❖ Machine Learning Introduction
❖ Types of Machine Learning
❖ Applications

2
Introduction
• Data Mining is process to find the pattern from large data ( or Big Data) using
the techniques like Artificial Intelligence, Machine Learning, Statistics and
database systems
• The overall goal of the data mining process is to extract useful information
from data set and transform it into an under stable structure of further use.
• Data Mining is the analysis step of the “Knowledge Discovery to Database”
process which is known as KDD
• The objective is to discover pattern rather than data itself

3
We are data rich but information poor
Why Data Mining is required?
• Volume of information is increasing everyday that we can handle
from business transactions, scientific data, sensor data, Pictures,
videos, etc. So, we need a system that will be capable of extracting
essence of information available and that can automatically generate
report, views or summary of data for better decision-making.

• Automatic summarization of data


• Extracting essence of information stored.
• Discovering patterns in raw data.

• Data Mining also known as Knowledge Discovery in Databases,


refers to the nontrivial extraction of implicit, previously unknown,
and potentially useful information from data stored in databases.

4
Data Mining?
• The actual data mining task is the automatic or semi-automatic analysis of the

large quantities of data.


• The main purpose is to extract previously unknown, interesting patterns such as
group of records (cluster analysis), unusual record (anomaly detection), and
dependencies (association and mining) etc.
• These methods can however be useful for the creation of new hypothesis to test
the against the larger data population.
• It helps to inferring the new information from already collected data

Data mining—searching for knowledge


in your data.

5
Why Data Mining
• The fast-growing, great amount of data, collected and stored in large and many
data repositories, has far exceeded our human ability for understanding without
powerful tools.

• As a result, data collected in large data repositories become “data tombs”—data


archives that are seldom visited.

• Data collected and stored at Enormous speeds (GB/Hour)


➢ remote sensing on satellite

• Data Mining may help scientists for


➢ in classifying and segmenting the data
➢ in hypothesis formation
Basics of Data Mining
• Data analysis is more inline with standard statistical software (i.e. web stats).
These usually present information about subsets and relations within the
recorded dataset (search engine usage, average visit time)
• Data Mining implies software uses some more intelligence over simple
grouping and partitioning of the data to have new inferred information.
• Data Mining is non-trivial process of identifying
➢ Valid
➢ Novel
➢ Implicit
➢ previously unknown
➢ and ultimately understandable patterns
➢ potentially useful
7
Steps for KDD

8
Steps of KDD
1. Data Cleaning: Data cleaning is defined as the removal of noisy
and irrelevant data from the collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data
transformation tools.

2. Data Integration: Data integration is defined as heterogeneous data


from multiple sources combined in a common
source(DataWarehouse).
1. Data integration using Data Migration tools.
2. Data integration using Data Synchronization tools.
3. Data integration using ETL(Extract-Load-Transformation)
process. 9
Steps of KDD
3. Data Selection: Data selection is defined as the process where
data relevant to the analysis is decided and retrieved from the data
collection.
1. Data selection using Neural network.
2. Data selection using Decision Trees.
3. Data selection using Naive bayes.
4. Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the
process of transforming data into appropriate form required by
mining procedure. Data Transformation is a two step process:
1. Data Mapping: Assigning elements from source base to
destination to capture transformations.
2. Code generation: Creation of the actual transformation
program.
10
Steps of KDD
1. Data Mining: Data mining is defined as clever techniques that are applied
to extract patterns potentially useful.
1. Transforms task relevant data into patterns.
2. Decides purpose of model using classification or characterization.
2. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly
increasing patterns representing knowledge based on given measures.
1. Find interestingness score of each pattern.
2. Uses summarization and Visualization to make data understandable
by user.
3. Knowledge representation: Knowledge representation is defined as
technique which utilizes visualization tools to represent data mining
results.
1. Generate reports.
2. Generate tables.
3. Generate discriminant rules, classification rules, characterization
rules, etc. 11
Primary Data Mining Tasks

In general, data mining tasks can be classified into two categories:


descriptive and predictive.

• Predictive methods, use some variables to predict unknown or


future values of other variables.
Ex: Classification, Regression,

• Descriptive methods, characterize the general properties of the


data in the database.
Ex: Association Rule Discovery, Clustering
Moving Towards Machine Learning ?
• Machine learning is programming computers to optimize a
performance criterion using example data or past experience.
• Learning is used when:
– Human expertise does not exist
– Humans are unable to explain their expertise
– Solution changes in time
– Solution needs to be adapted to particular cases

13
Why do we need Machine Learning?
• Some tasks cannot be defined well, except by examples (e.g.
recognition of faces or people).
• Large amounts of data may have hidden relationships and correlations.
Only automated approaches may be able to detect these.
• The amount of knowledge about a certain problem / task may be too
large for explicit encoding by humans (e.g. in medical diagnostics)
• Environments change over time, and new knowledge is constantly
being discovered. A continuous redesign of the systems “by hand” may
be difficult.
Machine Learning Concept

Traditional Programming

Data
Output
Computer
Program

Machine Learning

Data
Program
Computer
Output
The Machine Learning Approach

Input
ML Classifier
Data

e.g. Gene
Machine Prediction:
Expression
Learning Yes / No
Profiles, …
Machine Learning
• Learning Task:
– What do we want to learn or predict?
• Data and assumptions:
– What data do we have available?
– What is their quality?
– What can we assume about the given problem?
• Representation:
– What is a suitable representation of the examples to be classified?
• Method and Estimation:
– Are there possible hypotheses?
– Can we adjust our predictions based on the given results?
• Evaluation:
– How well does the method perform?
– Might another approach/model perform better?
18
19
20
Classification
• Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the
model to predict the class of objects whose class label is unknown.

• The derived model is based on the analysis of a set of training data (i.e., data
objects whose class label is known).
Classification Example
Classification Algorithms
• LDA (Linear Discriminant Analysis)
• QDA (Quadratic Discriminant Analysis )
• K-NN (k- Nearest Neighbor)
• SVM (Support Vector Machine)
• Decision Tree
• Random Forest
• ……………………….and many more
Classification Accuracy

• Accuracy: percentage of correct classifications

Total test instances classified correctly


Accuracy =
Total number of test instances

24
Evaluating a Classifier:
n-fold Cross Validation
• Suppose m labeled
instances
– Divide into n
subsets (“folds”) of
equal size
• Run classifier n times,
with each of the
subsets as the test set
– The rest (n-1) for
training
– Each run gives an
accuracy result

25
Evaluating a Classifier:
Confusion Matrix

Classified positive Classified negative

Actual positive True positive False negative

Actual negative False positive True negative

TP: number of positive examples classified correctly


FN: number of positive examples classified incorrectly
FP: number of negative examples classified incorrectly
TN: number of negative examples classified correctly 26
Evaluating a Classifier:
Precision and Recall

TP: number of positive examples classified correctly


FN: number of positive examples classified incorrectly
FP: number of negative examples classified incorrectly
TN: number of negative examples classified correctly

TP TP
Precision = Recall =
TP + FP TP + FN

Note that the focus is on the positive class 27


Evaluating a Classifier:
What Affects the Performance

• Complexity of the task


– Large amounts of features (high dimensionality)
– Feature(s) appears very few times (sparse data)
• Few instances for a complex classification task
• Missing feature values for instances
• Errors in attribute values for instances
• Errors in the labels of training instances
• Uneven availability of instances in classes

28
What is Regression?
•Regression analysis is defined in Wikipedia as:
•In statistical modeling, regression analysis is a set of statistical
processes for estimating the relationships between a dependent
variable (often called the ‘outcome variable’) and one or
more independent variables (often called ‘predictors’, ‘covariates’,
or ‘features’).

29
Regression

Curve Fitting
Example: curve fitting

Lecture 1 8/25/11 CS 194-10 Fall 2011, Stuart Russell 31


Example: curve fitting

Lecture 1 8/25/11 CS 194-10 Fall 2011, Stuart Russell 32


Example: curve fitting

Lecture 1 8/25/11 CS 194-10 Fall 2011, Stuart Russell 33


Types of Regression
• Linear Regression
• Polynomial Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression
• Ridge Regression
• Lasso Regression
• Logistic Regression
34
35
36
Clustering
•Clusters of objects are formed so that objects within a cluster have high
similarity in comparison to one another, but are very dissimilar to objects in other
clusters. Each cluster that is formed can be viewed as a class of objects, from
which rules can be derived.
38
Reinforcement Learning
• What’s Reinforcement Learning?
Environment

{Observation, Reward} {Actions}

Agent

• Agent interacts with an environment and learns by maximizing a scalar


reward signal
• No labels or any other supervision signal.
• Previously suffering from hand-craft states or representation.
Reinforcement Learning
• Learning a policy: A sequence of outputs
• No supervised output but delayed reward
• Credit assignment problem
• Game playing
• Robot in a maze
• Multiple agents, partial observability, ...

40
41
Data Mining/Machine learning Techniques
• Regression and classification are supervised learning approach that maps
an input to an output based on example input-output pairs, while clustering
is a unsupervised learning approach.
• Regression: It predicts continuous valued output. The Regression analysis
is the statistical model which is used to predict the numeric data instead of
labels. It can also identify the distribution trends based on the available data
or historic data. Predicting a person’s income from their age, education is
example of regression task.
• Classification: It defines/predicts discrete number of values. In
classification the data is categorized under different labels according to
some parameters and then the labels are predicted for the data. Classifying
emails as either spam or not spam is example of classification problem.
• Clustering: Clustering is the task of partitioning the dataset into groups,
called clusters.The goal is to split up the data in such a way that points
within single cluster are very similar and points in different clusters are
different. It determines grouping among unlabeled data.
42
Classification vs Clustering
▪In general, in classification you have a set of predefined classes and want to
know which class a new object belongs to.
▪Clustering tries to group a set of objects and find whether there is some
relationship between the objects.
▪In the context of machine learning, classification is supervised learning and
clustering is unsupervised learning.

43
Classification vs Regression
▪Classification and Regression are two major prediction problems which
are usually dealt in Data mining.
▪Predictive modeling is the technique of developing a model or function
using the historic data to predict the new data.
▪The significant difference between Classification and Regression is that
classification maps the input data object to some discrete labels.
▪On the other hand, regression maps the input data object to the continuous
real values.

44
Clustering vs Association rule
▪By definition, clustering is grouping a set of objects in such a manner that
objects in the same group are more similar than to those object belonging to
other groups.
▪Whereas, association rules is about finding associations amongst items within
large commercial databases.

45
46
What is Deep Learning?
▪‘Deep Learning’ means using a neural network with several layers
of nodes between input and output

▪ The series of layers between input & output do feature


identification and processing in a series of stages, just as our brains
seem to.

47
Machine Learning vs Deep Learning

48
Deep Learning and ML

49
Applications

• Image Classification
• Speech Recognition
• Language translation
• Stock Exchange Prediction
• Biomedical and diagnosis system
• Vehicular Communication
• Face detection
• Video Surveillance
50
51

You might also like