Unit Iii ML
Unit Iii ML
Unit Iii ML
EVALUATION HYPOTHESES
1. Motivation
2. Estimation Hypothesis, accuracy
3. Basics of sampling theory
4. A general approach for deriving confidence intervals
5. Difference in error of two hypotheses
0
EVALUATION HYPOTHESES
INTRODUCTION:
Evaluating the accuracy of hypotheses is fundamental to machine learning.
This chapter presents an introduction to statistical methods for estimating hypothesis
accuracy, focusing on three questions. First, given the observed accuracy of a hypothesis over
a limited sample of data, how well does this estimate its accuracy over additional examples?
Second, given that one hypothesis outperforms another over some sample of data, how
probable is it that this hypothesis is more accurate in general?
Third, when data is limited what is the best way to use this data to both learn a hypothesis and
estimate its accuracy? Because limited samples of data might misrepresent the general
distribution of data, estimating true accuracy from such samples can be misleading.
Statistical methods, together with assumptions about the underlying distributions of
data,allow one to bound the difference between observed accuracy over the sample of
available data and the true accuracy over the entire distribution of data.
1. MOTIVATION
1
2. ESTIMATION HYPOTHESIS ACCURACY
2
3
3. BASICS OF SAMPLING THEORY
This section introduces basic notions from statistics and sampling theory, including probability
distributions, expected value, variance, Binomial and Normal distributions, and two-sided and one-
sided intervals.
4
5
6
4. A GENERAL APPROACH FOR DERIVING CONFIDENCE INTERVALS
7
5. DIFFERENCE IN ERROR OF TWO HYPOTHESES
8
6. COMPARING LEARNING ALGORITHMS.
9
SUPPORT VECTOR MACHINES & DIMENSIONALITY REDUCTION TECHNIQUES
7. SEPARATING DATA WITH THE MAXIMUM MARGIN
In sci-kit learn, the SVM (support vector machine) class provides a method for finding the
Maximum Margin Separating Hyperplane
(MMSH). The SVM model is a supervised learning algorithm that can be used for both classification
and regression tasks. When used for classification, the SVM model finds the MMSH that separates
different classes of data points. The SVM algorithm works by mapping the data points to a higher-
dimensional space, where a linear boundary can be found to separate the classes. The SVM then finds
the optimal hyperplane that separates the classes in this higher-dimensional space and projects it back
to the original space.
In sci-kit-learn, the SVM class has several options for kernel functions, which can be used to map the
data points to a higher-dimensional space. The most commonly used kernel functions are the linear
kernel, the polynomial kernel, and the radial basis function (RBF) kernel. The linear kernel is used
when the data is linearly separable, the polynomial kernel is used when the data is not linearly
separable, and the RBF kernel is used when the data is not linearly separable and the classes have
different densities.
8. FINDING THE MAXIMUM MARGIN
Maximum Margin Separating Hyperplane (MMSH) is a concept in machine learning that refers to
a line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions) that separates different classes
of data points with the largest possible margin. The margin is the distance between the hyperplane and
the closest data points from each class, and the goal of MMSH is to find the hyperplane that
maximizes this distance.
Example 1
The LinearSVC class also has a number of hyperparameters that you can adjust to control the
behavior of the model. For example, you can use the C hyperparameter to control the regularization
strength, which determines how much the model is allowed to overfit the training data.
Python coding
from sklearn.svm import LinearSVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
def load_data():
# Load your data here
X, y = make_classification(n_samples=1000,
n_features=4, random_state=42)
10
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Load the data and split it into training and test sets
X_train, X_test, y_train, y_test = load_data()
11
12