unit 6 ai

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Unit – 6 Classification and Regression

1. Describe the working principles of the K-Nearest Neighbors


algorithm. Discuss the impact of the value of kk on the model's
performance and demonstrate its use with a suitable numerical
example.
2. Explain the concept of Naïve Bayes classification. Given the
following dataset, calculate the probability of a test instance
belonging to a specific class using the Naïve Bayes classifier:

Feature 1 Feature 2 Class

High Yes A

Low No B

Medium Yes A

High No B

Test Instance: Feature 1 = High, Feature 2 = Yes.

3. Discuss the working of Support Vector Machines with a focus on


the kernel trick. Explain how it can be used for both classification
and regression tasks.
4. Draw and explain a decision tree for the following data:

Weather Temperature Play

Sunny Hot No

Overcast Cool Yes

Rainy Mild Yes

Use the entropy and information gain methods to


determine the root node.

5. What is cross-validation, and why is it important in model training


and evaluation? Describe kk-fold cross-validation and its role in
avoiding overfitting.
6. Explain precision, recall, F1-score, accuracy, and the area under
the curve (AUC). Given the following confusion matrix, calculate
all these metrics:

Predicted Positive Predicted Negative

Actual Pos 50 10

Actual Neg 5 35

7. Describe the concept of multi-variable regression. Given the


following data, calculate the coefficients of the regression
equation y=b0+b1x1+b2x2 Using least squares regression:

x1 x2 y

1 2 5

2 4 10

3 6 15

8. What is regularization in regression? Explain the difference


between Ridge regression and LASSO regression with examples.
Why does LASSO perform feature selection?
𝟏
9. Given the cost function 𝒋̅(𝜽) = 𝜮(𝒉𝜽 (𝝒) − 𝒚)𝟐 + 𝝀𝜮|𝜽|calculate
𝟐𝒎
the optimized values of θ or a simple linear regression problem
with LASSO regularization when λ=0.1.
10. Regression is widely used in various fields. Discuss at least
three real-world applications of regression, emphasizing the type
of regression used and why it is suitable for the task.
Q1 Describe the working principles of the K-Nearest Neighbors
algorithm. Discuss the impact of the value of kk on the model's
performance and demonstrate its use with a suitable numerical
example.
1. Answer- Small k:
o A smaller k value makes the algorithm sensitive to noise in the
training data, as it only considers a few neighbors.
o This can lead to overfitting, where the model performs well on
training data but poorly on unseen data.
2. Large k:
o A larger k value smoothens the decision boundary, making the
model less sensitive to noise.
o However, it may lead to underfitting, as it considers more
neighbors, which can include irrelevant or distant points from
other classes.
3. Optimal k:
o The optimal k value is usually determined using cross-
validation. It strikes a balance between bias and variance.
Numerical Example
Dataset:

Data Point Feature 1 (x) Feature 2 (y) Class


A 1 2 Red
B 2 3 Red
C 3 2 Blue
D 6 5 Blue
E 7 8 Blue
Query Point:
New data point Q with coordinates (4,4)
Step 1: Calculate Distances

Using Euclidean distance: 𝑑 = √(𝑥2 − 𝜘1 )2 + (𝑦2 − 𝑦1 )2


d(Q, A)= √(4 − 1)2 + (4 − 2)2 =9+4 =3.6

d(Q,B)= √(4 − 2)2 + (4 − 3)2 =4+1 =2.2

d(Q,C)= √(4 − 3)2 + (4 − 2)2 =1+4 =2.2

d(Q,D)= √(4 − 6)2 + (4 − 5)2 =4+1 =2.2

d(Q,E)= √(4 − 7)2 + (4 − 8)2 =9+16 =5.0


Step 2: Select k-Nearest Neighbors
• For k=3, the nearest neighbors are B,C,D.
Step 3: Predict the Class
• Among B,C,D:
o Red=1, Blue=2.
o Predicted class: Blue.

Q2 Explain the concept of Naïve Bayes classification. Given the


following dataset, calculate the probability of a test instance belonging
to a specific class using the Naïve Bayes classifier:

Feature 1 Feature 2 Class

High Yes A

Low No B

Medium Yes A

High No B

Test Instance: Feature 1 = High, Feature 2 = Yes.


Answer –
Concept of Naïve Bayes Classification
Naïve Bayes is a probabilistic classifier based on Bayes' Theorem,
assuming independence among predictors. It calculates the posterior
probability of a class based on the prior probability and the likelihood of the
predictors. Bayes' Theorem is given by:
𝑃(𝑥|𝐶)𝑃(𝐶)
P(C∣X)=
𝑝(𝑥)

Where:
• P(C∣X): Posterior probability of class C given predictors X.
• P(X∣C): Likelihood of predictors X given class C.
• P(C): Prior probability of class C.
• P(X): Evidence (constant across all classes).
In Naïve Bayes, the independence assumption simplifies P(X∣C) as a
product of individual probabilities of each feature:
P(X∣C)=P(X1∣C)×P(X2∣C)×⋯×P(Xn∣C)

Given Dataset

Feature 1 Feature 2 Class

High Yes A

Low No B

Medium Yes A

High No B

Test Instance
Feature 1 = High, Feature 2 = Yes
We calculate the posterior probability for each class A and B, and classify
the test instance to the class with the highest posterior probability.

Step 1: Calculate Priors P(A)P and P(B)


From the dataset:
• Total instances: 4
• Class AA instances: 2
• Class BB instances: 2

Step 2: Calculate Likelihoods


For Class A:
Step 3: Calculate Posteriors

Step 4: Classification
Since P(A∣X)=0.25 and P(B∣X)=0.0 , the test instance is classified as Class
A.

Conclusion
Using Naïve Bayes classification, the test instance with Feature 1 = High
and Feature 2 = Yes belongs to Class A.

Q3 Discuss the working of Support Vector Machines with a focus on the


kernel trick. Explain how it can be used for both classification and
regression tasks.
Answer- Support Vector Machines (SVM): Working, Kernel Trick, and Applications
in Classification and Regression

1. Working of SVM
Support Vector Machines (SVM) are supervised learning algorithms used
for both classification and regression tasks. The goal of SVM is to find the
optimal hyperplane that best separates data points belonging to different
classes in the feature space.
• Hyperplane: A hyperplane is a decision boundary that divides the
feature space into regions, each corresponding to a class label. In a
2D space, it is a line; in 3D, it becomes a plane, and in higher
dimensions, it generalizes to a hyperplane.
• Support Vectors: These are the data points closest to the
hyperplane. They have the largest influence on the position and
orientation of the hyperplane.
• Margin: The margin is the distance between the hyperplane and the
closest support vectors. SVM aims to maximize this margin, ensuring
better generalization.
2. Challenges with Linearly Non-Separable Data
When the data points are not linearly separable, SVM introduces:
• Slack Variables: These allow some points to be misclassified,
introducing flexibility in the model.
• Regularization Parameter : Controls the trade-off between
maximizing the margin and minimizing classification errors.
3. The Kernel Trick
The kernel trick is a mathematical technique used to transform data into a
higher-dimensional space where a hyperplane can separate it linearly.
Instead of explicitly mapping the data, kernels calculate the dot product of
data points in the transformed space, avoiding computational complexity.
Common Kernels:
• Linear Kernel: Used when data is linearly separable.
• Polynomial Kernel: Maps data to polynomial features.
• Radial Basis Function (RBF) Kernel (Gaussian Kernel): Handles
complex decision boundaries by mapping data to an infinite-
dimensional space.
• Sigmoid Kernel: Similar to a neural network activation function,
used in specific cases.
4. SVM for Classification
• SVM finds the hyperplane that best separates classes.
• For multi-class classification, strategies like One-vs-One or One-vs-
Rest are used.
• Example: Classifying emails as spam or not spam.
5. SVM for Regression (SVR)
• In regression tasks, SVM tries to find a hyperplane that predicts a
continuous output value.
• Instead of maximizing margin, SVR introduces an epsilon-tube
within which prediction errors are ignored.
• It minimizes the deviation of predicted values outside this tube.
6. Advantages of SVM
• Effective in high-dimensional spaces.
• Works well for both linearly separable and non-linearly separable
data.
• Robust against overfitting, especially with proper regularization.
7. Limitations of SVM
• Computationally intensive for large datasets.
• Choice of kernel and hyperparameters (C) significantly affects
performance.
8. Applications
• Classification: Face recognition, text categorization, and
bioinformatics.
• Regression: Predicting house prices, stock trends, or weather data.
Q4 Draw and explain a decision tree for the following data:

Weather Temperature Play

Sunny Hot No

Overcast Cool Yes

Rainy Mild Yes

Use the entropy and information gain methods to


determine the root node.

Answer-Step 1: Understanding the Data


The dataset consists of three features:
• Weather: Sunny, Overcast, Rainy
• Temperature: Hot, Cool, Mild
• Play: Target variable (Yes/No)

Step 2: Formula for Entropy


Entropy measures the impurity of the dataset and is calculated as:

Where pi is the proportion of class iii.


Step 3: Calculate the Entropy of the Entire Dataset
The target variable Play has:
• Yes = 2
• No = 1
Total = 3

Step 4: Information Gain Formula


Information Gain (IG) measures the reduction in entropy after splitting on a
feature:

Step 5: Calculate Entropy for Each Feature


(a) Weather
The feature Weather has three categories: Sunny, Overcast, Rainy.
(b) Temperature
The feature Temperature has three categories: Hot, Cool, Mild

Step 6: Choosing the Root Node


Since all features have the same information gain, we can select either
Weather or Temperature as the root node.
Step 7: Construct the Decision Tree
Using Weather as the root:

Q5 What is cross-validation, and why is it important in model training


and evaluation? Describe kk-fold cross-validation and its role in
avoiding overfitting.
Answer- Cross-Validation and Its Importance in Model Training and Evaluation
Cross-validation is a statistical method used to evaluate the performance
and generalizability of machine learning models. It involves dividing the
available dataset into subsets to train and test the model iteratively,
ensuring that the model performs well on unseen data.

Importance of Cross-Validation
1. Assessing Model Performance:
Cross-validation provides a reliable estimate of how well a model will
perform on unseen data by simulating its behavior on various
portions of the dataset.
2. Avoiding Overfitting:
By testing the model on data it has not seen during training, cross-
validation ensures that the model does not memorize the training
data, helping detect overfitting.
3. Optimizing Model Parameters:
It helps in hyperparameter tuning by identifying the best settings for a
model that maximize its predictive power.
4. Comparing Models:
Cross-validation provides a fair comparison of different models by
using the same validation strategy.
5. Effective Use of Data:
By utilizing the entire dataset for both training and validation at
different iterations, cross-validation ensures efficient use of limited
data.

k-Fold Cross-Validation
Definition:
In k-fold cross-validation, the dataset is randomly divided into k equal-
sized folds or subsets. The process is as follows:
1. Use k−1 folds for training the model.
2. Use the remaining 1 fold for testing (validation).
3. Repeat the process k times, with each fold being used exactly once
as the validation set.
4. Calculate the average performance metric (e.g., accuracy, precision)
across all k iterations to evaluate the model.
Steps in k-Fold Cross-Validation:
• Divide the dataset into k subsets.
• Train the model on k−1 folds and test on the remaining fold.
• Repeat for each fold.
• Aggregate the performance results.
Formula for Evaluating Performance:

Role in Avoiding Overfitting


• Prevents Data Leakage:
Since each fold is tested on data not seen during training, the
model's performance reflects its ability to generalize.
• Balances Training and Testing:
Ensures that every data point is used for training and validation,
reducing bias and variance in model evaluation.
• Improves Robustness:
Provides a more accurate estimate of model performance compared
to a single train-test split, reducing the likelihood of overfitting to
specific data.
Example (5-Fold Cross-Validation):
• Dataset size: 100 samples.
• Divide into 5 folds of 20 samples each.
• Train on 4 folds (80 samples) and validate on the 1 remaining fold (20
samples).
• Repeat 5 times, using a different fold for validation each time.
• Calculate the mean validation score for overall evaluation.

Advantages of k -Fold Cross-Validation


1. Efficient Use of Data:
Maximizes the utilization of the dataset for both training and
validation.
2. Reduced Variability:
Produces a more stable and reliable evaluation score.
3. Versatility:
Works for various machine learning models and datasets.

Q6 Explain precision, recall, F1-score, accuracy, and the area under the
curve (AUC). Given the following confusion matrix, calculate all these
metrics:

Predicted Positive Predicted Negative

Actual Pos 50 10

Actual Neg 5 35

Answer - 1. Precision
Precision measures the proportion of correctly predicted positive
observations out of all predicted positives.

Indicates: How many of the positive predictions were actually
correct.

2. Recall (Sensitivity/True Positive Rate)


Recall measures the proportion of correctly predicted positive
observations out of all actual positives.


Indicates: How many of the actual positives were identified
correctly.

3. F1-Score
The F1-score is the harmonic mean of precision and recall, providing a
single metric to evaluate the balance between them.

• Indicates: A balance between precision and recall, especially useful


when dealing with imbalanced datasets.

4. Accuracy
Accuracy measures the proportion of correctly predicted observations out
of all observations.

Indicates: The overall correctness of the model.

5. Area Under the Curve (AUC)


AUC evaluates the ability of the model to differentiate between positive and
negative classes. It is derived from the Receiver Operating Characteristic
(ROC) curve, which plots the true positive rate against the false positive
rate.
• Indicates: Higher AUC means better model performance.

Confusion Matrix

Calculations

1. Precision

2. Recall
3. F1-Score

4. Accuracy

5. AUC
To calculate the AUC:

AUC is typically derived from the ROC curve using these values. For this
confusion matrix, a high AUC is expected due to strong performance in TPR
and low FPR.

Summary of Metrics
Metric Value

Precision 90.9%

Recall 83.3%

F1-Score 86.9%

Accuracy 85%

AUC High (based on ROC)


This comprehensive analysis highlights the performance and effectiveness
of the model in distinguishing between positive and negative classes.

Q7 Describe the concept of multi-variable regression. Given the


following data, calculate the coefficients of the regression equation
y=b0+b1x1+b2x2 Using least squares regression:

x1 x2 y

1 2 5

2 4 10

3 6 15

Answer- Concept of Multi-Variable Regression (7 Marks)


Multi-variable regression is an extension of simple linear regression where
the relationship between one dependent variable (y) and two or more
independent variables (x1,x2,…,xn) is modeled. The regression equation is
given by:
y=b0+b1x1+b2x2+⋯+bnxn+ϵ
Here:
• y: Dependent variable (output)
• x1,x2,…,xn: Independent variables (inputs)
• b0,b1,b2,…,bn: Regression coefficients
• ϵ\epsilon: Error term
The regression coefficients (b0,b1,b2,…,bn) are determined using the least
squares method, which minimizes the sum of the squared differences
between the actual and predicted values of y.

Data:
1. Set Up the Regression Equation

2. Write the Normal Equations


Using the least squares method, the normal equations for solving
b0,b1,b2are:

3. Calculate the Required Summations

4. Form the Normal Equations


5. Solve the Equations (Matrix Method)
In matrix form:

Solving, we find:
b0=0, b1=1, b2=2
6. Final Regression Equation

Conclusion
The regression equation y=x1+2x2 effectively models the given data using
the least squares method.

Q8 What is regularization in regression? Explain the difference between


Ridge regression and LASSO regression with examples. Why does
LASSO perform feature selection?
Answer- Concept of Multi-Variable Regression
Multi-variable regression, also known as multiple linear regression, is a
statistical technique used to model the relationship between a dependent
variable (y) and two or more independent variables (x1,x2,.....). The
regression equation is expressed as:
y=b0+b1x1+b2x2+⋯+bnxn
Here:
• b0 is the intercept.
• b1, b2, …, bn are the regression coefficients representing the impact of
each independent variable on y.
The coefficients are calculated using the least squares method, which
minimizes the sum of squared residuals (the difference between observed
and predicted y).

Calculation Using Least Squares Regression


Given data:

x1 x2 y

1 2 5

2 4 10

3 6 15

We need to calculate b0, b1, and b2.


Final Regression Equation:
The coefficients are:
b0=0,b1=2,b2=1
The regression equation is:
y=2x1+x2

Conclusion:
The least squares method provides an optimized model to predict y based
on x1 and x2. The derived equation can be used for forecasting or analyzing
relationships between variables.

Q9 Given the cost function j ̅(θ)=1/2m Σ(hθ (ϰ-y)^2+λΣ|θ|calculate the


optimized values of θ or a simple linear regression problem with LASSO
regularization when λ=0.1.
Answer- To solve the given problem, we calculate the optimized values of θ
for a simple linear regression problem using LASSO (Least Absolute
Shrinkage and Selection Operator) regularization. The cost function is
provided as:

where:
• m is the number of data points,
• hθ(x(i)) is the hypothesis function hθ(x)=θ0+θ1x(i) for linear regression,
• λ is the regularization parameter, and
• ∣θ∣ is the LASSO regularization term penalizing the absolute values of
θ.

Steps to Solve
1. Hypothesis Function
The hypothesis for linear regression is:

For m training examples (𝑥 (𝑙̇) , 𝑦(𝑙 ))


̇ the squared error term becomes:

2. LASSO Regularization
LASSO adds the term λ∑∣θ∣ to the cost function. This term induces sparsity
by shrinking some θj values to zero.
3. Derivative and Update Rule
To minimize J(θ), take partial derivatives with respect to θ0 and θ1:

Here, sign(θ1) adjusts the penalty based on the direction of θ1.


4. Numerical Example
Assume m=3, λ=0.1, and the dataset is:

x y
1 2
2 4
3 6

Step 1: Initial Guess


Start with initial values θ0=0 and θ1=0. Set the learning rate α for gradient
descent.
Step 2: Calculate Gradients
For each training example, compute:

Step 3: Update Parameters


Using the gradient descent update rule:
Step 4: Convergence
Repeat the updates until the cost function converges to a minimum. The
final values of θ0 and θ1 represent the optimized parameters.

Result
After applying LASSO regularization, λ=0.1 reduces the magnitude of θ1,
potentially setting it to zero for irrelevant features. This leads to a sparse
solution where only significant predictors remain in the model. For the
given data, the final values of θ0 and θ1 would be computed iteratively.

Q10 Regression is widely used in various fields. Discuss at least three


real-world applications of regression, emphasizing the type of
regression used and why it is suitable for the task.
Answer- Regression analysis is a powerful statistical tool widely used in
various fields for prediction and decision-making. Below are three real-
world applications of regression, with an explanation of the type of
regression used and its suitability for the task.

1. Predicting Housing Prices


• Application: Regression is extensively used in real estate to predict
housing prices based on features such as the size of the house,
location, number of bedrooms, proximity to schools, and amenities.
• Type of Regression: Multiple Linear Regression is commonly used
for this purpose because it considers multiple independent variables
to predict a continuous dependent variable (price).
• Suitability: Housing prices depend on various factors
simultaneously, and multiple linear regression effectively quantifies
the impact of each feature. By analyzing the coefficients, real estate
agents can determine which factors most influence prices, allowing
for more accurate and data-driven pricing strategies.

2. Medical Prognosis and Disease Risk Prediction


• Application: Regression models are used in healthcare to predict
disease risks, such as the likelihood of developing diabetes or heart
disease based on variables like age, body mass index (BMI),
cholesterol levels, and blood pressure.
• Type of Regression: Logistic Regression is typically employed here
since the outcome is binary (e.g., the presence or absence of
disease).
• Suitability: Logistic regression is suitable because it calculates
probabilities and maps them to binary outcomes using the sigmoid
function. This makes it ideal for applications where decisions need to
be made based on a threshold probability, such as whether a patient
should undergo further medical testing or intervention.

3. Marketing and Sales Forecasting


• Application: Companies use regression to forecast sales based on
factors like advertising expenditure, economic indicators, seasonal
trends, and consumer behavior.
• Type of Regression: Time Series Regression and Polynomial
Regression are often used to model non-linear relationships or
account for time-dependent factors.
• Suitability: Regression helps businesses understand how advertising
or market conditions influence sales, enabling them to optimize
marketing budgets and anticipate future performance. By capturing
trends and seasonality, regression models provide actionable
insights for strategic planning.

You might also like