01 Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

N.G.ACHARYA & D.K.

MARATHE COLLEGE OF
ARTS, SCIENCE & COMMERCE.
(Affiliated to University Of Mumbai)

PRACTICAL JOURNAL

PSCSP512
Machine Learning
SUBMITTED BY
KAMBLE YASH RAJESH
SEAT NO :

SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS


FOR QUALIFYING M.Sc. (CS) PART-I (SEMESTER – II) EXAMINATION.

2023-2024

DEPARTMENT OF COMPUTER SCIENCE


SHREE N.G.ACHARYA MARG,CHEMBUR

MUMBAI-400 071
N.G.ACHARYA & D.K.MARATHE COLLEGE OF

ARTS, SCIENCE & COMMERCE.

(Affiliated to University Of Mumbai)

CERTIFICATE

This is to certify that Mr. Kamble Yash Rajesh Seat No. studying
in Master of Science in Computer Science Part I Semester II has
satisfactorily completed the Practical of PSCSP512 Machine Learning
as prescribed by University of Mumbai, during the academic year 2023-
24.

Signature Signature Signature

Internal Guide External Examiner Head Of Department

College Seal Date:


INDEX

Practical Practical Signature.


No.

1 Implement Linear Regression(Diabetes Dataset)

2 Implement Logistics Regression(Iris Dataset).

3 Implement Multinomial Logistic Regression(Iris


Dataset).

4 Implement SVM classifier(Iris Dataset)

5 Train and fine-tune a decision tree for the moons


dataset.

6 Train and SVM regression on the California


housing dataset.

7 Implement Batch Gradient Descent with early


Stopping for softmax regression.

8 Implement MLP for classification of handwritten


Digits (MNIST Dataset).
Practical No:- 01
AIM :- Implement Linear Regression (Diabetes Dataset).
# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

diabetes = datasets.load_diabetes()
diabetes

print(diabetes.DESCR)
# columns
diabetes.feature_names

# Now we will split the data into the independent and independent variabl
X = diabetes.data
Y = diabetes.target
X.shape, Y.shape

# We will split the data into training and testing data


from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(X,Y,test_size=0.3,random_state=99)

train_x.shape, train_y.shape

# Linear Regression
from sklearn.linear_model import LinearRegression

le = LinearRegression()
le.fit(train_x,train_y)

y_pred = le.predict(test_x)
y_pred
result = pd.DataFrame({'Actual': test_y, 'Predict' : y_pred})
result

# we will check the accuracy


print('coefficient', le.coef_)
print('intercept', le.intercept_)
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score

#Variance_Score
explained_variance_score(test_y,predicted_y)

>> 0.47737703777354545
# mean_squared_error
mean_squared_error(test_y,y_pred)

>> 3157.972848565651

# r2 score
r2_score(test_y,y_pred)

Inference:-
The model explains 0.477377 variance of the target w.r.t. features
The mean absolute error of model is 3157.972848565651
The R-Square score of model is 0.45
Below are the Coefficients & intercepts of the Regression Equation as calculated by the
model.

coeff = pd.Series(le.coef_, index = train_x.columns)


intercept = le.intercept_

print("Coefficients:\n")
print(coeff)
print("\n")
print("Intercept:\n")
print(intercept)
print("\n")

Coefficients:

age 54.820535
sex -260.930304
bmi 458.001802
bp 303.502332
s1 -995.584889
s2 698.811401
s3 183.095229
s4 185.698494
s5 838.503887
s6 96.441048
dtype: float64

Intercept:

154.42752615353518

The regression Equation would be :

Diabetes Progression = Intercept + coeff(1) X age + coeff(2) X sex +.....+ coeff(10) X s6


Practical No 2
Aim :- Implement Logistic Regression (Iris Dataset)
#Check the accuracy of model
from sklearn import accuracy_score
print(accuracy_score(Y_test, Y_predict))
0.973684
Inference:-
The logistic regression model’s accuracy is 97.36%
We have a 97.36% of accuracy which is a very good model, and with the confusion matrix
we can see that we have just only one misclassified data.
Practical No 3
Aim:- Implement Multinomial Logistic Regression (Iris Dataset)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Importing Sklearn module and classes


from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn import datasets
from sklearn.model_selection import train_test_split

#Data Loading – IRIS dataset


iris = datasets.load_iris()
iris

{'data': array([[5.1, 3.5, 1.4, 0.2], [4.9, 3. , 1.4, 0.2], [4.7, 3.2, 1.3, 0.2], [4.6, 3.1, 1.5, 0.2], [5.
, 3.6, 1.4, 0.2], [5.4, 3.9, 1.7, 0.4], {'data': array([[5.1, 3.5, 1.4, 0.2], [4.9, 3. , 1.4, 0.2], [4.7,
3.2, 1.3, 0.2], [4.6, 3.1, 1.5, 0.2], [5. , 3.6, 1.4, 0.2], [5.4, 3.9, 1.7, 0.4],

X = iris.data[:, [0, 2]]


Y = iris.target

#Create Training / Test Data


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1,
stratify=Y)

X_train
X_train.shape

(105, 2) .

X_test.shape

(45, 2)

#Perform Feature Scaling


#in order to make sure features are in fixed range irrespective of their values / units etc.

sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
X_train
X_train_std

#Train a Logistic Regression Model


# Create an instance of LogisticRegression classifier
lr = Multiple LogisticRegression(C=100.0, solver='lbfgs', multi_class='ovr')

# Fit the model

lr.fit(X_train_std, Y_train)

#Measure model performance


# Create the predictions

Y_predict = lr.predict(X_test_std)
Y_predict

array([2, 0, 0, 1, 1, 1, 2, 1, 2, 0, 0, 2, 0, 1, 0, 1, 2, 1, 1, 2, 2, 0, 1, 1, 1, 1, 1, 2, 0, 2, 0, 0, 1, 1, 2,
2, 0, 0, 0, 1, 2, 2, 1, 0, 0])

# Use metrics.accuracy_score to measure the score


print("multinomial LogisticRegression Accuracy %.3f" %metrics.accuracy_score(Y_test,
Y_predict))

Output

multinomial LogisticRegression Accuracy 0.956

Infernce – The accuracy score of model is 95%


Practical No 4
Aim:- Implement SVM classifier (Iris Dataset)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Define the col names


colnames=["sepal_length_in_cm",
"sepal_width_in_cm","petal_length_in_cm","petal_width_in_cm", "class"]

#Read the dataset


dataset = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data", header = None, names= colnames )

#Data
dataset.head()

#Encoding the categorical column


dataset = dataset.replace({"class": {"Iris-setosa":1,"Iris-versicolor":2, "Iris-virginica":3}})
#Visualize the new dataset
dataset.head()

# Now we’re going to analyze our data


plt.figure(1)
sns.heatmap(dataset.corr())
plt.title('Correlation On iris Classes')

# Spliting the data


X = dataset.iloc[:,:-1]
y = dataset.iloc[:, -1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

#Create the SVM model


from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
#Fit the model for the data

classifier.fit(X_train, y_train)

#Make the prediction


y_pred = classifier.predict(X_test)

#And finally for check the acurracy of the model


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

from sklearn.model_selection import cross_val_score


accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Output
[[13 0 0]
[ 0 15 1]
[ 0 0 9]]
Accuracy: 98.18 %
Standard Deviation: 3.64 %

Inference - Accuracy score of model is 98.18% and standard deviation is 3.64%


We have a 98% of accuracy which is a very good model, and with the confusion matrix we
can see that we have just only one misclassified data.
Practical No 5
Aim:- Train and fine-tune a Decision Tree for the Moons Dataset

import numpy as np
import matplotlib.pyplot as plt
def plot_dataset(X, y, axes):
plt.figure(figsize=(10,6))
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs",alpha = 0.5)
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^",alpha = 0.2)
plt.axis(axes)
plt.grid(True, which='both')
plt.xlabel(r"$x_1$", fontsize=20)
plt.ylabel(r"$x_2$", fontsize=20, rotation=0)

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=10000, noise=0.4, random_state=21)


plot_dataset(X, y, [-3, 5, -3, 3])

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test =train_test_split(X,y, test_size = 0.2)

from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier()

from sklearn.model_selection import GridSearchCV

parameter = {
'criterion' : ["gini", "entropy"],
'max_leaf_nodes': list(range(2, 50)),
'min_samples_split': [2, 3, 4]
}

clf = GridSearchCV(tree_clf, parameter, cv = 5,scoring =


"accuracy",return_train_score=True,n_jobs=-1)

clf.fit(X_train, y_train)

clf.best_params_

{'criterion': 'gini', 'max_leaf_nodes': 37, 'min_samples_split': 2}

cvres = clf.cv_results_
for mean_score, params in zip(cvres["mean_train_score"], cvres["params"]):
print(mean_score, params)

0.77859375 {'criterion': 'gini', 'max_leaf_nodes': 2, 'min_samples_split': 2}


0.77859375 {'criterion': 'gini', 'max_leaf_nodes': 2, 'min_samples_split': 3}
0.77859375 {'criterion': 'gini', 'max_leaf_nodes': 2, 'min_samples_split': 4}
0.8201562500000001 {'criterion': 'gini', 'max_leaf_nodes': 3, 'min_samples_split': 2}
0.8201562500000001 {'criterion': 'gini', 'max_leaf_nodes': 3, 'min_samples_split': 3}
0.8201562500000001 {'criterion': 'gini', 'max_leaf_nodes': 3, 'min_samples_split': 4}
0.8596875 {'criterion': 'gini', 'max_leaf_nodes': 4, 'min_samples_split': 2}

#Getting the training score:


clf.score(X_train, y_train)
0.875125

We have an accuracy of approximately 87% but accuracy is sometimes not a good measure to
use,

lets see the confusion matrix.

from sklearn.metrics import confusion_matrix


pred = clf.predict(X_train)
confusion_matrix(y_train,pred)

array([[3547, 481],
[ 518, 3454]])
Now from the confusion matrix let's get our precision and recall, which are better
metrics.
from sklearn.metrics import precision_score, recall_score

pre = precision_score(y_train, pred)


re = recall_score(y_train, pred)
print(f"Precision: {pre} Recall:{re}")

Precision: 0.8777636594663278 Recall:0.8695871097683786

Not bad we have a higher precision than recall but lets combine the two metrics
into F1_ score.

from sklearn.metrics import f1_score


f1_score(y_train, pred)

0.8736562539521943
Our F1_Score and accuracy are almost the same.

Getting the testing score:


clf.score(X_test, y_test)

0.8585

Inference:-
We have an accuracy of approximately 85% on the testing set.
Practical No 6
Aim:- Train and SVM regression on the California housing dataset
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
housing_data = fetch_california_housing()
descr = housing_data[‘DESCR’]
feature_names = housing_data[‘feature_names’]
data = housing_data[‘data’]
target = housing_data[‘target’]
df1 = pd.DataFrame(data=data)
df1.rename(columns={0: feature_names[0], 1: feature_names[1], 2: feature_names[2], 3:
feature_names[3],
4: feature_names[4], 5: feature_names[5], 6: feature_names[6], 7: feature_names[7]},
inplace=True)
df2 = pd.DataFrame(data=target)
df2.rename(columns={0: ‘Target’}, inplace=True)
housing = pd.concat([df1, df2], axis=1)
print(housing.columns)
housing.head()

MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudeTarget08.32
5241.06.9841271.023810322.02.55555637.88-
122.234.52618.301421.06.2381370.9718802401.02.10984237.86-
122.223.58527.257452.08.2881361.073446496.02.80226037.85-
122.243.52135.643152.05.8173521.073059558.02.54794537.85-
122.253.41343.846252.06.2818531.081081565.02.18146737.85-122.253.422
print(“dimension of housing data: {}”.format(housing.shape))

dimension of housing data: (20640, 9)

housing.info()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing.loc[:, housing.columns != 'Target'],
housing['Target'], random_state=66)
from sklearn.svm import SVR
svr = SVR()
svr.fit(X_train, y_train)
s1 = svr.score(X_train, y_train)
s2 = svr.score(X_test, y_test)
print(“R² of Support Vector Regressor on training set: {:.3f}”.format(s1))
print(“R² of Support Vector Regressor on test set: {:.3f}”.format(s2))
O/P

R² of Support Vector Regressor on training set: -0.023

R² of Support Vector Regressor on test set: -0.033

Inference:-
The model underperform quite substantially, with a negative score on both the training set
and the test set.
SVM requires all the features to vary on a similar scale. We will need to re-scale our data that
all the features are approximately on the same scale:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
svr1 = SVR()
svr1.fit(X_train_scaled, y_train)
s3 = svr1.score(X_train_scaled, y_train)
s4 = svr1.score(X_test_scaled, y_test)
print(“R² of Support Vector Regressor on training set: {:.3f}”.format(s3))
print(“R² of Support Vector Regressor on test set: {:.3f}”.format(s4))

R² of Support Vector Regressor on training set: 0.659

R² of Support Vector Regressor on test set: 0.663

Inference:-
Scaling the data made a huge difference! Now we are actually underfitting, where training
and test set performance are quite similar but less close to 100% accuracy. From here, we can
try increasing either gamma or C to fit a more complex model.
svr2 = SVR(gamma=10)
svr2.fit(X_train_scaled, y_train)
s5 = svr2.score(X_train_scaled, y_train)
s6 = svr2.score(X_test_scaled, y_test)
print(“R² of Support Vector Regressor on training set: {:.3f}”.format(s5))
print(“R² of Support Vector Regressor on test set: {:.3f}”.format(s6))

R² of Support Vector Regressor on training set: 0.702

R² of Support Vector Regressor on test set: 0.697

Inference:-
Here, increasing gamma allows us to improve the model, resulting in 69.7% test set accuracy.
Practical No 7
Aim:- Implement Batch Gradient Descent with early stopping for Softmax
Regression.

import numpy as np
import math

class SoftmaxClassifier:
def __init__(self,learning_rate=0.1,max_iter=1000):
self.__learning_rate=learning_rate
self.__max_iter=max_iter

def __calculate_score(self,k,x):
weight=self.__weights[k]
return x.dot(weight)
def train(self,x,y):
self.__x=x
self.__y=y
self.__class_count=len(self.__y[0])
self.__weights=np.random.rand(self.__class_count,x.shape[1])
for i in range(self.__max_iter):
for j in range(self.__class_count):
self.__weights[j]=self.__calculate_new_weights(j)
def __calculate_softmax(self,k,x):
sum_of_exp=0
for i in range(self.__class_count):
sum_of_exp+=self.__calculate_score(i,x)
return self.__calculate_score(k,x)/sum_of_exp
def __calculate_cross_entropy_gradient(self,k):
sum=0
for i in range(len(self.__x)):
sum+=((self.__calculate_softmax(k,self.__x[i])-self.__y[i][k])*self.__x[i])
return sum
def __calculate_new_weights(self,k):
step_size=(self.__calculate_cross_entropy_gradient(k) )* self.__learning_rate
return self.__weights[k]-step_size
def predict(self,x):
y=np.zeros((len(x),self.__class_count))
for i in range(len(x)):
max_score_index=0
max_score=0
for j in range(self.__class_count):
score=self.__calculate_softmax(j,x[i])
if score>max_score:
max_score=score
max_score_index=j
y[i][max_score_index]=1
return y

from sklearn import datasets


from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

import numpy as np

def convert_to_one_hot(labels):
class_count=len(set(labels))
one_hot=np.zeros((len(labels),class_count))
for i in range(len(labels)):
one_hot[i][labels[i]]=1
return one_hot

def main():
iris=datasets.load_iris()
data=iris['data']
labels_one_hot=convert_to_one_hot(iris['target'])
rand=np.random.permutation(len(data))
x_train,x_test,y_train,y_test=train_test_split(data[rand],labels_one_hot[rand],test_size=0.33
)
soft_clf=SoftmaxClassifier()
soft_clf.train(x_train,y_train)
y_pred=soft_clf.predict(x_test)
Accuracy=accuracy_score(y_test,y_pred)
print(Accuracy)
if __name__=="__main__":
main()

O/P Accuracy=0.8
Inference:-
Accuracy of model is 80%
Practical No 8
Aim:- Implement MLP for classification of handwritten digits (MNIST
Dataset).

import tensorflow as tf
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

import matplotlib.pyplot as plt

image_index = 7777 # You may select anything up to 60,000


print(y_train[image_index]) # The label is 8
plt.imshow(x_train[image_index], cmap='Greys')

8
<matplotlib.image.AxesImage at 0x7fa7325add50>

x_train.shape
(60000, 28, 28)

# Reshaping the array to 4-dims so that it can work with the Keras API
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
input_shape = (28, 28, 1)
# Making sure that the values are float so that we can get decimal points after division
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# Normalizing the RGB codes by dividing it to the max RGB value.
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print('Number of images in x_train', x_train.shape[0])
print('Number of images in x_test', x_test.shape[0])
x_train shape: (60000, 28, 28, 1)
Number of images in x_train 60000
Number of images in x_test 10000

# Importing the required Keras modules containing model and layers


from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D
# Creating a Sequential Model and adding the layers
model = Sequential()
model.add(Conv2D(28, kernel_size=(3,3), input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten()) # Flattening the 2D arrays for fully connected layers
model.add(Dense(128, activation=tf.nn.relu))
model.add(Dropout(0.2))
model.add(Dense(10,activation=tf.nn.softmax))

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x=x_train,y=y_train, epochs=10)
Epoch 1/10
1875/1875 [==============================] - 43s 22ms/step - loss: 0.2121 -
accuracy: 0.9357
Epoch 2/10
1875/1875 [==============================] - 42s 22ms/step - loss: 0.0878 -
accuracy: 0.9727
Epoch 3/10
1875/1875 [==============================] - 45s 24ms/step - loss: 0.0588 -
accuracy: 0.9815
Epoch 4/10
1875/1875 [==============================] - 43s 23ms/step - loss: 0.0447 -
accuracy: 0.9855
Epoch 5/10
1875/1875 [==============================] - 44s 23ms/step - loss: 0.0359 -
accuracy: 0.9886
Epoch 6/10
1875/1875 [==============================] - 42s 22ms/step - loss: 0.0307 -
accuracy: 0.9897
Epoch 7/10
1875/1875 [==============================] - 43s 23ms/step - loss: 0.0259 -
accuracy: 0.9911
Epoch 8/10
1875/1875 [==============================] - 45s 24ms/step - loss: 0.0230 -
accuracy: 0.9921
Epoch 9/10
1875/1875 [==============================] - 43s 23ms/step - loss: 0.0197 -
accuracy: 0.9932
Epoch 10/10
1875/1875 [==============================] - 42s 22ms/step - loss: 0.0183 -
accuracy: 0.9943
<keras.callbacks.History at 0x7fa72f2c22f0>

model.evaluate(x_test, y_test)

313/313 [==============================] - 4s 13ms/step - loss: 0.0644 - accuracy:


0.9849
[0.06441233307123184, 0.9848999977111816]

Inference:-
The model accuracy is 98%

Use the model to predict image at index position 4444

image_index = 4444
plt.imshow(x_test[image_index].reshape(28, 28),cmap='Greys')
pred = model.predict(x_test[image_index].reshape(1, 28, 28, 1))
print(pred.argmax())

1/1 [==============================] - 0s 115ms/step


9

Inference:- The model predicts the correct digit

You might also like