Machine Leaning

Machine learning project
Great learning
PROJECT REPORT ON MACHINE LEARNING
REGARDS,
AKSHAY PANKAR
pg. 1
Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that will
help in predicting overall win and seats covered by a particular party.
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an inference
on it.
FIG 1 Gender Count
FIG 2 INFO
Checking Null VALUES
FIG 3 Checking Null VALUES
pg. 2
FIG 4 Description of data
We have converted variable to categorical variables , Codes are in python file
FIG 5 CHECKING duplicates of data
FIG 6 Checking outliers
Inference Drawn :- From this we can conclude that NO outliers are detected from DATA.
Q1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers .
pg. 3
Univariate Analysis
FIG 7 AGE DISTRIBUTION
FIG 8 ECONOMIC CONDITION NATIONAL DISTRIBUTION
pg. 4
FIG 9 Political knowlwgde DISTRIBUTION
From this we can conclude that Density is more for Less political Knowledge , Density ranges from age
group 20 to 80.
Bivariate and Multivariate Analysis
FIG 10 economic condition national vs age
FIG 11 economic condition House hold vs age
pg. 5
FIG 12 Blair vs age
FIG 13 Hague vs age
From this we can conclude that
1. For Age group 50 to 70 economic group is National .

2. For Age group 40 to 60 economic group is household.
3. For Age group 40 to 70 condition is blair.
4. For Age group 40 to 65 condition is Hague.
Correlation plots and Heat Map are in python file Individual correlations between variables are mention in
heat map and pairplot
pg. 6
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split: Split the
data into train and test (70:30)
COUMNS OUTPUT
FIG 14 DATA SET after dropping Vote _labour column
Train-test-split:
Our model will use all the variables and 'vote_Labour' is the target variable. The train -test split is a
technique for evaluating the performance of a machine learning algorithm. The procedure involves
taking a dataset and dividing it into two subsets.
• Training Set: 70percent of data.

• Testing Set: 30 percent of the data.
Why scaling?
• The dataset contains features highly varying in magnitudes, units and range between the
'age' column and other columns.
• But since, most of the machine learning algorithms use Euclidean distance between two
data points in their computations.
• in this case, we have a lot of encoded, ordinal, categorical and continuous variables. So, we
may use the minmax scaler technique to scale the .
pg. 7
1.4 Apply Logistic Regression and LDA (linear discriminant analysis).
Classification Report Trained Data:
FIG 15
Logistic Regression Model - Observation Train data:
• Accuracy: 84%
• Precision: 87%
• Recall: 91%
• F1-Score: 89%
pg. 8
Performance of the model
• The model is not over-fitted or under-fitted.

• The error in the test data is slightly higher than the train data, which is absolutely fine because the
error margin is low and the error in both train and test data is not too high. Thus, the model is not
over-fitted or under-fitted.
Linear Discriminant Analysis Model:

There are no outliers present in the continuous variable 'age'. The remaining variables are
categorical in nature. Our model will use all the variables and 'vote_Labour' is the target variable.
Classification report - Train data:
FIG 16
Classification report - test data:
FIG - 17
Linear Discriminant Analysis Model - Observation Train data:
• Accuracy: 84%
• Precision: 91%
• Recall: 87%
• F1-Score: 89%
pg. 9
Test data:
• Accuracy: 81%
• Precision: 89%
• Recall: 85%
• F1-Score: 87%
Performance of the model:

• The error in the test data is slightly higher than the train data, which is absolutely fine
because the error margin is low and the error in both train and test data is not too high.
Thus, the model is not over-fitted or under-fitted.
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
FIG - 18
• Accuracy: 86%
• Precision: 92%
• Recall: 88%
• F1-Score: 90%
Classification report - Test data:
pg. 10
• For the training data, the accuracy of the KNN model is 0.86, which means that the model is able to
correctly classify 86% of the samples in the training set.
• The precision and recall values for the two classes (0 and 1) in the training set are also quite high.
This suggests that the model is able to accurately identify both classes, without many false positives
or false negatives. The F1-score, which is the harmonic mean of precision and recall, is also high for
both classes, indicating good performance of the model.
• For the test data, the accuracy is 0.77, which is lower than the accuracy for the training data. This is
not surprising, as models tend to perform better on the data they were trained on. However, an
accuracy of 0.77 is still relatively high.
• The precision, recall, and F1-score values for the two classes in the test set are lower than those in
the training set. This suggests that the model may be overfitting to the training data and not
generalizing well to new data. However, the precision, recall, and F1-score values are still
reasonable, so the model may still be useful for classification tasks.
• Overall, the KNN model appears to perform reasonably well, with good performance on the training
data and reasonable performance on the test data.
Naïve Bayes Model:

FIG - 19
Classification report - test data:
FIG - 20
pg. 11
Train data:
• Accuracy: 84%
• Precision: 89%
• Recall: 88%
• F1-Score: 88%
Test data:
• Accuracy: 81%
• Precision: 88%
• Recall: 86%
• F1-Score: 87%

• The error in the test data is slightly higher than the train data, which is absolutely fine because the
error margin is low and the error in both train and test data is not too high. Thus, the model is not
over-fitted or under-fitted.
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
KNN Model Tuning:
pg. 12
pg. 13
Insights:
• Based on the results obtained, it appears that the AdaBoost algorithm outperforms the KNN
and Random Forest algorithms. The best accuracy achieved by the AdaBoost algorithm is
0.8426, which is higher than the best accuracies of KNN (0.8350) and Random Forest
(0.8369). However, all three algorithms perform reasonably well, with accuracies ranging
between 0.81 to 0.84.
• Regarding the feature importance, Random Forest provides a fe ature importance score.
Based on the score, it appears that 'age', 'education', 'economic.cond.national', and
'satisfaction.financial' are the most important features in predicting the voting behavior of
the individuals.
• When looking at the classification reports for both train and test datasets, it is evident that
all three models perform well on training data. However, AdaBoost algorithm slightly
outperforms KNN and Random Forest algorithms on the test dataset.
• Overall, the AdaBoost algorithm seems to be the most suitable for this particular dataset
based on the results obtained.
pg. 14
Logistic Regression Model Tuning:
Logistic Regression best parameters: {'C': 10, 'penalty': 'l2'}
Logistic Regression best accuracy: 0.8416216927734632
Logistic Regression best model: LogisticRegression(C=10)
Insights:
• we can see that the best parameters for logistic regression were C=10 and penalty=l2, which
achieved an accuracy of 84.16%. Then we used bagging and boosting to improve the
accuracy of the logistic regression model.
• The Bagging LR model achieved a test accuracy of 81%, which is lower than the logistic
regression model's accuracy. However, we can see that the precision, recall, and f1-score of
the Bagging LR model are relatively stable compared to the logistic regression model.
pg. 15
• The Boosting LR model achieved the same test accuracy as the Bagging LR model, but the
precision, recall, and f1-score are higher than the Bagging LR model. It means that the
Boosting LR model is better at identifying both the positive and negative classes, and hence
it has a better f1-score.
• Overall, the logistic regression model with C=10 and penalty=l2 performed well, and the
Bagging LR and Boosting LR models did not improve the accuracy by a large margin
Naive Model Tuning:
pg. 16
Insights:
• The Naive Bayes algorithm achieved a best accuracy of 0.8388. When bagging was applied to
the algorithm, the accuracy of the model improved to 0.82. This suggests that the bagging
algorithm was successful in reducing overfitting and improving the overall accuracy of the
model.
• On the other hand, when boosting was applied to the Naive Bayes algorithm, the accuracy
of the model decreased to 0.8. This indicates that boosting was not as effective as bagging in
improving the performance of the model.
• In terms of the classification report, both bagging and boosting NB models were able to
achieve high precision and recall for the positive class (1) with F1-scores of 0.87 and 0.86,
respectively. However, the bagging model outperformed the boosting model in terms of the
F1-score for the negative class (0), with a score of 0.68 compared to 0.65 for the boosting
model.
• Overall, the results suggest that bagging may be a better approach for improving the
performance of the Naive Bayes algorithm on this dataset compared to boosting.
pg. 17
1.7 Performance Metrics: Check the performance of Predictions on Trainand Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model. Final Model: Compare the models and write inference which model is
best/optimized.
Logistic Regression Model :

Logistic Regression Train Accuracy: 0.8444236176194939
Logistic Regression Test Accuracy: 0.8165938864628821
ROC and AUC – Train and Test
Figure 21 Logistic Regression -ROC

Curve
pg. 18
Classification report – Train and Test:
Confusion Matrix:
Figure 22 Logistic Regression -Confusion matrix
pg. 19
Insights:
• The Logistic Regression model has achieved an accuracy of 0.84 on the training set and 0.82
on the test set, indicating that the model is performing well on both the training and test
data. The precision, recall and f1-score for both the classes are also high on both training
and test sets, suggesting that the model is good at classifying both the classes.
• The ROC curve of the Logistic Regression model also shows a good performance, with an
AUC of 0.89 on the training set and 0.87 on the test set. This indicates that the model is able
to differentiate between the two classes well and is able to classify the samples accurately.
• Overall, the Logistic Regression model appears to be a good model for the given
classification problem.
Naïve Bayes Model:

Training Set Accuracy: 0.8388003748828491
Test Set Accuracy: 0.8144104803493449
Figure 23 Naïve Bayes -Confusion

Matrix
pg. 20
Figure 24 Naïve Bayes -ROC curve
Insights:
• The Naive Bayes model has achieved a train accuracy of 0.8388 and a test accuracy of
0.8144. The train and test classification reports show that the model has performed well in
pg. 21
terms of precision, recall, and F1-score for both classes, with a weighted F1-score of 0.84 for
train and 0.81 for test.
• The ROC curve and AUC score show that the model is performing well in distinguishing
between the positive and negative classes, with a train AUC of 0.89 and a test AUC of 0.87.
• Overall, the Naive Bayes model has performed reasonably well on the given dataset, with
high accuracy, good precision, recall, and F1-scores, and a good ability to distinguish
between the positive and negative classes.
KNN Model:
Train Accuracy: 0.9990627928772259
Test Accuracy: 0.8056768558951966
Figure 24 KNN Confusion Matrix
pg. 22
Figure 25 KNN ROC Curve
Train ROC AUC Score: 0.9999814663800133

Test ROC AUC Score: 0.870641989589358
Insights:
• the model is overfitting on the training data. The training accuracy is almost perfect at
99.9%, but the test accuracy is relatively lower at 80.6%. This suggests that the model is not
generalizing well to new data.
• Looking at the classification report, the precision, recall, and f1-score for class 0 in the test
set are relatively lower than those for class 1. This indicates that the model is better at
predicting class 1 than class 0.
• The ROC AUC score for the test set is 0.87, indicating that the model's ability to distinguish
between the positive and negative classes is fairly good. Howeve r, the train ROC AUC score
is almost perfect at 0.999, further indicating overfitting.
• Overall, the model's high accuracy on the training set but lower accuracy on the test set,
along with the differences in the classification report and ROC AUC scores, suggest that the
model is overfitting on the training data and is not able to generalize well to new data.
Conclusion:
• Based on the Comparison, it appears that the logistic regression model is the best-
performing model for the given problem.
• The logistic regression model has the highest accuracy on both the training and testing sets,
as well as the highest area under the ROC curve on both sets. Its precision and recall scores
are also higher than those of the naive Bayes and KNN models.
pg. 23
• Overall, the logistic regression model's performance is consistent across both the training
and testing sets, indicating that it is not overfitting the training data. Therefore, I would
recommend using the logistic regression model for this problem.
1.8 Based on these predictions, what are the insights?

Labour party has more than double the votes of conservative party.
• Most number of people have given a score of 3 and 4 for the national economic
condition and the average score is 3.245221 .
• Most number of people have given a score of 3 and 4 for the household economic
condition and the average score is 3.137772 .
• Blair has higher number of votes than Hague and the scores are much better for Blair
than for Hague.
• The average score of Blair is 3.335531 and the average score of Hague is 2.749506. So,
here we can see that, Blair has a better score.
• On a scale of 0 to 3, about 30% of the total population has zero knowledge about
politics/parties.
• People who gave a low score of 1 to a certain party, still decided to vote for the same
party instead of voting for the other party. This can be because of lack of political
knowledge among the people.
• People who have higher Eurosceptic sentiment, has voted for the conservative party and
lower the Eurosceptic sentiment, higher the votes for Labour party.
• Out of 454 people who gave a score of 0 for political knowledge, 360 people have voted
for the labour party and 94 people have voted for the conservative party.
• All models performed well on training data set as well as test dat set. The tuned models
have performed better than the regular models.
• There is no over-fitting in any model except Random Forest and Bagging regular models.
• Gradient Boosting model tuned is the best/optimized model.
pg. 24
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the
nltkin Python. We will be looking at the following speeches of the Presidents of the
United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and sentences for the mentioned
documents.
Number of characters:
Franklin D. Roosevelt Speech:

Characters: 7571
Words: 1526
Sentences: 68
John F. Kennedy Speech:

Characters: 7618
Words: 1543
Sentences: 52
Richard Nixon Speech:

Characters: 9991
Words: 2006
Sentences: 68
Findings:
Number of Characters:
• President Franklin D. Roosevelt's speech have 7571 characters (including spaces).
• President John F. Kennedy's speech have 7618 characters (including spaces).
• President Richard Nixon's speech have 9991 characters (including spaces).
Number of Words:
• There are 1526 words in President Franklin D. Roosevelt's speech.

• There are 1543 words in President John F. Kennedy's speech.
• There are 2006 words in President Richard Nixon's speech.
pg. 25
No of sentences:
• There are 68 sentences in President Franklin D. Roosevelt's speech.

• There are 52 sentences in President John F. Kennedy's speech.
• There are 68 sentences in President Richard Nixon's speech.
pg. 26
2.2 Remove all the stopwords from all three speeches.
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords)
Top three words for President Franklin D. Roosevelt:
Nation- 12
Know- 10
Spirit- 9
Top three words for President John F. Kennedy:
Let- 16
us -12
world- 8
Top three words for President Richard Nixon:

us- 26
let- 22
America- 21
pg. 27
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords)
Franklin D. Roosevelt:
Figure 26 Franklin D. Roosevelt:word cloud
John F. Kennedy:
Figure 27 John F. Kennedy: Word

cloud
pg. 28
Richard Nixon:
Figure 28 Richard
Nixon: Word Cloud
pg. 29

Machine Leaning

Uploaded by

Copyright:

Available Formats

Machine Leaning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Leaning

Uploaded by

Copyright:

Available Formats

Machine learning project

PROJECT REPORT ON MACHINE LEARNING

FIG 1 Gender Count

Checking Null VALUES

FIG 3 Checking Null VALUES

FIG 4 Description of data

We have converted variable to categorical variables , Codes are in python file

FIG 5 CHECKING duplicates of data

FIG 6 Checking outliers

FIG 7 AGE DISTRIBUTION

FIG 8 ECONOMIC CONDITION NATIONAL DISTRIBUTION

Bivariate and Multivariate Analysis

FIG 10 economic condition national vs age

FIG 11 economic condition House hold vs age

FIG 12 Blair vs age

FIG 13 Hague vs age

From this we can conclude that

1. For Age group 50 to 70 economic group is National .

FIG 14 DATA SET after dropping Vote _labour column

• Training Set: 70percent of data.

Classification Report Trained Data:

Logistic Regression Model - Observation Train data:

• The model is not over-fitted or under-fitted.

Linear Discriminant Analysis Model:

Classification report - Train data:

Linear Discriminant Analysis Model - Observation Train data:

• The model is not over-fitted or under-fitted.

Classification report - Train data:

Classification report - Test data:

Performance of the model:

Naïve Bayes Model:

Classification report - test data:

Performance of the model:

KNN Model Tuning:

Logistic Regression best model: LogisticRegression(C=10)

Naive Model Tuning:

Logistic Regression Model :

ROC and AUC – Train and Test

Figure 21 Logistic Regression -ROC

Figure 22 Logistic Regression -Confusion matrix

Naïve Bayes Model:

Figure 23 Naïve Bayes -Confusion

Figure 24 Naïve Bayes -ROC curve

Figure 24 KNN Confusion Matrix

Figure 25 KNN ROC Curve

Train ROC AUC Score: 0.9999814663800133

1.8 Based on these predictions, what are the insights?

Franklin D. Roosevelt Speech:

John F. Kennedy Speech:

Richard Nixon Speech:

• There are 1526 words in President Franklin D. Roosevelt's speech.

• There are 68 sentences in President Franklin D. Roosevelt's speech.

2.2 Remove all the stopwords from all three speeches.

Top three words for President Franklin D. Roosevelt:

Top three words for President John F. Kennedy:

Top three words for President Richard Nixon:

Figure 26 Franklin D. Roosevelt:word cloud

Figure 27 John F. Kennedy: Word

You might also like