Machine Leaning
Machine Leaning
Machine Leaning
Great learning
REGARDS,
AKSHAY PANKAR
pg. 1
Machine learning project
Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that will
help in predicting overall win and seats covered by a particular party.
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an inference
on it.
FIG 2 INFO
pg. 2
Machine learning project
Inference Drawn :- From this we can conclude that NO outliers are detected from DATA.
Q1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers .
pg. 3
Machine learning project
Univariate Analysis
pg. 4
Machine learning project
FIG 9 Political knowlwgde DISTRIBUTION
From this we can conclude that Density is more for Less political Knowledge , Density ranges from age
group 20 to 80.
pg. 5
Machine learning project
pg. 6
Machine learning project
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split: Split the
data into train and test (70:30)
COUMNS OUTPUT
Train-test-split:
Our model will use all the variables and 'vote_Labour' is the target variable. The train -test split is a
technique for evaluating the performance of a machine learning algorithm. The procedure involves
taking a dataset and dividing it into two subsets.
Why scaling?
• The dataset contains features highly varying in magnitudes, units and range between the
'age' column and other columns.
• But since, most of the machine learning algorithms use Euclidean distance between two
data points in their computations.
• in this case, we have a lot of encoded, ordinal, categorical and continuous variables. So, we
may use the minmax scaler technique to scale the .
pg. 7
Machine learning project
1.4 Apply Logistic Regression and LDA (linear discriminant analysis).
FIG 15
• Accuracy: 84%
• Precision: 87%
• Recall: 91%
• F1-Score: 89%
pg. 8
Machine learning project
Performance of the model
FIG 16
Classification report - test data:
FIG - 17
• Accuracy: 84%
• Precision: 91%
• Recall: 87%
• F1-Score: 89%
pg. 9
Machine learning project
Test data:
• Accuracy: 81%
• Precision: 89%
• Recall: 85%
• F1-Score: 87%
Performance of the model:
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
FIG - 18
• Accuracy: 86%
• Precision: 92%
• Recall: 88%
• F1-Score: 90%
pg. 10
Machine learning project
• For the training data, the accuracy of the KNN model is 0.86, which means that the model is able to
correctly classify 86% of the samples in the training set.
• The precision and recall values for the two classes (0 and 1) in the training set are also quite high.
This suggests that the model is able to accurately identify both classes, without many false positives
or false negatives. The F1-score, which is the harmonic mean of precision and recall, is also high for
both classes, indicating good performance of the model.
• For the test data, the accuracy is 0.77, which is lower than the accuracy for the training data. This is
not surprising, as models tend to perform better on the data they were trained on. However, an
accuracy of 0.77 is still relatively high.
• The precision, recall, and F1-score values for the two classes in the test set are lower than those in
the training set. This suggests that the model may be overfitting to the training data and not
generalizing well to new data. However, the precision, recall, and F1-score values are still
reasonable, so the model may still be useful for classification tasks.
• Overall, the KNN model appears to perform reasonably well, with good performance on the training
data and reasonable performance on the test data.
FIG - 19
FIG - 20
pg. 11
Machine learning project
Train data:
• Accuracy: 84%
• Precision: 89%
• Recall: 88%
• F1-Score: 88%
Test data:
• Accuracy: 81%
• Precision: 88%
• Recall: 86%
• F1-Score: 87%
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
pg. 12
Machine learning project
pg. 13
Machine learning project
Insights:
• Based on the results obtained, it appears that the AdaBoost algorithm outperforms the KNN
and Random Forest algorithms. The best accuracy achieved by the AdaBoost algorithm is
0.8426, which is higher than the best accuracies of KNN (0.8350) and Random Forest
(0.8369). However, all three algorithms perform reasonably well, with accuracies ranging
between 0.81 to 0.84.
• Regarding the feature importance, Random Forest provides a fe ature importance score.
Based on the score, it appears that 'age', 'education', 'economic.cond.national', and
'satisfaction.financial' are the most important features in predicting the voting behavior of
the individuals.
• When looking at the classification reports for both train and test datasets, it is evident that
all three models perform well on training data. However, AdaBoost algorithm slightly
outperforms KNN and Random Forest algorithms on the test dataset.
• Overall, the AdaBoost algorithm seems to be the most suitable for this particular dataset
based on the results obtained.
pg. 14
Machine learning project
Logistic Regression Model Tuning:
Logistic Regression best parameters: {'C': 10, 'penalty': 'l2'}
Logistic Regression best accuracy: 0.8416216927734632
Insights:
• we can see that the best parameters for logistic regression were C=10 and penalty=l2, which
achieved an accuracy of 84.16%. Then we used bagging and boosting to improve the
accuracy of the logistic regression model.
• The Bagging LR model achieved a test accuracy of 81%, which is lower than the logistic
regression model's accuracy. However, we can see that the precision, recall, and f1-score of
the Bagging LR model are relatively stable compared to the logistic regression model.
pg. 15
Machine learning project
• The Boosting LR model achieved the same test accuracy as the Bagging LR model, but the
precision, recall, and f1-score are higher than the Bagging LR model. It means that the
Boosting LR model is better at identifying both the positive and negative classes, and hence
it has a better f1-score.
• Overall, the logistic regression model with C=10 and penalty=l2 performed well, and the
Bagging LR and Boosting LR models did not improve the accuracy by a large margin
pg. 16
Machine learning project
Insights:
• The Naive Bayes algorithm achieved a best accuracy of 0.8388. When bagging was applied to
the algorithm, the accuracy of the model improved to 0.82. This suggests that the bagging
algorithm was successful in reducing overfitting and improving the overall accuracy of the
model.
• On the other hand, when boosting was applied to the Naive Bayes algorithm, the accuracy
of the model decreased to 0.8. This indicates that boosting was not as effective as bagging in
improving the performance of the model.
• In terms of the classification report, both bagging and boosting NB models were able to
achieve high precision and recall for the positive class (1) with F1-scores of 0.87 and 0.86,
respectively. However, the bagging model outperformed the boosting model in terms of the
F1-score for the negative class (0), with a score of 0.68 compared to 0.65 for the boosting
model.
• Overall, the results suggest that bagging may be a better approach for improving the
performance of the Naive Bayes algorithm on this dataset compared to boosting.
pg. 17
Machine learning project
1.7 Performance Metrics: Check the performance of Predictions on Trainand Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model. Final Model: Compare the models and write inference which model is
best/optimized.
pg. 18
Machine learning project
Classification report – Train and Test:
Confusion Matrix:
pg. 19
Machine learning project
Insights:
• The Logistic Regression model has achieved an accuracy of 0.84 on the training set and 0.82
on the test set, indicating that the model is performing well on both the training and test
data. The precision, recall and f1-score for both the classes are also high on both training
and test sets, suggesting that the model is good at classifying both the classes.
• The ROC curve of the Logistic Regression model also shows a good performance, with an
AUC of 0.89 on the training set and 0.87 on the test set. This indicates that the model is able
to differentiate between the two classes well and is able to classify the samples accurately.
• Overall, the Logistic Regression model appears to be a good model for the given
classification problem.
pg. 20
Machine learning project
Insights:
• The Naive Bayes model has achieved a train accuracy of 0.8388 and a test accuracy of
0.8144. The train and test classification reports show that the model has performed well in
pg. 21
Machine learning project
terms of precision, recall, and F1-score for both classes, with a weighted F1-score of 0.84 for
train and 0.81 for test.
• The ROC curve and AUC score show that the model is performing well in distinguishing
between the positive and negative classes, with a train AUC of 0.89 and a test AUC of 0.87.
• Overall, the Naive Bayes model has performed reasonably well on the given dataset, with
high accuracy, good precision, recall, and F1-scores, and a good ability to distinguish
between the positive and negative classes.
KNN Model:
Train Accuracy: 0.9990627928772259
Test Accuracy: 0.8056768558951966
pg. 22
Machine learning project
Insights:
• the model is overfitting on the training data. The training accuracy is almost perfect at
99.9%, but the test accuracy is relatively lower at 80.6%. This suggests that the model is not
generalizing well to new data.
• Looking at the classification report, the precision, recall, and f1-score for class 0 in the test
set are relatively lower than those for class 1. This indicates that the model is better at
predicting class 1 than class 0.
• The ROC AUC score for the test set is 0.87, indicating that the model's ability to distinguish
between the positive and negative classes is fairly good. Howeve r, the train ROC AUC score
is almost perfect at 0.999, further indicating overfitting.
• Overall, the model's high accuracy on the training set but lower accuracy on the test set,
along with the differences in the classification report and ROC AUC scores, suggest that the
model is overfitting on the training data and is not able to generalize well to new data.
Conclusion:
• Based on the Comparison, it appears that the logistic regression model is the best-
performing model for the given problem.
• The logistic regression model has the highest accuracy on both the training and testing sets,
as well as the highest area under the ROC curve on both sets. Its precision and recall scores
are also higher than those of the naive Bayes and KNN models.
pg. 23
Machine learning project
• Overall, the logistic regression model's performance is consistent across both the training
and testing sets, indicating that it is not overfitting the training data. Therefore, I would
recommend using the logistic regression model for this problem.
• Most number of people have given a score of 3 and 4 for the national economic
condition and the average score is 3.245221 .
• Most number of people have given a score of 3 and 4 for the household economic
condition and the average score is 3.137772 .
• Blair has higher number of votes than Hague and the scores are much better for Blair
than for Hague.
• The average score of Blair is 3.335531 and the average score of Hague is 2.749506. So,
here we can see that, Blair has a better score.
• On a scale of 0 to 3, about 30% of the total population has zero knowledge about
politics/parties.
• People who gave a low score of 1 to a certain party, still decided to vote for the same
party instead of voting for the other party. This can be because of lack of political
knowledge among the people.
• People who have higher Eurosceptic sentiment, has voted for the conservative party and
lower the Eurosceptic sentiment, higher the votes for Labour party.
• Out of 454 people who gave a score of 0 for political knowledge, 360 people have voted
for the labour party and 94 people have voted for the conservative party.
• All models performed well on training data set as well as test dat set. The tuned models
have performed better than the regular models.
• There is no over-fitting in any model except Random Forest and Bagging regular models.
• Gradient Boosting model tuned is the best/optimized model.
pg. 24
Machine learning project
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the
nltkin Python. We will be looking at the following speeches of the Presidents of the
United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and sentences for the mentioned
documents.
Number of characters:
Findings:
Number of Characters:
• President Franklin D. Roosevelt's speech have 7571 characters (including spaces).
• President John F. Kennedy's speech have 7618 characters (including spaces).
• President Richard Nixon's speech have 9991 characters (including spaces).
Number of Words:
pg. 25
Machine learning project
No of sentences:
pg. 26
Machine learning project
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords)
Nation- 12
Know- 10
Spirit- 9
Let- 16
us -12
world- 8
America- 21
pg. 27
Machine learning project
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords)
Franklin D. Roosevelt:
John F. Kennedy:
pg. 28
Richard Nixon:
Figure 28 Richard
Nixon: Word Cloud
pg. 29