Rahulsharma - 03 12 23
Rahulsharma - 03 12 23
Rahulsharma - 03 12 23
-RAHUL
SHARMA
2
SR CONTENT PAGE
NO. NO.
A CLUSTERING 1-12
1.1 Read the data and perform basic analysis such as 1-3
printing a few rows (head and tail), info, data
summary, null values duplicate values, etc.
1.2 Treat missing values in CPC, CTR and CPM using 4-5
the formula given.
B PCA 12-21
2.1 Read the data and perform basic checks like 13-14
checking head, info, summary, nulls,
and duplicates, etc.
2.3 We choose not to treat outliers for this case. Do you 18-19
think that treating outliers for this case is necessary?
2.4 Scale the Data using z-score method. Does scaling 18-19
have any impact on outliers?
Compare boxplots before and after scaling and
comment
2.5 Perform all the required steps for PCA (use sklearn 19
only) Create the covariance Matrix Get eigen values
and eigen vector
4
1.1 Read the data and perform basic analysis such as printing a few
rows (head and tail), info, data summary, null values duplicate
values, etc.
Answer:-
Top 5 rows:-
2
Last 5 rows:-
1.2 Treat missing values in CPC, CTR and CPM using the formula
given.
4
Answer:-
CPM = (Total Campaign Spend / Number of Impressions) * 1,000
CPC = Total Cost (spend) / Number of Clicks
CTR = Total Measured Clicks / Total Measured Ad Impressions * 100
Excluding the nan values, The distribution looks normal for all 3 Features.
#To keep the data symmetric we will impute the null values with median
As the computation method of all 3 parameters are given, we will use the
same to fill the null value
The remaining null values are present due to null value in the parameters
(impressions, clicks and sales). We will remove these rows from the
dataset for further analysis.
Answer:- Method1-
Method 2-
6
OBS (outliers) : From the above set of box plots, its evident that Outliers are present in all
numeric Features except for Ad-length and Ad-width
Data doesn’t display completely here, please go through my jupiter notebook file.
OUTLIER TREATMENT
Method 1-
Method 2-
7
8
1.4 Perform z-score scaling and discuss how it acts the speed of the
algorithm.
Answer:-
Answer:-
9
1.6 Make Elbow plot (up to n=10) and identify optimum number of
clusters for k-means algorithm.
When we move from k=1 to k=2 , we see that there is a significant drop in
the value , also when we move from k=2 to k=3,k=3 to k=4 there is a
significant drop as well.But from k=4 to k=5 , k=5 to k=6 , the drop
in values reduces significantly.
10
Answer:-
It is clear from above plot that the maximum value of average silhouette
score is achieved for k = 8, which, therefore, is considered to be the
optimum number of clusters for this data.
11
Answer:-
12
Answer:-
2.1 Read the data and perform basic checks shape, data types,
statistical summary.
Answer:-
Data type:-
Statistical Summary:-
14
Answer:-
(ii) Which state has the highest & lowest gender ratio?
Answer:-
16
2.3 &2.4 :- We choose not to treat outliers for this case. Do you think
that treating outliers for this case is necessary? Scale the Data using
z-score method. Does scaling have any impact on outliers?
Compare boxplots before and after scaling and comment.
19
2.5:- Perform all the required steps for PCA (use sklearn only) Create
the covariance Matrix Get eigen values and eigen vector
Answer:-
2.6:- Identify the optimum number of PCs (for this project, take at
least 90% explained variance). Show Scree plot.
Answer:-
20
Answer:-
21
Answer:-