Rahulsharma - 03 12 23

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

1

DATA MINING PROJECT ON CLUSTER & PCA

-RAHUL
SHARMA
2

SR CONTENT PAGE
NO. NO.
A CLUSTERING 1-12

1.1 Read the data and perform basic analysis such as 1-3
printing a few rows (head and tail), info, data
summary, null values duplicate values, etc.

1.2 Treat missing values in CPC, CTR and CPM using 4-5
the formula given.

1.3 Check and treat if there are any outliers. 5-7

1.4 Perform z-score scaling and discuss how it acts the 8


speed of the algorithm

1.5 Perform Hierarchical by constructing a Dendrogram 8-9


using WARD and Euclidean distance.

1.6 Make Elbow plot (up to n=10) and identify optimum 9


number of clusters for k-means algorithm.

1.7 Print silhouee scores and identify optimum number 10


of clusters.
3

1.8 Profile the ads based on optimum number of 11


clusters using silhouee score and your domain
understanding [Hint: Group the data by clusters and
take sum or mean to identify trends in Clicks, spend,
revenue, CPM, CTR, & CPC based on Device Type.
Make bar plots]

1.9 Conclude the project by providing summary of your 12


learning

B PCA 12-21

2.1 Read the data and perform basic checks like 13-14
checking head, info, summary, nulls,
and duplicates, etc.

2.2 Perform detailed Exploratory analysis by creating 14-18


certain questions like (i) Which state has highest
gender ratio and which has the lowest? (ii) Which
district has the highest & lowest gender ratio?
(Example Questions). Pick 5 variables out of the
given 24 variables

2.3 We choose not to treat outliers for this case. Do you 18-19
think that treating outliers for this case is necessary?

2.4 Scale the Data using z-score method. Does scaling 18-19
have any impact on outliers?
Compare boxplots before and after scaling and
comment

2.5 Perform all the required steps for PCA (use sklearn 19
only) Create the covariance Matrix Get eigen values
and eigen vector
4

2.6 Identify the optimum number of PCs (for this project, 20


take at least 90% explained
variance). Show Scree plot.

2.7 Compare PCs with Actual Columns and identify 21


which is explaining most
variance. Write inferences about all the Principal
components in terms of actual
variables.

2.8 Write linear equation for first PC 21


1

PART A:- CLUSTERING

Digital Ads Data:


The ads24x7 is a Digital Marketing company which has now got seed
funding of $10 Million. They are expanding their wings in Marketing
Analytics. They collected data from their Marketing Intelligence team and
now wants you (their newly appointed data analyst) to segment type of
ads based on the features provided. Use Clustering procedure to
segment ads into homogeneous groups.
The following three features are commonly used in digital marketing:
CPM = (Total Campaign Spend / Number of Impressions) * 1,000.
Note that the Total Campaign Spend refers to the 'Spend' Column in the
dataset and the Number of Impressions refers to the 'Impressions'
Column in the dataset.
CPC = Total Cost (spend) / Number of Clicks. Note that the Total
Cost (spend) refers to the 'Spend' Column in the dataset and the
Number of Clicks refers to the 'Clicks' Column in the dataset.
CTR = Total Measured Clicks / Total Measured Ad Impressions x
100. Note that the Total Measured Clicks refers to the 'Clicks' Column in
the dataset and the Total Measured Ad Impressions refers to the
'Impressions' Column in the dataset.

1.1 Read the data and perform basic analysis such as printing a few
rows (head and tail), info, data summary, null values duplicate
values, etc.

Answer:-

Top 5 rows:-
2

Last 5 rows:-

Shape of the dataset:-

Info of the dataset:-


3

Duplicates of the dataset:- zero.

Changing Datatype of Timestamp from Object to datetime64:-

1.2 Treat missing values in CPC, CTR and CPM using the formula
given.
4

Answer:-
CPM = (Total Campaign Spend / Number of Impressions) * 1,000
CPC = Total Cost (spend) / Number of Clicks
CTR = Total Measured Clicks / Total Measured Ad Impressions * 100

Excluding the nan values, The distribution looks normal for all 3 Features.
#To keep the data symmetric we will impute the null values with median

As the computation method of all 3 parameters are given, we will use the
same to fill the null value

After imputation the missing values are reduced to - CTR(0.8% nan/219),


CPM(.8% nan/219) and CTC(10% nan/2586)
5

The remaining null values are present due to null value in the parameters
(impressions, clicks and sales). We will remove these rows from the
dataset for further analysis.

1.3 Check and treat if there are any outliers.

Answer:- Method1-

Method 2-
6

OBS (outliers) : From the above set of box plots, its evident that Outliers are present in all
numeric Features except for Ad-length and Ad-width

Data doesn’t display completely here, please go through my jupiter notebook file.

OUTLIER TREATMENT

Method 1-

Method 2-
7
8

1.4 Perform z-score scaling and discuss how it acts the speed of the
algorithm.

Answer:-

1.5 Perform Hierarchical by constructing a Dendrogram using WARD


and Euclidean distance.

Answer:-
9

DENDOGRAM USING EUCLIDEAN DISTANCES

1.6 Make Elbow plot (up to n=10) and identify optimum number of
clusters for k-means algorithm.

Answer:- k-mean inertia= 63944.29253879197

When we move from k=1 to k=2 , we see that there is a significant drop in
the value , also when we move from k=2 to k=3,k=3 to k=4 there is a
significant drop as well.But from k=4 to k=5 , k=5 to k=6 , the drop
in values reduces significantly.
10

1.7 Print silhouette scores and identify optimum number of clusters.

Answer:-

It is clear from above plot that the maximum value of average silhouette
score is achieved for k = 8, which, therefore, is considered to be the
optimum number of clusters for this data.
11

1.8 Profile the ads based on optimum number of clusters using


silhouee score and your domain understanding [Hint: Group the data
by clusters and take sum or mean to identify trends in Clicks, spend,
revenue, CPM, CTR, & CPC based on Device Type. Make bar plots]

Answer:-
12

1.9 Conclude the project by providing summary of your learning

Answer:-

 The dataset has 25857 rows and 19 columns.


 The missing values in CPC, CTR and CPM are treated by using the
formulae given and writing a user-defined function, and calling it.
 We check for outliers, we can see there are outliers in the variables.
 Dendogram is the visualization and linkage is for computing the
distances and merging the clusters from n to 1.
 The output of Linkage is visualized by Dendogram.
 We will create linkage using Ward’s method and run linkage function
on the usable columns of the data.
 The linkage now stores the various distance at which the n clusters
are sequentially merged into a single cluster.
 Using FIt – transform function and viewing the output -
The dataframe is now stored in an array.
 Using this array we can now perform k-means
 The one requirement before we run the k-means algorithm, is to
know how many clusters we require as output
 From the plot we have following observations:
 When we move from k=1 to k=2 , we see that there is a significant
drop in the value ,also when we move from k=2 to k=3,k=3 to k=4
there is a significant drop as well.
 But from k=4 to k=5 , k=5 to k=6 , the drop in values reduces
significantly
 So 4 is optimal number of clusters.

PART B:- PCA


PCA FH (FT): Primary census abstract for female headed households
excluding institutional households (India & States/UTs - District Level),
Scheduled tribes - 2011 PCA for Female Headed Household Excluding
Institutional Household. The Indian Census has the reputation of being one of
the best in the world. The first Census in India was conducted in the year
1872. This was conducted at different points of time in different parts of the
country. In 1881 a Census was taken for the entire country simultaneously.
Since then, Census has been conducted every ten years, without a break.
Thus, the Census of India 2011 was the fifteenth in this unbroken series since
1872, the seventh after independence and the second census of the third
millennium and twenty first century. The census has been uninterruptedly
13

continued despite of several adversities like wars, epidemics, natural


calamities, political unrest, etc. The Census of India is conducted under the
provisions of the Census Act 1948 and the Census Rules, 1990. The Primary
Census Abstract which is important publication of 2011 Census gives basic
information on Area, Total Number of Households, Total Population,
Scheduled Castes, Scheduled Tribes Population, Population in the age group
0-6, Literates, Main Workers and Marginal Workers classified by the four
broad industrial categories, namely, (i) Cultivators, (ii) Agricultural Laborers,
(iii) Household Industry Workers, and (iv) Other Workers and also Non-
Workers. The characteristics of the Total Population include Scheduled
Castes, Scheduled Tribes, Institutional and House-less Population and are
presented by sex and rural-urban residence. Census 2011 covered 35
States/Union Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and
6,40,867 Villages.

2.1 Read the data and perform basic checks shape, data types,
statistical summary.

Answer:-

Shape of the dataset:- (5*61)

Data type:-

Statistical Summary:-
14

All information is incomplete please go through my ipynb file.

2.2 Perform detailed Exploratory analysis by creating certain


questions like (i) Which state has highest gender ratio and which
has the lowest? (ii) Which district has the highest & lowest gender
ratio? (Example Questions). Pick 5 variables out of the given 24
variables

(i) Which state has the highest & lowest population?

Answer:-

Maharashtra has highest population. & Daman &Diu has lowest.


15

(ii) Which state has the highest & lowest gender ratio?

Answer:-
16

For EDA - Variables considered:No_HH TOT_M TOT_F TOT_WORK_M


TOT_WORK_FNo of HouseholdTotal popula on MaleTotal popula on FemaleTotal
Worker Popula on MaleTotal Worker Popula on FemaleUnivariate Analysis:Plo ng
histogram and boxplots for the above variables:-
17

for bivariate analysis:-


18

2.3 &2.4 :- We choose not to treat outliers for this case. Do you think
that treating outliers for this case is necessary? Scale the Data using
z-score method. Does scaling have any impact on outliers?
Compare boxplots before and after scaling and comment.
19

2.5:- Perform all the required steps for PCA (use sklearn only) Create
the covariance Matrix Get eigen values and eigen vector

Answer:-

2.6:- Identify the optimum number of PCs (for this project, take at
least 90% explained variance). Show Scree plot.

Answer:-
20

2.7:- Compare PCs with Actual Columns and identify which is


explaining most variance. Write inferences about all the Principal
components in terms of actual variables.

Answer:-
21

2.8:- Write linear equation for first PC.

Answer:-

PC 1 = a1x1 + a2x2 + a3X3 +a4X4 + …….+ a57x5724

You might also like