Data Analyzing by Using Z-Score Method and PCA: W.M.Safras Sc/2018/10464

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Data Analyzing by Using Z-score

method and PCA

W.M.SAFRAS
Sc/2018/10464
Outline
 Z-score

 PCA
Z-score

 standardized value that specifies the exact location of an X


value within a distribution by describing its distance from the
mean in terms of standard deviation units .

 Z-Scores describe the exact location of a score within a distribution


 Sign: Whether score is above (+) or below (-) the mean
 Number: Distance between score and mean in standard deviation
units
Mean and Standard Deviation
 Mean
The mean is the sum of all the data that we consider, divided by the number of entries.
There are two types of arithmetic mean :
I. Population mean
Population mean is the mean of all the values in the population.

II. Sample mean


Sample mean is the mean of sample values collected.If the sample is random and sample size is
large then the sample mean would be a good estimate of the population mean.
σ𝑁
1 𝑥𝑖
𝜇= ; µ= the population mean, 𝑥𝑖 =each value from the population, 𝑁=size of the population
𝑁

 Standard Deviation
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of
values . A low standard deviation indicates that the values tend to be close to the mean of the set,
while a high standard deviation indicates that the values are spread out over a wide range.
σ(𝑥𝑖 − 𝜇)2
𝜎=
𝑁
; σ = population standard deviation , µ = population mean, N = size of the population
This is the equation of standard deviation
Explanation through an example
 Suppose we are given a data set as follow,
𝑥𝑖𝑗 =𝑖 𝑡ℎ students’ marks in 𝑗𝑡ℎ subject ;i=1,2,3……..N ;j=1,2,3
Then we need to find Z-scores in each subject,

Initially we are going to calculate mean and Standard deviation of each subject,
σ𝑁
𝑖=1 𝑥𝑖𝑗
1. 𝜇𝑗 = since we can calculate the population mean of each subject. Then
𝑁
we have three mean values are 𝜇1 ,𝜇2 & 𝜇3

σ𝑁
𝑖=1(𝑥𝑖𝑗 −𝜎𝑗 )
2
2. 𝜌𝑗 = by using this equation we can obtain 𝜎1 ,𝜎2 &𝜎3
𝑁

Now we have three mean values and standard deviations


Our next step is going to calculate the corresponding Z-scores of students with
respect to the subjects.
• Let’s 𝑧𝑖𝑗 =𝑖𝑡ℎ student’s Z-score in𝑗𝑡ℎ subject

𝑥𝑖𝑗 −𝜇𝑗
• 𝑧𝑖𝑗 = from this equation you can find the students’ Z-scores with respect to the subjects.
𝜎𝑗
• Our initial assumption is j=3 ,since each student should have three Z-score values.

• Finally we need the average Z-score of each student, from which we can make students’ ranking
process.
𝑗
σ𝑗=1 𝑧𝑖𝑗
𝑍𝑖 =
𝑗

Derive the expected value and variance of 𝑧𝑖1 .


letσ 𝑥𝑖1 = 𝑋

𝐸(𝑋 − 𝜇1 ) 𝐸 𝑋 − 𝜇1 𝜇1 − 𝜇2
𝐸 𝑧𝑖1 = = = =0
𝜎1 𝜎1 𝜎1
𝑋 − 𝜇1 1 1
𝑣𝑎𝑟 𝑧𝑖1 = 𝑣𝑎𝑟 = 2 𝑣𝑎𝑟 𝑋 − 𝜇1 = 2 𝑣𝑎𝑟 𝑋 = 1
𝜎1 𝜎 𝜎
𝑣𝑎𝑟 𝑋 = 𝜎 2
Principal Component Analysis(PCA)

 Principal Component Analysis, or PCA, is a dimensionality-reduction method that is


often used to reduce the dimensionality of large data sets.
Step by step explanation of PCA:
1. Standardization
The aim of this step is to standardize the range of the continuous initial variables so that each one of them
contributes equally to the analysis.
This can be done by subtracting the mean and dividing by the standard deviation for each value of each
variable.

2. Covariance Matrix computation


The aim of this step is to understand how the variables of the input data set are varying from the mean with
respect to each other.
3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the
covariance matrix in order to determine the principal components of the data
Covariance Matrix

 What is the covariance matrix


Covariance Matrix is a measure of how much two random variables gets change together.

A covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.
In the matrix diagonal there are variances i.e.. The covariance of each element with itself.
 Formula
1
𝑐𝑜𝑣 𝑥, 𝑦 = ෍(𝑥𝑖 − 𝜇𝑥 )(𝑦𝑖 − 𝜇𝑦 )
𝑁

⋮ ⋱ ⋮

Entries in covariance matrix are symmetric with respect to main diagonal
Explanation about PCA through an example

Student no Mathematics Chemistry Physics


1 90 60 90
2 90Suppose we Are given
90 data as follows:
30
3 60 60 60
4 60 60 90
5 30 30 30

We can standardize above data as,


Let we calculate the mean , standard deviation and z-scores
𝜇1 = 66,𝜇2 = 60 , 𝜇3 = 60
𝜎1 =, 𝜎2 = ,𝜎3 =

Now Ā=(66 60 60) be the mean values of each subject


Our next step is find the covariance matrix
1
• equation of the covariance : 𝑐𝑜𝑣 𝑥, 𝑦 = 𝑁 σ(𝑥𝑖 − 𝜇𝑥 )(𝑦𝑖 − 𝜇𝑦 )

504 360 180


𝐶 = 360 360 0
180 0 720

Where C is the covariance matrix,


• Then we have to calculate the eigen values of matrix C
det 𝐴 − 𝜆Ι = 0 ; 𝐼 = Identity matrix,
504 − λ 360 180
𝑑𝑒𝑡 360 360 − λ 0 =0
180 0 720 − λ

𝜆1 =44.81966.. , 𝜆2 = 629.11039. . , 𝜆3 = 910.06995 …


• Then calculate the eigen vectors
0 504 − 44.8 360 180 𝑋1
0 = 360 360 − 44.8 0 𝑋2
0 180 0 720 − 44.8 𝑋3
After calculating eigen vectors of corresponding eigen values we have,

−3.75 −0.50 1.05


4.28 −0.67 0.69
1 1 1

• Now we should select the eigen vectors corresponding the highest eigen value.
• This is the first principal components.
• Since we can find our new Z-scores.
𝑠1𝑖 − 𝜇1 𝑠2𝑖 − 𝜇2 𝑠3𝑖 − 𝜇3
𝑍 = 𝑥1 + 𝑥2 + 𝑥3
𝜎1 𝜎2 𝜎3
From this equation we can calculate the Z-scores for students, so above equation becomes as,
𝑍 = 𝑥1 𝑧1 + 𝑥2 𝑧2 + 𝑥3 𝑧3

You might also like