Notes
Notes
Notes
Inferential Statistics
It the method to talk about population parameter from sample. Involving point estimation, interval estimation
and hypothesis testing.
Sampling
Sampling is a key task. It involves understanding of the characteristics of the population and then choosing out
sample observation representing the population in more effective manner.
Cluster Sampling
Margin Of error
Interval Estimation – Technique to find out the interval from the sample set, in which maximum population lies.
Point Estimation – Finding out the average point of the sample distribution which closely represent the
population.
Hypothesis Testing – Take assumption, and then try to prove/disapprove using sample data.
Qualitative – Finding about the qualitative characteristics of the set. Like Gender, preference. Textual Data
Quantitative – Finding the definitive quantity of the set. Like Numbers, it can be discrete or continuous.
a. Discrete – It is also called as integers. It is more like a fixed value - Family size, Number of hotel rooms,
Shoe size, results of dice rolling. Best represented on graph and table.
b. Continuous – It is a data point for which value keep on changing based on different sample. For eg,
Waiting time, Temperature in a room, Weight, Height, etc. It can be in decimals. It is best represented
as Graph.
Measurement
Data Types Arithmetic Operations Examples Remarks
Scales
=
Nominal = Eg, Gender, Eye Color,
/
Qualitative
=
Ordinal = > < >= <= Olympic Medal, Rating
/
Looking for
= IQ Score, Temperature, Interested in
Interval = > < >= <= + -
/ Calendar Data Gap between
Quantitative the data
Best to have, as
=
Ratio = > < >= <= + - * / Cost of an item provide exact
/
count
Nominal - Everything is same, nothing is better and worse, all are equal.
Ordinal – It has some order into it. Where one is better than other.
Interval – The data set where starting and ending point comes for evaluation. Doing ratio must be avoided on
interval data.
Ratio – Which has absolute value. Money, Length are examples of this.
Product – Nominal
Age – Ratio
Gender – Nominal
Education – Ratio
Marital Status – Nominal
Usage –Ratio
Fitness – Ordinal
Income – Ratio
Miles - Ratio
Descriptive Stats
1. Central Tendency –
a. Arithmetic mean (Average) is the powerful measure of Central Tendency. Mean gets affected
very much by extreme values (Outliers).
b. Median is effective to manage outliers in the data set.
Wider gap between Mean and Median points out that there is lot of outliers in the data set.
c. Mode - Most frequent(or occurring) value in the distribution
Most suited for Qualitative Data. Like Brand.
Central Tendency tools are used to find the shape of the data set.
2. Dispersion(Variation in data) –
a. Range
Finding variation in the distribution, Less variation means more consistent, and more variation means
less consistent.
Q1 = n+1/4 th observation
Q2 = n+1/2th observation
Q3 = 3 (n+1)/4th observation
Q4 = n+1
Normal Distribution
Relates to the six sigma (within specification) quality measure in manufacturing domain
First step while analyzing any data set is to find Mean, Median, shape of the data to understand the
distribution of dataset.
Aim to achieve normal distribution. This also signifies that process in control and outliers has been
treated.
Outliers
Natural – Actual case happened, Treatment can be done using transformation, to make the variation less.
Artificial – Wrong data value, Treatment can be done by replacing them with mean, median etc.
If Mean is lower than Median, then it’s Left skewed data, and if Mean is higher than mode, then it’s positive
skewed data
Quartiles
Dist Plot is used for plotting the distribution of the graph. It is part of seaborn package and shows the
curve on the histogram.
CountPlot is used to represent Nominal and Nominal data. It is part of seaborn package.
Box Plot is used to represent dataset within IQR. It Is part of seaborn package.
Standard Deviation is the average distance of all the data points from the mean.
It is used when to normalize evaluation between varies unit of data sets. Lesser the output of CV, more
consistent it is.
1.5 * IQR
Co-Relation – It is done to predict the occurrence of one variable basis of another. Found out dependent
variables. It would only be useful to check in case of linear distribution. It can only be done with numerical data
and not text data.
Co-relation between 2 variables if results in Big Number, that indicates the relationship between 2 variables are
moving in same direction
Co-efficient of co-relation -> Measures the relative strength of a linear relationship between 2 numerical
variables. It will bring the value between -1 to 1. This will make 2 individual numerical variables on same
evaluable state and test the strength of linear relationship between both.
r = COV(x, Y) /Sx Sy
Big Number means that both are having linear relation, while less number means scattered relationship.
Inferential Statistics
Probability
Assessing Probability –
Where P(A) is always >=0 and always <=1. P(A) is pure number.
Sample Space is the set which contains all the possible values.
Mutually Exclusive Event – coming from one sample space. Occurrence of One event prevents the occurrence
of another event. They can’t be independent variables.
* P (A ꓵ B) is Probability of A and B
Difference between Mutually exclusive and Independent Events – Mutually exclusive events are part of same
sample space, while independent events are from different sample space
P (A ꓵ B) = P (A) * P (B)
Marginal Probability -> It is the probability of an event irrespective of the outcome of another variable .
Joint Probability -> finding the probability of happening two things jointly together.
Marginal Probability -> It is the probability happening basis on finding the margins.
Joint Probability -> finding the probability of happening two things jointly together.
Bayes Theorem (Developed by Thiomas Bayes)
Formula
P (B | A) = P(A | B) * P(B)
B’ is complement of B
P(Hit) = 40%
P(Flop) = 60%
P (Positive Survey | Hit) * P(Hit) / P(Positive Survey/Hit) * P(Hit) + P(Positive Survey/Flop) * P(Flop)
P(Defect) = ?
P (New) = 75%
P (Old) = 25%
P(Defect | New Machine) * P (New Machine) / P(Defect | New Machine) * P (New Machine) + / P(Defect | Old
Machine) * P (Old Machine)
Binomial Normal
Poisson
Binomial Distribution is used for quality control and quality assurance function. Mostly used by manufacturing
units for defective analysis.
Cumulative would be false if the value is required for that trial, and true if the value is required for less than
equal to trial.
Python Formula
Absolute = stats.binom.pmf(x,lambda)
Cumulative = stats.binom.cdf(x,lambda)
Poisson Distribution
Different Between Poisson and Binomial – When we are doing defect analysis we use Poisson Distribution and
when we are doing defective analysis we are doing Binomial distribution.
Python Formula
Absolute = stats.poisson.pmf(x,lambda)
Normal Distribution
X-µ
Mean would of SND (Standard normal Distribution) would be always be 0, and SD would always be 1.
1. Identification of a problem
2. Collect necessary data
a. Primary Data
b. Secondary Data
3. Clean Data
a. Missing Data Imputation
b. Outlier Detection and Treatment
4. Transform Data
a. Statistical Transformation
b. Non Statistical Transformation
5. Data visualization
6. Statistical modeling/Inferential Statistics
7. Insights Presentation