Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Descriptive Statistics

This includes summarization of data including Graphs/Charts.

Inferential Statistics

It the method to talk about population parameter from sample. Involving point estimation, interval estimation
and hypothesis testing.

 Population is all possible observations of a specific characteristics of interest.


 Sample is a subset of population

Sampling

Sampling is a key task. It involves understanding of the characteristics of the population and then choosing out
sample observation representing the population in more effective manner.

There are various techniques to do sampling –

 Cluster Sampling

Determining idea sample size

Margin Of error

Interval Estimation – Technique to find out the interval from the sample set, in which maximum population lies.

Point Estimation – Finding out the average point of the sample distribution which closely represent the
population.

Hypothesis Testing – Take assumption, and then try to prove/disapprove using sample data.
Qualitative – Finding about the qualitative characteristics of the set. Like Gender, preference. Textual Data

Quantitative – Finding the definitive quantity of the set. Like Numbers, it can be discrete or continuous.

a. Discrete – It is also called as integers. It is more like a fixed value - Family size, Number of hotel rooms,
Shoe size, results of dice rolling. Best represented on graph and table.
b. Continuous – It is a data point for which value keep on changing based on different sample. For eg,
Waiting time, Temperature in a room, Weight, Height, etc. It can be in decimals. It is best represented
as Graph.

Measurement
Data Types Arithmetic Operations Examples Remarks
Scales
=
Nominal = Eg, Gender, Eye Color,
/
Qualitative
=
Ordinal = > < >= <= Olympic Medal, Rating
/
Looking for
= IQ Score, Temperature, Interested in
Interval = > < >= <= + -
/ Calendar Data Gap between
Quantitative the data
Best to have, as
=
Ratio = > < >= <= + - * / Cost of an item provide exact
/
count

Measurement Scales - Qualitative

Nominal - Everything is same, nothing is better and worse, all are equal.

Ordinal – It has some order into it. Where one is better than other.

Measurement Scales - Quantitative

Interval – The data set where starting and ending point comes for evaluation. Doing ratio must be avoided on
interval data.

Ratio – Which has absolute value. Money, Length are examples of this.
Product – Nominal
Age – Ratio
Gender – Nominal
Education – Ratio
Marital Status – Nominal
Usage –Ratio
Fitness – Ordinal
Income – Ratio
Miles - Ratio

Descriptive Stats

1. Central Tendency –
a. Arithmetic mean (Average) is the powerful measure of Central Tendency. Mean gets affected
very much by extreme values (Outliers).
b. Median is effective to manage outliers in the data set.
Wider gap between Mean and Median points out that there is lot of outliers in the data set.
c. Mode - Most frequent(or occurring) value in the distribution
Most suited for Qualitative Data. Like Brand.

Central Tendency tools are used to find the shape of the data set.

2. Dispersion(Variation in data) –
a. Range
Finding variation in the distribution, Less variation means more consistent, and more variation means
less consistent.

b. IQR(Inter Quartile Range) -> Q3 – Q1

Finding variation in the middle.

c. Box Plot -> IQR is the basis for it.

Q1 = n+1/4 th observation
Q2 = n+1/2th observation
Q3 = 3 (n+1)/4th observation
Q4 = n+1

3. Pattern Recognition Tools – Histogram


It depicts the shape of the data.

Normal Distribution

as process is in control and most desirable state for prediction

Relates to the six sigma (within specification) quality measure in manufacturing domain

First step while analyzing any data set is to find Mean, Median, shape of the data to understand the
distribution of dataset.
Aim to achieve normal distribution. This also signifies that process in control and outliers has been
treated.

Outliers

Natural – Actual case happened, Treatment can be done using transformation, to make the variation less.

Artificial – Wrong data value, Treatment can be done by replacing them with mean, median etc.

If Mean is lower than Median, then it’s Left skewed data, and if Mean is higher than mode, then it’s positive
skewed data

Quartiles

IQR = Inter Quartile Range (Area between Q1 and Q3)


IQR is effective to remove outliers, and consider range (or interval) of uniform distribution.

Distribution is only for Ordinal or Ratio data

HistPlot is used to build the histogram = Mydata[‘Education’].hist

X axis is the column observation and the y axis is frequency.

Dist Plot is used for plotting the distribution of the graph. It is part of seaborn package and shows the
curve on the histogram.

CountPlot is used to represent Nominal and Nominal data. It is part of seaborn package.

Box Plot is used to represent dataset within IQR. It Is part of seaborn package.

Standard Deviation is the average distance of all the data points from the mean.

Empirical Rules – Means Coming from experiments

If the Mean is 60, and std is 10,

1. 1 Standard Deviation - 68% of distribution points is between 50 to 70.


2. 2 Standard Deviation – 95% of distribution points is between 40 to 80
3. 3 Standard Deviation – 99.7% of distribution points is between 30 to 90.

Formula of STD Dev of Population -


Formula of STD Dev of Sample -

Coefficient of Variance - Helps to know which distribution set is more consistent.

It is used when to normalize evaluation between varies unit of data sets. Lesser the output of CV, more
consistent it is.

Box Plot is excellent way of detecting outliers.

Box Plot is the Box between Q1 and Q3 quartile of distribution.

Box Plot Rule

1.5 * IQR

Hence to manage whisker

Maximum on right side -> Q3+1.5 *IQR or stop at last distribution

Left Side -> Q1 – 1.5*IQR or start from first point of distribution


Day 2

Co-Relation – It is done to predict the occurrence of one variable basis of another. Found out dependent
variables. It would only be useful to check in case of linear distribution. It can only be done with numerical data
and not text data.

 Scatter Plot – Plotting the distribution points to check distribution line

Co-Variance -> Impact of variation of one variable to another.

Formula COV(X,Y) = (X-XBar) (Y-YBr)/n-1

Co-relation between 2 variables if results in Big Number, that indicates the relationship between 2 variables are
moving in same direction

Drawback – Can’t be sole decider of relationship

Co-efficient of co-relation -> Measures the relative strength of a linear relationship between 2 numerical
variables. It will bring the value between -1 to 1. This will make 2 individual numerical variables on same
evaluable state and test the strength of linear relationship between both.

r = COV(x, Y) /Sx Sy

Sx means (Std Dev of x)

Big Number means that both are having linear relation, while less number means scattered relationship.
Inferential Statistics

Statistics Methods of Decision Making

Probability

Assessing Probability –

a. Empirical(Based on experience) Probability – Based on Historical Data


b. Subjective Probability – Based on experience, Expert Analysis, When no past data is there.
c. A priory Classical probability – Based on knowledge of process

Formula, P(A) – No of favorable Events / Total number of Events

Where P(A) is always >=0 and always <=1. P(A) is pure number.

Sample Space is the set which contains all the possible values.

Mutually Exclusive Event – coming from one sample space. Occurrence of One event prevents the occurrence
of another event. They can’t be independent variables.

Addition Rule (For Dependent Events)

Mutually Exclusive Events - P (A U B) = P(A) + P(B) or called as P(A) or P(B)

Non Mutually Exclusive Events - P (A U B) = P(A) + P(B) – P(A ꓵ B)

* P (A ꓵ B) is Probability of A and B

Difference between Mutually exclusive and Independent Events – Mutually exclusive events are part of same
sample space, while independent events are from different sample space

Multiple Rules (For Independent Events)

Multiple Trials, multiple sample space

Two Events independent, Occurrence of one doesn’t influence occurrence of another

P (A ꓵ B) = P (A) * P (B)

Marginal Probability -> It is the probability of an event irrespective of the outcome of another variable .

Joint Probability -> finding the probability of happening two things jointly together.

Conditional Probability (One dependent on the occurrence of other event)

P (B | A) -> When probability of B is dependent on probability of A. It comes in case of dependent variables.

(For Not Independent Events)

P (A ꓵ B) = P (A) * P (B/A) or P (B/A) = P (A ꓵ B) / P (A)


All the recommendation engine in the algorithm using conditional probability.

Marginal Probability -> It is the probability happening basis on finding the margins.

Joint Probability -> finding the probability of happening two things jointly together.
Bayes Theorem (Developed by Thiomas Bayes)

It is an extension of Conditional probability

It is mixture of prior probability with new evidence.

Formula

P (B | A) = P(A | B) * P(B)

P(A|B) * P(B) + P(A|B’) * P(B’)

B’ is complement of B

A = new event that impact P(B)

P(A|B) -> Check for P(B|A) in past records

Used by Google Search

P(Hit) = 40%

P(Flop) = 60%

P(Positive Survey | Hit) = 90%

P(Negative Survey | Flop) = 70%

P(Hit | Positive survey) = ?

P (Positive Survey | Hit) * P(Hit) / P(Positive Survey/Hit) * P(Hit) + P(Positive Survey/Flop) * P(Flop)

90% * 40% / 90%*40% + 30% + 60%

P(Defect | Older Machine) = 23%

P(Defect | New Machine) = 8%

P(New Machine| Defect) = .51

P(Defect) = ?

P (New) = 75%

P (Old) = 25%

P(Defect | New Machine) * P (New Machine) / P(Defect | New Machine) * P (New Machine) + / P(Defect | Old
Machine) * P (Old Machine)

8% * 75% /(8% * 75% + 23% * 25%)

.008 * .75/.008*.75 + .23*.25


Probability Distribution

Discrete Probability Continuous Probability


Distribution Distribution

Binomial Normal

Poisson

Binomial Distribution can only occur when 4 things are true –

1. Trials are random and independent.


2. No of trials are fixed
3. There are only two outcomes of the trial
4. The probability of success is uniform throughout n trials

Binomial Distribution is used for quality control and quality assurance function. Mostly used by manufacturing
units for defective analysis.

Excel Formula for BD is

= BINOM.DIST(number_s, trials, probability_s, cumulative)

Cumulative would be false if the value is required for that trial, and true if the value is required for less than
equal to trial.

At least 4 will pay means that P(4) + P(5) + P (6) + P(7)

Python Formula

Absolute = stats.binom.pmf(x,lambda)

Cumulative = stats.binom.cdf(x,lambda)

Poisson Distribution

This is used for defect analysis.

Different Between Poisson and Binomial – When we are doing defect analysis we use Poisson Distribution and
when we are doing defective analysis we are doing Binomial distribution.

It can happen for below conditions


1. Trials are random and independent.
2. There can be multiple outcomes of the trial
3. The probability of success is uniform throughout n trials

Excel Formula for BD is

= POISSON.DIST(x, mean, cumulative)

Python Formula

Absolute = stats.poisson.pmf(x,lambda)

Cumulative = stats.poisson.cdf (x,lambda)

Normal Distribution

Require mean(µ) and Standard Deviation ()

Excel Formula for ND is norm.dist

Python Formula for ND is stats.norm.cdf

Example - stats.norm.cdf(0.28, loc=.295, scale=.025)

X-µ

Mean would of SND (Standard normal Distribution) would be always be 0, and SD would always be 1.

Steps to Solve a business Analytics Problem

1. Identification of a problem
2. Collect necessary data
a. Primary Data
b. Secondary Data
3. Clean Data
a. Missing Data Imputation
b. Outlier Detection and Treatment
4. Transform Data
a. Statistical Transformation
b. Non Statistical Transformation
5. Data visualization
6. Statistical modeling/Inferential Statistics
7. Insights Presentation

You might also like