Business Statistics Using Excel: Understanding Standard Deviation Well Is Critical For This Course
Business Statistics Using Excel: Understanding Standard Deviation Well Is Critical For This Course
Business Statistics Using Excel: Understanding Standard Deviation Well Is Critical For This Course
The art and science of collecting, analyzing, presenting and interpreting data.
From accounting, taking random samples of invoices, making a list of items of what we've sold,
monitoring whether the machine is filling too much too little, and counting the number of items items
To production, doing hypothesis testing, finance projecting: what a stock might earn in the future, and
even marketing,
Statistics is art because we have to be artful in the way we set up the data and present the data well.
Statistics is science because we must use the scientific methods of experiment and probability.
What is data?
Data is a set of values of qualitative variables or quantitative variables about one or more persons or
objects, while a datum (singular of data) is a single value of a single variable.
“Data” and “information” are not the same: data transforms into information when it is viewed in a set
context or in post-analysis.
Look For: Data vs Information vs Knowledge
What is a statistic?
Graph: A graph is a picture designed to express words, particularly the connection between two or more
quantities.
Raw data ("unprocessed data") is a collection of numbers or characters before it has been "cleaned" and
corrected by researchers.
Raw data needs to be corrected to remove outliers or obvious instrument or data entry errors. Data
processing commonly occurs by stages, and the "processed data" from one stage may be considered the
"raw data" of the next stage. Field data is raw data that is collected in an uncontrolled "in situ"
environment.
Data has been described as the new oil of the digital economy
A variable is any measured characteristic or attribute that differs for different subjects. For example, if
the height of 1000 subjects was measured, then height would be a variable.
Variables can be quantitative variables or qualitative variables (sometimes called categorical variables).
Quantitative variables (or numerical variables) are measured on one of three different scales: Interval
scale, ratio scale, or ordinal scale.
Interval scale: An interval scale is a scale of measurement where the distance between any two adjacent
units of measurement (or ‘intervals’) is the same, but the zero point is arbitrary.
Ratio scale: Ratio scale consists not only of equidistant points but also has a meaningful zero point.
Ordinal scale: Ordinal scale is a scale where the values/observations belonging to it can be ranked (put
in order) or have a rating scale attached. You can count and order, but not measure, ordinal data.
Nominal scale: A set of data is said to be categorical if the values or observations belonging to it can be
sorted according to category.
If a group of business students was asked to name their favorite browser to browse the Web, then the
variable would be qualitative. If the time spent on the computer to research a topic was measured, then
the variable would be quantitative. Nominal measurement consists of assigning items to groups or
categories. No quantitative information is conveyed and no ordering of the items is implied. Nominal
scales are therefore qualitative rather than quantitative. Football club allegiance, sex or gender, degree
type, and courses studies are all examples of nominal scales.
Continuous variable: A set of data is said to be continuous if the values belong to a continuous interval
of real values.
Discrete variable: Variables that can only take on a finite number of values are called "discrete
variables”
Also Learn: Frequency Distribution, Grouped Frequency Distribution, class boundaries and Histogram
Classes: Classes provide several convenient intervals into which the values of the variable of a frequency
distribution may be grouped.
Class limit: Class limits separate one class in a grouped frequency distribution from another.
True or mathematical limits: True or mathematical limits separate one class in a grouped frequency
distribution from another.
If CW = class width, UCB = upper class boundary, and LCB = lower class boundary, then the class width is
calculated using equation. CW = UCB - LCB
Cross-tabulation: Cross-tabulation is the process made with two or more data sources (variables) that
are tabulating the results of one against the other
What is Statistics?
https://www.youtube.com/watch?
v=weHwUZdg7Qo&list=PLrRPvpgDmw0ngx_uPhvasTbOWLOztsaBj&t=0s
Data Categorical, Quantitative, Nominal, Ordinal, Interval, Ratio
https://www.youtube.com/watch?
v=bfJZ5Hm8YqE&list=PLrRPvpgDmw0ngx_uPhvasTbOWLOztsaBj&index=7
Frequency Distribution & Categorical Variables
https://www.youtube.com/watch?
v=roZrH6hcnic&list=PLrRPvpgDmw0ngx_uPhvasTbOWLOztsaBj&index=8
Subtle Chart Junk
https://www.youtube.com/watch?
v=38D6nMQt6LU&list=PLrRPvpgDmw0ngx_uPhvasTbOWLOztsaBj&index=9
Frequency Distributions, Histograms, Skew, Quantitative Variable
https://www.youtube.com/watch?
v=YfVu7xGHgnA&list=PLrRPvpgDmw0ngx_uPhvasTbOWLOztsaBj&index=10
Cumulative Frequency Distribution & Chart, PivotTable & Formula
https://www.youtube.com/watch?
v=tdRZqfFI_1o&list=PLrRPvpgDmw0ngx_uPhvasTbOWLOztsaBj&index=11
Dot Plot and Stem and Leaf Charts
https://www.youtube.com/watch?
v=tNtt7fyWcoM&list=PLrRPvpgDmw0ngx_uPhvasTbOWLOztsaBj&index=12
Power Query Import Multiple Text Files, Grade Histogram by Year
https://www.youtube.com/watch?
v=gOdraewU6u4&list=PLrRPvpgDmw0ngx_uPhvasTbOWLOztsaBj&index=13
Cross Tabulation & Charts For Two Variables, Simpson’s Paradox
https://www.youtube.com/watch?
v=3OwB8qtjYZw&list=PLrRPvpgDmw0ngx_uPhvasTbOWLOztsaBj&index=14
VLOOKUP to Add New Field To Table & Create Cross Tabulation
https://www.youtube.com/watch?
v=GSCmJVEXe_I&list=PLrRPvpgDmw0ngx_uPhvasTbOWLOztsaBj&index=16
Also learn: Adding a PivotChart to an existing PivotTable, Refreshing a PivotTable, Extending the dataset
in a PivotTable, How to create a Pie Chart, How to create a bar chart, changing their labels, Formatting
the chart axis, Creating a Histogram, Creating a Frequency Polygon
A population includes all of the entities of interest in a study. A sample is a subset of the population,
often randomly chosen and preferably representative of the population as a whole.
A data set is usually a rectangular array of data, with variables in columns and observations in rows. A
variable (or field or attribute) is a characteristic of members of a population, such as height, gender, or
salary. An observation (or case or record) is a list of all variable values for a single member of a
population.
A dummy variable is a 0−1 coded variable for a specific category. It is coded as 1 for all observations in
that category and 0 for all observations not in that category.
A binned (or discretized) variable corresponds to a numerical variable that has been categorized into
discrete categories. These categories are usually called bins.
Cross-sectional data are data on a cross section of a population at a distinct point in time. Time series
data are data collected over time.
While measures of variability include the standard deviation (or variance), the minimum and maximum
values of the variables, kurtosis and skewness
Inferential statistics is defined as using the sample descriptive statistics to make an inference
(estimation) of the population.
Mode: The mode is the most frequently occurring value in a set of discrete data.
Median: The median is the value halfway through the ordered data set.
Central tendency Measures: the location of the middle or the centre of a distribution.
Arithmetic mean: The sum of a list of numbers divided by the number of numbers.
Population mean: The population mean is the mean value of all possible values.
Extreme value: An extreme value is an unusually large or an unusually small value compared with the
others in the data set.
Outlier: An outlier is an observation in a data set which is far removed in value from the others in the
data set.
Quartiles: Quartiles are values that divide a sample of data into four groups containing an equal number
of observations.
Q1: Q1 is the lower quartile and is the data value a quarter way up through the ordered data set.
Q3: Q3is the upper quartile and is the data value a quarter way down through the ordered data set.
Average(B4:B16) Average/Mean
Use of SumProduct
Cumulative frequency distribution: The cumulative frequency for a value x is the total number of scores
that are less than or equal to x.
Symmetrical data set: A data set is symmetrical when the data values are distributed in the same way
above and below the middle value.
Range: The range of a data set is a measure of the dispersion of the observations.
Interquartile range: The interquartile range is a measure of the spread of or dispersion within a data set.
Standard deviation: Measure of the dispersion of the observations (a square root value of the variance).
Coefficient of variation: The coefficient of variation measures the spread of a set of data as a proportion
of its mean.
Let the values be between Cell B4:B16 and the let the column be called X
Then X^2 value in Cell C4 can be calculated using =B4^2
Copy formula down C4:C16
n = Cell F4 = number of values between B4 to B16 : =COUNT(B4:B16)
Σx = Cell F5 = Sums the value between B4 & B16: =SUM(B4:B16)
ΣX^2 = Cell F6= Sums the value between C4 & C16: =SUM(C4:C16)
mean = Cell F7 = Sum/Count: =F5/F4
variance = Cell F8 Variance of the values between B4 & B16 : =F6/F4−F7^2
standard deviation = Cell F9 =F8^0.5
Range = Cell F13 =MAX (B4:B16)−MIN (B4:B16)
Q1 = Cell F14 =QUARTILE.INC (B4:B16, 1)
Median or Q2= Cell F15 =MEDIAN(B4:B16) or QUARTILE.INC (B4:B16, 2)
Q3 = Cell F16 =QUARTILE.INC(B4:B16, 3)
QR = Cell F17 =F16−F14
SIQR = Cell F18 =(F16−F14)/2
Mean = Cell F19 =AVERAGE(B4:B16)
varp = Cell F20 =VAR.P(B4:B16)
std = Cell F21 =STDEV.P(B4:B16)
SIQR = Semi-interquartile range is the measure of spread or dispersion. It is computed as one half of the
difference between the 75th percentile as 25th percentile.
Variation: Variation is a measure that describes how spread out or scattered a set of data is.
It is very important to note that Excel contains two different functions (VAR.S (), VAR.P ()) to calculate
the value of the variance. The function that you use is dependent upon whether the data set represents
the complete population or is a sample from the population being measured.
1. If the data set is the complete population then the population variance (σ2) is given by the Excel
function VAR.P ().
2. If the data set is a sample from the population then the sample variance (S2) is given by the Excel
function VAR.S ().
Population variance: The population variance is the variance of all possible values.
Shape: The shape of the distribution refers to the shape of a probability distribution and involves the
calculation of skewness and kurtosis.
The value for skewness is zero for symmetric distributions (mean = median).
If mean < median, then the measure of skewness is negative and the distribution is said to be negatively
skewed.
If mean > median, then the measure of skewness is positive and the distribution is said to be positively
skewed.
1) Q3 — Median = Median — Q1
2) Largest value — Q3 = Q1 — smallest value
3) Median = Midrange
Right-skewed distributions: Largest value — Q3 greatly exceeds Q1 — Smallest value
Left-skewed distributions: Q1 — Smallest value greatly exceeds Largest value — Q3
Box plot: A box plot is a way of summarizing a set of data measured on an interval scale.
Also Learn: How to draw a Box plot in Excel
Different tools should be used to examine relationships, depending on whether the variables involved
are numeric or categorical.
Use a crosstabs, a table of counts of joint categories, to discover relationships between two
categorical variables.
Relationships between two variables are usually more evident when the counts are expressed as
percentages of row totals or percentages of column totals.
For finding relationships among numerical variables: We might want to examine the relationship
between heights and weights of people, or between salary and years of experience of employees. To
study such relationships, we introduce two new summary measures, correlation and covariance, and a
new type of chart called a scatterplot.
Scatterplots are appropriate only for truly numerical variables, not for categorical variables that have
been coded numerically. In particular, many people create dummy (0–1) variables for categorical
variables such as Gender and then include these dummies in a table of correlations. This is certainly
possible, but it is not a good practice.
A scatterplot is a scatter of points, where each point denotes the values of an observation for two
selected variables. The two variables are often labeled generically as X and Y, so a scatterplot is
sometimes called an X-Y chart. The whole purpose of a scatterplot is to make a relationship (or the lack
of it) apparent.
Scatterplots are great for initial exploration of the data. If a scatterplot suggests a relationship between
two variables, other methods can then be used to examine this relationship in more depth.
Excel allows you to superimpose a trend line, linear or curved, on a scatterplot. It is an easy way to
quantify the relationship apparent in the scatterplot.
A trend line is a line or curve that “fits” the scatter as well as possible. This could indeed be a straight
line, or it could be one of several types of curves.
Intuitively, the relationship is “strong” if the points in a scatterplot cluster tightly around some straight
line. If this straight line rises from left to right, the relationship is positive and the measures will be
positive numbers. If it falls from left to right, the relationship is negative and the measures will be
negative numbers.
Excel’s old COVAR function actually uses denominator n, so it gives the population covariance, not the
sample covariance (denominator n–1) in Equation (3.1). Starting in Excel 2010, both versions are
available, named COVARIANCE.P (population) and COVARIANCE.S (sample).
Covariance is too sensitive to the measurement scales of X and Y to make it interpretable, so it is better
to rely on correlation, which is unaffected by measurement scales.
Covariance has a serious limitation as a descriptive measure because it is very sensitive to the units in
which X and Y are measured. For example, the covariance can be inflated by a factor of 1000 simply by
measuring X in dollars rather than thousands of dollars. This limits the usefulness of covariance as a
descriptive measure
The correlation, denoted by Correl(X, Y), remedies this problem. It is a unitless quantity that is
unaffected by the measurement scale. For example, the correlation is the same regardless of whether
the variables are measured in dollars, thousands of dollars, or millions of dollars.
The correlation is not only unaffected by the units of measurement of the two variables, but it is always
between −1 and +1. The closer it is to either of these two extremes, the closer the points in a scatterplot
are to a straight line, either in the negative or positive direction. On the other hand, if the correlation is
close to 0, the scatterplot is typically a “cloud” of points with no apparent relationship. However,
although it is not common, it is also possible that the points are close to a curve and have a correlation
close to 0. This is because correlation is relevant only for measuring linear relationships.
Correlation is useful only for measuring the strength of a linear relationship. Strongly related variables
can have correlation close to 0 if the relationship is nonlinear.