1 Introduction To Statistics
1 Introduction To Statistics
1 Introduction To Statistics
Learning Objectives
In this chapter, you will learn:
• What is Statistics
• Why Statistics
• Basic vocabulary used in Statistics
• How statistics is used in Business
• The sources of data and its types used in Business
• Types of Variables
• Level of Management
• Tabular and Graphical Presentation of Data
What is Statistics?
Statistics
Data Information
Data: Facts, especially Information: Knowledge
numerical facts, collected communicated
together for reference or concerning some
information. particular fact.
Collect data
ex. Survey
Present data
ex. Tables and graphs
Characterize data
ex. Sample mean = X i
n
Collect
Organize
Summarize
Display
Analyze
Inferential Statistics
Estimation
ex. Estimate the population
mean weight using the sample
mean weight
Hypothesis testing
ex. Test the claim that the Predict and forecast values
population mean weight is 120 of population parameters
pounds Test hypotheses about
values of population
parameters
Make decisions
Drawing conclusions and/or making decisions concerning a population
based on sample results.
Basic Vocabulary of Statistics
Population
A population consists of all the items or individuals about
which you want to draw a conclusion.
A population is the group of all items of interest to a
statistics practitioner.
frequently very large; sometimes infinite.
E.g. All 1.252 Billion Indian population i.e. census data.
Sample
A subset of the population.
A sample is a set of data drawn from the population.
Potentially very large, but less than the population.
E.g. a sample of 765 voters exit polled on election day
Basic Vocabulary of Statistics
Population
Sample
Subset
Parameter Statistic
Variable
A variable is some characteristic of a population or sample.
E.g. student grades. Typically denoted with a capital letter: A,
B, C…
The values of the variable are the range of possible values for
a variable.
E.g. student marks (0..100)
Data
Data are the observed values of a variable.
Data are the different values associated with a variable.
E.g. student marks: {67, 74, 71, 83, 93, 55, 48}
Example
NSB dean is interested in learning about the average age of PGDM (E)
students. Identify the basic terms in this situation.
The population is the age of all PGDM (E) students at the Institute.
A sample is any subset of that population. For example, we might
select 10 PGDM (E) students and determine their age.
The variable is the “age” of each PGDM (E) students.
The data would be the set of values in the sample.
The parameter of interest is the “average” age of all PGDM (E)
students at the Institute.
The statistic is the “average” age for all PGDM (E) students in the
sample.
Why Collect Data?
• A marketing research analyst needs to assess the
effectiveness of a new television advertisement.
• A pharmaceutical manufacturer needs to determine whether
a new drug is more effective than those currently in use.
• An operations manager wants to monitor a manufacturing
process to find out whether the quality of product being
manufactured is conforming to company standards.
• A power company collect data to predict Electricity prices
and optimizing operations.
Sources of Data
Primary Sources:
The data collector is the one using the data for analysis
Data from a political survey
Data collected from an experiment
Observed data
Secondary Sources
The person performing data analysis is not the data collector
Analyzing census data
Examining data from print journals or data published on
the internet.
Types of Variables
Data
Categorical Numerical
Examples:
Marital Status
Discrete Continuous
Political Party
Eye Color
Examples: Examples:
(Defined categories)
Number of Children Weight
Defects per hour Voltage
Ratio
Interval
Ordinal
Nominal NOIR
Nominal scale
A nominal scale classifies data into distinct categories in
which no ranking is implied.
1 f
6 i
2 n
4 i
3 s
h
5
Interval scale
• Distances between consecutive integers are equal
– Relative magnitude of numbers is meaningful
– Differences between numbers are comparable
– Location of origin, zero, is arbitrary
Example:
the difference between 1 and 2 years of age is the
same amount as the difference between 21 and 22
years of age, or 50 and 51, or 65 and 66.
the difference between a height of 60 inches and a
height of 55 inches is the same amount of difference
as a height of 72 inches and a height of 67 inches.
Ratio Level Data
• Highest level of measurement
– Relative magnitude of numbers is meaningful
– Differences between numbers are comparable
– Location of origin, zero, is absolute (natural)
Examples: Height, Weight, and Volume
Example: Monetary Variables, such as Profit and Loss,
Revenues, and Expenses
Example: Financial ratios, such as P/E Ratio, Inventory
Turnover, and Quick Ratio.
Example
The Hierarchy of Levels
Nominal
The Hierarchy of Levels
Ordinal
Interval
Ordinal Attributes can be ordered
Ratio
Interval Distance is meaningful
Categorical Data
Tallying Data
335 65 400
Total
Contingency Table Based on
% of Overall Total
No
42.50% = 170 / 400
Errors Errors Total
25.00% = 100 / 400
Small 170 20 190 16.25% = 65 / 400
Amount
Medium 100 40 140
Amount No
Large Errors Errors Total
65 5 70
Amount Small 42.50% 5.00% 47.50%
Amount
335 65 400
Total Medium 25.00% 10.00% 35.00%
Amount
83.75% of sampled invoices have no
errors and 47.50% of sampled invoices Large 16.25% 1.25% 17.50%
are for small amounts. Amount
83.75% 16.25% 100.0%
Total
Contingency Table Based on
% of Row Totals
No
89.47% = 170 / 190
Errors Errors Total
71.43% = 100 / 140
Small 170 20 190 92.86% = 65 / 70
Amount
Medium 100 40 140
Amount No
Large Errors Errors Total
65 5 70
Amount Small 89.47% 10.53% 100.0%
Amount
335 65 400
Total Medium 71.43% 28.57% 100.0%
Amount
Medium invoices have a larger chance
(28.57%) of having errors than small Large 92.86% 7.14% 100.0%
(10.53%) or large (7.14%) invoices. Amount
83.75% 16.25% 100.0%
Total
Contingency Table Based on
Percentage of Column Total
No
Errors Errors Total 50.75% = 170 / 335
30.77% = 20 / 65
Small 170 20 190
Amount
Medium 100 40 140
Amount No
Large Errors Errors Total
65 5 70
Amount Small 50.75% 30.77% 47.50%
Amount
335 65 400
Total Medium 29.85% 61.54% 35.00%
Amount
There is a 61.54% chance that invoices Large 19.40% 7.69% 17.50%
with errors are of medium size. Amount
100.0% 100.0% 100.0%
Total
Tables Used For Organizing
Numerical Data
Numerical Data
Frequency Cumulative
Ordered Array
Distributions Distributions
Organizing Numerical Data:
Ordered Array
An ordered array is a sequence of data, in rank order,
from the smallest value to the largest value.
Day Students
16 17 17 18 18 18
Age of
Surveyed 19 19 20 20 21 22
College 22 25 27 32 38 42
Students
Night Students
18 18 19 19 20 21
23 28 32 33 41 45
Organizing Numerical Data:
Frequency Distribution
The frequency distribution is a summary table in which the
data are arranged into numerically ordered class groupings.
You must give attention to selecting the appropriate number
of class groupings for the table, determining a suitable width
of a class grouping, and establishing the boundaries of each
class grouping to avoid overlapping.
The number of classes depends on the number of values in
the data. With a larger number of values, typically there are
more classes. In general, a frequency distribution should
have at least 5 but no more than 15 classes.
To determine the width of a class interval, you divide the
range (Highest value–Lowest value) of the data by the
number of class groupings desired.
Organizing Numerical Data:
Frequency Distribution Example
24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41,
43, 44, 27, 53, 27
STEPS
1. Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43,
44, 46, 53, 58
2. Find range: 58 - 12 = 46
3. Select number of classes: 5 (usually between 5 and 15)
4. Compute class interval (width): 10 (46/5 then round up)
5. Determine class boundaries (limits):
1. Class 1: 10 to less than 20
2. Class 2: 20 to less than 30
3. Class 3: 30 to less than 40
4. Class 4: 40 to less than 50
5. Class 5: 50 to less than 60
6. Compute class midpoints: 15, 25, 35, 45, 55
7. Count observations & assign to classes
Organizing Numerical Data:
Frequency Distribution Example
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Relative
Class Frequency Percentage
Frequency
10 but less than 20 3 .15 15
20 but less than 30 6 .30 30
30 but less than 40 5 .25 25
40 but less than 50 4 .20 20
50 but less than 60 2 .10 10
Total 20 1.00 100
Organizing Numerical Data:
Cumulative Frequency
Distribution
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Cumulative Cumulative
Class Frequency Percentage
Frequency Percentage
Visualizing Data
Vacation
38%
Catching up on work
Other
Organizing Categorical Data:
Pareto Diagram
40% 90%
35% 80%
cumulative % invested
70%
30%
60%
(line graph)
25%
graph)
50%
20%
40%
15%
30%
10% 20%
5% 10%
0% 0%
Stocks Bonds Savings CD
Plot the Ogive
Stem-and-Leaf
Display Histogram Polygon Ogive
Organizing Numerical Data:
Stem and Leaf Display
A stem-and-leaf display organizes data into groups
(called stems) so that the values within each group
(the leaves) branch out to the right on each row.
Age of Day Students Age of College Students
Survey
16 17 17 18 18 18 Day Students Night Students
ed
College
19 19 20 20 21 22 Stem Leaf Stem Leaf
Studen 1 67788899 1 8899
ts 22 25 27 32 38 42
2 0012257 2 0138
Night Students 3 28 3 23
18 18 19 19 20 21 4 2
4 15
23 28 32 33 41 45
Organizing Numerical Data:
Stem and Leaf Display
A stem-and-leaf display organizes data into groups
(called stems) so that the values within each group
(the leaves) branch out to the right on each row.
Age of College Students
Day Students Night Students
Stem Leaf Stem Leaf
1 67788899 1 8899
2 0012257 2 0138
3 28 3 23
4 2 4 15
Visualizing Numerical Data:
The Histogram
A graph of the data in a frequency distribution is
called a histogram.
In a histogram there are no gaps between adjacent
bars.
The class boundaries (or class midpoints) are shown
on the horizontal axis.
The vertical axis is either frequency, relative
frequency, or percentage.
Bars of the appropriate heights are used to represent
the number of observations within each class.
Visualizing Numerical Data:
The Histogram
Relative
Class Frequency Frequency Percentage
Frequency
5
0
5 15 25 35 45 55 More
Visualizing Numerical Data:
The Polygon
Relative
Class Frequency Percentage
Frequency
Frequency
5
(In a percentage polygon
the vertical axis would be
defined to show the 0
percentage of observations 5 15 25 35 45 55 More
per class)
Organizing Numerical Data:
The Cumulative Percentage Polygon
Class Lower % Less Than
Boundary Lower
Boundary
10<20 10 0
20<30 20 15
30<40 30 45 Ogive: Daily High Temperature
Cumulative Percentage
40<50 40 70
100
50<60 50 90
60 100
50
0
10 20 30 40 50 60
Scatter Plots
328
324
320
316
0 1 2 3 4 5 6
200 20%
100 10%
0 0%
FR SO JR SR FR SO JR SR
100 25
0 0
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Class Exercise 1
The owner of the restaurant wanted to study the demand for
dessert. He decided that in addition to studying whether the desert
was ordered, he would also study the gender of individual. Data
were collected from 600 customers and organized in the following
contingency tables.
Gender
Dessert Ordered Male Female Total
Yes 40 96 136
No 240 224 464
Total 280 320 600
a.Construct a contingency tables for row, column and total percentage?
b.Which type of percentage (row, column and total ), do you think more
informative for each gender?
c.What conclusions concerning the pattern of dessert ordering can the
restaurant owner reach?
Class Exercise 2
The Following Table represents estimated green power sales
by renewable energy source 2008
Source Percentage
Geothermal 2.8
hydro 11.3
Landfill mass and biomass 28.1
Solar 0.2
Unreported 2.5
Wind 55.1
a. Construct a bar chart, pie chart and Pareto chart
b. What conclusion can you reach about the sources of green
power
Source: National renewable energy laboratory,2008
Class Exercise 3
Calculate the following ?
a. Divide the data into classes
b. Absolute frequency
c. Relative frequency
d. Percentages
e. Cumulative frequency
f. Cumulative percentage
g. Midpoints
h. Draw Histogram and relative frequency
polygon
THANKS