1 Introduction To Statistics

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 89

Introduction to Statistics

Learning Objectives
In this chapter, you will learn:

• What is Statistics
• Why Statistics
• Basic vocabulary used in Statistics
• How statistics is used in Business
• The sources of data and its types used in Business
• Types of Variables
• Level of Management
• Tabular and Graphical Presentation of Data
What is Statistics?

The science of collecting, describing, and interpreting data.

“Statistics is a way to get information from data”


Data Information
Data: Facts, especially Information: Knowledge
numerical facts, collected communicated
together for reference or concerning some
information. particular fact.

Statistics is a tool for creating new understanding from a set of numbers.

What is statistics?
• The word “statistics” is used in 3 main ways:
– Common meaning : factual information involving
numbers. A better word for this is data
– Precise meaning: quantities which have been
derived from sample data, e.g. the mean (or
average) of a data set
– Common meaning: an academic subject which
involves reasoning about statistical quantities
Why Study Statistics?
Decision Makers Use Statistics To:

• Present and describe business data and information

• Draw conclusions about large populations, using
information collected from samples
• Make reliable forecasts about a business activity
• Improve business processes
Types of Statistics
Descriptive Statistics
• Descriptive Statistics is that branch of Statistics
that summarizes, presents and analyzes the great
bodies of statistical data for describing their
salient features.
• If a business analyst is using data gathered on a
group to describe or reach conclusion about the
same group the statistics is called descriptive.
• Descriptive statistic includes methods of
organizing, summarizing, analyzing, and
presenting data in an informative way.
Descriptive Statistics

 Collect data
 ex. Survey
 Present data
 ex. Tables and graphs
 Characterize data
 ex. Sample mean =  X i
 Collect
 Organize
 Summarize
 Display
 Analyze
Inferential Statistics

• Another facet of statistics is inferential statistics-

also called statistical inference and inductive
• If a researcher gather data from sample and uses
statistics generated to reach conclusion about
population from which sample was taken.
• Statistical inference is that branch of Statistics
that deals with drawing valid inferences about
the population parameters on the basis of sample
data along with an associated degree of their
Inferential Statistics

 Estimation
 ex. Estimate the population
mean weight using the sample
mean weight
 Hypothesis testing
 ex. Test the claim that the  Predict and forecast values
population mean weight is 120 of population parameters
pounds  Test hypotheses about
values of population
 Make decisions
Drawing conclusions and/or making decisions concerning a population
based on sample results.
Basic Vocabulary of Statistics
A population consists of all the items or individuals about
which you want to draw a conclusion.
A population is the group of all items of interest to a
statistics practitioner.
 frequently very large; sometimes infinite.
E.g. All 1.252 Billion Indian population i.e. census data.
 A subset of the population.
 A sample is a set of data drawn from the population.
 Potentially very large, but less than the population.
E.g. a sample of 765 voters exit polled on election day
Basic Vocabulary of Statistics


Parameter Statistic

Measures used to describe Measures computed

the population are called from sample data are
parameters called statistics
Basic Vocabulary of Statistics

 A variable is some characteristic of a population or sample.
E.g. student grades. Typically denoted with a capital letter: A,
B, C…
 The values of the variable are the range of possible values for
a variable.
E.g. student marks (0..100)
 Data are the observed values of a variable.
 Data are the different values associated with a variable.
E.g. student marks: {67, 74, 71, 83, 93, 55, 48}
NSB dean is interested in learning about the average age of PGDM (E)
students. Identify the basic terms in this situation.

The population is the age of all PGDM (E) students at the Institute.
A sample is any subset of that population. For example, we might
select 10 PGDM (E) students and determine their age.
The variable is the “age” of each PGDM (E) students.
The data would be the set of values in the sample.
The parameter of interest is the “average” age of all PGDM (E)
students at the Institute.
The statistic is the “average” age for all PGDM (E) students in the
Why Collect Data?
• A marketing research analyst needs to assess the
effectiveness of a new television advertisement.
• A pharmaceutical manufacturer needs to determine whether
a new drug is more effective than those currently in use.
• An operations manager wants to monitor a manufacturing
process to find out whether the quality of product being
manufactured is conforming to company standards.
• A power company collect data to predict Electricity prices
and optimizing operations.
Sources of Data
 Primary Sources:
The data collector is the one using the data for analysis
 Data from a political survey
 Data collected from an experiment
 Observed data
 Secondary Sources
The person performing data analysis is not the data collector
 Analyzing census data
 Examining data from print journals or data published on
the internet.
Types of Variables


Categorical Numerical

 Marital Status
Discrete Continuous
 Political Party

 Eye Color
Examples: Examples:
(Defined categories)
 Number of Children  Weight
 Defects per hour  Voltage

(Counted items) (Measured characteristics)

Types of Variables
 Qualitative variables have values that can only be placed into
categories, such as “yes” and “no.”
 A variable that categorizes or describes an element of a
Note: Arithmetic operations, such as addition and averaging, are not
meaningful for data resulting from a qualitative variable
 Quantitative variables have values that represent quantities.
 A variable that quantifies an element of a population.
Note: Arithmetic operations such as addition and averaging, are
meaningful for data resulting from a quantitative variable.
Identify each of the following examples as attribute (qualitative) or
numerical (quantitative) variables.

 The amount of CNG pumped by the next 10 customers at the local

hp PUMP . (Numerical)
 The amount of radon in the basement of each of 25 homes in a
new development. (Numerical)
 The color of the baseball cap worn by each of 20 students.
 The length of time to complete a mathematics homework
assignment. (Numerical)
 The state in which each truck is registered when stopped and
inspected at a weigh station. (Attribute)
Identify each of the following as examples of qualitative or
quantitative variables:
The temperature in Barrow, Alaska at 12:00 pm on any
given day.
The make of automobile driven by each faculty member.
Whether or not a 6 volt lantern battery is defective.
The weight of a lead pencil.
The length of time billed for a long distance telephone call.
The brand of cereal children eat for breakfast.
The type of book taken out of the library by an adult.
Level of Measurement



Nominal NOIR
Nominal scale
A nominal scale classifies data into distinct categories in
which no ranking is implied.

Categorical Variables Categories

Personal Computer Yes / No

Type of Stocks Growth, Value, Other


Internet Provider Microsoft Network /

Ordinal scale

An ordinal scale classifies data into distinct

categories in which ranking is implied

Categorical Variable Ordered Categories

Student class designation Freshman, Junior, Senior

Product satisfaction Satisfied, Neutral, Unsatisfied

Faculty rank Professor, Associate Professor,

Assistant Professor, Instructor
Standard & Poor’s bond ratings AAA, AA, A, BBB, BB, B, CCC, CC,
Student Grades A, B, C, D, F
Example of Ordinal Measurement

1 f
6 i
2 n
4 i
3 s
Interval scale
• Distances between consecutive integers are equal
– Relative magnitude of numbers is meaningful
– Differences between numbers are comparable
– Location of origin, zero, is arbitrary
the difference between 1 and 2 years of age is the
same amount as the difference between 21 and 22
years of age, or 50 and 51, or 65 and 66.
the difference between a height of 60 inches and a
height of 55 inches is the same amount of difference
as a height of 72 inches and a height of 67 inches.
Ratio Level Data
• Highest level of measurement
– Relative magnitude of numbers is meaningful
– Differences between numbers are comparable
– Location of origin, zero, is absolute (natural)
Examples: Height, Weight, and Volume
Example: Monetary Variables, such as Profit and Loss,
Revenues, and Expenses
Example: Financial ratios, such as P/E Ratio, Inventory
Turnover, and Quick Ratio.
The Hierarchy of Levels

The Hierarchy of Levels

Nominal Attributes are only named; weakest

The Hierarchy of Levels


Nominal Attributes are only named; weakest

The Hierarchy of Levels

Ordinal Attributes can be ordered

Nominal Attributes are only named; weakest

The Hierarchy of Levels

Ordinal Attributes can be ordered

Nominal Attributes are only named; weakest

The Hierarchy of Levels

Interval Distance is meaningful

Ordinal Attributes can be ordered

Nominal Attributes are only named; weakest

The Hierarchy of Levels

Interval Distance is meaningful

Ordinal Attributes can be ordered

Nominal Attributes are only named; weakest

The Hierarchy of Levels

Ratio Absolute zero

Interval Distance is meaningful

Ordinal Attributes can be ordered

Nominal Attributes are only named; weakest

Level of Measurement :
Level of Measurement:
Statistical Tests
Identify each of the following as examples of (1) nominal, (2)
ordinal, (3) discrete, or (4) continuous variables:
 The length of time until a pain reliever begins to work.
 The number of chocolate chips in a cookie.
 The number of colors used in a statistics textbook.
 The brand of refrigerator in a home.
 The overall satisfaction rating of a new car.
 The number of files on a computer’s hard disk.
 The pH level of the water in a swimming pool.
 The number of staples in a stapler.
Class Exercise
Q 1: Determine whether the variable is categorical
or numerical If numerical, determine whether the
variable is discrete or continuous .Determine the
level of measurement
Amount of money spent on clothing in past
Favorite department store?
Most likely time period during which shopping for
clothing takes place?
Number of pairs of shoes owned?
Class Exercise
Q 2: A manufacturer of dog food was planning to
survey household in India to determine purchasing
habit of dog owners. Among the variables to be
collected are

The primary place of purchase of dog food?

Whether dry or moist food can be purchased ?
Number of dogs living in the household?
Whether the dog is pedigreed?
Class Exercise
Q3 : Suppose the following information collected from
Mr X on his application for a home loan at the HDFC
bank Loan department
a. Monthly payment : Rs 25100
b. Annual Family income:
c. Marital status: Married
d. No of job changed in past 10 years: 2

Classify each of the response by type of data and level of

Organizing and Visualizing
Categorical and Numerical Data
Categorical Data Are Organized By
Utilizing Tables

Categorical Data

Tallying Data

One Categorical Two Categorical

Variable Variables

Summary Table Contingency Table

Organizing Categorical Data:
Summary Table
A summary table indicates the frequency, amount, or
percentage of items in a set of categories so that you can
see differences between categories.

How do you spend the holidays? Percent

At home with family 45%
Travel to visit family 38%
Vacation 5%
Catching up on work 5%
Other 7%
Contingency Table

Used to study patterns that may exist

between the responses of two or more
categorical variables
Cross tabulates or tallies jointly the responses
of the categorical variables
For two variables the tallies for one variable
are located in the rows and the tallies for the
second variable are located in the columns
Contingency Table - Example

 A random sample of 400 Contingency Table Showing

invoices is drawn. Frequency of Invoices Categorized
 Each invoice is categorized as a By Size and The Presence Of Errors
small, medium, or large No
amount. Errors Errors Total

 Each invoice is also examined to Small 170 20 190

identify if there are any errors.
Medium 100 40 140
 This data are then organized in Amount
the contingency table to the Large 65 5 70
right. Amount

335 65 400
Contingency Table Based on
% of Overall Total
42.50% = 170 / 400
Errors Errors Total
25.00% = 100 / 400
Small 170 20 190 16.25% = 65 / 400
Medium 100 40 140
Amount No
Large Errors Errors Total
65 5 70
Amount Small 42.50% 5.00% 47.50%
335 65 400
Total Medium 25.00% 10.00% 35.00%
83.75% of sampled invoices have no
errors and 47.50% of sampled invoices Large 16.25% 1.25% 17.50%
are for small amounts. Amount
83.75% 16.25% 100.0%
Contingency Table Based on
% of Row Totals
89.47% = 170 / 190
Errors Errors Total
71.43% = 100 / 140
Small 170 20 190 92.86% = 65 / 70
Medium 100 40 140
Amount No
Large Errors Errors Total
65 5 70
Amount Small 89.47% 10.53% 100.0%
335 65 400
Total Medium 71.43% 28.57% 100.0%
Medium invoices have a larger chance
(28.57%) of having errors than small Large 92.86% 7.14% 100.0%
(10.53%) or large (7.14%) invoices. Amount
83.75% 16.25% 100.0%
Contingency Table Based on
Percentage of Column Total
Errors Errors Total 50.75% = 170 / 335
30.77% = 20 / 65
Small 170 20 190
Medium 100 40 140
Amount No
Large Errors Errors Total
65 5 70
Amount Small 50.75% 30.77% 47.50%
335 65 400
Total Medium 29.85% 61.54% 35.00%
There is a 61.54% chance that invoices Large 19.40% 7.69% 17.50%
with errors are of medium size. Amount
100.0% 100.0% 100.0%
Tables Used For Organizing
Numerical Data
Numerical Data

Frequency Cumulative
Ordered Array
Distributions Distributions
Organizing Numerical Data:
Ordered Array
An ordered array is a sequence of data, in rank order,
from the smallest value to the largest value.

Day Students
16 17 17 18 18 18
Age of
Surveyed 19 19 20 20 21 22
College 22 25 27 32 38 42
Night Students
18 18 19 19 20 21
23 28 32 33 41 45
Organizing Numerical Data:
Frequency Distribution
 The frequency distribution is a summary table in which the
data are arranged into numerically ordered class groupings.
 You must give attention to selecting the appropriate number
of class groupings for the table, determining a suitable width
of a class grouping, and establishing the boundaries of each
class grouping to avoid overlapping.
 The number of classes depends on the number of values in
the data. With a larger number of values, typically there are
more classes. In general, a frequency distribution should
have at least 5 but no more than 15 classes.
 To determine the width of a class interval, you divide the
range (Highest value–Lowest value) of the data by the
number of class groupings desired.
Organizing Numerical Data:
Frequency Distribution Example

Example: A manufacturer of insulation randomly

selects 20 winter days and records the daily
high temperature

24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41,
43, 44, 27, 53, 27
1. Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43,
44, 46, 53, 58
2. Find range: 58 - 12 = 46
3. Select number of classes: 5 (usually between 5 and 15)
4. Compute class interval (width): 10 (46/5 then round up)
5. Determine class boundaries (limits):
1. Class 1: 10 to less than 20
2. Class 2: 20 to less than 30
3. Class 3: 30 to less than 40
4. Class 4: 40 to less than 50
5. Class 5: 50 to less than 60
6. Compute class midpoints: 15, 25, 35, 45, 55
7. Count observations & assign to classes
Organizing Numerical Data:
Frequency Distribution Example
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class Midpoints Frequency

10 but less than 20 15 3

20 but less than 30 25 6
30 but less than 40 35 5
40 but less than 50 45 4
50 but less than 60 55 2
Total 20
Organizing Numerical Data:
Relative & Percent Frequency
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class Frequency Percentage
10 but less than 20 3 .15 15
20 but less than 30 6 .30 30
30 but less than 40 5 .25 25
40 but less than 50 4 .20 20
50 but less than 60 2 .10 10
Total 20 1.00 100
Organizing Numerical Data:
Cumulative Frequency
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Cumulative Cumulative
Class Frequency Percentage
Frequency Percentage

10 but less than 20 3 15% 3 15%

20 but less than 30 6 30% 9 45%
30 but less than 40 5 25% 14 70%
40 but less than 50 4 20% 18 90%
50 but less than 60 2 10% 20 100%
Total 20 100 20 100%
Why Use a Frequency Distribution?

 It condenses the raw data into a more useful form

 It allows for a quick visual interpretation of the data
 It enables the determination of the major
characteristics of the data set including where the data
are concentrated / clustered
Frequency Distributions:
Some Tips
 Different class boundaries may provide different
pictures for the same data (especially for smaller
data sets)
 Shifts in data concentration may show up when
different class boundaries are chosen
 As the size of the data set increases, the impact of
alterations in the selection of class boundaries is
greatly reduced
 When comparing two or more groups with different
sample sizes, you must use either a relative
frequency or a percentage distribution
Visualizing Categorical Data
Through Graphical Displays
Categorical Data

Visualizing Data

Summary Table Contingency

For One Variable Table For Two

Bar Pareto Side By Side Bar

Chart Chart Chart
Pie Chart
Organizing Categorical Data:
Summary Table
In a bar chart, a bar shows each category, the length
of which represents the amount, frequency or
percentage of values falling into a category.
How Do You Spend the Holidays?
Other 7%
Catching up on… 5%
Vacation 5%
Travel to visit… 38%
At home with… 45%

0% 10% 20% 30% 40% 50%

Organizing Categorical Data:
Pie Chart
The pie chart is a circle broken up into slices that
represent categories. The size of each slice of the pie
varies according to the percentage in each category.
How Do You Spend the Holiday's
7% At home with family
5% 45%
Travel to visit family

Catching up on work

Organizing Categorical Data:
Pareto Diagram

 Used to portray categorical data

 A bar chart, where categories are shown in
descending order of frequency
 A cumulative polygon is shown in the same graph
 Used to separate the “vital few” from the “trivial
Organizing Categorical Data:
Pareto Diagram

Current Investment Portfolio

45% 100%
% invested in each category (bar

40% 90%

35% 80%

cumulative % invested

(line graph)

10% 20%
5% 10%
0% 0%
Stocks Bonds Savings CD
Plot the Ogive

Rel Cum Cum rwl

Interval Frequency frequency % frequency fre Cum fre %
10-19 5 0.125 12.5 5 0.125 12.5
20-29 7 0.175 17.5 12 0.3 30
30-39 12 0.3 30 24 0.6 60
40-49 10 0.25 25 30 0.75 75
50-59 6 0.15 15 40 1 100
Visualizing Categorical Data:
Side By Side Bar Charts
The side by side bar chart represents the data from a contingency table.
Errors Errors Total
Invoice Size Split Out By Errors & No
Small 50.75% 30.77% 47.50% Errors
Medium 29.85% 61.54% 35.00% Errors
No Errors
Large 19.40% 7.69% 17.50%
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0%
100.0% 100.0% 100.0% Large Medium Small

Invoices with errors are much more likely to be of

medium size (61.54% vs 30.77% and 7.69%)
Visualizing Numerical Data By
Using Graphical Displays
Numerical Data

Frequency Distributions and

Ordered Array Cumulative Distributions

Display Histogram Polygon Ogive
Organizing Numerical Data:
Stem and Leaf Display
 A stem-and-leaf display organizes data into groups
(called stems) so that the values within each group
(the leaves) branch out to the right on each row.
Age of Day Students Age of College Students
16 17 17 18 18 18 Day Students Night Students
19 19 20 20 21 22 Stem Leaf Stem Leaf
Studen 1 67788899 1 8899
ts 22 25 27 32 38 42
2 0012257 2 0138
Night Students 3 28 3 23
18 18 19 19 20 21 4 2
4 15
23 28 32 33 41 45
Organizing Numerical Data:
Stem and Leaf Display
A stem-and-leaf display organizes data into groups
(called stems) so that the values within each group
(the leaves) branch out to the right on each row.
Age of College Students
Day Students Night Students
Stem Leaf Stem Leaf
1 67788899 1 8899
2 0012257 2 0138
3 28 3 23
4 2 4 15
Visualizing Numerical Data:
The Histogram
 A graph of the data in a frequency distribution is
called a histogram.
 In a histogram there are no gaps between adjacent
 The class boundaries (or class midpoints) are shown
on the horizontal axis.
 The vertical axis is either frequency, relative
frequency, or percentage.
 Bars of the appropriate heights are used to represent
the number of observations within each class.
Visualizing Numerical Data:
The Histogram

Class Frequency Frequency Percentage

10 but less than 20 3 .15 15

20 but less than 30 6 .30 30 Histogram: Daily High
30 but less than 40 5 .25 25 10 Temperature
40 but less than 50 4 .20 20
50 but less than 60 2 .10 10
Total 20 1.00 100


5 15 25 35 45 55 More
Visualizing Numerical Data:
The Polygon

 A percentage polygon is formed by having the

midpoint of each class represent the data in that class
and then connecting the sequence of midpoints at
their respective class percentages.
 The cumulative percentage polygon, or ogive,
displays the variable of interest along the X axis, and
the cumulative percentages along the Y axis.
 Useful when there are two or more groups to
Visualizing Numerical Data:
The Frequency Polygon

Class Frequency Percentage

10 but less than 20 3 .15 15

20 but less than 30 6 .30 30
30 but less than 40 5 .25 25
40 but less than 50 4 .20 20
50 but less than 60 2 .10 10
Total 20 1.00 100
10 Frequency Polygon: Daily High

(In a percentage polygon
the vertical axis would be
defined to show the 0
percentage of observations 5 15 25 35 45 55 More
per class)
Organizing Numerical Data:
The Cumulative Percentage Polygon
Class Lower % Less Than
Boundary Lower
10<20 10 0
20<30 20 15
30<40 30 45 Ogive: Daily High Temperature

Cumulative Percentage
40<50 40 70
50<60 50 90
60 100

10 20 30 40 50 60
Scatter Plots

Scatter plots are used for numerical data

consisting of paired observations taken from
two numerical variables
One variable is measured on the vertical axis
and the other variable is measured on the
horizontal axis
Scatter plots are used to examine possible
relationships between two numerical
Scatter Plot Example

Volume Cost per

per day day
23 125
Cost per Day vs. Production
26 140
250 Volume
29 146 Cost per Day
33 160 150
38 167 50
42 170 0
20 30 40 50 60 70
50 188 Volume per Day
55 195
60 200
Time Series
 A Time Series Plot is used to study patterns in the
values of a numeric variable over time
 The Time Series Plot:
Numeric variable is measured on the vertical axis and
the time period is measured on the horizontal axis
Attendance (in millions) at USA amusement/theme parks from 2000-2005
Year Year Number Attendance
2000 0 317
2001 1 319
2002 2 324
2003 3 322
2004 4 328
2005 5 335
Time Series Example

Attendance (in millions) at US Theme





0 1 2 3 4 5 6

Year (Since 2000)

Principles of Excellent Graphs

 The graph should not distort the data.

 The graph should not contain unnecessary
adornments (sometimes referred to as chart junk).
 The scale on the vertical axis should begin at zero.
 All axes should be properly labeled.
 The graph should contain a title.
 The simplest possible graph should be used for a
given set of data.
Graphical Errors: Chart Junk

Bad Presentation  Good Presentation

Minimum Wage Minimum Wage

1960: $1.00
1970: $1.60
1980: $3.10
1990: $3.80 1960 1970 1980 1990
Graphical Errors:
No Relative Basis

Bad Presentation  Good Presentation

A’s received by A’s received by
Freq. students. % students.

200 20%

100 10%

0 0%

FR = Freshmen, SO = Sophomore, JR = Junior, SR = Senior

Graphical Errors:
Compressing the Vertical Axis

Bad Presentation  Good Presentation

Quarterly Sales Quarterly Sales
$ $
200 50

100 25

0 0
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Class Exercise 1
The owner of the restaurant wanted to study the demand for
dessert. He decided that in addition to studying whether the desert
was ordered, he would also study the gender of individual. Data
were collected from 600 customers and organized in the following
contingency tables.
Dessert Ordered Male Female Total
Yes 40 96 136
No 240 224 464
Total 280 320 600
a.Construct a contingency tables for row, column and total percentage?
b.Which type of percentage (row, column and total ), do you think more
informative for each gender?
c.What conclusions concerning the pattern of dessert ordering can the
restaurant owner reach?
Class Exercise 2
The Following Table represents estimated green power sales
by renewable energy source 2008
Source Percentage
Geothermal 2.8
hydro 11.3
Landfill mass and biomass 28.1
Solar 0.2
Unreported 2.5
Wind 55.1
a. Construct a bar chart, pie chart and Pareto chart
b. What conclusion can you reach about the sources of green
Source: National renewable energy laboratory,2008
Class Exercise 3
Calculate the following ?
a. Divide the data into classes
b. Absolute frequency
c. Relative frequency
d. Percentages
e. Cumulative frequency
f. Cumulative percentage
g. Midpoints
h. Draw Histogram and relative frequency

You might also like