Stat Introduction Units 1& 2

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 108

Some Basic Concepts

 Population and Sample


Population : Collection of all individuals or
individual items under consideration
Sample : Sample is a subset of population.
Samples are drawn from the population.

Data Array: Simplest way to arrange data either in


the ascending order or descending order.

Introduction - Statistics & Data Analysis 1


3. Some Basic Concepts
Frequency Table – Data Array & Grouped Data
-Actual Data : 20 observations 2.0, 3.8, 4.1, 4.7, 5.5, 3.4,
4.0, 4.2, 4.8, 5.5, 3.4, 4.1, 4.3, 4.9, 5.5, 3.8, 4.1, 4.7, 4.9, 5.5
-Data Array (Ascending Order): Arrange the data as
2.0, 3.4, 3.4, 3.8, 3.8, 4.0, 4.1, 4.1, 4.1, 4.2, 4.3, 4.7, 4.7, 4.8,
4.9, 4.9, 5.5, 5.5, 5.5, 5.5

Class Frequency
(Group of similar values) (No. of observations in each Class)
2.0 - 2.5 1
2.6 - 3.1 0
3.2 - 3.7 2
3.8 - 4.3 8
4.4 - 4.9 5
Introduction - Statistics & Data Analysis 2
5.0 - 5.5 4
Sample of daily production in yards of 30
carpet looms

16.2 15.8 15.8 15.8 16.3 15.6

15.7 16.0 16.2 16.1 16.8 16.0

16.4 15.2 15.9 15.9 15.9 16.8

15.4 15.7 15.9 16.0 16.3 16.0

16.4 16.6 15.6 15.6 16.9 16.3


Introduction - Statistics & Data Analysis 3
Data array of daily production in
yards of 30 carpet looms
15.2 15.7 15.9 16.0 16.2 16.4

15.4 15.7 15.9 16.0 16.3 16.6

15.6 15.8 15.9 16.0 16.3 16.8

15.6 15.8 15.9 16.1 16.3 16.8

15.6 15.8 16.0 16.2 16.4 16.9

Introduction - Statistics & Data Analysis 4


Types of Statistics
Descriptive Statistics
• It deals with collecting, summarizing and simplifying data, which
are otherwise quite unwieldy and voluminous.
• When the population interest is small, we will be able to directly
describe the important aspects of the population measurements.

Inferential Statistics
• It is the science of using a sample to make generalizations about
the important aspects of a population.
• A descriptive value for a population is called a parameter and a
descriptive value for a sample is called a statistic.
Statistical Data
• Statistical data are the basic raw material of
statistics.
• It refers to those aspects of a problem
situation that can be measured, quantified
or counted.
Data Sources
Data sources could be seen as of two types:
 Secondary
 Primary
Secondary data: They already exist in some form:
published or unpublished - in an identifiable
secondary source. They are, generally, available from
published source(s), though not necessarily in the
form actually required.
Primary data: The data which do not already exist in
any form, and thus have to be collected for the first
time from the primary source(s). By their very nature,
these data require fresh and first-time collection
covering the whole population or a sample drawn
from it.
Types of Data
• In statistics, data are classified into two broad
categories:
Quantitative Data: That can be quantified in
definite units of measurement.
 Discrete data
e.g. The number of customers visiting a departmental
store everyday, the number of incoming flights at an
airport, number of defective items in a consignment
received for sale.
 Continuous data:
e.g. All characteristics such as weight, length, height,
thickness, velocity, temperature etc.
Types of Data

Qualitative: That refers to the qualitative characteristics


of a subject or an object.
 Nominal data
They are the outcome of classification into two or more
categories of items or units comprising a sample or a
population according to some quality characteristic.
e.g. Classification of students according to gender (as males and
females), of workers according to skill (as skilled, semi-skilled
and unskilled) and of employees according to the level of
education (as matriculates, undergraduates and post-
graduates).
Types of Data
 Rank data,
o They are the result of assigning ranks to specify order in
terms of the integers 1,2,3, ..., n.
o Ranks may be assigned according to the level of
performance in a test.
e.g. a contest, a competition, an interview or a show. The
candidates appearing in an interview, for example, may be
assigned ranks in integers ranging from 1 to n, depending
on their performance in the interview.
Variables
• A variable is a characteristic or condition that can
change or take on different values.
• Most research begins with a general question
about the relationship between two variables for a
specific group of individuals.
Population
• A population is the set of all elements about which
we wish to draw conclusions.

SAMPLE
• Usually populations are so large that a researcher
cannot examine the entire group. Therefore, a
sample is selected to represent the population in a
research study. The goal is to use the results
obtained from the sample to help answer questions
about the population.
• A sample is a subset o the elements of a population.
Methods of Classification
Every item of the collected data has its own characteristics.
These characteristics can be of two types:
(i) Descriptive: (e.g. Honesty, beauty etc.)
These characteristics are those which cannot be measured
directly but they are counted on the basis of presence or
absence. (Non-measurable characteristics or attributes)
(ii) Numerical: (e.g. height, weight, profit etc.)
Numerical facts are those which can be measured.
types of classification

Statistical data can have two types of classification :


(1) Qualitative classification
(2) Quantitative classification.
Qualitative classification can be of two types:
• Dichotomy or Two-fold Classification
• Manifold Classification
Students

Male Females

Female Female
Male Male Employed Unemployed
Employed Unemployed
Quantitative Classification

Data classification on the basis of phenomena which is


capable of quantitative measurement like age, height, weight,
prices, production, income, expenditure, sales, profits, etc.

The main methods of such classification are:

(i) Geographical Classification

(ii) Chronological Classification

(iii) Variable Classification

(a) Continuous Variable (b) Discrete Variable


(i) Geographical Classification: This type of classification
is based on geographical or location differences between
various items in the data like states, cities, regions, zones
etc. For e.g. The yield of agricultural output per hectare for
different countries in some given period may be presented
as follows:

Agricultural Output of different countries (in Kg. per hectare)


Country India USA Pakistan Japan china
Avg. 125 585 140 410 330
Output
(ii)Chronological Classification: When data are
classified with respect to different periods of time
( hour, day, week, month, year, etc.) it is known as
chronological or temporal classification. For
example, the population of India for different
decades may be presented as follows:

Population of India ( in Crores)

Year 1951 1961 1971 1981 1991 2000


Population 36.1 43.9 54.7 68.5 84.4 102.7
(iii) Variable Classification: The classification
on this basis is known as variable
classification. Variables are of two kinds:

(a) Discrete variable (b) Continuous


variable
Classification Classification based on
based on the basis the basis of Continuous
of Discrete Values values
Income (Rs.) No. of Employees
Height No. of Students
(cms.)

154 8 1000-1500 15

155 10 1500-2000 33
156 6
2000-2500 22
157 2
2500-3000 18
158 12
3000-3500 12
159 12

Total 50 Total 100


Tabular and Graphical Methods
• Summarizing Qualitative Data
• Summarizing Quantitative Data
• Exploratory Data Analysis
• Scatter Diagrams
Summarizing Qualitative Data
• Frequency Distribution
• Relative Frequency
• Percent Frequency Distribution
• Bar Graph
• Pie Chart
Exploratory Data Analysis

• The techniques of exploratory data analysis


consist of simple arithmetic and easy-to-draw
pictures that can be used to summarize data
quickly.

• One such technique is the stem-and-leaf


display.
Stem-and-Leaf Display
• A stem-and-leaf display shows both the rank order
and shape of the distribution of the data.
• It is similar to a histogram on its side, but it has the
advantage of showing the actual data values.
• The first digits of each data item are arranged to the
left of a vertical line.
• To the right of the vertical line we record the last
digit for each item in rank order.
• Each line in the display is referred to as a stem.
• Each digit on a stem is a leaf.
8 57
9 3678
Stem-and-Leaf Display
• Leaf Units
– A single digit is used to define each leaf.
– In the preceding example, the leaf unit was 1.
– Leaf units may be 100, 10, 1, 0.1, and so on.
– Where the leaf unit is not shown, it is assumed
to equal 1.
Example: Leaf Unit = 0.1
If we have data with values such as
8.6 11.7 9.4 9.1 10.2 11.0 8.8
a stem-and-leaf display of these data will be

Leaf Unit = 0.1


8 6 8
9 1 4
10 2
11 0 7
Example: Hudson Auto Repair
The manager of Hudson Auto would like to
get a better picture of the distribution of
costs for engine tune-up parts. A sample of
50 customer invoices has been taken and
the costs of parts, rounded to the nearest
91 78 93 57 75 52 99 80 97 62
dollar, are listed below.
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
Example: Hudson Auto Repair
• Stem-and-Leaf Display

5 2 7
6 2 2 2 2 5 6 7 8 8 8 9 9 9
7 1 1 2 2 3 4 4 5 5 5 6 7 8 9
9 9
8 0 0 2 3 5 8 9
9 1 3 7 7 7 8 9
10 1 4 5 5 9
Scatter Diagram

• A scatter diagram is a graphical presentation of


the relationship between two quantitative
variables.
• One variable is shown on the horizontal axis
and the other variable is shown on the vertical
axis.
• The general pattern of the plotted points
suggests the overall relationship between the
variables.
Example: Panthers Football Team
• Scatter Diagram
The Panthers football team is interested in
investigating the relationship, if any, between
interceptions made and points scored.
x = Number of y = Number of
Interceptions Points Scored
1 14
3 24
2 18
1 17
3 27
Example: Panthers Football Team

• Scatter Diagram
Number of Points Scored y

30
25
20
15
10
5
0 x
0 1 2 3
Number of Interceptions
Example: Panthers Football Team

• The preceding scatter diagram indicates a positive


relationship between the number of interceptions
and the number of points scored.
• Higher points scored are associated with a higher
number of interceptions.
• The relationship is not perfect; all plotted points in
the scatter diagram are not on a straight line.
Scatter Diagram
• A Positive Relationship
y

x
Scatter Diagram
• A Negative Relationship
y

x
Scatter Diagram
• No Apparent Relationship
y

x
Tabular and Graphical Procedures
Data
Qualitative Data Quantitative Data

Tabular Graphical Tabular Graphical


Methods Methods Methods Methods

•Frequency •Bar Graph


•Frequency •Histogram
Distribution •Pie Chart
Distribution •Ogive
•Rel. Freq. Dist.
•Rel. Freq. Dist. •Scatter
•% Freq. Dist.
•Cum. Freq. Dist. Diagram
•Crosstabulation
•Cum. Rel. Freq.
Distribution
•Stem-and-Leaf
Display
3. Some Basic Concepts
Frequency Table
This shows the number of times different values or
categories of observations occur in a dataset.
Example: A system administrator maintains records of computer
network failure. In a year there was totally 58 failures, of which 10
for electrical causes, 14 for hardware problem and 34 for software
misuse. This information can be represented by the following
Frequency Table

Cause of Network Failure Frequency


Electrical 10
Hardware Problem 14
Software Misuse 34
Total 58
Introduction - Statistics & Data Analysis 38
Classes: 2 types
• Exclusive method: upper limit of one class is the lower
limit of the next class
10 to 15; 15 to 20; 20 to 25 etc.
• Inclusive method: upper limit of one class is included in
that class itself
10 to 14;15 to 19;20 to 24
• Constructing a frequency distribution
width of class intervals = Largest value in data-Smallest value in data

Total no. of class intervals

Introduction - Statistics & Data Analysis 39


Sturge’s rule
Total no. of classes = 1+3.322 log N
where N = total no. of observations
If there are 10 observations, the number of classes
shall be k = 1+ (3.322 x 1) = 4.322 or 4
If there are 100 observations, the no. of classes shall
be k = 1 + (3.322 x 2) = 1 + 6.644 = 7.644 or 8

Introduction - Statistics & Data Analysis 40


Example: Daily production of 30 carpet
looms
Class Frequency
15.2 to 15.4 2
15.5 to 15.7 5
15.8 to 16.0 11
16.1 to 16.3 6
16.4 to 16.6 3
16.7 to 16.9 3
Total 30

Width of Class = 17.0 – 15.2


6
Introduction -= 0.3 yd
Statistics & Data Analysis 41
questions
• Prepare a frequency table for the following
data with width of each class interval as 10.
Use exclusive method of classification:
• 57,72,96,22,10,44,51,56,10,34,80,69,50,84,66
,75,34,47,50,53,0,22,10,47,75,18,83,34,73,90,
45,70,61,42,58,14,20,66,33,46,04,57,80,48,39
,64,28,46,65,69.

Introduction - Statistics & Data Analysis 42


questions
• Classify the following data by taking class
interval such that their mid-values are
17,22,27,32 and so on.
• 30,30,36,33,42,27,22,41,30,42,30,21,54,36,
31,40,28,19,48,26,48,15,37,16,17,54,42,51,
44,32,42,31,21,25,36,22,41,40,46.

Introduction - Statistics & Data Analysis 43


4. Graphical Representation of Data
Bar Diagram
Pie Chart
Stem-and-Leaf Displays
Histogram
Frequency Polygon
Ogives (Cumulative Frequency Distribution)

Introduction - Statistics & Data Analysis 44


4. Graphical Representation of Data

Bar Diagram
4th
Qtr
Sales in the Year 1990
Region Q1 Q2 Q3 Q4
3rd
Qtr East East 20.0 30.0 90.0 60.0
West
2nd North West 30.6 38.6 34.6 31.6
Qtr
North 45.9 46.9 45.0 43.9
1st
Qtr

0 100 200

Introduction - Statistics & Data Analysis 45


4. Graphical Representation of Data

Bar Diagram
E a st
W e st
N o r th Sales in the Year 1990
4 th Q tr Region Q1 Q2 Q3 Q4
3 r d Q tr
East 20.0 30.0 90.0 60.0
West 30.6 38.6 34.6 31.6
2 n d Q tr
North 45.9 46.9 45.0 43.9
1 s t Q tr

0 100 200

Introduction - Statistics & Data Analysis 46


4. Graphical Representation of Data

Bar (Column) Diagram


90 Sales in the Year 1990
80
70 Region Q1 Q2 Q3 Q4
60 East 20.0 30.0 90.0 60.0
50 East
40 West
West 30.6 38.6 34.6 31.6
30
20
North North 45.9 46.9 45.0 43.9
10
0
1st 2nd 3rd 4th
Qtr Qtr Qtr Qtr

Introduction - Statistics & Data Analysis 47


4. Graphical Representation of Data

Pie Chart
1st Qtr 2nd Qtr Sales in the Year 1990
3rd Qtr 4th Qtr

Region Q1 Q2 Q3 Q4 Total
East 20 30 90 60 200
(In %) 10 15 45 30 100

Note:
Angle 3600 at centre is distributed
Introductionproportional
- Statistics & Data Analysis to % share. 48
4. Graphical Representation of Data

Pie Chart
Sales in the Year 1990
1st Qtr
2nd Qtr Region Q1 Q2 Q3 Q4 Total
3rd Qtr East 20 30 90 60 200
4th Qtr
(In %) 10 15 45 30 100

Introduction - Statistics & Data Analysis 49


Pie-Diagrams
• Are very popular diagrams used for
representing breakdown of an aggregate
into its components or sub-divisions
• Generally used to compare the relationship
between various components
• % is converted into degrees keeping in view
that the whole circle covers 3600

Introduction - Statistics & Data Analysis 50


Pictograms & Cartograms
• Pictograms present data by means of
pictorial representations
• Cartograms represent data by maps
• Major limitations of diagrammatic
representation of data are: it presents
limited information, subjective in character,
real statistical values are suppressed

Introduction - Statistics & Data Analysis 51


4. Graphical Representation of Data
Stem-and-Leaf Display
(1).Select one or more leading digits as stem values.
A Typical Stem-and-Leaf Trailing digits become leafs.
Display (2).List possible Steam-Values in a Vertical column.
0 4 (3).Record the leaf for every observations beside
corresponding stem value.
1 1345678889 (4).Indicate the units for stems & leaves someplace in the
display.
2 12234566667778899 Note: Apply when not all values are single digited.
3 0112233344556
4 11222

Stem: tens digit


Leaf : ones digit
Introduction - Statistics & Data Analysis 52
4. Graphical Representation of
Data
• Histogram
– Graphical representation of a frequency
distribution of a continuous series
– For each class interval, a rectangle is
constructed with base equal to the width of
the class interval and height proportional to
the frequency

Introduction - Statistics & Data Analysis 53


4. Graphical Representation of Data
Frequency Table/Distribution
F r e q u e n c y H is t o g r a m
Class Frequency
10 7-12 2
8
8 13-18 5
6 19-24 8
6 5
Class Frequency -->

25-30 6
4 3
2 31-36 3
2 1 37-42 1
0 ---------------------------
Total 25
---------------------------
7-12

13-18

19-24

25-30

31-36

37-42

Note: Showing Label/Value in


C la s s -- > the Histogram is Optional
Introduction - Statistics & Data Analysis 54
4. Graphical Representation of Data
Frequency Table/Distribution
R e la t iv e F r e q u e n c y
H is t o g r a m Class Frequency Relative
0 .4 Frequency
7-12 2 0.08
0 .3
13-18 5 0.20
Frequency -->
Relative

0 .2 19-24 8 0.32
0 .1 25-30 6 0.24
31-36 3 0.12
0
37-42 1 0.04
-------------------------------------
7-12

13-18

19-24

31-36

37-42
25-30

Total 25 1.00
-------------------------------------
C la s s - - >
Introduction - Statistics & Data Analysis 55
4. Graphical Representation of
Data
• Frequency Polygon
– A line graph that connects the midpoints of all the
bars in a histogram
– Graphical representation of a frequency
distribution but it is assumed that the distribution
has equal class width whereas histograms may have
unequal class-width as well.
– Two or more frequency polygons can be drawn on
the same graph whereas two histograms cannot be.

Introduction - Statistics & Data Analysis 56


4. Graphical Representation of Data
Data for Frequency Polygon
F r e q u e n c y P o ly g o n
( w it h H is t o g r a m ) Class Frequency Relative
Frequency
0 .4 7- 12 2 0.08
0 .3 13-18 5 0.20
19-24 8 0.32
Frequency -->
Relative

0 .2
25-30 6 0.24
0 .1
31-36 3 0.12
0 37-42 1 0.04
-------------------------------------
3.5

9.5

15.5

21.5

27.5

33.5

39.5

45.5

Total 25 1.00
-------------------------------------
C la s s M i d - P o in t - - >
Introduction - Statistics & Data Analysis 57
4. Graphical Representation of Data

O g iv e s Data for Ogives


1
0 .9 Class Relative Cumulative
0 .8 Frequency Relative
0 .7 Frequency
0 .6 7-12 0.08 0.08
Frequency -->
Cumulative Relative

0 .5
13-18 0.20 0.28
0 .4
0 .3 19-24 0.32 0.60
0 .2 25-30 0.24 0.84
0 .1 31-36 0.12 0.96
0 37-42 0.04 1.00
--------------------------------------
7

13

19

25

31

37

43

Total 1.00
C la ss --> --------------------------------------
Introduction - Statistics & Data Analysis 58
Ogives (cumulative frequency curves)
• A graph of a cumulative frequency distribution is
called Ogive
• A cumulative frequency distribution that enables
us to see how many observations lie above or
below certain values, rather than merely recording
the number of items within intervals
• A less-than or a greater-than ogive can be
constructed for a given frequency distribution

Introduction - Statistics & Data Analysis 59


Questions
• 1.Which of the following is not a type of bar chart?
– Multiple
– Percentage
– Ogive

• 2.A line graph indicates


– Comparison
– Variation
– Range
– All of above

Introduction - Statistics & Data Analysis 60


Questions
• 3.Which of the following is not an eg. of compressed data?
– Frequency distribution
– Data array
– Histogram
– Ogive
• 4.When constructing a frequency distribution, the first step
is
– Divide the data into at least 5 classes
– Sort the data points into classes and count the no.of points in each
class
– Decide on the type and no. of classes for dividing the data
– None of above
Introduction - Statistics & Data Analysis 61
Questions
• 5. A relative frequency distribution presents frequencies in
terms of ?
– Fractions
– Whole numbers
– Percentages
– Both a and c

Introduction - Statistics & Data Analysis 62


Questions
• A single observation in a data set is
called……….
• The ……..&………are two methods of data
arrangement.
• Multiple bar diagram is …..dimensional
diagram
• Pie-diagrams are…..dimensional diagram

Introduction - Statistics & Data Analysis 63


Questions
• The following table gives Marks Students
the marks of 100 students
in the subject
0-9 5
“Microbiology” 10-19 15
– Draw more than & less than
type ogives. Using these
20-29 18
curves, find the no. of 30-39 30
students
– With marks less than 45 40-49 15
– With marks more than 65 50-59 10
– Marks between 45 & 65
60-69 5
70-79
Introduction - Statistics & Data Analysis 2 64
5. Descriptive Statistics
What are Descriptive Statistics?
-These are a set of single number statistics, useful to gain
some overall idea about the data without making use on
any ‘statistical inference’.
-Descriptive Statistics may be accompanied by meaningful
graphical representation of data.
Widely Used Descriptive Statistics
- Measures of Location/Central Tendency
- Measures of Dispersion/variability
- Measures of Skewness
- Measures of Kurtosis
Introduction - Statistics & Data Analysis 65
Measures of Central Tendency
• Central Tendency
– Middle point of a distribution
– Measures of location
– Mean, Median, Mode
• Dispersion
– Spread of data in a distribution (extent to
which the observations are scattered)

Introduction - Statistics & Data Analysis 66


Mean
• Another name for average.
• If describing a population, denoted as , the
greek letter i.e. “mu”. (PARAMETER)
• If describing a sample, denoted as x , called
“x-bar”. (STATISTIC)
• Appropriate for describing measurement data.
• Seriously affected by unusual values called
“outliers”.
Introduction - Statistics & Data Analysis 67
5. Descriptive Statistics
 Select Measures of Location/Central Tendency

Arithmetic Mean (AM) - for ungrouped data


1 n
( x1  x2  x3  ........  xn )
  xj 
n j 1 n
Example: AM of the numbers 2, 3, 5 is = (2 + 3 + 7)/3 = 4

Introduction - Statistics & Data Analysis 68


5. Descriptive Statistics
 Select Measures of Location/Central Tendency
Arithmetic Mean (AM) - for grouped data
k k

f
j 1
j xj f
j 1
j xj
( f1 x1  f 2 x2  ...  f n xn )
  
n k
( f1  f 2  ...  f n )
f j
1 observations; k=no. of classes
where n = no. jof
xj= mid-point of j-th class
fj = frequency of j-th class (Note: fj’s add to n)
Introduction - Statistics & Data Analysis 69
Exercise: weights in pounds of a sample of packages
is given, calculate the sample mean
Class Frequency
10.0-10.9 1
11.0-11.9 4
12.0-12.9 6
13.0-13.9 8
14.0-14.9 12
15.0-15.9 11
16.0-16.9 8
17.0-17.9 7
18.0-18.9 6
19.0-19.9Introduction - Statistics & Data Analysis 2 70
Exercise: weights in pounds of a sample of
packages is given, calculate the sample mean
Class Frequency x (midpoint) fx
10.0-10.9 1 10.5 10.5
11.0-11.9 4 11.5 46.0
12.0-12.9 6 12.5 75.0
13.0-13.9 8 13.5 108.0
14.0-14.9 12 14.5 174.0
15.0-15.9 11 15.5 170.5
16.0-16.9 8 16.5 132.0
17.0-17.9 7 17.5 122.5
18.0-18.9 6 18.5 111.0
19.0-19.9 2 19.5 39.0
65 988.5

Introduction - Statistics & Data Analysis 71


Ans: 988.5/65=15.2077pounds
Exercise: time in seconds needed to serve a sample
of customers is given; calculate the sample mean
Class Frequency
20-29 6
30-39 16
40-49 21
50-59 29
60-69 25
70-79 22
80-89 11
90-99 7
100-109 4
110-119 0
120-129 2

Introduction - Statistics & Data Analysis 72


Descriptive Statistics
• Weighted Mean
– Calculate an average that takes into account the
importance of each value to the overall total

Introduction - Statistics & Data Analysis 73


5. Descriptive Statistics
 Select Measures of Location/Central Tendency

Weighted Arithmetic Mean (AM)


n

wj 1
j xj
 n


where n = no. of observations;
w
j 1
j

xj= value of j-th observation


wj = weight assigned to j-th observation
ie. sum of the weight assigned to each observation divided by sum of
all the weights
Introduction - Statistics & Data Analysis 74
Weighted average: A company uses three grades of labor-
unskilled, semiskilled and skilled-to produce 2 end
products. Find the average cost of labor per hour for each
of these products
Grade of labor Hourly wage Labor hrs per unit of
(x) output
Product 1 Product 2

Unskilled $5.00 1 4

Semiskilled $7.00 2 3

Skilled $9.00 5 3

Introduction - Statistics & Data Analysis 75


Exrecise: contd…

• A simple AM gives the average labor wage rate


as (5+7+9)/3=7$/hr
• But the correct average would be a weighted
average
• We can see that for product1, the average cost
of labor would be
(1/8)x5+(2/8)x7+(5/8)x9=$8.00/hr
• For product2, the average cost would be
(4/10)x5+(3/10)x7+(3/10)x9=$6.80/hr

Introduction - Statistics & Data Analysis 76


Descriptive Statistics

• In case of quantities that change over a period


of time, we need to know about an average
growth rate over a period of several years
• Here AM becomes inappropriate and hence
the need for GM
• Eg. bank interest rates, rate of price rise etc.

Introduction - Statistics & Data Analysis 77


5. Descriptive Statistics
 Select Measures of Location/Central Tendency

Geometric Mean (GM)


Examples
 n
(1) GM of number 2 & 8 Product of all x values
=  (2 x 8) = 4
(2) GM of numbers 1, 3 & 9 =  (1 x 3 x 9) = 3

Introduction - Statistics & Data Analysis 78


5.Descriptive Statistics
 Select Measures of Location/Central Tendency
Median
-This is a single value that measures the central item
in the data. ie.the middlemost or most central item in
the set of numbers
-So, Median  lowest 50% observations
& Median  remaining 50% observations.

Introduction - Statistics & Data Analysis 79


5.Descriptive Statistics
 Select Measures of Location/Central Tendency
Median
-Let n observations are x1, x2, …,xn,
-Let y1, y2, …..,yn represent corresponding Data Array
(Ascending/Descending order)
-Then median is calculated as

 y (n 1) if n is an odd number



 2
Median   y n  y n
1
 2 2
if n is an even number

 2
Introduction - Statistics & Data Analysis 80
5.Descriptive Statistics
• Disadvantages of the Median
– Median is the value at the average position
– In case of large data array, it becomes
difficult to calculate the median and also
sometimes it may have unusual values
– For eg. Consider the values
2,4,8,10,300,256,310….median is 10 which
has no apparent relationship with other
values in the distribution

Introduction - Statistics & Data Analysis 81


Estimate the median for the
following frequency distribution
Class Frequency Cum.freq.
100-149.5 12 12
150-199.5 14 26
200-249.5 27 53
250-299.5 58 111
300-349.5 72 183
350-399.5 63 246
400-449.5 36 282
450-499.5 18 300

Introduction - Statistics & Data Analysis 82


5.Descriptive Statistics
 Select Measures of Location/Central Tendency
Mode
-This is a value that has highest frequency of
occurrence (at least locally) in the data set.
-In a data, we may have more than one mode.

-Mode from ungrouped data may be very


unreliable; it may occur just out of chance
factor, so may not be representative as a central
value of the dataset.
Introduction - Statistics & Data Analysis 83
5.Descriptive Statistics
• Mo = LMO + (d1/(d1+d2))w
– Where
• LMO is the lower limit of the modal class
• d1 is the frequency of the modal class minus the
frequency of the class directly below it
• d2 is the frequency of the modal class minus the
frequency of the class directly above it
• w is the width of the modal class interval
• Modal class is class with highest frequency
Introduction - Statistics & Data Analysis 84
Exercise
• The ages of a sample of students in a college are as
follows:
– Calculate the mean (frequency distribution can be
15-19, 20-24 etc.)
– Estimate the median and mode

19,17,15,20,23,41,33,21,18,20,18,33,32,29,24,19,18,
20,17,22,55,19,22,25,28,30,44,19,20,39

Introduction - Statistics & Data Analysis 85


Measures of Central Tendency
• Central Tendency
– Middle point of a distribution
– Measures of location
– Mean, Median, Mode
• Dispersion
– Spread of data in a distribution (extent to
which the observations are scattered)

Introduction - Statistics & Data Analysis 86


Measures of Variability
• Range
• Interquartile range (IQR)
• Variance and standard deviation
• Coefficient of variation (CV)

Introduction - Statistics & Data Analysis 87


Mean = 79
heart rate of population 1

5 • Both populations have a


4
similar mean but different
distribution

2
spread of values
1 • We could quote a range
(population 1: 96-62=34
0
60 65 70 75 80 85 90 95
heart rate
beats; population 2: 88-
70=18 beats)
heart rate of population 2
• However, the problem is
this range depends just on
6
5

the extreme values we


distribution

measure
3
2
1
0
76

84

92
60
64
68
72

80

88

96

heart rate

Introduction - Statistics & Data Analysis 88


Variance and standard deviation
2.5

2
• Both populations have
1.5 the same range but
1

0.5
clearly population 2
0
1 3 5 7 9 11 13 15 17 19 21 23 25
has less spread across
6
most values.
5

4
• Better to measure
3 deviation from the
mean
2

0
1 3 5 7 9 11 13 15 17 19 21 23 25

Introduction - Statistics & Data Analysis 89


The normal distribution
• Many variables in nature form
a bell-shaped distribution
• This normal or Gaussian
curve can be used to calculate
the probability of a given
measurement being found
assuming it belongs to the
population
• A vertical line drawn from the
centre of the curve to the
horizontal axis divides the
area of the curve into two
equal parts. Each is the mirror
image of the other
Introduction - Statistics & Data Analysis 90
Skewness
Asymmetrical distribution
Frequency
• This curve is skewed
towards the right
(positively skewed)
• ie. The values are not
equally distributed

Value

Introduction - Statistics & Data Analysis 91


Skewness
(Asymmetrical distribution)
Frequency
• This curve is skewed
to the left (negatively
skewed)

Value

Introduction - Statistics & Data Analysis 92


Kurtosis
k>3
Frequency • Kurtosis measures the
peakedness of a
distribution
k=3 • These curves have the
same central location and
dispersion and they are
k<3 symmetrical
• They differ only in their
degrees of kurtosis

Value
Introduction - Statistics & Data Analysis 93
Measures of Variability/Dispersion
• Range
• Interquartile range (IQR)
• Variance and standard deviation (average distance of
any of the observation in the data set from the mean)
• Coefficient of variation (CV)

– Dispersion is an important characteristic since it gives us


additional information to judge the reliability of our measure
of central tendency
– ie. If the data are widely dispersed, then the central location is
less representative of the data as a whole than it would be for
data more closely centered around the mean.

Introduction - Statistics & Data Analysis 94


5.Descriptive Statistics
 Dispersion/Variability Measures
Range
-This is the spread between the maximum & the minimum
values in the dataset, i.e.


Range   Value of Highest  
 Value of Lowest 
 
 Observation   Observation 
Example: Let Given Observations are 1.5, 1.0, 4.5, 5, 0.5
Highest Value = 5.0; and Lowest Value = 0.5
So, Range = 5.0 – 0.5 = 4.5
Introduction - Statistics & Data Analysis 95
5.Descriptive Statistics
 Measures of Dispersion/Variability
Interquartile Range
-To compute this we divide the data into 4 equal parts,
each of which contains 25% of the items in the
distribution
-The quartiles are then the highest values in each of
these four parts and the interquartile range is the
difference between the values of the first and third
quartiles (Q3-Q1)

Introduction - Statistics & Data Analysis 96


5.Descriptive Statistics
 Measures of Dispersion/Variability
Variance
-This gives the average Squared-deviation of
observations from Arithmetic Mean
-Population Variance
N N
1 1
σ   (X j  μ)   X j  μ
2 2 2 2

N j1 N j1
where σ  Population Variance
2

N  No. of observations in Population


X j  j  th observation, j  1,2,....., N
μ  Arithmetic Mean of the Population
Introduction - Statistics & Data Analysis 97
5.Descriptive Statistics
 Measures of Dispersion/Variability
Variance
-Sample Variance
n n
1 1
S   (x j  x) or S 
2 2 2
 (x j  x) 2

n j1 (n - 1) j1
where S  Sample Variance
2

n  No. of sample observations


x j  j  th sample observation, j  1,2,....., n
Note: x  Sample Arithmetic Mean
(1) Each of the above form has certain merits & demerits.
Details will be discussed in subsequent sessions.
(2) The average of sample variances taken together for a
particular populationIntroduction
tends not& to
- Statistics equal the population98 var.,
Data Analysis
unless we use n-1 as the denominator.
5.Descriptive Statistics
 Measures of Dispersion/Variability
Standard Deviation

-Population/Sample Standard Deviation

Standard Deviation (SD)  Variance


Note:
(1) SD is always +ve. So, it is the positive-square-
root of variance
(2) If variance=25, then sd = 5 (but not –5)
Introduction - Statistics & Data Analysis 99
5.Descriptive Statistics
 Relative Dispersion/Variability
Coefficient of Variation (CV)
-The CV useful to measure the extent of variability in
relation to a central tendency measure

SD
CV (in %)  x 100
Arithmetic Mean (AM)
- Note:
(1) CV is undefined when AM=0
Introduction - Statistics & Data Analysis 100
5.Descriptive Statistics
 Units of Descriptive Statistics
---------------------------------------------------------------
Statistics Unit
---------------------------------------------------------------
Mean Same as the original data
Median --do--
Mode --do--
SD --do--
Variance Square of unit measuring original data
CV Per Cent
------------------------------------------------------------------------
Introduction - Statistics & Data Analysis 101
More words about the normal curve:
Chebyshev’s theorem
• According to a theorem devised by the Russian mathematician,
P.L.Chebyshev, no matter what the shape of the distribution, at
least 95% of the values will fall within +2 standard deviations
from the mean of the distribution and at least 99% of the values
will lie within +3 standard deviations from the mean.
• In the case of a symmetrical bell-shaped curve, we can say that
– About 68% of the values in the population will fall within +1 standard
deviation from the mean
– About 95% of the values will lie within +2 standard deviations from the
mean

Probability Theory 102


More about the normal curve:
Chebyshev’s theorem
Frequency

34% 34%
47.7% 47.7% Value
Probability Theory 103
x
Determine the variance and standard deviation of
the following data set

Question1
0.04,0.06,0.12,0.14,0.14,0.15,0.17,0.17,0.18,
0.19,0.21,0.21,0.22,0.24,0.25

Introduction - Statistics & Data Analysis 104


Determine the sample variance and standard deviation of the following data

Class Frequency
700-799 4
800-899 7
900-999 8
1000-1099 10
1100-1199 12
1200-1299 17
1300-1399 13
1400-1499 10
1500-1599 9
1600-1699 7
1700-1799 2
1800-1899Introduction - Statistics & Data Analysis 1 105
Questions
Q1: Determine the sample variance and
sample standard deviation of annual charity
payments to a hospital.
Set of payments:
863,903,957,1041,1138,1204,1354,1624,
1698,1745,1802,1883

Introduction - Statistics & Data Analysis 106


Questions
Q2: In an attempt to estimate the potential future demand,
the National Motor company did a study asking
married couples how many cars the average energy-
minded family should own in 2010. For each couple,
the responses were obtained to get the overall couple
response and the answers were tabulated as follows;
Calculate the variance and standard deviation

No. of cars 0 0.5 1.0 1.5 2.0 2.5

Frequency 2 14 23 7 4 2

Introduction - Statistics & Data Analysis 107


Questions
• Intel is considering employing one of the two training
programs. Two groups were trained for the same task.
Group1 was trained by Program A; Group2, by program B.
For the first group, the times required to train the
employees had an average of 32.11hours and a variance of
68.09. In the second group, the average was 19.75 hours
and the variance was 71.14. Which training program has
less relative variability in its performance?

Introduction - Statistics & Data Analysis 108

You might also like