Chapter 1 and 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

Introduction to Statistics

(Stat 2181)

Chapter 1

Introduction

1
Chapter Goals
After completing this chapter, you will be
able to:
• Explain the reasons for studying statistics

• Explain the difference between Descriptive and


Inferential statistics

• Describe application, uses and limitations of


statistics

• Identify types of variables and scales of


measurement 2
1.1: Definition and Classification of Statistics

• Status(Latin)
“Political
• Statista (Italian) Statistics
state”
• Statistik (German)
• In ancient times, statistics was used for administrative purpose
only.
•People usually consider statistics as a numerical description.
However, statistics can also refer to a discipline which deals with
making sense of data.
•Thus, ‘statistics’ is defined in two senses: plural and singular .

3
Plural vs Singular
• Statistics in the Plural Sense: statistics means a
collection of numerical facts.
Example
• Data on human population of a region.
• Infants’ birth weight at a public hospital in three
consecutive months.
• Number of man-hours lost in industry in specific
years.
• Statistics in the Singular Sense: It refers to the study and
use of theory and methods for the analysis of data arising
from a random process or phenomena.
4
Definition and . . .

•A population is the complete set of things (usually


people, objects, transactions, or events) that have a
specified property in common.
• A sample is part of a population.
• Parameters are descriptive measures computed
from direct measurements on all population
elements.
•A statistic rather than the field of statistics also
refers to a numerical quantity computed from
sample data, e.g., the sample mean, median,
maximum, etc.
5
Why Statistics?

To effectively manage data and undertake research.


 To facilitate communication.
To make valid decisions based on part of subjects
taken from a large population.

To monitor and evaluate activities performed at


different institutions.

6
Statistical
Data Information
tools

7
Classification of Statistics
o Descriptive statistics comprises a set of methods to
describe the characteristics of a set of data.
oInferential statistics proceeds from data
characteristics to making generalizations, estimates,
forecasts, or other judgments based on the data.
o Example: On the last 3 Sundays, a car salesman sold 2,
1, and 0 new cars respectively.
o "The salesman averaged 1 new car sold for the last 3
Sundays."
o "The car salesman never sells more than 2 cars on any
Sunday."

8
Descriptive Statistics

• Collect data
– e.g., Survey

• Present data
– e.g., Tables and graphs

• Summarize data
– e.g., Sample mean =
X i

9
Inferential Statistics
• Estimation
– e.g., Estimate the population
mean weight using the sample
mean weight
• Hypothesis testing
– e.g., Test the claim that the
population mean weight is 65 kg

Inference is the process of drawing conclusions or making


decisions about a population based on sample results

10
o In generic terms, the field of statistics provides
some of the most fundamental tools and
techniques of the scientific method:
o forming hypotheses,
o designing experiments and observational
studies,
o gathering data,
o summarizing data,
o drawing inferences from data, i.e., estimation
and testing hypotheses.
11
1.2 Stages in Statistical Investigation
There are five basic stages for any statistical investigation.
1.Collection of Data refers to the process of collecting
observations (measurements, survey responses, etc.).
2.Organization of Data: The arrangement of data in a suitable
form. It constitutes editing, classifying and tabulation.
3.Presentation of Data is the process of displaying data in a
precise manner using tables, graphs & diagrams.
4. Analysis of Data is the process of systematically applying
statistical and/or logical techniques to describe, illustrate, and
evaluate data.
5. Interpretation of Data it is related with generalization of some
characteristics from sample to population.

12
1.3: Application, Uses and Limitations of Statistics

•Applications
•Statistics is applied in almost all areas of research such as in
• Industry – control charts and inspection plans.
• Commerce – demand and supply.
• Agriculture – mean comparison (ANOVA).
• Economics – index number, time series and estimation.
• Education – formulation of policies to start new course.
• Planning – data related to production and consumption.
• Medicine – testing efficacy of a new drug.
• Modern Applications, for example, software engineering.

13
1.3: Application, Uses …

•Uses of Statistics
• Statistics can be used to express facts related to different
situations in number.
• Statistics presents messy data in a simple and easily
understandable manner.
• We can make comparisons of facts from data.
• Statistics helps to show existing trends and make future
predictions.
• It is also used to run successful business: a businessman
should estimate demand and supply of a commodity based
on relevant data.

14
Limitations

1. Statistics is not suitable to directly study qualitative


phenomenon.
2. Statistics does not study individual cases.
3. Statistical laws are not exact – Only true on the
AVERAGE.
4. Statistics may be easily misused.
5. Statistics is only, one of the methods of studying a
problem.

15
1.4 Types of Variables and Measurement Scales

• Variable is a characteristic which takes on


different values.
• Value: A specific amount possible for a variable to
be.
Types of Variables
o Qualitative Variables:
oAttributes, categories
o Examples: male/female, registered to vote/not,
ethnicity, eye color, etc.
o Quantitative Variables
Discrete variable can assume only a countable number of
values.
Continuous variable can take on any value along an
interval – measurements, how much
16
Scales of Measurement
Differences between
measurements, true Ratio Scale
zero exists
Quantitative Variable
Differences between
measurements but Interval Scale
no true zero
Ordered Categories
(rankings, order, or Ordinal Scale
scaling but no exact
difference) Qualitative Variable
Categories (no
ordering or direction) Nominal Scale
17
Example
• Marital status
• Eye color
• Nominal: • Gender
• Race

• Stage of disease
• Ordinal: • Severity of pain
• Level of satisfaction

• Temperature
• Interval
• Exam scores

• Ratio: • Distance
• Length
• Time until death
• Weight 18
Chapter Two

Methods of Data Collection and

Presentation

19
Chapter Goals
After completing this chapter, you are expected to:
• Explain why we collect data
• Identify sources of data
• Describe the various methods of data collection
• Create and interpret diagrams to describe categorical
variables:
– frequency distribution, bar chart, pie chart, Pictograms
• Create and interpret graphs to describe numerical
variables:
– frequency distribution, histogram, ogive, stem-and-leaf
plot
20
2.1 Methods of Data Collection
• Why we collect data?
– To answer questions,
– To make decisions, and
– To gain a deeper understanding of some
phenomena.
• Example
– Does lowering speed limit reduce the number of
fatal traffic accidents?
– What fractions of students in a college belong to
blood group O?
• Data: A plural noun (the singular form is datum) means a
set of known or given facts.
• Data can be collected using survey or experiment.
21
2.1.1 Sources of Data

• Primary
– Data generated by the immediate user(s) of the data.
– Survey, experimental and observational research are
most popular.
– Tend to require more time and expense than secondary
data.
• Secondary
– Data gathered from another source for a similar or
different purpose.
• Internal sources within the researcher’s organization
• External sources, including governmental, trade,
commercial and internet sources.
22
Sources of Data . . .

• Example: If it is required to know the attrition rate


of students at a university, then data can be
accessed from the registrar office of that particular
university.
• Uses of Secondary Data
• Secondary data save time and cost as compared to primary
data.
• They are less subject to intentional bias.
• Secondary data are the only option for inaccessible
information.
• Drawback of Secondary Data
• They may not fit all the requirements that we need.
23
2.1.2 Types of Data

Data

Categorical Numerical
(Qualitative) (Quantitative)

Examples: Data on
 Marital Status
 Cause of death Discrete Continuous
 Eye Color
(Defined categories or
groups) Examples: Data on Examples: Data on

 Number of patients  Weight


 Frequency of cough at  Blood sugar level
night  Survival time
 Number of missing teeth (Measured characteristics) 24
Types of Data . . .
•Cross Section Data: a set of observations taken at one point
in time.
•Example: Data collected on HIV/AIDS status of all
students enrolled at Addis Ababa University in a given year.
• Spatial Data: data collected is connected with that of a
place.
• Example: District wise rainfall in Addis Ababa.
•Time Series Data: a set of observations collected for a
sequence of times, usually at equal intervals, which may be
on weekly, monthly, yearly etc. basis.
Example: The following is the data for the three types of
expenditures in Birr for a family for the four years
2005,2006,2007, and 2008.

25
Types of Data . . .

Year Food Education Others Total
2005 5400 1500 5500 12400
2006 5700 2000 6000 13700
2007 5900 1800 6200 13900
2008 6000 2100 5800 13900

26
Methods of Data Collection
 Various methods based on the nature of the
investigation and limitations in the availability of
resources.
1. Direct Observation: The investigator observes the
behavior of subjects/individuals in the set of
observations.
Though costly, it is arguably a good method, as it
reduces the chance of incorrectness.
2. Enumeration: selected group of respondents will
be asked a set of questions available in the
schedule by well-trained enumerators.
Could be time consuming if the coverage area is wide
27
Data Collection …

3. Direct Personal Interview: This is perhaps best


suited when the problems are not completely
understood .
 It is also recommended in situations when the
information collected is of confidential nature.
4. Telephone Interview: questions are prepared and then
forwarded to the respondents via telephone calls.
• This is recommended if a respondent cannot be easily
accessible apart from by means of a telephone.

28
Data Collection …
5. Indirect Oral Interview: The researcher contacts third
parties called witnesses capable of supplying the
necessary information.
– Recommended if the information is of complex
nature or the informants are not inclined to respond.
6. Mailed Enquiry Method: Letters with a set of
questions are sent to the respondents and responses are
collected afterwards.
 Recommended if the survey covers large area and the
respondents are scattered around.
7. Old Records: A researcher uses data collected by
others & stored in some forms such as in books,
newspapers, almanacs or even unpublished sources.
29
2.2 Methods of Data Presentation
• Data in raw form are usually not easy to use for
decision making.

• Data can be summarized using


• Table
• Diagram
• Graph
• Statistical quantities such as mean, standard
deviation, etc.
• The type of diagram/ graph to use depends on the
variable being summarized
30
Data Presentation . . .
Data Displays

Categorical data Quantitative


•Frequency tables of •Frequency tables
counts or percentages
•Histograms
•Bar or column charts
• Frequency Polygon
•Pie chart
•Ogive
•Stem and leaf plot
31
2.2.1 Frequency Distributions
Key Terms
• Class - categories or ranges within which the data fall.
• Frequency – Number of observations in each class
• Class relative frequency - the class frequency divided by
the total number of observations in the data set.
• Class limits - the lowest and highest values for each class.
• Class mark - Midpoint of each class.
• Class boundaries: values which fall midway between the
UCL of one class and the LCL of the next large one.
Let d = LCL of 2nd class – UCL of 1st class.
Then LCB =LCL – ½ x d and UCB = UCL + ½ x d
• Class width - the difference between the lower & upper
class boundaries of the same class.
32
Frequency Distributions . . .
• Example

Class Class Class Freque Relative


Limits Boundaries Mark ncy Frequency
1 – 10 0.5 -10.5 5.5 12 12/ 100
11- 20 10.5 – 20.5 15. 5 10 10/ 100
… …. ….. … …
81 – 90 80.5 – 90.5 85.5 6 6/ 100
Total 100 1.00

Class Width = 10

33
Frequency Distribution …
Frequency Distribution

Qualitative Quantitative

Ungrouped Grouped
• Frequency Distribution: A table useful to present data in
classes and shows the number of observations in each class.
• Qualitative FD: a frequency distribution where the data to be
presented are only nominal or ordinal.
• Ungrouped FD: a frequency distribution where each number
in a dataset represents a single class.
• Grouped FD: several values are grouped into one class.
34
Steps in the Construction of Grouped FD
1. Find the difference between the smallest and largest
values in the raw data and denote as R.
2. Set the number of classes (K); usually in between 5 &
20 or use Struges’ rule K=1+3.322(log10 n)
3. Estimate the class width W= R/K; round the estimate to
a convenient value.
4. Determine the LCL for the first class by selecting a
convenient number that is <= the lowest data value.
Then add to it the class width to get the lower class
limit of the second class. Keep adding until the
desired number of classes is reached.
5.1. If the observations are whole numbers (e.g., 12, 23, 78,
etc.), subtract ONE from the lower class limit of the second
class to get the upper class limit of the first class. 35
Steps in the Construction of Grouped FD
5.2. If the observations are fractions (e.g., 1.2,
2.3, 7.8, etc.), subtract 0.1 from the lower class
limit of the second class to get the upper class
limit of the first class.
5.3. If the observations are fractions (e.g., 1.32,
2.35, 7.84, etc.), subtract 0.01 from the lower
class limit of the second class to get the upper
class limit of the first class.
6. Count number of frequencies in each class and put
them with the corresponding classes.

36
Relative and Cumulative FD
• Relative frequency table: a table showing relative
frequencies in each class.
– Relative frequency can be expressed in terms of a a
percentage.
• Cumulative frequency (cf): the sum of the frequencies
succeeding or preceding a class k including the frequency
of the class k.
– The cumulative relative frequency expresses the same
information as a percent by multiplying by 100%/n.
• Less than cf counts the number of observations less than
or equal to upper class boundary of a class.
• More than cf is obtained by adding frequencies of
observations greater than lower class boundary of a class.
37
Example
• Consider the following data

30 40 41 33 70 51 37 10 31 21 60 44 63 72 23 37 65 14
25 28 64 39 17 74 53 34 51 27 43 45 33 16 23 68 47 32
36 19 48 49 67 60 45 54 44 30 15 38 22 46 61 25 29 55
48 49 35 13 37 36
• Prepare i) absolute frequency distribution;
ii) relative frequency distribution;
iii) less than and more than cumulative
frequency distributions.

38
Example …
R= 74 – 10 = 64 , n = 60
Using Sturges’ Rule:
K=1+3.322(log10 60) = K=1+3.322( 1.778151 ) = 6.9070  7
W = 64/ 7 = 9.14 10

1st Class: 10 – 19 f(1): 7

2nd Class: 20 – 29 f(2): 9

3rd Class: 30 – 39 f(2): 15


. Detail

39
Example …

Class Frequency RF LCF MCF


10-19 7 0.116 7 60
20-29 9 0.15 16 53
30-39 15 0.25 31 44
40-49 13 0.216 44 29
50-59 5 0.083 49 16
60-69 8 0.133 57 11
70 - 79 3 0.05 60 3
Total 60 1.00
40
2.2.2 Diagrammatic Presentation of Data
• It includes bar chart, pie diagram, pictogram and
steam and leaf plot.
• Bar charts are the simplest and most widely used
diagrams for data presentation.
• Bar charts display absolute or relative frequency
distributions for categorical variables.

Bar Chart

Simple Multiple Subdivided 2 Way Broken

41
Simple Bar Chart

• Simple Bar Chart contains a number of rectangles


arranged either horizontally or vertically.
• Horizontal bar chart: the X-axis represents the
frequencies while the Y-axis represents the categories.
• Vertical bar chart: the Y-axis represents the frequencies
while the X-axis represents the categories.
• A simple bar chart is useful for 1-dimensional comparison
only.
• Example
• Represent the data given in the following table using a
vertical and horizontal bar charts.

42
Simple Bar Chart . . .
Year No. of students
2000 3005
2001 3567
2002 3800
2003 4300
2004 3650
2005 5000

43
Two Way Bar Chart
• To represent data having both negative and positive
values.
• Example
Year 1990 1991 1992 1993
Net Migration 50,000 -5,000 20,000 40,000

44
Multiple Bar Chart
• To make comparison between two or more variables.
• Example: A number of accounting firms were audited, and
classified according to size status (I [large], II [medium] and
III [small]) and the degree to which income-changing
accounting practices were used in preparing clients' tax
returns.
Degree of Change
Size No changes Some changes Total
Large 23 36 59
Medium 52 61 113
Small 22 21 43
Total 97 118 215

45
Multiple Bar Chart

46
Subdivided Bar Chart
• To show and compare the breakup of one variable into
several components.
Year 2000 2001 2002 2003 2004
No. of females 800 824 856 768 900
No. of males 1389 2450 1245 1655 1445
Total 2189 3274 2101 2423 2345

47
Broken Chart
• To represent data having broad variations in value.
• One observation may be extremely larger as compared to the
others.
• If we use a scale proportional to the value (frequency), then it
will be almost impossible to see the bars of small values.
• Example
• Represent the data given below using a suitable chart.
Year 1990 1991 1992 1993 1994 1995
Value 899 543 787 35323 121 234

48
Broken Bars . . .
• Simple bar: • Broken bar:

49
Pie Diagram
• A circular diagram where a circle is divided into sectors with
areas proportional to the corresponding components.
• Pie diagrams are useful for displaying the relative frequency
distribution of a categorical variable.
University Addis Ababa Gondar Jimma Total
No. of students 8000 6000 6000 20000 Addis Ababa =
[8000/20000]* 360
= 144
L Gondar =
e [6000/20000]* 360
g = 108
e Jimma =
n [6000/20000]* 360
= 108
d 50
Pictograms
• Pictograms are useful to present data using pictures.
• Example: Represent the following data using a
pictogram.
Department Accounting Statistics Computer Science Chemistry
Number of 150 200 250 200
Students
• Accounting:
• Computer Science:
• Chemistry:

• Statistics: Key:
= 50 students

51
Steam and Leaf Plot
• A stem and leaf plot is a special table where each
data value is split into a stem (the first digit or
digits) and a leaf (usually the last digit).
Data: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Stem Leaf
 21 is shown as 2 1
 38 is shown as 3 8

• Completed stem-and-leaf plot:

52
Steam and . . .
• Give a stem-and-leaf plot for the following data.
• 3.584, 3.615, 3.586, 3.712, 3.823, 3.616, 3.580, 3.888,
3.617, 3.584, 3.882, 3.912, 3.91, 3.712, 3.580, 3.917
• Stem Leaf
• 3.58 0 0 4 4 6
• 3.61 5 6 7
• 3.71 2 2
• 3.82 3
• 3.88 2 8
• 3.91 0 2 7
• 3.58|4 represents 3.584

53
2.2.4 Graphical Presentation of Data

• Graphs include histogram, frequency polygon and ogive.

• Histogram is a set of rectangles whose areas are in


proportion to class frequencies.

• Histogram depicts the frequency distribution of a


quantitative variable.

• x-axis represents class width and the y-axis indicates


frequency.

54
Histogram Example

Daily High
Temperature Frequency
Histogram : Daily High Tem perature
10 but less than 20 3
20 but less than 30 6 7 6
30 but less than 40 5
40 but less than 50 4
6 5
50 but less than 60 2 5 4
Frequency

4 3
3 2
2
1 0 0
(No gaps 0
between 0 0 1010 2020 30 30 40 40 50 50 60 60 70
bars) Temperature in Degrees 55
Frequency Polygon
• This is a line graph of class frequencies plotted against
class marks.
• End points must be joined to the x-axis (y = 0) at mid
points of empty classes: one before the first class and the
other after the last class.
• They serve the same purpose as histograms, but are
especially helpful for comparing sets of data.
• Example
• 1. Represent the following data using a frequency polygon.
Class 14.5-24.5 24.5-34.5 34.5-44.5 44.5-54.5 54.5-64.5
Frequency 3 4 8 6 7

56
Frequency Polygon . . .

57
Frequency Polygon . . .
• 2. The following frequency distribution refer to test scores
for 28 students in an examination. Plot frequency polygons
for the two datasets.
Score 0-5 5-10 10- 15 15-20 20-25
Test1 3 4 8 6 7
Test2 1 2 5 12 8

58
Ogive
o The ogive is a frequency polygon (line plot) of
cumulative frequency or the relative cumulative frequency.

oThe X-axis is the class boundaries and the vertical axis is


either the less than or more than cumulative frequency.

oExample
Price in Birr Frequency Less than More than
Frequency Frequency
10-20 2 2 26
20-30 3 5 24
30-40 6 11 21
40-50 8 19 15
50-60 5 24 7
60-70 2 26 2

59
Ogive …

60

You might also like