Introduction To Statistics: Haramaya University College of Computing and Informatics Department of Statistics
Introduction To Statistics: Haramaya University College of Computing and Informatics Department of Statistics
Introduction To Statistics: Haramaya University College of Computing and Informatics Department of Statistics
Introduction to Statistics
Writer: Editor:
Teshome Kebede (MSc) Awol Seid (MSc)
© September 2015
Contents
1 Introduction 1
1.1 History and Definition of Statistics . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Classification of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Application of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Uses of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Limitation of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Measurement Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
i
Introduction to Statistics Haramaya University
4 Measures of Variation 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Objectives of Measures of Variation . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Types of Measures of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Range and Relative Range . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.2 Quartile Deviation and Coefficient of Quartile Deviation . . . . . . . . 55
4.3.3 Mean Deviation and Coefficient of Mean Deviation . . . . . . . . . . . 56
4.3.4 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 59
4.3.5 Coefficient of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 Elementary Probability 65
5.1 What is Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Concept of Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.1 Set Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Definition and Some Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Counting Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 Approaches in Probability Definition . . . . . . . . . . . . . . . . . . . . . . . 71
5.6 Some Probability Rules/Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.7 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.8 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6 Probability Distributions 79
6.1 Type of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
ii
Introduction to Statistics Haramaya University
iii
Introduction to Statistics Haramaya University
iv
1
Introduction
All of us are familiar with statistics in everyday life. As a discipline of study and research it
has a short history, but as a numerical information it has a long antiquity. There are various
documents of ancient times containing numerical information about countries (states), their
resources and composition of the people. This explains the origin of the word statistics as
a factual description of a state. The term ‘statistics’ is derived from the Latin word status,
meaning state, and historically statistics referred to the display of facts and figures relating
to the demography of states or countries. Generally, it can be defined in two senses: plural
(as statistical data) and singular (as statistical methods).
Plural sense: Statistics are collection of facts (figures). This meaning of the word is widely
used when reference is made to facts and figures on sales, employment or unemployment,
accident, weather, death, education, etc. In this sense the word Statistics serves simply
as data. But not all numerical data are statistics. In order for the numerical data to
be identified as statistics, it must possess certain identifiable characteristics. Some of
these characteristics are described as follows:
1
Introduction to Statistics Haramaya University
as statistics since the average has been computed from many related figures such
as yearly salaries of many professors.
2. Statistics, generally, are not the outcome of a single cause but affected
by multiple causes. There are a number of forces working together that affect
the facts and figures. For example, when we say the crime rate in a certain city
has increased by 15% over the last year, a number of factors might affect these
change. These factors may be general level of economy such as economic recession,
unemployment rate, extent of use of drugs, extent of legal effectiveness and so
on. While these factors can be isolated by themselves, the effect of these factors
cannot be isolated and measured individually. Similarly, a marked increase in food
grain production in a certain country may have been due to combined effect of
many factors such as better seeds, more extensive use of fertilizers, governmental
and banking support, adequate rainfall and so on. It is generally not possible to
segregate and study the effect of each of these forces individually.
2
Introduction to Statistics Haramaya University
Singular sense: Statistics is the science that deals with the methods of data collection,
organization, presentation, analysis and interpretation of data. It refers the subject
area that is concerned with extracting relevant information from available data with the
aim to make sound decisions. According to this meaning, statistics is concerned with
the development and application of methods and techniques for collecting, organizing,
presenting, analyzing and interpreting statistical data.
According to the singular sense definition of statistics, a statistical study (statistical inves-
tigation) involves five stages: collection of data, organization of data, presentation of data,
analysis of data and interpretation of data.
1. Collection of Data: This is the first stage in any statistical investigation and involves
the process of obtaining (gathering) a set of related measurements or counts to meet
predetermined objectives. The data collected may be primary data (data collected di-
rectly by the investigator) or it may be secondary data (data obtained from intermediate
sources such as newspapers, journals, official records, etc).
2. Organization of Data: It is usually not possible to derive any conclusion about the
main features of the data from direct inspection of the observations. The second pur-
pose of statistics is describing the properties of the data in a summary form. This stage
of statistical investigation helps to have a clear understanding of the information gath-
ered and includes editing (correcting), classifying and tabulating the collected data in a
systematic manner. Thus, the first step in the organization of data is editing. It means
correcting (adjusting) omissions, inconsistencies, irrelevant answers and wrong compu-
tations in the collected data. The second step of the organization of data is classification
that is arranging the collected data according to some common characteristics. The last
step of the organization of data is presenting the classified data in tabular form, using
rows and columns (tabulation).
3
Introduction to Statistics Haramaya University
4. Analysis of Data: The analysis of data is the extraction of summarized and com-
prehensive numerical description in order to reach conclusions or provide answers to a
problem. The problem may require simple or sophisticated mathematical expressions.
Based on the scope of the decision making, statistics can be classified into two: Descriptive
and Inferential Statistics.
Descriptive Statistics: refers to the procedures used to organize and summarize masses of
data. It is concerned with describing or summarizing the most important features of
the data. It deals only the characteristics of the collected data without going beyond
it. That is, this part deals with only describing the data collected without going any
further: that is without attempting to infer(conclude) anything that goes beyond the
data themselves.
Inferential Statistics: includes the methods used to find out something about a population,
based on the sample. It is concerned with drawing statistically valid conclusions about
the characteristics of the population based on information obtained from sample. In
this form of statistical analysis, descriptive statistics is linked with probability theory in
order to generalize the results of the sample to the population. Performing hypothesis
testing, determining relationships between variables and making predictions are also
inferential statistics.
4
Introduction to Statistics Haramaya University
(b) At least 5% of the killings reported last year in city X were due to tourists.
(c) Of the students enrolled in Haramaya University in this year 74% are male and 26% are
female.
(d) The chance of winning the Ethiopian National Lottery in any day is 1 out of 167000.
(e) The demand for automobiles may decline next year in Europe.
(f) It has been continuously raining in Harar from Monday to Friday. It will continue to
rain in the weekend.
In this modern time, statistical information plays a very important role in a wide range of
fields. Today, statistics is applied in almost all fields of human endeavor.
In Scientific Research: Statistics plays an important role in the collection of data through
efficiently designed experiments, in testing hypotheses and estimation of unknown pa-
rameters, and in interpretation of results.
In Industry: Statistical techniques are used to improve and maintain the quality of manu-
factured goods at a desired level. Statistical methods help to check whether a product
satisfies a given standard.
In Business: Statistical methods are employed to forecast future demand for goods, to plan
for production, and to evolve efficient management techniques to maximize profit.
5
Introduction to Statistics Haramaya University
In Literature: Statistical methods are used in quantifying an author’s style, which is useful
in settling cases of disputed authorship.
In Detective Work: Statistics helps in analyzing bits and pieces of information, which indi-
vidually may appear to be unrelated or even inconsistent, to see an underlying pattern.
There seems to be no human activity whose value cannot be enhanced by injecting statistical
ideas in planning and by using statistical methods for efficient analysis of data assessment of
results for feedback and control.
For formulating and testing hypotheses: For instance, hypothesis like whether a
new medicine is effective in curing a disease, whether there is an association between
variables can be tested using statistical tools.
For forecasting: Statistical methods help in studying past data and predicting future
trends.
6
Introduction to Statistics Haramaya University
I It does not deal with a single observation, rather, as discussed earlier, it only deals with
aggregate of facts. For example, the marks obtained by one student in a class does not
carry any meaning in itself, unless it is compared with a set standard or with other
students in the same class or with his own marks obtained earlier.
I Statistical methods are not applicable to qualitative characters and cannot be coded in
numerical values.
I Statistical results are true on average; i.e. for the majority of cases. Since statistics is
not exact science, statistical conclusions are not universally true. That is, statistical
laws are not universally true like the laws of physics, chemistry and mathematics.
I Statistics are liable to be misused or misinterpreted. This may be due to incomplete in-
formation, inadequate and faulty procedures during data collection and sample selection
and mainly due to ignorance (lack of knowledge).
1.6. Variable
Variable is any phenomena or an attribute that can assume different values. The most impor-
tant single distinguishing feature of a variable is that it varies; that is, it can take on different
values. Based on the values that variables assume, variables can be classified as
1. Qualitative variables: A qualitative variable has values that are intrinsically nonnu-
merical (categorical).
2. Quantitative variables: A quantitative variable has values that are intrinsically nu-
merical.
Example: Height, Family size, Weight, etc.
B Discrete variable: takes whole number values and consists of distinct recogniz-
able individual elements that can be counted. It is a variable that assumes a finite
or countable number of possible values. These values are obtained by counting
(0, 1, 2, ...).
7
Introduction to Statistics Haramaya University
B Continuous variable: takes any value including decimals. Such a variable can
theoretically assume an infinite number of possible values. These values are ob-
tained by measuring.
Example: Height, Weight, Time, Temperature, etc.
Generally the values of a variable can be obtained either by counting for discrete
variables, by measuring for continuous variables or by making categories for qual-
itative variables.
Example: Classify each of the following as qualitative and quantitative and if it is quanti-
tative classify as discrete and continuous.
The level of measurement is one way in which variables can be classified. Broadly, this relates
to the level of information content implicit in the set of values and how each value may be
interpreted (mathematically) relative to other values on the variable - an issue which dictates
how the variable can be used and interpreted in statistical analysis. Consider the following
illustrations.
B Mr A wears 5 when he plays foot ball and Mr B wears 6 when he plays foot ball.
8
Introduction to Statistics Haramaya University
Based on the number on the shirts it is not possible to judge, whether Mr B plays better.
But by using the test score, it is possible to judge that Mr B did better in the exam. Also
it is not possible to find the average shirt numbers (or the average shirt number is nothing)
because the numbers on the shirts are simply codes but it is possible to obtain the average
test score. Therefore, scales of measurement
shows also that what mathematical operations and what statistical analysis are permis-
sible to be done on the values of the variable.
Different measurement scales allow for different levels of exactness, depending upon the char-
acteristics of the variables being measured. The four types of scales available in statistical
analysis are
1. Nominal Scales of variables are those qualitative variables which show category of
individuals. They reflect classification in to categories (name of groups) where there is
no particular order or qualitative difference to the labels. Numbers may be assigned
to the variables simply for coding purposes. It is not possible to compare individual
basing on the numbers assigned to them. The only mathematical operation permissible
on these variables is counting. These variables
2. Ordinal Scales of variables are also those qualitative variables whose values can be
ordered and ranked. Ranking and counting are the only mathematical operations to
be done on the values of the variables. But there is no precise difference between the
values (categories) of the variable.
Example: Academic Rank (BSc, MSc, PhD), Grade Scores (A, B, C, D, F), Strength
(Very Weak, Week, Strong, Very Strong), Health Status (Very Sick, Sick, Cured), Eco-
nomic Status (Lower Class, Middle Class, Higher Class), etc.
9
Introduction to Statistics Haramaya University
3. Interval Scales of variables are those quantitative variables when the value of the
variables is zero it does not show absence of the characteristics i.e. there is no true zero.
Zero indicates lower than empty. For example, for temperature measured in degrees
Celsius, the difference between 5℃ and 10℃ is treated the same as the difference
between 10℃ and 15℃. However, we cannot say that 20℃ is twice as hot as 10℃ ,
i.e. the ratio between two different values has no quantitative meaning. This is because
there is no absolute zero on the Celsius scale; 0℃ not imply ‘no heat’.
4. Ratio Scales of variables are those quantitative variables when the values of the vari-
ables are zero, it shows absence of the characteristics. Zero indicates absence of the
characteristics. All mathematical operations are allowed to be operated on the values
of the variables.
For instance, a zero unemployment rate implies zero unemployment. Thus, we can also
legitimately say an unemployment rate of 20 percent is twice a rate of 10 percent or one
person is twice as old as another. In the case of temperature, we can use the Kelvin
scale instead of the Celsius scale: the Kelvin scale is a ratio scale because 0 Kelvin is
‘absolute zero’ (-273℃) and this does imply no heat.
10
2
Research results or findings reveal information’s that are obviously an output of properly
and carefully collected relevant data, after they are being analyzed through legitimate data
analysis instruments. So, data are always a base (or an input) for research. This implies
that the quality of our study is heavily dependent on the quality of our data. Data can
be collected from different sources which are generally grouped under two major categories,
namely, primary and secondary sources of data. Thus, despite their nature (i.e., qualitative
or quantitative, discrete or continuous, etc), data are necessarily from:
1. Primary Data: Primary data is the one which is collected by the investigator himself
for the purpose of a specific inquiry or study. These data are those data collected for
the first time either through direct observation or by enquiring individuals under the
direct supervision and instruction of the researcher. Such data is original in character
and is generated in surveys conducted by individuals or research institutions.
2. Secondary Data: When an investigator uses the data which has already been collected
by others, such data is called secondary data. This data is primary data for the agency
that collected it and becomes secondary data for someone else who uses this data for
his own purposes. The secondary data can be obtained from journals, official reports,
government publications, publications of professional and research organizations and so
on.
Based on the role of time, data can be classified as cross-sectional and time series.
11
Introduction to Statistics Haramaya University
2. Time series data: is a set of observations collected for a sequence of time usually at
equal intervals.
The first and foremost task in statistical investigation is data collection. Before the actual
data collection, four important points should be considered. These are the purpose of data
collection (why we need to collect data?), the data to be collected (what kind of data to
be collected?), the source of data (where we can get the data?) and the methods of data
collection (how can we collect this data?).
Once it is decided what type of study is to be made, it becomes necessary to collect information
about the concerned body. This information has to be collected from certain individuals
directly or indirectly. Such a technique is known as survey method. The survey methods
are commonly used in social sciences, i.e., problems related to sociology, political science,
psychology and various economic studies.
Another way of collecting data is experimentation, i.e., an actual experiment is conducted and
then observations (measurements and counts) will be recorded. Such experimental studies
are common in natural sciences; agriculture, biology, medical science, industry,...etc.
2.2.1. Questionnaire
The most common methods of data collection for survey are personal interview and self-
administered questionnaire. In these and other methods of data collection, it is necessary
to prepare a document, called questionnaire, which contains a number of questions to be
answered and is used to record the responses.
Questionnaire is a form containing a cover letter that explains about the person conducting
the survey and the objectives of the survey, and a set of related questions which will be
answered by the respondents. One of the most important points in preparing it is that all
questions in it must have relevance to the objectives of the survey. In short, the following
points should be kept in mind while designing a questionnaire:
B Questions should be simple, short and easy to understand and they should convey one
and only one idea. Technical terms should be avoided.
12
Introduction to Statistics Haramaya University
B Sensitive questions (questions of personal and financial nature) should be avoided. Such
questions should be obtained indirectly, by constructing a set of ranges and must put
at the last part.
Examples: age (0−25, 26−50, 51−75, > 75), salary (below 200, 200−500, 500−1000, >
1000).
B Leading questions should be completely avoided. If you ask person like “Do not you
smoke?” the person will automatically say ‘Yes I do not’.
Secondary data should be used with utmost care. The investigator, before using these data,
must observe that they possess the following characteristics.
1. Reliability of Data: The data collected from other source should be reliable enough
to be used by the investigator. Determining and testing the reliability of secondary data
is the most important as well as difficult task. Reliability can be tested by answering
questions like:
2. Suitability of Data: Before using secondary data, they must be evaluated whether
they could serve for another purpose other than the one for which they were collected.
The suitability of data can be evaluated from the point of the nature and scope of
investigation view.
3. Adequacy of Data: Adequacy can be tested by evaluating the data in terms of area
coverage, level of accuracy, number of respondents participated and so on.
13
Introduction to Statistics Haramaya University
Once the above points are observed in the secondary data, it is ready to be used for further
analysis.
It is almost impossible for management to deal with all the collected data in the raw form
as it is in a haphazard and unsystematic form. In order to describe situations and make
inferences about the population even to describe the sample, the data must be organized into
some meaningful way.
Before further analysis, the collected data should be edited for completeness, consistency,
accuracy and homogeneity.
Consistency: Some information given by the respondent may not be compatible in the
sense that an information furnished by the individual either does not justify some other
information or is contradictory to earlier one.
Accuracy: It is of vital importance. If the data are inaccurate, the conclusions drawn from
it have no relevance. If the investigator has either made a false report or the respondent
has deliberately supplied the wrong information, editing will be of no use. In recent
times, checks have been evolved to attain accuracy example by sending supervisors to
check the work of investigators or reinvestigating a few respondents after a certain gap
of time.
Homogeneity: To maintain homogeneity, the information sheets are checked to see whether
the unit of information or measurement is the same in all the questionnaires. If differ-
ences are there, it has to be converted to the same unit during editing.
The next important step towards organizing data is classification. Classification is the sep-
aration of items according to similar characteristics and grouping them into various groups.
14
Introduction to Statistics Haramaya University
A table is a systematic arrangement of data in rows and columns, which is easy to understand
and makes data fit for further analysis and drawing conclusions. Tabulation should not be
confused with classification, as the two differ in many ways. Mainly the purpose of classifica-
tion is to divide the data into homogenous groups whereas the data are presented into rows
and columns in tabulation. Hence, classification is a preliminary step prior to tabulation.
A statistical table, in general, should have the following parts.
2. Title: There should be a title at the top of every statistical table. The title should be
clear, concise and adequate. The title should answer the questions : What is the data?
where is the data? how is the data classified? and, what is the time period of data?
15
Introduction to Statistics Haramaya University
4. Caption: The caption labels the data presented in a column of the table. There may
be sub-captions in each caption.
5. Body: The body of the table is the most important part. The information given in the
rows and columns forms the body of the table. It contains the quantitative information
to be presented.
6. Footnote: Any explanatory notes concerning the table itself, placed directly beneath
the table, is called ‘footnote’. The main purpose of footnote is to clarify some of the
specific items given in the table or to explain the ambiguities, omissions, if any, about
the data shown in the table.
7. Source Note: If the data is collected from secondary sources, a source note is given
to disclose the sources from which the data is collected.
Though the format of a table has already been discussed, some guidelines for preparing a
table are as follows:
The table should contain the required number of rows and columns with stubs and cap-
tions and the whole data should be accommodated within the cells formed corresponding
to these rows and columns.
If the quantity is zero, it should be entered as zero. Leaving blank space or putting
dash in place of zero is confusing and undesirable.
The unit of measurement should either be given in parentheses just below the column’s
caption or in parentheses along with the stub in the row.
If any figure in the table has to be specified for a particular purpose, it should be marked
with an asterisk or another symbol. The specification of the marked figure should be
16
Introduction to Statistics Haramaya University
The most convenient way of organizing numerical data is to construct a frequency distribu-
tion. Frequency distribution is the organization of raw data in table form, using classes and
frequencies. Here the term ‘class’ is a description of a group of similar numbers in a data set
while ‘frequency’ is the number of times a variable value is repeated. Hence, ‘class frequency’
is the number of observations belonging to a certain class.
There are three types of frequency distributions: categorical, ungrouped and grouped fre-
quency distributions.
Example: The blood type of 22 students is given below. Construct categorical fre-
quency distribution.
A B B AB O A O O B AB B
A B B O A O AB A O O AB
235433231043221114222
17
Introduction to Statistics Haramaya University
Basic Terms
Class Limits: the lowest and highest values that can be included in a class are called
class limits. The lowest values are called lower class limits and the highest values are
called upper class limits. For example: Class limit for the first class is 1-25, where 1 is
the lower class limit and 25 is the upper class limit of the first class.
Class Boundaries: are class limits when there is no gap between the UCL of the first
class and the LCL of the second class. The lowest values are called lower class boundaries
and the highest values are called upper class boundaries. The class boundary for the
first class 0.5-25.5 where the Lower class boundary is 0.5 and the Upper class boundary
is 25.5. Note that the UCL of one class is the LCL of the next class.
Class Width: the difference between UCB and LCB of a class. It is also the difference
18
Introduction to Statistics Haramaya University
between the lower limits of two consecutive classes or it is the difference between upper
limits of two consecutive classes.
w = U CBi − LCBi
= LCLi − LCLi−1
= U CLi − U CLi−1
= CMi − CMi−1
Class Mark: is the half way between the class limits or the class boundaries.
LCLi + U CLi LCBi + U CBi
cmi = =
2 2
Relative Frequency
The absolute frequency distribution is a summary table in which the original data is
condensed into groups and their frequencies, which is called absolute frequency distri-
bution. But if a researcher would like to know the proportion or percentage of cases in
each group, instead of simply, the number of cases, s/he can do so by constructing a
relative frequency distribution table. The relative frequency distribution can be formed
by dividing the frequency in each class of the frequency distribution by the total number
of observations. It can be converted in to a percentage frequency distribution by simply
multiplying each relative frequency by 100.
The relative frequencies are particularly helpful when comparing two or more frequency
distributions in which the number of cases under investigation are not equal. The
percentage distributions make such a comparison more meaningful, since percentages
are relative frequencies and hence the total number in the sample or population under
consideration becomes irrelevant.
19
Introduction to Statistics Haramaya University
Cumulative Frequency
The above frequency distributions tell us the actual number (percentage) of units in each
class, it does not tell us directly the total number (percentage) of units that lie below
or above the specified values of the classes. This can be determined from a cumulative
frequency distribution. A cumulative frequency distribution displays the total number
of observations above (below) a certain value. When the interest of the investigator
focuses on the number of items below a specified value, then this specified value is the
upper boundary of the class. It is known as less than cumulative frequency distribution.
Similarly, when the interest lies in finding the number of cases above a specified value,
then this value is taken as the lower boundary of the specified class and is known as
more than cumulative frequency distribution.
(b) Find the unit of measurement (u). u is the smallest difference between any two
distinct values of the data.
(c) Find the Range(R). R is the difference between the largest and the smallest values
of the variable.
R = max − min
k = 1 + 3.322 log N
20
Introduction to Statistics Haramaya University
R R
w= =
k 1 + 3.322 log N
(f) Put the smallest value of the data set as the LCL of the first class. To obtain
the LCL of the second class add the class width w to the LCL of the first class.
Continue adding until you get k classes.
Obtain the UCLs of the frequency distribution by adding w−u to the corresponding
LCLs.
LCBi = LCLi − u
2 and U CBi = U CLi + u
2 for i = 2, 3, ..., k.
16 21 26 24 11 17 25 26 13 27 24 26 3 27 23 24 15 22 22 12 22 29 18 22 28
25 7 17 22 28 19 23 23 22 3 19 13 31 23 28 24 9 20 33 30 23 20 8 21 24
Solution:
3 3 7 8 9 11 12 13 13 15 16 17 17 18 19 19 20 20 21 21 22 22 22 22 22 22
23 23 23 23 23 24 24 24 24 24 25 25 26 26 26 27 27 28 28 28 29 30 31 33
u = 9 − 8 = 1, R = max − min = 33 − 3 = 30
k = 1 + 3.322 log N = 1 + 3.322 log 50 = 6.64 ≈ 7
w = R/k = 30/6.64 = 4.5 ≈ 5
w−u=5−1=4
21
Introduction to Statistics Haramaya University
Advantages:
– It attracts the attention of even a layman and gives him an insight into the nature
of the distribution.
– It helps for further statistical analysis, like central tendency, scatter, symmetry, of
the data.
Disadvantages:
– Because the selection of the class width and the lower class limit of the first class are
to a certain extent arbitrary, different frequency distributions may be constructed
for the same data and hence may give contradictory impressions.
This section covers methods for organizing and displaying data. Such methods provide sum-
mary information about a data set and may be used to conduct exploratory data analyses.
The methods for providing summary information are essential to the development of hypothe-
ses and to establishing the groundwork for more complex statistical analyses.
22
Introduction to Statistics Haramaya University
Though the data presented in the form of table yields a good information, they are not
always good for all. Showing data in the form of a graph can make complex and confusing
information appear more simple and straightforward.
Bar Chart
It is the simplest and most commonly used diagrammatic representation of a frequency dis-
tribution. It is the most common presentation for nominal, categorical or discrete data. It
uses a serious of separated and equally spaced bars. The heights of the bars represent the
frequency or relative frequency of the classes. But the width of the bars has no meaning;
however, all the bars should be the same width to avoid distortion. And also the bars are
separated by constant distance.
B Simple Bar Chart: is a diagram in which categories of a variable are marked on the X
axis and the frequencies of the categories are marked on the Y axis. It is applicable for
discrete variables, that is, for data given according to some period, places and timings.
These periods and timings are represented on the base line (X axis) at regular interval
and the corresponding frequencies are represented on the Y-axis.
– The width of the bar represents nothing (it is meaningless), but it should be equal
for all bars.
– It can also represent some magnitude (on the Y axis) over time, space, groups, etc
(on the X axis).
Example:
Construct simple bar chart for the following data.
23
Introduction to Statistics Haramaya University
B Component Bar Chart: is used when there is a desire to show a total or aggregate is
divided into its component parts. The bars represent total value of a variable with each
total broken into its component parts and different colors are used for identification.
In such type of diagrams, a bar is subdivided into parts in proportion to the size of the
subdivision. These subdivided rectangles are shaded differently by lines, dots and colors
so that they will be very easy to compare the components. Sometimes the volumes of
different attributes may be greatly different.
For making meaningful comparisons, the components of the attributes are reduced to
percentages. In that case each attribute will have 100 as its maximum volume. This
sort of component bar chart is known as percentage bar chart.
Example:
Consider the following table and the corresponding chart.
24
Introduction to Statistics Haramaya University
B Multiple Bar Chart: used to display data on more than one variable. In the multiple
bars diagram two or more sets of inter-related data are interpreted.
Example: Consider the following table which show the export of some item for a given
country and the corresponding chart.
Pie Chart
Pie chart is popularly used in practice to show percentage break down of data. It is a circle
representing a set of data by dividing the circle into sectors proportional to the number of
items in the categories or it is a circle representing the total, cut into slices in proportional to
the size of the parts that make up the total. It gives the proportional sizes of different data
groups as slice of a pie or a circle.
25
Introduction to Statistics Haramaya University
Histogram
Histogram is the most common graphical presentation of a frequency distribution for numer-
ical data. It uses a series of adjacent bars in which the width of each bar represents the class
width and the heights represent the frequency or relative frequency of the class. It is used for
grouped data in which the class boundaries are marked on the X axis and the frequencies are
marked along the Y axis.
Example:
In the following, the heights of 45 female students at Haramaya University are recorded to the
nearest inch. Construct a histogram by hand first. Check your result by using any statistical
package.
67 67 64 64 74 61 68 71 69 61 65 64 62 63 59
70 66 66 63 59 64 67 70 65 66 66 56 65 67 69
64 67 68 67 67 65 74 64 62 68 65 65 65 66 67
26
Introduction to Statistics Haramaya University
Frequency Polygon
It is a graph that consists of line segments connecting the intersection of the class marks
and the frequencies of a continuous frequency distribution. It can also be constructed from
histogram by joining the mid-points of each bar.
27
Introduction to Statistics Haramaya University
As there are two cumulative frequency distributions, there are two ogive (pronounced as “oh-
jive”) curves. These are the less than cumulative frequency which is a line graph joining the
intersection points of the upper class boundaries and their corresponding less than cumulative
frequencies and the more than cumulative frequency which is a line graph joining the inter-
section points of the lower class boundaries and their corresponding more than cumulative
frequencies.
Example: Consider the following ogive curves for the marks of 50 students.
28
Introduction to Statistics Haramaya University
2.5. Exercises
1. A car salesman takes inventory and finds that he has a total of 125 cars to sell. Of
these, 97 are the 2001 model, 11 are the 2000 model, 12 are the 1999 model, and 5 are
the 1998 model. Which two types of charts are most appropriate to display the data?
Construct one of the plots.
2. Define the following graphical methods and describe how they are used.
a) Bar chart
b) Histogram
d) Frequency polygon
e) Ogive
3. The following are the ages of 30 patients in the emergency room of a hospital on a
Friday night. Construct a histogram display from these data.
35 32 21 43 39 60
36 12 54 45 37 53
45 23 64 10 34 22
36 45 55 44 55 46
22 38 35 56 45 57
4. The final grades in Basic Statistics of 80 students at Haramaya University are recorded
in the accompanying table.
68 84 75 82 68 90 62 88 76 93
73 79 88 73 60 93 71 59 85 75
61 65 75 87 74 62 95 78 63 72
66 78 82 75 94 77 69 74 68 60
96 78 89 61 75 95 60 79 83 71
79 62 67 97 78 85 76 65 71 75
65 80 73 57 88 78 62 76 53 74
86 67 73 81 72 63 76 75 85 77
29
Introduction to Statistics Haramaya University
(d) a histogram.
30
3
3.1. Introduction
Usually the collected data is not suitable to draw conclusions about the mass from which it
has been taken. Even though the data will be some what summarized after it is depicted
using frequency distributions and presented by using graphs and diagrams, still we cannot
make any inferences about the data since we have many groups. Hence, organizing a data
into a frequency is not sufficient, there is a need for further condensation, particularly when
we want to compare two or more distributions we may reduce the entire distribution into one
number that represents the distribution we need. A single value which can be considered as
a typical or representative of a set of observations and around which the observations can be
considered as centered is called an average (or average value or center of location). Since such
typical values tend to lie centrally within a set of observations when arranged according to
magnitudes; averages are called measures of central tendency (MCT).
To condense a mass of data in to one single value. That is to get a single value which
is best representative of the data (that describes the characteristics of the entire data).
Measures of central tendency, by condensing masses of in to one single value enable us
to get an idea of the entire data. Thus one value can represent thousands of data even
more.
To facilitate comparison. Statistical devices like averages, percentages and ratios used
for this purpose. Measures of central tendency, by condensing masses of in to one single
value, facilitates comparison. For instance, to compare two classes A and B, instead
31
Introduction to Statistics Haramaya University
of comparing each student result, which is practically infeasible, we can compare the
average mark of the two classes.
Suppose we have variable x having successive values x1 , x2 , ..., xn . The sum of these values
can be written as x1 + x2 + ... + xn . This can be written as using Greek letter as
P
n
x1 + x2 + ... + xn =
X
xi
i=1
. x1 y1 + x2 y2 + ... + xn yn =
Pn
i=1 xi yi
+ + + =
1 1 1 1 P4 1
. x1 x2 x3 x4 i=1 xi
32
Introduction to Statistics Haramaya University
Rules of Summation
i=1
There are many types of measures of central tendency, each possessing particular properties
and each being typical in some unique way. The most frequently encountered ones are
– Arithmetic mean (simple arithmetic mean, weighted arithmetic mean and com-
bined mean)
– Geometric mean
– Harmonic mean
. Positional averages
– Median
33
Introduction to Statistics Haramaya University
3.6. Mean
1. Suppose a variable x has observed values x1 , x2 , ..., xn . The simple arithmetic mean
denoted by x̄ (for sample) and µ (for population) is the sum of these observations
divided by the total number of observations. Symbolically,
x1 + x2 + ... + xn
Pn
i=1 xi
x̄ = =
n n
x1 + x2 + ... + xN
PN
i=1 xi
µ= =
N N
Simple AM is the most commonly used average.
f1 x1 + f2 x2 + ... + fn xn
P
fi xi
x̄ = = P
f1 + f2 + ... + fn fi
3. For data in grouped frequency distribution we use the class mark instead of each ob-
served value and simple AM is given by
f1 m1 + f2 m2 + ... + fn mn
P
fi mi
x̄ = = P
f1 + f2 + ... + fn fi
Example 1: The heights of 7 students selected from a class are given below in centimeter.
165, 160, 172, 168, 159, 170, 173. Calculate the simple AM of heights.
x1 + x2 + ... + x7 1167
P7
i=1 xi
x̄ = = = = 166.5 cm
7 7 7
Example 2: The following is the frequency distribution of marks in Stat 1011 of 46 students
(out of 20). Find the mean mark of this class.
34
Introduction to Statistics Haramaya University
f1 x1 + f2 x2 + ... + fn xn 623
P
fi xi
x̄ = = P = = 13.54
f1 + f2 + ... + fn fi 46
Example 3: Calculate the mean amount of yield of maize from the grouped frequency
distribution given below.
It is an arithmetic mean used when all observations in data have unequal relative importance
(technically termed as weight). Suppose x1 , x2 , ..., xn have weights w1 , w2 , ..., wn respectively,
then weighted arithmetic mean (x̄w ) is given by
w1 x1 + w2 x2 + ... + wn xn
P
wi xi
x̄w = = P
w1 + w2 + ... + wn wi
Example: Semester grade point average (GPA) of a student is a good example of weighted
arithmetic mean.
Course Weights (Credit hours) Grade (x)
Stat 281 4 B=3
Math 261 4 B=3
Math 224 3 C=2
Phil 201 3 B=3
Comp 201 3 C=2
Calculate the GPA of this student?
w1 x1 + w2 x2 + w3 x3 + w4 x4 + w5 x5 45
P
wi x i
GP A = x̄w = = P = = 2.64
w1 + w2 + w3 + w4 + w5 wi 17
35
Introduction to Statistics Haramaya University
Combined Mean
If there are k different groups (having the same unit of measurement) with mean x̄1 , x̄2 , ..., x̄k
and number of observations n1 , n2 , ..., nk respectively, then the mean of all the groups i.e. the
combined mean is given by
B If a constant k is added or subtracted from each value in a distribution, then the new
mean will be
x̄new = x̄old ± k
B If each value of a distribution is multiplied by a constant k, the new mean will be the
original mean multiplied by k. That is,
x̄new = kx̄old
B Arithmetic mean can be calculated for any set of data (quantitative data), and it will
be unique.We cannot calculate AM for open-ended grouped frequency distribution.
B It lends itself for further statistical analysis. For example, as combined mean.
B The algebraic sum of the deviations of each value from the arithmetic mean is zero.
That is
(xi − x̄) = 0
X
36
Introduction to Statistics Haramaya University
Example 1: The mean age of a group of 100 students was found to be 32.02 years. Later it
was discovered that age of 57 was misread as 27. Find the correct mean.
Solution:
Let x̄cor and x̄wr are the correct and wrong means respectively. Thus, from the given problem
x̄wr = 32.02, n = 100, xwr = 27 and xcor = 57.
( xi )wr
P
x̄wr =
n
( xi )wr = x̄wr × n
X
( xi )cor
P
x̄cor =
n
3232
x̄cor = = 32.32year
100
Example 2: The mean weight of 150 students in certain class is 60 kg. The mean weight of
boys in the class is 70 kg and that of the girls is 55 kg. Find the number of boys and girls in
the class.
Solution:
Let nb and ng are number of boys and girls in the class respectively. Further, suppose
¯ = 60kg, x̄b = 70kg and x̄g = 55kg are the mean weight of both, boys and girls respectively.
x̄
nb + ng = 150 (3.1)
ng = 2nb (3.2)
37
Introduction to Statistics Haramaya University
The geometric mean of n-positive numbers is the nth root of their product. The geometric
mean of x1 , x2 , ..., xn is given by the following for raw data, ungrouped and grouped frequency
respectively. v
u n
√
GM = x1 × x2 × ... × xn = t
n
uY
n
xi
i=1
v
q u n
f1 f2 fn u Y fi
GM = x1 × x2 × ... × xn = t
n n
xi
i=1
v
q u n
f1 f2 fn u Y fi
GM = m1 × m2 × ... × mn = t
n n
mi
i=1
Examples
1. A given epidemic was spreading at the rate of 1.5 and 2.67 in two successive days. What
is its average spread rate?
Solution:
√ √ √
GM = x1 × x2 = 1.5 × 2.67 = 4.005 = 2.001
38
Introduction to Statistics Haramaya University
2. The price of a commodity increased by 5% from 1989 to 1990, 8% from 1990 to 1991
and by 77% from 1991 to 1992. Find the average price increase.
Solution:
For increment, take the base line value as 100% and then add the % increase so as to
get the values in successive years.
Then,
1X 1
GM = anti log( log xi ) = anti log( × 6.30) = anti log(2.1) = 125.89
n 3
Therefore, the price increment is 25.89%.
3. A machine depreciated by 10% each in the first two years and by 40% in the third year.
Find out the average rate of depreciation.
Solution:
Like the previous one, take the base line value of the machine as 100% and then deduct
the % of depreciation so as to get the depreciated values in successive years.
Then,
1X 1
GM = anti log( log xi ) = anti log( × 5.69) = anti log(1.70) = 50.12
n 3
Therefore, the machine depreciated by is 49.88%.
Harmonic mean is another specialized average which is useful in averaging variables expressed
as rate per unit of time such as speed, number of units produced per day. Simple harmonic
39
Introduction to Statistics Haramaya University
n n
HM = =P 1
1
x1 + 1
x2 + ... + 1
xn xi
The simple HM is preferably used to calculate average speed for fixed distance, average
price for fixed total cost, average time for fixed total distance.
f1 + f2 + ... + fn
P
fi
HM = f1 f2 fn
=P fi
x1 + x2 + ... + xn xi
f1 + f2 + ... + fn
P
fi
HM = f1 f2 fn
=P fi
m1 + m2 + ... + mn mi
The weighted HM is used to compute mean speed to cover differing distances, mean
prices when the total cost is not fixed, etc.
Examples
1. A driver travels for 3 days at speed of 48 km/hr for about 10 hrs, 40 km/hr for 12 hrs,
32 km/hr for 15 hrs respectively. What is the average speed of the driver in 3 days?
Solution:
Using di = si × ti ; i = 1, 2, 3 the distance covered in three days is fixed, which is 480km.
So simple HM is appropriate to compute the average speed.
3 3
HM = =P 1
1
x1 + 1
x2 + 1
x3 xi
3 3
= =
1
48 + 1
40 + 1
32
0.0771
= 38.91km/hr
2. A driver travelled for 3 days on first days he derived for 10 hrs at speed of 48 km/hr,
on the second day for 12 hrs at 45 km/hr, on third day for 15 hrs at 40 km/hr. What
is the average speed?
40
Introduction to Statistics Haramaya University
Solution:
Using di = si × ti ; i = 1, 2, 3 the distance covered in each day is not fixed, which is
480km, 540km and 600km respectively. So weighted HM is appropriate to compute
the average speed.
w1 + w2 + w3
P
wi
HMw = w1 w2 w3 = P wi
x1 + x2 + x3 xi
10 + 12 + 15 37
= 10 12 15 =
48 + 45 + 40
0.892
= 41.48km/hr
B The GM of two numbers x1 and x2 is equal to the GM of their AM and HM. That is,
√ √
GM = x1 × x2 = AM × HM
3.7. Mode
The mode (modal value) of data set is the value that occurs most frequently. When two values
occur with the same greatest frequency, each one is a mode and the data set is bimodal. When
more than two values occur with the greatest frequency, each is a mode and the data set is
said to be multimodal. When no value is repeated or values are equally repeated, we say that
there is no mode.
I 5553151435
I 122234566679
I 1 2 3 6 7 8 9 10
In a frequency distribution, the mode is located in the class with highest frequency and that
class is the modal class. Then the formula for mode is
fx̂ − fx̂−1
x̂ = Lx̂ + w
(fx̂ − fx̂−1 ) + (fx̂ − fx̂+1 )
41
Introduction to Statistics Haramaya University
where
Example: Use the frequency distribution of heights in the following table to find the mode
of height of the 100 male students at XYZ university and interpret the result.
Solution:
A class having the highest frequency is considered as a modal class. Thus the 3rd class
(65.5-68.5) is the modal class.
fx̂ − fx̂−1
x̂ = Lx̂ + w
(fx̂ − fx̂−1 ) + (fx̂ − fx̂+1 )
42 − 18
= 65.5 + ×3
(42 − 18) + (42 − 27)
24
= 65.5 + ×3
39
= 65.5 + 1.846
= 67.346
Mode is not affected by extreme values and can be calculated for open-ended classes. But it
often does not exist and is value may not be unique. In such case mode is ill-defined.
Properties of Mode
42
Introduction to Statistics Haramaya University
3. The mode can be used for both qualitative (such as religious preference, gender, political
affiliation, etc) and quantitative data types.
3.8. Median
A median is a value which divides set of data in to two equal parts such that the number of
observations below it is the same as the number of observations above it. It is the middle
value when the values are arranged in order of increasing (or decreasing) magnitude. To
find the median, first sort the values (arrange them in order), then use one of the following
procedures.
1. If the number of values is odd, the median is the number that is located in the exact
middle of the list.
n+1
th
x̃ = value
2
Example: What is the median of 180, 201, 220, 191, 219, 209 and 220.
Solution:
First we should have to sort the data: 180, 191, 201, 209, 219, 220, 220. Since n = 7 is
odd
4+1
th
x̃ = value = 4th value = 209
2
2. If the number of values is even, the median is found by computing the mean of the two
middle numbers.
n th
th
+ n
+1
value value
x̃ = 2 2
2
Example: What is the median of 62, 63, 64, 65, 66, 66, 68 and 78.
Solution:
First we should have to sort the data: 62, 63, 64, 65, 66, 66, 68, 78. Since n = 8 is even
| {z }
n th
th
+ n
+1
value value
x̃ = 2 2
2
4th value + 5th value
=
2
65 + 66
= = 65.5
2
43
Introduction to Statistics Haramaya University
n
− Fx̃−1
x̃ = Lx̃ + 2
w
fx̃
where
n th
The median class is the class which include
2 value.
Example: The following table shows a frequency distribution of grades on a final examination
in college algebra for 120 students. Then, obtain median and interpret the results.
Grade No of students
30-39 1
40-49 3
50-59 11
60-69 21
70-79 43
80-89 32
90-99 9
Solution:
44
Introduction to Statistics Haramaya University
n th
The class which includes = 60th value is considered as the median class. Hence,
2 value
the 5th class is the median class.
n
− Fx̃−1
x̃ = Lx̃ + 2
w
fx̃
2 − 37
120
!
= 69.5 + × 10
43
= 74.849
Therefore, out of 120 students 60 of them scored less than 74.849 and 60 of them scored
greater than 74.849 on college algebra examination.
1. It is an average of location, not the average of the values in the data set.
3.9. Quantiles
The median gives us a value which divides the data set in to two equal parts. There are
also other positional measures that divide a given data set into more than two equal parts.
These measures are collectively known as quantiles. Quantiles include quartiles, deciles and
percentiles.
Quartiles are some three points that divide the array in to four parts in away each portion
contains equal number of observations. The first, second and third points are called the
first (Q1 ), second (Q2 ) and third (Q3 ) quartiles respectively. 25% of the data fall below
Q1 , 50% below Q2 and 75% below Q3 and
Q1 ≤ Q2 ≤ Q3
Deciles are nine points that divide the array in to ten equal parts.The first, second, . . . ,
ninth deciles are denoted by D1 , D2 , ..., D9 respectively. 10% of the data fall below D1 ,
20% below D2 , . . . , 90% below D9 and
D1 ≤ D2 ≤ . . . ≤ D9
45
Introduction to Statistics Haramaya University
Percentiles are ninety nine points that divide the array in to 100 equal parts. They are
denoted by P1 , P2 , ..., P99 . Always
P1 ≤ P2 ≤ . . . ≤ P99
1. For raw data and data in ungrouped frequency distribution. After arranging data in
ascending order, we apply the following formula.
th
i(n + 1)
Qi = value, i = 1, 2, 3
4
th
i(n + 1)
Di = value, i = 1, 2, ..., 9
10
th
i(n + 1)
Pi = value, i = 1, ..., 99
100
Example: Given the data 420, 430, 435, 438, 441, 449, 490, 500, 510 and 515. Find
= 433.75
th
2 × (10 + 1)
Q2 = value = 5.5th value
4
= 5th value + 0.5(6th value − 5th value)
= 445
th
3 × (10 + 1)
Q3 = value = 8.25th value
4
= 8th value + 0.25(9th value − 8th value)
= 502.5
46
Introduction to Statistics Haramaya University
= 421
th
7 × (10 + 1)
D7 = value = 7.7th value
10
= 7th value + 0.7(8th value − 7th value)
= 497
= 439.2
th
75 × (10 + 1)
P75 = value = 8.25th value
100
= 8th value + 0.25(9th value − 8th value)
= 502.5
( in
4 − Fqi−1 )
Qi = Lqi + w
fqi
( 10
in
− Fdi−1 )
Di = Ldi + w
fdi
( 100
in
− Fpi−1 )
Pi = Lpi + w
fpi
47
Introduction to Statistics Haramaya University
where
Lqi , Ldi , Lpi are the lower class boundaries of the classes containing the con-
cerned quantile points,
Fqi−1 , Fdi−1 , Fpi−1 are the LCF of the class which precedes the class containing
the concerned quantile points,
fqi , fdi , fpi are frequencies of classes containing the concerned quantile points
and
w is the class width of a class containing the concerned quantile point.
Note
th
I Qi is found in the class containing the in
4 observation.
th
I Di is found in the class containing the in
10 observation.
th
I Pi is found in the class containing the in
100 observation.
Example: Calculate all quartiles, the 5th and 8th deciles, and the 30th and 80th percentiles
for the students score data and interpret the results.
Solution:
th
Q1 is found in the 3rd class (18.5-22.5) because this class include the 1×56
4 = 14th value
( 1×56
4 − Fq0 )
Q1 = Lq1 + ×4
fq1
( 1×56
4 − 11)
= 18.5 + ×4
8
= 18.5 + 1.5 = 20
48
Introduction to Statistics Haramaya University
th
Q2 is found in the 4th class (22.5-26.5) because this class include the 2×56
4 = 28th value
( 2×56
4 − Fq1 )
Q2 = Lq2 + ×4
fq2
( 2×56
4 − 19)
= 22.5 + ×4
10
= 22.5 + 3.6 = 26.1
th
Q3 is found in the 6th class (30.5-34.5) because this class include the 3×56
4 = 42th value
( 3×56
4 − Fq2 )
Q3 = Lq3 + ×4
fq3
4 − 41)
( 3×56
= 30.5 + ×4
7
= 30.5 + 0.57 = 31.07
th
D5 is found in the 4th class (22.5-26.5) because this class include the 5×56
10 = 28th value
( 5×56
10 − Fd4 )
D5 = Ld5 + ×4
fd5
4 − 19)
( 2×56
= 22.5 + ×4
10
= 22.5 + 3.6 = 26.1
th
D8 is found in the 6th class (30.5-34.5) because this class include the 8×56
10 = 44.8th value
( 8×56
10 − Fd7 )
D8 = Ld8 + ×4
fd8
( 8×56 − 41)
= 30.5 + 10 ×4
7
= 30.5 + 2.17 = 32.67
th
P30 is found in the 3rd class (18.5-22.5) because this class include the 30×56
100 = 16.8th value
( 30×56
100 − Fp29 )
P30 = Lp30 + ×4
fp30
( 30×56
100 − 11)
= 18.5 + ×4
19
= 18.5 + 1.22 = 19.72
49
Introduction to Statistics Haramaya University
th
P90 is found in the 7th class (34.5-38.5) because this class include the 90×56
100 = 50.4th value
( 90×56
100 − Fp89 )
P90 = Lp90 + ×4
fp90
( 90×56
100 − 48)
= 34.5 + ×4
8
= 34.5 + 1.2 = 35.7
50
Introduction to Statistics Haramaya University
3.10. Exercises
1. Define and compare the characteristics of the mean, the median and the mode.
2. Your statistics instructor tells you on the first day of class that there will be five tests
during the term. From the scores on theses tests for each student he will compute a
measures of central tendency that will serve as the student’s final course grade. Before
taking the first test you must choose whether you want your final grade to be the mean
or the median of the five test scores. Which would you choose? Why? Justify your
answer.
3. A student’s final grades in mathematics, physics, chemistry and sport are, respectively,
82, 86, 90, and 70. If the respective credits received for these courses are 3, 5, 3, and 2,
determine an appropriate average grade.
4. A large department store collects data on sales made by each of its sales people. The
number of sales made on a given day by each of 20 sales people is shown below.
9 6 12 10 13 15 16 14 14 16 17 16 24 21 22 18 19 18 20 17
5. In a certain investigation, 460 persons were involved in the study, and based on an
enquiry on their age, it was known that 75% of them were 22 or more. The following
frequency distribution shows the age composition of the persons under study.
(a) Find the median and modal life of condensers and interpret them.
(c) Compute the 5th decile, 25th percentile, 50th percentile and the 75th percentile and
interpret the results.
51
Introduction to Statistics Haramaya University
(a) If 75% of the items were sold in birr 45 or less and most items were sold in birr 34,
find the missing frequencies.
(b) If 25% of the items were sold in greater than or equal to birr 45 and most items
were sold in birr 34, find the missing frequencies.
52
4
Measures of Variation
4.1. Introduction
In the third chapter, we concentrated on a central value (measures of central tendency), which
gives an idea of the whole mass that is a complete set of values. However the information
so obtained is neither exhaustive nor comprehensive, as the mean does not lead us to know
whether the observations are close to each other or far apart. Median is a positional average
and has nothing to do with the variability of the observations in a data set. Mode is the
largest occurring value independent of the other values in the set. This leads us to conclude
that a measure of central tendency is not enough to have a clear idea about the data unless all
observations are the same. Moreover two or more data sets may have the same mean and/or
median but they may be quite different. So MCT alone do not provide enough information
about the nature of the data. The table below displays the price of a certain commodity in
four cities. Find the mean and median prices of the four cities and interpret it.
City A 30 30 30
City B 29 30 31
City C 15 30 45
City D 5 30 55
All the four data sets have mean 30 and median is also 30. But by inspection it is apparent
that the four data sets differ remarkably from one another. So measures of central tendency
alone do not provide enough information about the nature of the data. Thus, to have a
clear picture of the data, one needs to have a measure of dispersion or variability among
observations in the data set.
53
Introduction to Statistics Haramaya University
Variation or dispersion may be defined as the extent of scatteredness of value around the
measures of central tendency. Thus, a measure of dispersion tells us the extent to which the
values of a variable vary about the measure of central tendency.
2. To compare two or more sets of data with regard to their variability. Two or
more data sets can be compared by calculating the same measure of variation having
the same units of measurement. A set with smaller value posses less variability or is
more uniform (or more consistent).
54
Introduction to Statistics Haramaya University
Before giving the details of these measures of dispersion, it is worthwhile to point out that a
measure of dispersion (variation) is to be judged on the basis of all those properties of good
measures of central tendency. Hence, their repetition is superfluous.
Range is the simplest and crudest/rough measure of dispersion. It is defined as the difference
between the largest and the smallest values in the data.
Coefficient of Range:
Range hardly satisfies any property of good measure of dispersion as it is based on two extreme
values only ignoring the others. It is not also liable to further algebraic treatment. The main
advantage in using range is the simplicity of its computation.
Q3 − Q1
CQD =
Q3 + Q1
55
Introduction to Statistics Haramaya University
QD involves only the middle 50% of the observations by excluding the observations below
the lower quartile and the observations above the upper quartile. Note also that it does not
take into account all the individual values occurring between Q1 and Q2 . It means that, no
idea about the variation of even the 50% mid values is available from this measure. Anyhow
it provides some idea if the values are uniformly distributed between Q1 and Q2 .
The measures of variation discussed so far are not satisfactory in the sense that they lack
most of the requirements of a good measure. Mean deviation is a better measure than range
and quartile deviation. Mean deviation is the arithmetic mean of the absolute values of the
deviation from some measures of central tendency usually the mean and the median of a
distribution. Hence we have mean deviation about the mean M D(x̄) and mean deviation
about the median M D(x̃).
P P
|xi −x̄| |xi −x̃|
I For raw data: M D(x̄) = n and M D(x̃) = n
P P
fi |mi −x̄| fi |mi −x̃|
I For grouped data: M D(x̄) = P
fi
and M D(x̃) = P
fi
M D is not much affected by extreme values. Its main drawback is that the algebraic negative
signs of the deviations are ignored. M D is minimum when the deviation is taken from median.
The coefficient of mean deviations are:
M D(x̄)
CM D(x̄) =
x̄
M D(x̃)
CM D(x̃) =
x̃
Examples
1. Consider a sample with data values of 27, 25, 20, 15, 30, 34, 28, and 25. Compute
the range, coefficient of range, quartile deviation, coefficient of quartile deviation, mean
deviation about mean, mean deviation about median, coefficient of mean deviation
about mean and coefficient of mean deviation about median.
56
Introduction to Statistics Haramaya University
Solution:
Data: 15, 20, 25, 25, 27, 28, 30, 34
max − min 34 − 15
R = max − min = 34 − 15 = 19, CR = = = 0.388
max + min 34 + 15
To find QD and CQD, we have to calculate Q1 and Q3 first.
th
1 × (8 + 1)
Q1 = value = 2.25th value
4
= 2nd value + 0.25(3rd value − 2nd value)
= 21.25
th
3 × (8 + 1)
Q3 = value = 6.75th value
4
= 6th value + 0.75(7th value − 6th value)
= 29.5
Q3 − Q1 29.5 − 21.25
QD = = = 4.125
2 2
Q3 − Q1 29.5 − 21.25 8.25
CQD = = = = 0.163
Q3 + Q1 29.5 + 21.25 50.75
Beside to this to compute M D(x̄), M D(x̃), CM D(x̄) and CM D(x̃) we should obtain x̄
and x̃.
204
P
xi
x̄ = = = 25.5; x̃ = 26
n 8
|xi − x̄| |x1 − x̄| + |x2 − x̄| + ... + |x8 − x̄|
P
M D(x̄) = =
n 8
|15 − 25.5| + |20 − 25.5| + ... + |34 − 25.5|
=
8
34
= = 4.25
8
|xi − x̃| |x1 − x̃| + |x2 − x̃| + ... + |x8 − x̃|
P
M D(x̃) = =
n 8
|15 − 26| + |20 − 26| + ... + |34 − 26|
=
8
32
= =4
8
Thus,
M D(x̄) 4.25
CM D(x̄) = = = 0.1667
x̄ 25.5
M D(x̃) 4
CM D(x̃) = = = 0.154
x̃ 26
57
Introduction to Statistics Haramaya University
Solution:
Previously, we have obtained the following quantities for the students score data:
Q3 − Q1 31.07 − 20 11.07
QD = = = = 5.54
2 2 2
Q3 − Q1 31.07 − 20 11.07
CQD = = = = 0.22
Q3 + Q1 31.07 + 20 51.07
M D(x̄) 6.04
CM D(x̄) = = = 0.24
x̄ 25.64
M D(x̃) 6.06
CM D(x̃) = = = 0.23
x̃ 26.1
58
Introduction to Statistics Haramaya University
Variance and standard deviation are the most superior and widely used measures of disper-
sions and both measure the average dispersion of the observations around the mean. The
variance of a data set is the sum of the squares of the deviation of each observation taken
from the mean divided by total number of observations in the data set. The positive square
root of variance is called standard deviation.
For a population containing N elements, the population standard deviation is denoted by the
Greek letter σ (sigma) and hence the population variance is denoted by σ 2 .
P rP
(xi −µ)2 (xi −µ)2
I For raw data: σ2 = N and σ = N
P rP
fi (mi −µ)2 fi (mi −µ)2
I For grouped data: σ2 = N and σ = N
For a sample of n elements, the sample variance and standard deviation denoted by s2 and
s, respectively, are calculated as using the formulae:
P rP
(xi −x̄)2 (xi −x̄)2
I For raw data: s2 = n−1 and s = n−1
P rP
fi (mi −x̄)2 fi (mi −x̄) 2
I For grouped data: s2 = P
fi −1
and s = P
fi −1
Examples
1. Consider a sample with data values of 10, 20, 12, 17, and 16. Compute the variance
and standard deviation.
Solution:
We are expected to compute the sample mean x̄ first since the sample variance is a
function the sample mean.
10 + 20 + 12 + 17 + 16 75
P
xi
x̄ = = = = 15
n 5 5
(xi − x̄)2
P
S =2
n−1
(10 − 15)2 + (20 − 15)2 + (12 − 15)2 + (17 − 15)2 + (16 − 15)2
=
5−1
64
= = 16
4
√
rP
(xi −x̄)2
Hence, s = n−1 = 16 = 4.
59
Introduction to Statistics Haramaya University
2. Calculate the variance and standard deviation for the following frequency distribution.
The main objection of mean deviation, removal of the negative signs, is removed by
taking the square of the deviations from the mean. The first main demerit of variance
is that its unit is the square of the unit of measurement of the variable values. For
example, the sample variance of 2m, 6m and 4m is 4m2 . The interpretation is, on
average each value differs from the mean by 4m2 , which is completely wrong because
one thing the unit of measurement of variance is not the same as that of the data set.
The other disadvantage of variance is, the variation of the data is exaggerated because
the deviation of the each value from the mean is squared. For the given example, the
variation of the data is exaggerated from two to four since it is taking the square of the
deviations. Variance also gives more weight the extreme values as compared to those
which are near to the mean value.
Standard deviation is considered to be the best measure of dispersion because the unit
of measurement is the same as the data set and the exaggeration made by variance will
be eliminated by taking the square root of it. In simple words, it explains the average
amount of variation on either sides of the mean. If the standard deviation of the data is
60
Introduction to Statistics Haramaya University
small the values are concentrated near the mean and if it large the values are scattered
away from the mean.
1. If a constant is added (subtracted) to (from) each and every observation, the standard
deviation as well as the variance remains the same.
2. If each and every value is multiplied by a nonzero constant k, the standard deviation is
multiplied by k and the variance is multiplied by k 2 .
3. If there are k different groups having the same units of measurement with sample
means x̄1 , x̄2 , ..., x̄k , number of sample observations n1 , n2 , ..., nk and sample variances
s21 , s22 , ..., s2k respectively, then the variance of all the groups called the pooled variance
denoted by s2p is given by:
Examples
1. The mean weight of 150 students is 60 kilograms. The mean weight of boys is 70 kg
with a standard deviation of 10 kg. For the girls, the mean weight is 55 kg and the
standard deviation 15 kg. Then,
2. A distribution consists of four parts characterized as follows. Find the mean and stan-
dard deviation of the distribution.
61
Introduction to Statistics Haramaya University
3. The arithmetic mean and standard deviation of a series of 20 items were computed as
20 and 5 respectively. While calculating these, an item 13 was misread as 30. Find the
correct mean and standard deviation.
4. The following data are some of the particulars of the distribution of weights of boys and
girls in a class.
Boys Girls
Number 100 50
Mean 60 45
Variance 9 4
All absolute measures of dispersion have units. If two or more distributions differ in their
units of measurement, their variability cannot be compared by any of the absolute measure
of variation. Also, the size of the absolute measures of dispersion depends upon the size of
the values. That is if the size of the values is larger, the value of the absolute measures will
also be larger. Generally absolute measures of variation fail to be appropriate for comparing
two or more groups if:
~ The size of the data between the groups is not the same.
62
Introduction to Statistics Haramaya University
I For population: cv = σ
µ × 100%
I For sample: cv = s
x̄ × 100%
The distribution having less cv is said to be less variable or more consistent or more
uniform. For field experiments, cv , is generally reported. If it is small, it indicates
more reliability of experimental findings.
Examples
1. Compare the variability of the following two sample data sets using standard deviation
and coefficient of variation.
2. The average IQ of statistics students is 110 with standard deviation 5 and the average
IQ of mathematics students is 106 with standard deviation 4. Which class is less variable
in terms of IQ?
63
Introduction to Statistics Haramaya University
4.4. Exercises
1. Find the range, quartile deviation, mean deviation about the mean, mean deviation
about the median, mean deviation about the mode, variance, standard deviation and
coefficient of variation for the following distribution.
4. Two persons participated in five shooting competition and were able to hit the target
correctly out of fifteen shots as given below.
Competitor A 6 12 12 10 7
Competitor B 12 15 7 7 4
Which competitor is more uniform in shooting performance?
64
5
Elementary Probability
Probability theory plays a central role in statistics. After all, statistical analysis is applied to
a collection of data in order to discover something about the underlying events. These events
may be connected to one another. However, the individual choices involved are assumed to
be random. Alternatively, we may sample a population at random and make inferences about
the population as a whole from the sample by using statistical analysis. Therefore, a solid
understanding of probability theory - the study of random events - is necessary to understand
how the statistical analysis works and also to correctly interpret the results.
In order to discuss the theory of probability, it is essential to be familiar with some ideas and
concepts of mathematical theory of set. A set is a collection of well-defined objects which is
denoted by capital letters like A, B, C, etc.
In describing which objects are contained in set A, two common methods are available. These
methods are:
1. Listing all objects of A. For example, A = {1, 2, 3, 4} describes the set consisting of the
positive integers 1, 2, 3 and 4.
2. Describing a set in words, for example, set A consists of all real numbers between 0 and
1, inclusive. It can be written as A = {x : 0 ≤ x ≤ 1}, that is, A is the set of all x0 s
65
Introduction to Statistics Haramaya University
If A = {a1 , a2 , ..., an }, then each object ai ; i = 1, 2, ..., n belonging to set A is called a member
or an element of set A, i.e., ai ∈ A. A set consisting all possible elements under consideration
is called a universal set (denoted by ∪). On the other hand, a set containing no element is
called an empty set (denoted by ∅ or {}).
If every element of set A is also an element of set B, A is said to be a subset of B and write
as A ⊂ B. Every set is a subset of itself, i.e., A ⊂ A. Empty set is a subset of every set. If
A ⊂ B and B ⊂ C, then A ⊂ C. If A ⊂ B and B ⊂ A, then A and B are said to be equal.
1. Union (Or): A set consisting all elements in A or B or both is called the union set
ofA and B, and write as A ∪ B. That is, A ∪ B = {x : x ∈ A, x ∈ B or x ∈ both}. The
setA ∪ B is also called the sum of A andB.
Important Laws
• Commutative laws:
– A∪B =B∪A
– A∩B =B∩A
• Associative laws:
– A ∪ (B ∪ C) = (A ∪ B) ∪ C
66
Introduction to Statistics Haramaya University
– A ∩ (B ∩ C) = (A ∩ B) ∩ C
• Distributive laws:
– A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)
– A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)
• Identity laws:
– A ∪ A = A, A ∩ A = A
– A ∪ U = U, A ∩ U = A
– A ∪ ∅ = A, A ∩ ∅ = ∅
1. Experiment (ξ): is any statistical process that can be repeated several times and in
any trial of which the outcome is unpredictable.
2. Sample Space (S): is a set consisting all possible outcomes of a given experiment, ξ.
4. Independent Event: two or more events are independent if the occurrence of one
event has no effect on the probability of occurrence of the other.
67
Introduction to Statistics Haramaya University
5. Mutually Exclusive Events: two or more events are mutually exclusive, if they have
no outcome in common. They cannot occur together simultaneously.
Counting techniques are mathematical models which are used to determine the number of
possible ways of arranging or ordering objects. They are used to find a solution to fix the size
of the sample space that is extremely large. To count possible outcomes of a sample space
or/and an event we use the following counting techniques.
Addition Rule: states that if a task can be done (accomplished) by any of the k procedures,
where ith procedures has ni alternatives, then the total number of ways of doing the
task is
k
n1 + n2 + ... + nk =
X
ni
i=1
Example: Suppose a lady wants to make journey from Harar to Dire Dawa. If she can
use either plane, bus, cycle, horse, and there are 3 flights, 4 buses, 2 cycles and 3 horses
available. In how many different ways can she make her journey?
Solution:
nf + nb + nc + nh = 3 + 4 + 2 + 3 = 12
Multiplication Rule: states that if a choice consists k steps where the first step can be
done in n1 ways, for each of which second can be done in n2 ways, ..., for each of those
k th steps can be done in nk ways. Then, the total number of distinct ways to accomplish
the task/choice is equal to
k
n1 × n2 × ... × nk =
Y
ni
i=1
68
Introduction to Statistics Haramaya University
Example 1: Suppose a cafeteria provides 5 kinds of cake which it serves with tea, coffee,
milk and coca cola. Then, in how many different ways can you order your breakfast of
cake with a drink?
Solution:
The work has two steps. First, we order a type of cake n1 = 5 and then we order kind
of drink through n2 = 4. Thus,one can have
n1 × n2 = 5 × 4 = 20
Example 2: There are 2 bus routes from city X to city Y and 3 train routes from city
Y to city Z. In how many ways can a person go from city X to city Z?
Solution:
n1 × n2 = 2 × 3 = 6
n! = n × (n − 1) × (n − 2) × ... × (1)
By definition 1! = 0! = 1.
Example 1: In how many different ways can 3 persons sleep in a bed?
Solution:
n! = 3! = 3 × 2 × 1 = 6 ways.
Solution:
n! = 4! = 4 × 3 × 2 × 1 = 24 ways.
Rule 2: Given n distinct objects, the number of permutations of r objects taken from
n objects is denoted by nP r and given by
n!
nP r = ; r≤n
(n − r)!
69
Introduction to Statistics Haramaya University
Example 1: In how many ways can 10 people be seated on a bench if only 4 seats are
available?
Solution:
10! 10 × 9 × 8 × 7 × 6!
nP r = 10P 4 = = = 5040 ways.
(10 − 4)! 6!
Example 2: How many 5 letter permutations can be formed from the letters in the
word DISCOVER?
Solution:
8! 8 × 7 × 6 × 5 × 4 × 3!
nP r = 8P 5 = = = 6270
(8 − 5)! 3!
Rule 3: Given n objects in which n1 are alike, n2 are alike, ..., nr are alike is given by
n!
n1 ! × n2 ! × ... × nr !
Example: How many different permutations can be made from the letters in the word:
I STATISTICS
Solution:
I MISSISSIPPI
Solution:
Combination: A set of n distinct objects considered without regard to the orders of ap-
pearance is called combination. For example, abc, bac, acb, cab, cba are six different
permutations but they are the same combination.
Rule 1: The number of ways of selecting r objects from n distinct objects is called
combination of r objects from n objects denoted by nCr or n
r and given by
!
n n!
nCr = = ; r≤n
r (n − r)! × r!
70
Introduction to Statistics Haramaya University
Example: In how many ways can student choose 3 books from a list of 12 different
books?
Solution:
12
! !
n n!
= =
r 3 (n − r)! × r!
12!
=
(12 − 3)! × 3!
12! 12 × 11 × 10 × 9!
= =
9! × 3! 9! × 3!
= 220
5 7
! ! ! !
n1 n2
× = × = 10 × 35 = 350
r1 r2 2 3
5 6
! ! ! !
n1 n2
× = × = 10 × 15 = 150
r1 r2 2 2
(c) two particular male workers cannot be members for some reason.
3 7
! ! ! !
n1 n2
× = × = 3 × 35 = 105
r1 r2 2 3
The difference between permutation and combination is that in combination the order
of objects being selected (arranged) is not important, but order matters in permutation.
1. The Classical Approach (also called Mathematical Approach): Suppose there are
N possible outcomes in the sample space S of an experiment. Out of these N outcomes,
71
Introduction to Statistics Haramaya University
only n are favorable to the event E, then the probability that the event E will occur is:
Solution:
The sample space of the given experiment is S = {1, 2, 3, 4, 5, 6}. Further let A be
an event of getting odd numbers in rolling a die only once.
n(A) 3
P (A) = = = 0.5
n(S) 6
b) number 4 occurs.
Solution:
n(B) 1
P (B) = = = 0.167
n(S) 6
Solution:
n(C) 0
P (C) = = =0
n(S) 6
Solution:
n(D) 6
P (D) = = =1
n(S) 6
B Events with zero probability of occurrence are known as null or impossible events.
72
Introduction to Statistics Haramaya University
Example 2: What is the probability of getting one head in tossing two coins?
Solution:
S = {HH, HT, T H, T T } and suppose E be the event getting one head in an experiment
of tossing two coins.
n(E) 2
P (E) = = = 0.5
n(S) 4
Let S be a sample space associated with a random experiment. Then with any event E, in
this sample space, we associate a real number called probability of E satisfying the following
properties (axioms).
I 0 ≤ P (E) ≤ 1
I P (S) = 1
P (A or B) = P (A ∪ B) = P (A) + P (B)
I P (A ∪ Ac ) = P (A) + P (Ac )
73
Introduction to Statistics Haramaya University
I P (φ) = 0
Using the above axioms, it can be shown that for any two events A and B,
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
Solution:
Let A be an event of all candles are defective.
5 15
n(A) ×
P (A) = = 4
20
0
= 0.001032
n(S) 4
Example 2: An urn contains 6 white, 4 red and 9 black balls. If 3 balls are drawn at random,
find the probability that
74
Introduction to Statistics Haramaya University
When the outcome or occurrence of an event affects the outcome or occurrence of another
event, the two events are said to be dependent (conditional). If two events, A and B, are
dependent to each other, the probability of event A occurring knowing that event B has
already occurred is said to be the conditional probability of A given that event B has already
occurred,
P (A ∩ B)
P (A/B) = ; P (B) 6= 0
P (B)
The probability of event B occurring knowing that event A has already occurred is said to be
the conditional probability of B given that event A has already occurred,
P (A ∩ B)
P (B/A) = ; P (A) 6= 0
P (A)
Remarks
(i) 0 ≤ P (A/B) ≤ 1
(ii) P (S/B) = 1
75
Introduction to Statistics Haramaya University
Example: If the probability that a research project will be well planned is 0.6, and the
probability that it will be well planned and well executed is 0.54. Then, what is the probability
that it will be
Solution:
Let D and E be an events of the research project is well planned and well executed
respectively. Then P (D) = 0.6 and P (D ∩ E) = 0.54.
P (D ∩ E) 0.54
P (E/D) = = = 0.9
P (D) 0.6
Solution:
P (D ∩ E c ) P (D) − P (D ∩ E) P (D ∩ E)
P (E c /D) = = =1−
P (D) P (D) P (D)
P (D ∩ E c )
P (E c /D) = = 1 − P (E/D) = 1 − 0.9 = 0.1
P (D)
5.8. Independence
P (A ∩ B) = P (A) × P (B)
76
Introduction to Statistics Haramaya University
Solution:
→ 1 2 3 4 5 6
1 (1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6)
2 (2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6)
3 (3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6)
4 (4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6)
5 (5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6)
6 (6,1) (6,2) (6,3) (6,4) (6,5) (6,6)
n(A) 18 n(A ∩ B) 9
P (A) = = , P (A ∩ B) = =
S 36 S 36
n(B) 18 n(A ∩ C) 9
P (B) = = , P (A ∩ C) = =
S 36 S 36
n(C) 9 n(B ∩ C) 0
P (C) = = , P (B ∩ C) = =
S 36 S 36
P (A ∩ B) = P (A) × P (B)
9 18 18
= ×
36 36 36
P (A ∩ C) 6= P (A) × P (C)
9 18 9
6= ×
36 36 36
P (B ∩ C) 6= P (B) × P (C)
0 18 9
6= ×
36 36 36
Therefore, based on the above results A and B are statistically independent events. However,
events A and C and B and C are not statistically independent.
77
Introduction to Statistics Haramaya University
5.9. Exercises
(b) Events
2. A package contains 12 resistors, 3 of which are defective. If 3 are selected, find the
probability of getting
3. Let A and B be two events associated with an experiment and suppose that P (A) = 0.4
while P (A ∪ B) = 0.7. Let P (B) = p. For what choice of p
4. The personnel department of a company has records which show the following analysis
of its 200 accountants.
Age Bachelor’s Master’s
Under 30 90 10
30 to 40 20 30
Over 40 40 10
If one accountant is selected at random from the company, find the probability that
78
6
Probability Distributions
S = {HH, HT, T H, T T }
Let X be number of heads. Thus, another sample space with respect to X (also called the
range space of X) is
Rx = {0, 1, 2}
Definition: A function X which assigns a real numbers to all possible values of a sample
space is called a random variable. A random variable is a variable that has a single numerical
value (determined by chance) for each outcome of a procedure.
A random variable can be classified as being either discrete or continuous depending on the
numerical values it assumes.
A discrete random variable has either a finite number of values or a countable number of
values; that is, they result from counting process. The possible value of X may be x1 , x2 , ..., xn .
For any discrete random variable X the following will be true.
(i) 0 ≤ p(xi ) ≤ 1
P∞
(ii) i=1 p(xi ) = 1 for finite and i=1 p(xi ) = 1 for countably infinite.
Pn
p(xi ) is called probability function or point probability function or mass function. The col-
lection of pairs (xi , p(xi )) is called probability distribution. It gives the probability for each
value or range of values of the random variable.
79
Introduction to Statistics Haramaya University
Solution:
S = {HH, HT, T H, T T }. Let X be a random variable of getting head in tossing a coin two
times. Then Rx = {0, 1, 2}.
1 2 1
P (X = 0) = P (T T ) = , P (X = 1) = P (HT, T H) = , P (X = 2) = P (HH) =
4 4 4
X 0 1 2
P (X = xi ) 1
4
2
4
1
4
P (Y = y) = cy 2 , y = 0, 1, 2, 3, 4
Solution:
First we should have to compute the point probabilities find the value of c.
4
P (Y = yi ) = 1
X
i=0
0 + c + 4c + 9c + 16c = 1
1
c=
30
A continuous random variable has infinitely many values, and those values can be as-
sociated with measurements on a continuous scale in such a way that there are no gaps or
interruptions. That means, if it assumes all possible values in the interval (a, b), where a, b ∈ <
and there exist a function called probability density function (pdf) satisfying the following
conditions.
(i) f (x) ≥ 0, ∀x
R∞
(ii) −∞ f (x)dx =1
80
Introduction to Statistics Haramaya University
For any two real numbers a and b such that −∞ < a < b < ∞ then
Z b
P (a < X < b) = f (x)dx
a
Example: Let X be a continuous random variable and its pdf is given by:
2x,
for 0 < x < 1
f (x) =
0,
otherwise
. f (x) = 2x ≥ 0 ∀x
R1 R1
. 0 f (x)dx = 0 2x = 1
" #0.75
x2
Z 0.75 Z 0.75
P (0.5 < X < 0.75) = 2xdx = 2 xdx = 2 = 0.315
0.5 0.5 2 0.5
Definition: If X is discrete random variable with possible values of x1 , x2 , ..., xn having the
probabilities of p(x1 ), p(x2 ), ..., p(xn ), then the mean value of X denoted by E(X) or µ is
defined as:
∞
E(X) = µ = xi p(xi )
X
i=1
Definition: If X is continuous random variable with pdf of f(x), its mean is given by
Z ∞
E(X) = µ = xf (x)dx
−∞
81
Introduction to Statistics Haramaya University
Properties of Expectation
E(aX) = aE(X) = aµ
2. If X = a, then E(X) = a.
Properties of Variance
B var(X + a) = var(X)
B var(aX) = a2 var(X)
Example 1: A coin is tossed two times. Let X be the number of heads. Find the mean
value and the standard deviation of X.
Solution:
We already constructed a probability distribution for number of heads in previous example.
X 0 1 2
P (X = xi ) 1
4
2
4
1
4
2
1 2 1
E(X) = µ = xi p(xi ) = 0 × +1× +2× =1
X
i=0
4 4 4
1 2 1 6
E(X 2 ) = 02 × + 12 × + 2× = = 1.5
4 4 4 4
82
Introduction to Statistics Haramaya University
Solution:
Z ∞ Z 0 Z 1
E(X) = xf (x)dx = x(1 + x)dx + x(1 − x)dx
−∞ −1 0
#0 #1
1 1
" "
x2 x3 x2 x3
E(X) = + + − =− + =0
2 3 −1
2 3 0
6 6
Z 0 Z 1
E(X 2 ) = x2 (1 + x)dx + x2 (1 − x)dx
−1 0
#0 #1
1 1
" "
x3 x4 x3 x4
E(X 2 ) = + + − = + = 0.167
3 4 −1
3 4 0
12 12
The binomial probability distribution is a discrete probability distribution that provides many
applications. It is associated with a multiple-step experiment that we call the binomial
experiment. A binomial experiment exhibits the following four properties.
2. The trials are independent. (The outcome of any individual trial does not affect the
probabilities in the other trials.)
83
Introduction to Statistics Haramaya University
3. The outcome of each trial must be classifiable into one of two possible categories (success
or failure).
4. The probability of a success, denoted by p, does not change from trial to trial.
If a procedure satisfies these four requirements, the distribution of the random variable (X) is
called a binomial probability distribution (or binomial distribution). To calculate probabilities
we use the following formula.
!
n x n−x
P (X = x) = p q f or x = 0, 1, 2, ..., n
x
where
Expected value and variance of binomially distributed random variable [X ∼ Bin(n, p)] can
be obtained using the following.
E(X) = µ = np
q √
SD(X) = σ = np(1 − p) = npq
Example: A university found that 20% of its students withdraw without completing the
introductory statistics course. Assume that 20 students registered for the course. Compute
the probability
Let X be number of students who will withdraw without completing the introductory
statistics course. From the given problem p = 0.2 = 20%, n = 20 and X ∼ Bin(20, 0.2).
20 20!
!
P (X = 4) = 0.24 0.816 = 0.24 0.816 = 0.2182
4 4!(20 − 4)!
84
Introduction to Statistics Haramaya University
i=0
20 20 20
! ! !
= 0.20 0.820 + 0.21 0.819 + 0.22 0.818
0 1 2
20! 20! 20!
= 0.20 0.820 + 0.21 0.819 + 0.22 0.818
0!(20 − 0)! 1!(20 − 1)! 2!(20 − 2)!
= 0.2061
i=3
= P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
20!
= 0.2061 + 0.23 0.817
3!(20 − 3)!
= 0.2054
E(X) = np = 20 × 0.2 = 4
In this section we consider a discrete random variable that is often useful in estimating the
number of occurrences over a specified interval of time or space. For example, the random
variable of interest might be the number of arrivals at a car wash in one hour, the number
of repairs needed in 10 miles of highway, or the number of leaks in 100 miles of pipeline. If
the following two properties are satisfied, the number of occurrences is a random variable
described by the Poisson probability distribution.
1. The probability of an occurrence is the same for any two intervals of equal length.
85
Introduction to Statistics Haramaya University
For the Poisson probability distribution, X is a discrete random variable indicating the num-
ber of occurrences in the interval. Since there is no stated upper limit for the number of
occurrences, the probability function p(x) is applicable for values x = 0, 1, 2, ... without limit.
In practical applications, x will eventually become large enough so that p(x) is approximately
zero and the probability of any larger values of x becomes negligible.
A property of the Poisson distribution is that the mean and variance are equal. That is,
E(X) = var(X) = λ
Example: A student finds that the average number of amoeba in 10 ml of pond water is 4.
Find the probability that in 10 ml of water from that pond there are
Let Y be the number of amoeba found in 10 ml pond water. From the given question
λ = 4 which implies that Y ∼ P oisson(λ).
e−4 45
P (X = 5) = = 0.156
5!
(b) no amoeba.
e−4 40
P (X = 0) = = e−4 = 0.0183
0!
i=0
e−4 40 e−4 41 e−4 42
= + +
0! 1! 2!
= e−4 + 4e−4 + 8e−4
= 0.238
86
Introduction to Statistics Haramaya University
The most important probability distribution for describing a continuous random variable is
the normal probability distribution. The normal distribution has been used in a wide variety
of practical applications in which the random variables are heights and weights of people,
test scores, scientific measurements, amounts of rainfall, and other similar values. It is also
widely used in statistical inference. In such applications, the normal distribution provides a
description of the likely results obtained through sampling.
Normal Curve
The form or shape of the normal distribution is illustrated by the bell-shaped normal curve
in the following figure. The probability density function (pdf) that defines the bell-shaped
curve of the normal distribution follows.
If a random variable X ∼ N (µ, σ 2 ) its probability density function (pdf) is given by:
1 2 2
f (x) = √ e−(x−µ) /2σ −∞<x<∞
2πσ
The normal curve has two parameters, µ and σ. They determine the location and shape of
the normal distribution.
87
Introduction to Statistics Haramaya University
1. The entire family of normal distributions is differentiated by two parameters: the mean
µ and the standard deviation σ.
2. The highest point on the normal curve is at the mean, which is also the median and
mode of the distribution.
3. The mean of the distribution can be any numerical value: negative, zero, or positive.
Three normal distributions with the same standard deviation but three different means
(-10, 0, and 20) are shown here.
4. The normal distribution is symmetric, with the shape of the normal curve to the left
of the mean a mirror image of the shape of the normal curve to the right of the mean.
The tails of the normal curve extend to infinity in both directions and theoretically
never touch the horizontal axis. Because it is symmetric, the normal distribution is not
skewed; its skewness measure is zero.
5. The standard deviation determines how flat and wide the normal curve is. Larger
values of the standard deviation result in wider, flatter curves showing more variability
in the data. Two normal distributions with the same mean but with different standard
deviations are shown here.
88
Introduction to Statistics Haramaya University
6. Probabilities for the normal random variable are given by areas under the normal curve.
The total area under the curve for the normal distribution is 1. Because the distribution
is symmetric, the area under the curve to the left of the mean is 0.50 and the area under
the curve to the right of the mean is 0.50.
a) 68.3% of the values of a normal random variable are within plus or minus one
standard deviation of its mean.
b) 95.4% of the values of a normal random variable are within plus or minus two
standard deviations of its mean.
c) 99.7% of the values of a normal random variable are within plus or minus three
standard deviations of its mean.
A random variable that has a normal distribution with a mean of zero and a standard deviation
of one is said to have a standard normal probability distribution. The letter z is commonly
used to designate this particular normal random variable, that is z ∼ N (0, 1). The reason for
discussing the standard normal distribution so extensively is that probabilities for all normal
distributions are computed by using the standard normal distribution. That is, when we have
a normal distribution with any mean µ and any standard deviation σ, we answer probability
questions about the distribution by first converting to the standard normal distribution. Then
we can use the standard normal probability table and the appropriate z values to find the
89
Introduction to Statistics Haramaya University
x−µ
z=
σ
Consequently, the standard normal density is given by:
1 2
f (z) = √ exp−z /2 −∞ < z < ∞
2π
which is graphically shown below.
Example 1: Given that z is a standard normal random variable, compute the following
probabilities.
a) P (0 ≤ z ≤ 2.5) = 0.4938
P (−1 < z < 1.5) = P (−1 < z < 0) + P (0 < z < 1.5)
= 0.7745
Example 2: The college boards, which are administered each year to many thousands of
high school students, are scored so as to yield a mean of 500 and a standard deviation of 100.
These scores are close to being normally distributed. What percentage of the scores can be
expected to satisfy each condition?
90
Introduction to Statistics Haramaya University
Let X be the score of students with mean µ = 500, σ = 100 and X ∼ N (500, 100).
X −µ 600 − µ
P (X > 600) = P >
σ σ
600 − 500
=P z>
100
= P [z > 1]
= 0.1587
X −µ 450 − µ
P (X < 450) = P <
σ σ
450 − 500
=P z<
100
= P [z < −0.5] = P [z < 0] − P [−0.5 < z < 0]
= 0.5 − 0.1915
= 0.3085
450 − µ X −µ 600 − µ
P (450 < X < 600) = P < <
σ σ σ
450 − 500 600 − 100
=P <z<
100 100
= P [−0.5 < z < 1]
= 0.1915 + 0.3413
= 0.5328
91
Introduction to Statistics Haramaya University
6.5. Exercises
1. State the conditions (assumptions) under which random variable can have a binomial
distribution.
2. The probability that a freshman entering Haramaya University will survive first semester
is 0.92. Assuming this pattern remain unchanged over the subsequent years, what is
the probability that among 100 randomly selected freshmen in first semester,
3. A normal distribution has mean µ = 62.4. Find its standard deviation if 20% of the
area under the curve lies to the right of 79.2.
4. Show that 65.24% of the observations in a normally distributed population lie between
µ − 1.1σ and µ + 0.8σ.
(a) the lowest mark if the lowest 10% of the students are given F’s.
(b) the lowest mark to get grade A if the top 5% of the students are given A’s.
(c) the lowest mark to get grade B if the top 10% of the students are given A’s and
the next 25% are given B’s.
92
7
One of the principal objectives of statistical analysis is to draw inference about the population
on the basis of data collected by sampling from population. In other words, one is required to
draw inference (or to generalize) about the population from the sample data. The inference to
be drawn relates to some parameters of the population, such as the mean, standard deviation
or some other feature like the proportion of an attribute occurring in the population. The
two most important types of problems of inference in statistics are:
In the absence of the complete data or information about the population, it would not be
possible to determine the exact or true value of a parameter. It would be worthwhile to obtain
from the sample data an estimate of the unknown true or exact value of the parameter or an
interval of values in which the parameter lies, and also determine a procedure for determining
the accuracy of the estimate. This type of inference is known as estimation of parameters.
There are two types of estimation.
1. Point estimation: here the objective is to find a single value for the unknown param-
eter.
Suppose that we are concerned with the estimation of a parameter of a population from a
given sample of the population. The procedure of point estimation consists of determining a
93
Introduction to Statistics Haramaya University
single quantity from the sample values given such that the single number of fairly close to the
unknown value of the parameter of the population. Suppose that the sample (of size n) drawn
from the population is denoted by x1 , x2 , ..., xn , and that the unknown parameter is denoted
by θ. The point estimation of θ will be based on the sample observations x1 , x2 , ..., xn . It
will be a function of the sample observations x1 , x2 , ..., xn , that is, a statistic. The statistic
to be used for point estimation of θ is called a point estimator and is denoted by θ̂. When
an actual set of sample values is given, we can compute a numerical value, which is called
a point estimate of θ̂. The estimator θ̂ of the parameter θ is a function of the sample
observations x1 , x2 , ..., xn , and will assume different values corresponding to different sets
of sample observations x1 , x2 , ..., xn . For a given set of sample observations, we get point
estimate of θ; this is one of the possible values of θ̂.
The point estimator of µ can assume an infinite number of values corresponding to the infinite
set of (the numerical) sample values that x1 , x2 , ..., xn take. From one given set of sample
values, that is, a particular set of numerical values one can compute one particular value of the
P
xi
estimator µ and this value is a point estimate of µ. Besides the mean x̄ = x1 +x2 +...+xn
n = n ,
there may be other types of estimator of µ, based on some other function of the same set of
sample observations x1 , x2 , ..., xn ; in fact the sample median is also an estimator of µ. The
question then arises: which of the sample estimators is to be preferred and why. This raises
another question: what should be the basis of selecting an estimator, or in other words what
should be the criteria of a good estimator. Without going into details, we would like to state
below four desirable properties that an estimator should possess.
3. Consistency: It refers to the effect of sample size on the accuracy of the estimator. A
statistic is said to be consistent estimator of the population parameter if it approaches
the parameter as the sample size increases, i.e. θ̂ → θ as n → N .
4. Sufficiency: An estimator is said to be sufficient if it uses all the information about the
94
Introduction to Statistics Haramaya University
population parameter contained in the sample. For example, the statistic mean uses all
the sample values in its computation while median and mode do not. Hence, the mean
is a better estimator in this sense.
In previous section, we stated that a point estimator is a sample statistic used to estimate a
population parameter. For instance, the sample mean x̄ is a point estimator of the population
mean µ and the sample proportion p̂ is a point estimator of the population proportion p.
Because a point estimator cannot be expected to provide the exact value of the population
parameter, an interval estimate is often computed by adding and subtracting a value, called
the margin of error, to the point estimate. The general form of an interval estimate is as
follows:
point estimate ± margin of error
Thus, a confidence interval (or interval estimate) is a range (or an interval) of values that is
likely to contain the true value of the population parameter. A confidence interval associated
with a degree of confidence, which is a measure of how certain we are that our interval contains
the population parameter.
The degree of confidence is the probability 1 − α (often expressed as the equivalent percentage
value) that the confidence interval contains the true value of population parameter. The
degree of confidence is also called the level of confidence or the confidence coefficient.
The purpose of an interval estimate is to provide information about how close the point
estimate, provided by the sample, is to the value of the population parameter. Thus, the
general form of an interval estimate of a population mean is
x̄ ± margin of error
95
Introduction to Statistics Haramaya University
s
x̄ ± tα/2 (n − 1) √
n
where s is the sample standard deviation, 1 − α is the confidence coefficient, and tα/2
is the t value providing an area of α/2 in the upper tail of the t distribution with n − 1
degrees of freedom. The reason the number of degrees of freedom associated with the t
value in above expression is n − 1 concerns the use of s as an estimate of the population
standard deviation σ. The expression for the sample standard deviation is
sP
(xi − x̄)2
s=
n−1
1. If all possible samples of size n were drawn, then on an average 100(1 − α)% of these
samples would include the population mean within the interval around there sample
means bounded by L and U .
3. If a random sample of size n was taken from the population, we can be 100(1 − α)%
confident in our assertion that the population mean would lie around the sample mean
in the interval bounded by the values L and U .
Example 1: Haramaya University wishes to estimate the average age of students who gradu-
ate with B.Sc. degree. A random sample of 625 graduating students showed that the average
age was 24 with a standard deviation of 5 years. Construct the 95% confidence interval for
the true average age of all such graduating students at the University and interpret it.
96
Introduction to Statistics Haramaya University
Solution:
Let µ is the true average age of all graduating students with B.Sc. degree from the University.
From the sample data we have n = 625, x̄ = 24 and s = 5.
s s s
x̄ − zα/2 √ , x̄ + zα/2 √ = x̄ ± zα/2 √
n n n
5 5
24 − 1.96 × √ , 24 + 1.96 × √
625 625
[24 − 0.392, 24 + 0.392] = [23.608, 24.392]
Interpretation:
On average the true average age of all graduating students with B.Sc. degree from the
University is found between 23.608 and 24.392 at 95% confidence level.
Example 2: An airline wants to evaluate the depth perception of its pilots over the age of
50. A random sample of 14 airline pilots over the age of 50 are asked to judge the distance
between two markers placed 20 feet apart at the opposite end of the laboratory. The sample
data listed here are the pilots’ error (recorded in feet) in judging the distance.
Use the sample data to place a 95% confidence interval on µ, the average error in depth
perception for the company’s pilots over the age of 50.
Solution:
Since the sample size small, it is appropriate to construct the confidence interval based on
the t distribution. Let y be the average error in depth perception for the company’s pilots
over the age of 50. We can verify that ȳ = 2.26 and s = 0.28.
s s s
ȳ − tα/2 (n − 1) √ , ȳ + tα/2 (n − 1) √ = ȳ ± tα/2 (n − 1) √
n n n
97
Introduction to Statistics Haramaya University
0.28 0.28
2.26 − 2.16 × √ , 2.26 + 2.16 × √ = [2.10, 2.42]
14 14
Interpretation:
Therefore, we are 95% confident that the average error in the pilots’ judgment of the distance
is between 2.10 and 2.42 feet.
7.3.1. Introduction
1. Type-I Error: is a mistake occurred if one rejects the null hypothesis which is actually
true.
2. Type-II Error: is a mistake occurred if one failed to reject the null hypothesis which
is actually false.
98
Introduction to Statistics Haramaya University
There are four possible outcomes of any hypothesis test, two of which are correct and two of
which are incorrect. The incorrect ones are called type I and type II.
State of Nature
Decision H0 True H0 False
Do not reject H0 Correct decision (1 − α) Type II error (β)
Reject H0 Type I error (α) Correct decision (1 − β)
1. State (formulate) the null and alternative hypotheses. The hypotheses may be either of
the following.
3. Calculate the appropriate test statistic. The following is a general formula for a test
statistic that will be applicable in many of hypothesis tests.
x̄−µ
B Use t statistic if n is small, t = √0
s/ n
∼ tα (n − 1).
x̄−µ
B Use z statistic if n is large enough, z = √0
σ/ n
∼ N (0, 1).
4. Obtain the tabulated (critical) value. For two tailed test the critical value is zα/2 (tα/2 ),
for right tailed zα (tα ) and for left tailed −zα (−tα ) respectively.
5. Define the critical (rejection) region. If the value of the test statistic falls in the critical
region (rejection region), reject the null hypothesis; otherwise do not reject it.
99
Introduction to Statistics Haramaya University
Examples
1. A professor wants to know if the introductory statistics class has a good grasp of basic
math. Six students are chosen at random from the class and given a math proficiency
test. The professor wants the class to be able to score at least 70 on the test. The six
students get scores of 62, 92, 75, 68, 83, and 95. Can the professor certain that the
mean score for the class on the test would be at least 70 at 0.05?
Solution:
First, compute the sample mean and standard deviation.
sP
(xi − x̄)
P
xi
x̄ = = 79.17, s= = 13.17
n n−1
H0 : µ = 70
Ha : µ > 70
X Compute appropriate test statistics. Since the sample size is small t is appropriate
in this case.
x̄ − µ0 79.17 − 70
t= √ = √ = 1.71
s/ n 13.17/ 6
100
Introduction to Statistics Haramaya University
X Here we define the rejection region. Reject H0 if t > 2.015 otherwise do not reject
H0 . Since the computed t value 1.71 is not greater than critical value 2.015, we
fail to reject the null hypothesis.
X Interpretation:
Hence, the professor is not certain on the math test of the class which states that
it is at least 70 at 5% level of significance.
2. A merchant believes that the average age of customers who purchase a certain brand
of wears is 13 years of age. A random sample of 35 customers had an average age of
15.6 years. At 1% should this conjecture be rejected. The standard deviation of the
population is 1 year.
Solution:
Suppose x be the age of customers who purchase a certain brand of wear. Given
µ0 = 13, n = 35, x̄ = 13 and s = 1.
H0 : µ = 13
Ha : µ 6= 13
X Specify the level of significance (α = 0.01). This test is a two tailed test, so you
divide the alpha level by two, α/2 = 0.005.
X Compute appropriate test statistics. Since the sample size is large (n > 30) z is
appropriate in this case.
x̄ − µ0 15.6 − 13 2.6
z= √ = √ = = 15.38
s/ n 1/ 35 0.169
X Obtain the tabulated value from the standard normal table which is zα/2 .
X Define the rejection region. Reject H0 if |z| > zα/2 otherwise do not reject H0 .
Since the calculated z value 15.38 is much greater than tabulated value 2.575, the
null hypothesis is rejected.
X Interpretation:
101
Introduction to Statistics Haramaya University
7.4. Exercises
1. In order to ensure efficient usage of a server, it is necessary to estimate the mean num-
ber of concurrent users. According to records, the sample mean and sample standard
deviation of number of concurrent users at 100 randomly selected times is 37.7 and 9.2,
respectively.
(a) Construct a 90% confidence interval for the mean number of concurrent users.
(b) Do these data provide significant evidence, at 1% significance level, that the mean
number of concurrent users is greater than 35?
2. To assess the accuracy of a laboratory scale, a standard weight that is known to weigh 1
gram is repeatedly weighed 4 times. The resulting measurements (in grams) are: 0.95,
1.02, 1.01, 0.98. Assume that the weighings by the scale when the true weight is 1 gram
are normally distributed with mean µ.
(b) Do these data give evidence at 5% significance level that the scale is not accurate?
102
8
Correlation is a mathematical tool desired towards measuring the degree of linear relationship
(degree of association) between the variables. Correlation that involves only two variables
is called simple correlation and which involves more than two variables is called multiple
correlations.
Covariance is a measure of the joint variation in two variables, i.e. it measures the way in
which the values of the two variables vary together.
1. If the covariance is zero, there is no linear relationship between the two variables.
3. If the covariance is positive, there is a direct linear relationship between the variables.
Pearson’s coefficient of correlation (r) is used to measure the strength of the linear relationship
between two variables.
n xy − x y
P P P
r=p P 2
n x − ( x)2 × n y 2 − ( y)2
P p P P
103
Introduction to Statistics Haramaya University
Interpretation of r
1. If the value of r is -1 or +1, there is perfect negative or perfect positive linear relationship
between the variables.
3. If r is -0.5 (or approximately -0.5) or +0.5 (or approximately +0.5), there is moderate
negative or moderate positive linear relationship between the variables.
Regression is defined as the estimation or prediction of the unknown value of one variable
from the known values of one or more variables. It is also functional relationship between two
or more variables. The variable whose values are to be estimated or predicted is known as
dependent or explained variable while the variable which are used in determining the value
of the dependent variable are called independent or predictor variables. The regression study
that involves only two variables is called simple regression and the regression analysis that
studies more than two variables is called multiple regression.
y = α + βx + ε
where
To estimate the regression coefficients (α̂ and β̂), the procedure is minimizing the sum of the
squares of the errors.
104
Introduction to Statistics Haramaya University
Then, from sample data the values of α̂ and β̂ can be obtained as follows:
n xy − x y
P P P
β̂ = ; α̂ = ȳ − β̂ x̄
n x2 − ( x)2
P P
The coefficient of determination tells how well the estimated model fits the data. For simple
linear regression (two variables case), it is defined as the square of the sample correlation
coefficient, and denoted by r2 . Hence r2 measures the proportion or percentage of the variation
in the dependent variable explained by the independent variable. r2 is a nonnegative quantity
which lies in the limits 0 and 1. If it approaches to 1, it means a good fit and if it approaches
0, no relationship between the variables.
Example
A researcher wants to find out if there is a relationship between the heights of sons and the
heights of their fathers. In other words, do taller fathers have taller sons? The researcher
took a random sample of 6 fathers and their 6 sons. Their height in inches is given below in
an ordered array.
Father (X) 63 65 66 67 67 68
Son (Y) 66 68 65 67 69 70
n xy − x y
P P P
r=
n x2 − ( x)2 × n y 2 − ( y)2
p P P p P P
= 0.597
105
Introduction to Statistics Haramaya University
(b) Estimate the regression model of height of sons on height of fathers and interpret the
estimated parameters.
n xy − x y
P P P
β̂ =
n x2 − ( x)2
P P
α̂ = ȳ − β̂ x̄ = 67.5 − 0.625 × 66
= 26.25
ŷ = 26.25 + 0.625x
For one inch increment in fathers height, the height of the son is increased by 0.625
inches.
Thus 35.7% of variation in the dependent variable (son height) is accounted for by the
variation of the independent variable (father height).
8.4. Exercises
AGE SBP
15 116
20 120
25 130
30 132
40 150
50 148
a) Compute regression a line (systolic blood pressure (SBP) on AGE) and interpret
the results.
106
Introduction to Statistics Haramaya University
b) Compute correlation coefficient between SBP and AGE and also interpret the
result.
c) How much the variance of SBP can be explained by the fact that there is variability
in AGE?
2. An experiment was conducted to study the effect on sleeping time of increasing the
dosage of a certain barbiturate. Three readings were made at each of three dose levels:
b) Determine the regression line relating dosage (X) to sleeping time (Y).
T HE EN D!!
107
Introduction to Statistics Haramaya University
108