Probability and Statistics Ver 2.11 (2022)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 89

MAT271E

PROBABILITY AND STATISTICS

V9.05 2022

1
CONTENTS
CONTENTS............................................................................................................................ 2

Chapter 1................................................................................................................................... 4
INTRODUCTION TO STATISTICS ........................................................................................ 4
DEFINITION .......................................................................................................................... 4
METHOD OF STATISTICS .................................................................................................. 4
STATISTICAL VARIABLES ................................................................................................ 5
MISUSE OF STATISTICS ..................................................................................................... 6

Chapter 2................................................................................................................................... 7
PROBABILITY ....................................................................................................................... 7
THE PROBABILITIES OF SIMPLE EVENTS ..................................................................... 7
THE PROBABILITIES OF COMPOUND EVENTS ............................................................ 8
PROBABILITIES OF ENGINEERING PROBLEMS ......................................................... 13
EXAMPLES .......................................................................................................................... 14
BERNOULLI TRIALS ......................................................................................................... 18

Chapter 3................................................................................................................................. 23
FREQUENCY ANALYSIS .................................................................................................... 23
DEFINITIONS ...................................................................................................................... 23
FREQUENCY DISTRIBUTIONS ........................................................................................ 23
HISTOGRAMS AND FREQUENCY POLIGONS ............................................................. 24
FREQUENCY HISTOGRAMS ............................................................................................ 25
CUMULATIVE–FREQUENCY DISTRIBUTIONS AND OGIVES .................................. 25
GENERAL RULES FOR FORMING FREQUENCY DISTRIBUTIONS of CONTINOUS
DATA .................................................................................................................................... 26
GENERAL RULES FOR FORMING FREQUENCY DISTRIBUTIONS of DISCRETE
DATA .................................................................................................................................... 28
TYPES OF FREQUENCY CURVES ................................................................................... 29

Chapter 4................................................................................................................................. 31
THE MEAN, MEDIAN, MODE, AND OTHER MEASURES OF CENTRAL TENDENCY ...... 31
THE ARITHMETIC MEAN ................................................................................................. 31
THE WEIGHTED ARITHMETIC MEAN........................................................................... 32
THE GEOMETRIC MEAN G .............................................................................................. 32
THE HARMONIC MEAN H ................................................................................................ 33
THE RELATION BETWEEN THE ARITHMETIC, GEOMETRIC, AND HARMONIC
MEANS ................................................................................................................................. 33
THE MEDIAN ...................................................................................................................... 33
THE MODE .......................................................................................................................... 34
THE EMPIRICAL RELATION BETWEEN THE MEAN, MEDIAN, AND MODE ........ 34
THE ROOT MEAN SQUARE (RMS) ................................................................................. 35
QUANTILES ........................................................................................................................ 36
QUARTILES, DECILES, AND PERCENTILES ................................................................ 36

Chapter 5................................................................................................................................. 41
THE STANDARD DEVIATION AND OTHER MEASURES OF DISPERSION ..................... 41
THE VARIANCE ................................................................................................................. 41
2
THE STANDARD DEVIATION ......................................................................................... 41
COEFFICIENT OF VARIATION ........................................................................................ 42
SKEWNESS .......................................................................................................................... 42
KURTOSIS ........................................................................................................................... 44

Chapter 6................................................................................................................................. 45
PROBABILITY DISTRIBUTION FUNCTIONS ..................................................................... 45
INTRODUCTION ................................................................................................................. 45
NORMAL DISTRIBUTION................................................................................................. 45
LOGNORMAL DISTRIBUTION ........................................................................................ 49
GAMA DISTRIBUTION ...................................................................................................... 55

Chapter 7................................................................................................................................. 58
SAMPLING DISTRIBUTIONS .............................................................................................. 58
THE CONCEPT OF SAMPLING DISTRIBUTION ........................................................... 58
SAMPLING DISTRIBUTIONS ........................................................................................... 60

Chapter 8................................................................................................................................. 69
STATISTICAL HYPOTHESIS TESTING .............................................................................. 69
HYPOTHESIS TESTS FOR PARAMETERS ..................................................................... 69
APPLICATIONS .................................................................................................................. 71
COMPARISON TEST WITH T-DISTRIBUTION .............................................................. 76

Chapter 9................................................................................................................................. 78
REGRESSION ANALYSIS................................................................................................... 78
SIMPLE LINEAR REGRESSION ANALYSIS................................................................... 79
COMMON MISCONCEPTIONS ABOUT CORRELATION ............................................. 81

Chapter 10............................................................................................................................... 83
VARIANCE ANALYSIS ........................................................................................................ 83
INTRODUCTION ................................................................................................................. 83
STEPS OF VARIANCE ANALYSIS ................................................................................... 83
LSD (Least Square DIfferences) Test ................................................................................... 86

REFERENCES ......................................................................................................................... 89

3
Chapter 1

INTRODUCTION TO STATISTICS
DEFINITION
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or
explanation, and presentation of data. It is applicable to a wide variety of academic disciplines,
from the natural and social sciences to the humanities, and to government and business.
Statistics is the production of new and extra data by using existing ones.
Statistical methods can be used to summarize or describe a collection of data; this is called
descriptive statistics. In addition, patterns in the data may be modeled in a way that accounts for
randomness and uncertainty in the observations, and then used to draw inferences about the process
or population being studied; this is called inferential statistics. Both descriptive and inferential
statistics comprise applied statistics. There is also a discipline called mathematical statistics,
which is concerned with the theoretical basis of the subject.
In applying statistics to a scientific, industrial, or societal problem, one begins with a process or
population to be studied. This might be a population of people in a country, of crystal grains in a
rock, or of goods manufactured by a particular factory during a given period. It may instead be a
process observed at various times; data collected about this kind of "population" constitute what
is called a time series.
For practical reasons, rather than compiling data about an entire population, one usually studies a
chosen subset of the population, called a sample. Data are collected about the sample in an
observational or experimental setting. The data are then subjected to statistical analysis, which
serves two related purposes: description and inference.
* Descriptive statistics can be used to summarize the data, either numerically or graphically, to
describe the sample. Basic examples of numerical descriptors include the mean and standard
deviation. Graphical summarizations include various kinds of charts and graphs.
* Inferential statistics is used to model patterns in the data, accounting for randomness and
drawing inferences about the larger population. These inferences may take the form of answers to
yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation),
descriptions of association (correlation), or modeling of relationships (regression). Other modeling
techniques include ANOVA, time series, and data mining.
The word statistics is also the plural of statistic (singular), which refers to the result of applying a
statistical algorithm to a set of data, as in economic statistics, crime statistics, etc.
The word "statistics" is used in several different senses. In the broadest sense, "statistics" refers to
a range of techniques and procedures for analyzing data, interpreting data, displaying data, and
making decisions based on data. This is what courses in "statistics" generally cover.

METHOD OF STATISTICS
Exact and unique solutions can be found for many problems encountered in natural sciences when
the values of the relevant variables are known. We can determine the acceleration of a body of
known mass by a certain force applying the Newton's law of motion. This is known as the
deterministic approach.

4
Consider, however, the throw of a die. Nobody could possibly tell which side would show up.
There are several problems where the outcome cannot be predicted with certainty.
Many examples can be given for engineering problems that cannot be solved deterministically.
We cannot know how much precipitation will fall in Istanbul next year, or how large a force would
cause a certain beam to fail. Uncertainties due to natural causes or the variability of the properties
of materials prohibit the prediction of the outcome in such problems. These type of problems can
be solved by probabilistic approach.
Engineers often have to deal with uncertainties. Examples are the annual volume of flow in a
stream, the traffic density at a junction, the magnitude of the next earthquake at a certain location.
These variables will behave randomly, making an exact prediction impossible. Engineers must
take the uncertainty into consideration to achieve reliable and economical solutions. Only when
the scatter around the mean value is small enough not to have a significant effect, we can neglect
the uncertainty and adopt the mean of the variable.

STATISTICAL VARIABLES
A variable is a symbol, such as X, Y, H, x, or B, that can assume any of a prescribed set of values,
called the domain of the variable. If the variable can assume only one value, it is called a constant.

Continuous and Discrete Variables


Some variables (such as reaction time) are measured on a continuous scale. There is an infinite
number of possible values these variables can take on. Other variables can only take on a limited
number of values. For example, if a dependent variable were a subject's rating on a five– point
scale where only the values 1, 2, 3, 4, and 5 were allowed, then only five possible values could
occur. Such variables are called "discrete" variables.

EXAMPLE 1.1
The number N of children in a family, which can assume any of the values 0, 1, 2, 3,... but cannot
be 2.5 or 3.842, is a discrete variable.

EXAMPLE 1.2
The age A of an individual, which can be 62 years, 63.8 years, or 65.8341 years, depending on the
accuracy of measurement, is a continuous variable.
Data that can be described by a discrete or continuous variable are called discrete data or
continuous data, respectively. The number of children in each of 1000 families is an example of
discrete data, while the heights of 100 university students is an example of continuous data. In
general, measurements give rise to continuous data, while enumerations, or countings, give rise to
discrete data.

Quantitative and Qualitative Variables


A variable is any measured characteristic or attribute that differs for different subjects. For
example, if the weight of 30 subjects were measured, then weight would be a variable.
Quantitative variables are measured on an ordinal, interval, or ratio scale; qualitative variables are
measured on a nominal scale.

5
EXAMPLE 1.3
If students were asked to name their favorite color, then the variable would be qualitative (i.e. pink,
yellow, red ….)

EXAMPLE 1.4
If the time that took to answer above question were measured, then the variable would be
quantitative. (i.e. 1,35 seconds, 2 seconds ….)

MISUSE OF STATISTICS
“If you need statistics to analyze your experiment, then you've done the wrong experiment. If your
data speak for themselves, don't interrupt!”
“There are three kinds of lies: lies, damned lies, and statistics.”

This well–known saying is part of a phrase attributed to Benjamin Disraeli and popularized in the
U.S. by Mark Twain: The semi–ironic statement refers to the persuasive power of numbers, and
succinctly describes how even accurate statistics can be used to bolster inaccurate arguments.
How to Lie with Statistics is Darrell Huff's perennially popular introduction to statistics for the
general reader. Written in 1954, it is a brief, breezy, illustrated volume outlining the common
errors, both intentional and unintentional, associated with the interpretation of statistics, and how
these errors can lead to biased or inaccurate conclusions. Although a number of more recent
versions have been released, the original edition contained humorous, witty illustrations by Irving
Geis.
Over time it has become one of the most widely read statistics books in history, with over one and
a half million copies sold in the English–language edition. It has also been widely translated.
Themes of the book include "Correlation does not imply causation" and "Using Random
Sampling". It also shows how statistical graphs can be used to distort reality:
By truncating the bottom of a line or bar chart, one makes differences seem larger than they are
By representing one–dimensional quantities on a pictogram by two– or three–dimensional objects
to compare their sizes, one makes the reader forget that the images don't scale the same way the
quantities do. Two rows of small images would give a better idea than one small and one big one.

6
Chapter 2

PROBABILITY
The chance of the occurrence of a random event is defined as its probability. The basic axiom
of the probability theory states that each random event has a certain probability that varies in the
range of 0 to 1. Denoting the random variable by a capital letter and its value in an observation by
the corresponding small letter we can write
P(X=xi)=pi
where X = xi, is a random event, P is the symbol for the probability of the event, and pi is the
probability of the event X = xi
pi = 0 implies that the event X = xi will never occur, pi =1 means that the event will occur in all
observations. With the increase of the probability from 0 to 1, the chance of occurrence of the
event also increases, i.e. the event is seen more frequently.

THE PROBABILITIES OF SIMPLE EVENTS


How can we estimate the probabilities of random events? In the case of simple variables, the
probabilities can be determined by logic.

EXAMPLE 2.1
We are throwing a die. The probability of having 6 is given as :
P(X=1) = 1/6
There are six random events in the throw of a die with equal probabilities.
P( X = 1) = P(X = 2) = P(X = 3) = (P(X = 4) = P(X = 5) = P(X = 6) = 1/6
As one of these events will certainly occur in each throw, the total probability is given as:
6

� 𝑃𝑃(𝑋𝑋 = 𝑖𝑖) = 1
𝑖𝑖=1

Then as a rule, total probility always equal to one.

𝑛𝑛

� 𝑃𝑃(𝑋𝑋 = 𝑖𝑖) = 1
𝑖𝑖=1

EXAMPLE 2.2
What is the probability that a card drawn at random from a deck of cards will be an ace?

SOLUTION
Since of the 52 cards in the deck, 4 are aces, the probability is 4/52.

7
1 ♠ 1 ♥ 1 ♦ 1 ♣
2 ♠ 2 ♥ 2 ♦ 2 ♣
3 ♠ 3 ♥ 3 ♦ 3 ♣
4 ♠ 4 ♥ 4 ♦ 4 ♣
5 ♠ 5 ♥ 5 ♦ 5 ♣
6 ♠ 6 ♥ 6 ♦ 6 ♣
7 ♠ 7 ♥ 7 ♦ 7 ♣
8 ♠ 8 ♥ 8 ♦ 8 ♣
9 ♠ 9 ♥ 9 ♦ 9 ♣
10 ♠ 10 ♥ 10 ♦ 10 ♣
11 ♠ 11 ♥ 11 ♦ 11 ♣
12 ♠ 12 ♥ 12 ♦ 12 ♣
13 ♠ 13 ♥ 13 ♦ 13 ♣
♠: spade ♥: hearth ♦: diamond ♣ : club
In general, the probability of an event is the number of favorable outcomes divided by the total
number of possible outcomes. (This assumes the outcomes are all equally likely.)

EXAMPLE 2.3
The same principle can be applied to the problem of determining the probability of obtaining
different totals from a pair of dice.

SOLUTION
As shown below, there are 36 possible outcomes when a pair of dice is thrown.

To calculate the probability that the sum of the two dice will equal 5, calculate the number of
outcomes that sum to 5 and divide by the total number of outcomes (36). Since four of the
outcomes have a total of 5 (1,4; 2,3; 3,2; 4,1), the probability of the two dice adding up to 5 is 4/36
= 1/9 . In like manner, the probability of obtaining a sum of 12 is computed by dividing the number
of favorable outcomes (there is only one) by the total number of outcomes (36). The probability is
therefore 1/36 .

THE PROBABILITIES OF COMPOUND EVENTS


While the probability of simple events is calculated easily by using the simple probability rules,
the probability of compund events is calculated by using the theory of set. The same rules of set
theory apply for the probability of compund events.

8
Theory of Set
If two sets have no common point they are called the disjoint sets, their intersection is the empty
set.
In Fig. 2.1, the sets A and B are disjoint, A∩B=φ. Just like in the set theory, the union of two
events consists of points that are at least in one of the events AUB.

Fig. 2.1. Events that are disjoint and not in the sample space of a random variable

Probability theory tells us how the probability of a compound random event can be computed when
the probabilities of (simple or compound) events that constitute it are known.

Disjoint events (Mutually exclusive events)


The probability of a compound event consisting of two disjoint events is the sum of their
probabilities. In Fig. 2.3 the probability of AUB is the sum of the probabilities P(A) and P(B).

P(AUB) = P(A) + P(B) if (A∩B) = φ

By extending this theorem, we can see that the sum of the probabilities of all sixmple events for a
random variable is equal to one as one of these simple events will certainly occur in an observation
These type of event is called as mutually exclusive events.

Mutually Exclusive Events


Two events are mutually exclusive if it is not possible for both of them to occur. For example, if a die is rolled, the
event "getting a 1" and the event "getting a 2" are mutually exclusive since it is not possible for the die to be both a
one and a two on the same roll. The occurrence of one event "excludes" the possibility of the other event.

EXAMPLE 2.4
What is the probability of rolling a die and getting either a 1 or a 6?

SOLUTION
Since it is impossible to get both a 1 and a 6, these two events are mutually exclusive. Therefore,
P(1 U 6) = P(1) + P(6) = 1/6 + 1/6 = 1/3

9
Nondisjoint events (Not mutually exclusive events)
The probability of the union of two events that are not disjoint can easily be computed by the Venn
diagram, where each event is represented by the region inside a closed curve whose area is assumed
to be proportional to the probability of the event. In the Venn diagram of Fig. 2.4.

P(AUC) = P(A) + P(C) – P(A∩C)

It is seen that in the general case we must subtract the probability of the intersection from the sum
of the probabilities P(A) and P(B). Eq. (2.5) reduces to Eq.(2.4) when A and B are disjoint events.

EXAMPLE 2.5
What is the probability that a card selected from a deck will be either an ace (1) or a spade
(♠)?
SOLUTION
The relevant probabilities are:
P(1) = 4/52
P(♠) = 13/52
The only way in which an ace and a spade can both be drawn is to draw the ace of spades. There
is only one ace of spades, so:
P(1 and ♠) = 1/52 .
The probability of an ace or a spade can be computed as:
P(1 or spade) =P(1) + P(♠) – P(1 and ♠) = 4/52 + 13/52 – 1/52 = 16/52 = 4/13.

EXAMPLE 2.6
Define the probability of union of two sets in the throw of a die:
P(E) = P{2,4,6} = 1/2
P(S) = P{1,2} = 1/3
SOLUTION
The probability of the union of the sets E and S can be computed as
P(EUS) = P (1,2,4,6)=P(E) + P(S) – P(E∩S)
P(E∩S)=P(E).P(S) = 1/2 . 1/3 = 1/6
P(EUS) = P(E) + P(S) – P(E∩S)
P(EUS) = 1/2 + 1/3 – 1/6
P(EUS) = 2/3

Multiplication rule

10
The probability of the multiple events which are coming one, then another one an so on is determined as
the multiplication of all probabilities

P(A and B) = P(A∩B)= P(A).P(B)

In other words, the probability of A and B both occurring is the product of the probability of A
and the probability of B.

EXAMPLE 2.7
What is the probability that a fair coin will come up with heads twice in a row? Two events
must occur: a head on the first toss and a head on the second toss.

SOLUTION
Since the probability of each event is 1/2, the probability of both events is: 1/2 x 1/2 = 1/4.

EXAMPLE 2.8
Consider a problem: Someone draws a card at random out of a deck, replaces it, and then
draws another card at random. What is the probability that the first card is the ace of clubs
and the second card is a club (any club).

SOLUTION
Since there is only one ace of clubs in the deck, the probability of the first event is 1/52. Since
13/52 = 1/4 of the deck is composed of clubs, the probability of the second event is 1/4. Therefore,
the probability of both events is: 1/52 x 1/4 = 1/208 .

EXAMPLE 2.9
Consider the probability of rolling two dice and getting a 6 on both of the rolls.
SOLUTION
The events are defined in the following way:
Event A: 6 on the first roll: p(A) = 1/6
Event B: 6 on the second roll: p(B) = 1/6
P(6 ∩ 6) = 1/6 x 1/6
P(6 ∩ 6) = 1/36

Conditional Probability
A conditional probability is the probability of an event given that another event has occurred. For
example, what is the probability that the total of two dice will be greater than 8 given that the first
die is a 6?
This can be computed by considering only outcomes for which the first die is a 6. Then, determine
the proportion of these outcomes that total more than 8. All the possible outcomes for two dice are
shown below:

11
There are 6 outcomes for which the first die is a 6, and of these, there are four that total more than
8 (6,3; 6,4; 6,5; 6,6). The probability of a total greater than 8 given that the first die is 6 is therefore
4/6 = 2/3.
More formally, this probability can be written as:
P(total>8 | Die 1 = 6) = 2/3.
In this equation, the expression to the left of the vertical bar represents the event and the expression
to the right of the vertical bar represents the condition. Thus it would be read as "The probability
that the total is greater than 8 given that Die 1 is 6 is 2/3."
In more abstract form, P(A|B) is the probability of event A given that event B occurred.

P(A∩B)= P(A).P(B|A)

where P(B|A) is the conditional probability of B given A.

EXAMPLE 2.11
If someone draws a card at random from a deck and then, without replacing the first card,
draws a second card, what is the probability that both cards will be aces?

SOLUTION
Event A is that the first card is an ace. Since 4 of the 52 cards are aces, P(A) = 4/52 = 1/13. Given
that the first card is an ace, what is the probability that the second card will be an ace as well? Of
the 51 remaining cards, 3 are aces.
Therefore, P(B|A) = 3/51 = 1/17
P (A and B) = 1/13 x 1/17 = 1/221.

EXAMPLE 2.11
Consider the probability of having İ letter, then T letter and then Ü letter when you draw
them from a box of 29 letters of Turkish alphabets.

SOLUTION
2 conditions:
A) The probability when we replace the letters into the box

12
B) The probability when we don’t replace the letters into the box

A) With replacing the letters into the box, the probability is


P(İ and T and Ü) = 1/29 . 1/29 . 1/29 = 1 / 24389

B) Without replacing the letters, the probability is


P(İ and T and Ü) = 1/29 . 1/28 . 1/27 = 1 / 21924

Complimentary Event Rule


The event that an odd number shows in the throw of a die (E) is the complimentary event of an
even number (Ē ):
Points that are not in a region E of the sample space are in the region Ē, called the complementary
event. It is easily seen that
P(E) + P(Ē) = 1

EXAMPLE 2.10
Consider the throwing of a die and define the probabilities of getting more than 4 and less
than or equal to 4.

SOLUTION
The events are defined in the following way:
P(X>4) = 2/6
P(X<=4)= 4/6
Getting more than 4 and less than or equal to 4 is the complimentary event of each other,
then
P(X>4) + P(X<=4) = 1
2/6 + 4/6 = 1

PROBABILITIES OF ENGINEERING PROBLEMS


In engineering problems the above approach is usually useless in estimating the probabilities. We
must base our estimates on the frequencies. The frequency of a random event is the ratio of the
number of times it occurs to the total number of observations. If the event X = xi is observed ni
times during N experiments its frequency is
𝑓𝑓𝑖𝑖 = 𝑛𝑛𝑖𝑖 /𝑁𝑁
The probability of a random event is defined as limit of its frequency as the number of observations
approaches infinity:
pi = lim (ni / N )
N →∞

13
Although we can never make an infinite number of observations, it can be assumed that fi
approaches pi quite rapidly with the increase of N. If no precipitation has been observed at a station
for ni=900 days along a period of N= 1500 days than the probability of no precipitation can be
estimated as
P(X = 0) = 900 / 1500 = 0.60
With the increase of the observation period the estimated frequency will be a better estimate of the
true probability.
Sometimes the term "probability" is used in a somewhat different sense. The probability of an
event that cannot be observed many times cannot be estimated by using the mentioned equations
but we can provide an estimate on the basis of our past experience and information on its structure.
Such an estimate will obviously be subjective. Compare, for example, the statement “The
probability of rain tomorrow is 50%" with the statement 'The probability of a rainy day in this
location is 50%". The methods of the probability theory, however, can be applied no matter how
the probabilities have been estimated.

EXAMPLES

EXAMPLE 2.13
The number of vehicles waiting for a left turn at a cross–section is observed to vary between
0 and 6, with the following probabilities:
P(X=0)=4/60
P(X=1)=16/60
P(X=2)=20/60
P(X=3)=14/60
P(X=4)=3/60
P(X=5)=2/60
P(X=6)=1/60

SOLUTION
There are 7 simple events in the sample space of the random variable X "the number of vehicles
waiting for a left turn", the sum of their probabilities adding up to 1.
We can compute, for example, the probability that more than 3 vehicles are waiting as simple
events are disjoint:
P(X > 3) = P(X = 4) + P(X = 5) + P(X = 6) = 3 / 60 + 2 / 60 + 1 / 60 = 6/60=1/10
The event "less than or equal to 3 vehicles waiting" is the complementary of the above event.
Therefore
P(X ≤ 3) = l – P(X > 3) = 1–1/10 = 9/10

EXAMPLE 2.14
There are 3 bulldozers at a construction site, each having a probability of no failure during
the total period of construction equal to 0.50. Let us consider the random variable X “the
number of bulldozers in operation throughout the construction period".

14
SOLUTION
There are 4 events in the sample space of X as X can be equal to 0, 1, 2 or 3. Let us compute their
probabilities.
Denoting a bulldozer in operation by S, and a bulldozer not in operation by F, following 8
combinations are possible for the 3 bulldozers: FFF, FFS. FSF. SFF, FSS, SFS, SSF. SSS. These
combinations have equal probabilities because the probability of failure F is assumed to be equal
to the probability of no–failure (success 5). Since the sum of the probabilities should be equal to
1, each of the above combinations has a probability of 1/8.
Now we can determine the probabilities of the random events of the variable X:
P(X = 0) = P(FFF) = 1/8
P(X = 1) = P(FFS) + P(FSF) + P(SFF) =1/8 + 1/8 + 1/8 = 3/8
P(X = 2) = P(FSS) + P(SFS) + P(SSF) =1/8 + 1/8 + 1/8 = 3/8
P(X = 3) = P(SSS) = l/8
The sum of these probabilities is 1, as expected.
We shall discuss how this problem can be solved when the probability of failure is not equal to
0.50 later, in relation to the Bernoulli trials

EXAMPLE 2.15
It is known that the probability that a job is completed in 2–4 days is 0.50, in 4–6 days
0.25, and in 2–6 days 0.55. Let us compute the probability that the job is completed in 4
days.

SOLUTION
𝐴𝐴 = {(𝑋𝑋 = 2) ∪ (𝑋𝑋 = 3) ∪ (𝑋𝑋 = 4)}
𝑃𝑃(𝐴𝐴) = 0.50
𝐵𝐵 = {(𝑋𝑋 = 4) ∪ (𝑋𝑋 = 5) ∪ (𝑋𝑋 = 6)}
𝑃𝑃(𝐵𝐵) = 0.25
The intersection and union of the events A and B are:
Let as define the following events:
𝐴𝐴 ∪ 𝐵𝐵 = {(𝑋𝑋 = 2) ∪ (𝑋𝑋 = 3) ∪ (𝑋𝑋 = 4) ∪ (𝑋𝑋 = 5) ∪ (𝑋𝑋 = 6)}
𝐴𝐴 ∩ 𝐵𝐵 = {𝑋𝑋 = 4}
The probability of intersection can be computed by Eq. (2.5):
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝑋𝑋 = 4) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐵𝐵) − 𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵)
= 0.50 + 0.25 − 0.55 = 0.20
The probability that the job is completed in 4 days is determined as 0.20.

EXAMPLE 2.16
The following electric circuit has three switch as A, B and C. The probabilities of being
open in these switches are P(A)=0.15, P(B)=0.10 and P(C)=0.02. Calculate the probability
that lamp will be off.

B
15
C
SOLUTION
The lamp will be off when the switch A and B or switch C is being open. Thus we must compute
the probability of the union of A∩B and C. Using the addition and multiplication rule together,:

P((A ∩ B) ∪ C) = P (A ∩ B) + P(C) – P((A ∩ B) ∩ C)

Substituting into the equation for the probability of union:


Assuming that the events are independent the probabilities of intersections can be determined its
P (A ∩ B) = P(A) . P(B) = 0.15 x 0.10 = 0.015
P((A ∩ B) ∩ C) = P(A ∩ B) . P(C) = 0.015 x 0.02 – 0003
P((A ∩ B) ∪ C) = 0.0015 + 0.02 – 0.0003 = 0.0347

EXAMPLE 2.17
One can get to town 2 from town 1 either by the route A or by the routes B and C through
town 3. In winter the probabilities of the routes being open are P(A)=0.40, P(B)=0.75,
P(C)=0.67. These events are not independent. The probability of the route C to be open
when B is open is given as P(CB)=0.80, the probability of the route A to be open when
both B and C are open is P(AB∩C)=0.5. Let us determine the probability that one can get
from 1 to 2 in winter.

SOLUTION
The travel between the points 1 and 2 is possible if the route A is open or both B and C are open.
Let us find the probability that both B and C are open

Substituting into the equation for the probability of union:

𝑃𝑃(𝐴𝐴 ∪ (𝐵𝐵 ∩ 𝐶𝐶)) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐵𝐵 ∩ 𝐶𝐶) − 𝑃𝑃(𝐴𝐴 ∩ (𝐵𝐵 ∩ 𝐶𝐶))


Where the probability 𝑃𝑃(𝐴𝐴 ∩ (𝐵𝐵 ∩ 𝐶𝐶)) can be computed as
𝑃𝑃(𝐴𝐴 ∩ (𝐵𝐵 ∩ 𝐶𝐶)) = 𝑃𝑃(𝐴𝐴|(𝐵𝐵 ∩ 𝐶𝐶))𝑃𝑃(𝐵𝐵 ∩ 𝐶𝐶) = 0.50 × 0.60 = 0.30
𝑃𝑃(𝐵𝐵 ∩ 𝐶𝐶) = 𝑃𝑃(𝐶𝐶|𝐵𝐵)𝑃𝑃(𝐵𝐵) = 0.80 × 0.75 = 0.60
Now the probability of travel can be determined using Eq. (2.5) as
𝑃𝑃(𝐴𝐴 ∪ (𝐵𝐵 ∩ 𝐶𝐶)) = 0.40 + 0.60 − 0.30 = 0.70
16
EXAMPLE 2.18
A structural frame consists of 3 elements with probabilities of failure P(A)=0.05,
P(B)=0.04 , P(C)=0.03. The events of failure of the elements are assumed to be
independent.
What is the probability of failure for the frame (the frame fails when at least one of hs
elements fails)?

SOLUTION
Since the events of failure will occur when one of the elements fail, we must determine the
probability of the union of probability of failure of the elements (P(AUBUC) ).

A B C

The probability of the union of three events can be written as:


P(A U B U C) = P(A) + P(B) + P(C)– P(A∩B) – P(A∩C) – P(B∩C) + P(A∩B∩C)
The probabilities of intersection can be computed as the products of the probabilities of individual
events as the events are assumed to be independent:
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴)𝑃𝑃(𝐵𝐵) = 0.05 × 0.04 = 0.002
𝑃𝑃(𝐴𝐴 ∩ 𝐶𝐶) = 𝑃𝑃(𝐴𝐴)𝑃𝑃(𝐶𝐶) = 0.05 × 0.03 = 0.0015
𝑃𝑃(𝐵𝐵 ∩ 𝐶𝐶) = 𝑃𝑃(𝐵𝐵)𝑃𝑃(𝐶𝐶) = 0.04 × 0.03 = 0.0012
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵 ∩ 𝐶𝐶) = 𝑃𝑃(𝐴𝐴)𝑃𝑃(𝐵𝐵)𝑃𝑃(𝐶𝐶) = 0.05 × 0.04 × 0.03 = 0.00006
Substituting into the first equation we get the probability of failure of the frame:
𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵 ∪ 𝐶𝐶) = 0.05 + 0.04 + 0.03 − 0.002 − 0.0015 − 0.0012 + 0.00006 = 0.11536
We could also solve the problem by computing the probability that no element of the frame fails.
Since the events of failure and no failure are complementary:
𝑃𝑃(𝐴𝐴) = 1 − 𝑃𝑃(𝐴𝐴) = 0.95
𝑃𝑃(𝐵𝐵) = 1 − 𝑃𝑃(𝐵𝐵) = 0.96
𝑃𝑃(𝐶𝐶) = 1 − 𝑃𝑃(𝐶𝐶) = 0.97
Because of the assumption of independence:
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵 ∩ 𝐶𝐶) = 𝑃𝑃(𝐴𝐴). 𝑃𝑃(𝐵𝐵). 𝑃𝑃(𝐶𝐶) = 0.95 × 0.96 × 0.97 = 0.88464
The probability of the failure of the frame is
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵 ∩ 𝐶𝐶) = 1 − 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵 ∩ 𝐶𝐶) = 1 − 0.88464 = 0.11536

EXAMPLE 2.19

17
A frame is hold by to robs, A and B, with probabilities of breaking off equal to 0.1 for each
rob. The probability that one of the robs breaks off when the other does is 0.8. Determine
the probability of failure of the frame. (The frame fails, when either one of the robs is
broken)

A B

SOLUTION
We are looking for the probability P(AUB) where A and B denote the events of breaking off of
the robs. From the equation:
𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐵𝐵) − 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)
where the probability of intersection is:
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴|𝐵𝐵)𝑃𝑃(𝐵𝐵) = 0.8 × 0.1 = 0.08
Substituting into the first equation:
𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 0.1 + 0.1 − 0.08 = 0.12
If the events A and B were independent:
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴)𝑃𝑃(𝐵𝐵) = 0.1 × 0.1 = 0.01
𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 0.1 + 0.1 − 0.01 = 0.19
If the events A and B were functionally dependent, i.e. 𝑃𝑃(𝐴𝐴|𝐵𝐵) = 1
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴|𝐵𝐵)𝑃𝑃(𝐵𝐵) = 1 × 0.1 = 0.1
𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 0.1 + 0.1 − 0.1 = 0.10
The probability breaking of the frame may vary in the range of 0.10 and 0.19 as a function of the
dependence of the events A and B.

BERNOULLI TRIALS
Let us consider an experiment where only two outcomes are possible (there are only two simple
events in the sample space). Suppose one of the outcomes corresponds (arbitrarily) to "success"
and the other to "failure". The probability of the success will be denoted by p, and the probability
of the failure by q=1–p. Such an experiment is called a Bernoulli trial.
As a simple example, if the "success" in the throw of a die is equated with the throw of a six, then
p=1/6 and q=5/6

EXAMPLE 2.20

18
In throwing a die event, let us repeat the experiment n times (independent Bernoulli trials). Now
consider the random variable x, the number of times of success in n trials. X is a discrete variable
that is an integer in the range of 0 to n. Let us compute the probability of X=x in n trials. Suppose
n=3 (three trials).

SOLUTION
The event of no success (X=0) will occur only when all the three trials are failures. Since the trials
are considered independent this has the probability:
P(X=0) = qqq = q3
1 success in 3 trials can occur in three different ways: first trial successful and the others failures,
second trial successful and the others failures, third trial successful and the others failures. Each
of these three events has the probability pq2 and the probability of their union is:
P(X=l) = 3pq2 : (pqq) + (qpq) + (qqp)
2 successes in 3 trials can also occur in three different ways: first two trials successful and the third
trial failure, first and third trials successful and the second failure, second and third trials successful
and the first failure. Each event has the probability p2q and therefore the probability of 2 successes
in 3 events is:
P(X=2) = 3p2q : (ppq) + (pqp) + (qpp)
Finally the probability of 3 successes in 3 events is:
P(X=3) = ppp = p3

Generalizing to n trials, the probability of x successes can be computed as


𝑃𝑃(𝑋𝑋 = 𝑥𝑥) = �𝑛𝑛𝑥𝑥 �𝑝𝑝 𝑥𝑥 𝑞𝑞 𝑛𝑛−𝑥𝑥 𝑋𝑋 = 0,1, . . . . . . 𝑛𝑛
The distribution of the variable X given by above equation is called the Bernoulli distribution
(binomial distribution).
where �𝑛𝑛𝑥𝑥 � denotes the number of combinations of n different things taken x at a time, to be
computed as:
𝑥𝑥 𝑛𝑛!
� �=
𝑛𝑛 𝑥𝑥! (𝑛𝑛 − 𝑥𝑥)!

We can easily find this coefficient in a simple way as follows:

n=0 1
n=1 1 1
n=2 1 2 1
n=3 1 3 3 1
n=4 1 4 6 4 1
n=5 1 5 10 10 5 1
n=6 1 6 15 20 15 5 1

19
EXAMPLE 2.21
The probability of a successful bid for a contractor is assumed to be p=l/3. Let us compute
the probability of 0, 1, 2, 3 and 4 successes in 4 bids. The variable X denotes the number
of successes in 4 bids.

SOLUTION
0
𝑃𝑃(𝑋𝑋 = 0) = � � (1/3)0 (2/3)4 = 16/81
4
1
𝑃𝑃(𝑋𝑋 = 1) = � � (1/3)1 (2/3)3 = 32/81
4
2
𝑃𝑃(𝑋𝑋 = 2) = � � (1/3)2 (2/3)2 = 24/81
4
3
𝑃𝑃(𝑋𝑋 = 3) = � � (1/3)3 (2/3)1 = 8/81
4
4
𝑃𝑃(𝑋𝑋 = 4) = � � (1/3)4 (2/3)0 = 1/81
4
The probability that the contractor is successful at least once in 4 bids can be computed as follows:
𝑃𝑃[𝑋𝑋 ≥ 1] = 1 − 𝑃𝑃[𝑋𝑋 < 1] = 1 − 𝑃𝑃[𝑋𝑋 = 0] = 1 − 16/81 = 65/81

We can compute the probability of the first success to occur in the yth trial as follows. This will
happen when the first y–1 trial are failures, and the next trial is a success. The probability of the
probability of success at the yth trial is p. Therefore:
𝑃𝑃(𝑌𝑌 = 𝑦𝑦) = 𝑞𝑞 𝑦𝑦−1 𝑝𝑝 𝑦𝑦 = 1,2. .. (2.72)

This is called the geometric distribution. Its parameters are:


𝐸𝐸(𝑌𝑌) = 1/𝑝𝑝 𝑉𝑉𝑉𝑉𝑉𝑉(𝑌𝑌) = 𝑞𝑞/𝑝𝑝2 (2.73)
As expected a success will occur, on the average, once in every p trials. In the example of the
throw of a die, the probability of a six in the first trial is P(Y=l)=l/6, in the second trial
P(Y=2)=(5/6)(l/6) = 5/36, etc. We can expect to throw a six once in every six trials, on the average.

EXAMPLE 2.23
In the previous example, the probabilities of first success in the first, second, third,... bids
are:
P(Y=l) = (2/3)° l/3 = l/3

20
P(Y=2) = (2/3)1 1/3 = 2/9
P(Y=3) = (2/3)2 l/3 = 4/27
P(Y=4) = (2/3)3 l/3 = 8/81
……………..
It can be expected that the first success will occur, on the average, in the trial no.
E(Y)=1/p=3.
Return period (recurrence interval) T is defined as the average interval between two consecutive
successes. As this coincides with the average time to the first success it is seen that
T=1/p

EXAMPLE 2.24
The spillway of a dam is designed for a discharge that is exceeded with the probability of
0.01 each year. The life of the dam is assumed to be 50 years. What is the probability that
the design flow is exceeded (at least once) during the life of the project?

SOLUTION
Using Equation of the Bernoulli distribution:
0
𝑃𝑃(𝑋𝑋 ≥ 1) = 1 − 𝑃𝑃(𝑋𝑋 = 0) = 1 − � � 0.010 0.9950 = 0.395
50
What is the probability that the first exceedance occurs after more than 10 years? Using Eq. (2.72)
of the geometric distribution:
10 10

𝑃𝑃(𝑌𝑌 > 10) = 1 − � 𝑃𝑃(𝑌𝑌 = 𝑦𝑦) = 1 − � 0.99𝑦𝑦−1 0.01 = 0.92


𝑦𝑦=1 𝑦𝑦=1

What is the average interval between two consecutive exceedances of the design flood? Eq. (2.73)
of the geometric distribution gives:
E(Y) = 1/0.01 = 100 years

By Eq. (2.74) this is also the return period of the design flood
T= 100 years which means that the design flood (or a larger flood) will occur, on the average, once
every 100 years (this is called the 100–year flood). It should not be understood that this event
(exceedance of the design flood) will occur at regular intervals of 100 years. In fact there is a
probability of 0.012=0.0001 that it will occur twice in two consecutive years.
If the spillway was designed for the 50–year flood (p=1/50), the probability of no such flood
occurring during the life of the project would be:
0
𝑃𝑃(𝑌𝑌 = 0) = � � (1/50)0 (49/50)0 = 0.364
50
The probability of no occurrence of an event with the return period of T years during a time interval
of T years can be computed as
𝑇𝑇(𝑇𝑇 − 1) 2
𝑃𝑃(𝑌𝑌 = 0) = (1 − 𝑝𝑝)𝑇𝑇 = 1 − 𝑇𝑇𝑇𝑇 + 𝑝𝑝 +. ..
2
which approaches 𝑒𝑒 −𝑇𝑇𝑇𝑇 for large values of T. Therefore:

21
1
− 𝑝𝑝
𝑃𝑃(𝑌𝑌 = 0) ≅ 𝑒𝑒 −𝑇𝑇𝑇𝑇 = 𝑒𝑒 𝑝𝑝 = 𝑒𝑒 −1 = 0.368
The value of 0.364 obtained for 7=50 years is very close to this.

EXAMPLE 2.25
A breakwater is designed for waves of the return period T=10 years (probability of
exceedance p=0.10). The probability that the breakwater is damaged when larger waves
occur is assumed to be 0.20. What is the probability of damage in a 3 year period?

SOLUTION
The probability of larger waves occurring X=0,l,2, and 3 years in a period of 3 years are calculated
by the Bernoulli distribution:
0
𝑃𝑃(𝑋𝑋 = 0) = � � 0.100 (1 − 0.10)3 = 0.729
3
1
𝑃𝑃(𝑋𝑋 = 1) = � � 0.101 (1 − 0.10)2 = 0.243
3
2
𝑃𝑃(𝑋𝑋 = 2) = � � 0.102 (1 − 0.10)1 = 0.027
3
3
𝑃𝑃(𝑋𝑋 = 3) = � � 0.103 (1 − 0.10)0 = 0.001
3
Using the total probability theorem, the probability of no damage in a 3 year period is computed
as :
1.0 x 0.729 + (1 – 0.20) 0.243 + (1 – 0.20)2 0.0027 + (1 – 0.20)3 0.001 = 0.94
The probability of damage in 3 years is:
1–0.94 = 0.06

REFERENCES:
– Bayazıt, M., Oğuz, B., 1998, Probability and Statistics for Engineers, Birsen Yayınevi.
– Spiegel, M.R., 1992, Schaum’s Outline Series , Theory and Problems of Statitistics, 2/ed in SI
units, Mc Graw Hill.

22
Chapter 3

FREQUENCY ANALYSIS
DEFINITIONS

Raw Data
Raw data are collected data that have not been organized numerically. An example is the set of
weights of 100 male students obtained from an alphabetical listing of university records.

Arrays
An array is an arrangement of raw numerical data in ascending or descending order of magnitude.

Range
The difference between the largest and smallest numbers is called the range of the data. For
example, if the largest weight of 100 male students is 74 kilograms (kg) and the smallest weight
is 60 kg, the range is 74–60 =14 kg.

FREQUENCY DISTRIBUTIONS
When summarizing large masses of raw data, it is often useful to distribute the data into classes,
or categories, and to determine the number of individuals belonging to each class, called the class
frequency.

A tabular arrangement of data by classes together with the corresponding class frequencies is
called a frequency distribution, or frequency table. Table 2.1 is a the weights of 100 male students
at MAT271E Course (recorded to the nearest kilogram) and Table 2.2 is a frequency distribution
of the weights.

Table 2.1 Weights of 100 Male Students at MAT271E Course


66 68 63 68 68 67 70 69 69 71 60 63 63 66 72
67 67 67 67 68 67 70 71 69 71 60 63 74 66 72
66 68 64 67 68 66 67 70 69 71 61 67 68 68 72
66 68 65 62 64 65 70 68 66 67 62 64 65 65 73
68 70 70 67 69 64 64 65 69 71 62 64 65 65 73
71 68 68 67 66 68 63 71 71 71 63 64 67 66 73
68 66 67 67 69 66 66 68 65 70 66 70 70 66 69
70 70 69 67 68 69 71 62 64 71 65 65 73 65 73
68 66 67 67 69 66 70 70 69 66

23
Table 2.2 Frequency distribution of weights of the students
Weight (kg) Number of Students
60–62 5
63–65 18
66–68 42
69–71 27
72–74 8
Total 100

The first class (or category), for example, consists of weights from 60 to 62 kg and is indicated by
the range symbol 60–62. Since five students have weights belonging to this class, the
corresponding class frequency is 5.
Data organized and summarized as in the above frequency distribution are often called grouped
data. Although the grouping process generally destroys much of the original detail of the data, but
a clear "overall" picture of data is gained.

HISTOGRAMS AND FREQUENCY POLIGONS


Histogram is graphic representations of frequency distributions.
1. A histogram, consists of a set of rectangles having (a) bases on a horizontal axis (the X axis),
with centers at the class marks and lengths equal to the class interval sizes, and (b) areas
proportional to the class frequencies.
If the class intervals all have equal size, the heights of the rectangles are proportional to the class
frequencies, and it is then customary to take the heights numerically equal to the class frequencies.
2. A frequency polygon is a line graph of the class frequency plotted against the class mark. It can
be obtained by connecting the midpoints of the tops of the rectangles in the histogram.
The histogram and frequency polygon corresponding to the frequency distribution of weights in
Table 2.1 are shown on the same set of axes in Fig. 2–1.
45

40
35

30

25

20
15

10
5

0
60-62 63-65 66-68 69-71 72-74

Figure. Histogram and frequency poligon of the data set

24
Class Intervals and Class Limits
A symbol defining a class, such as 60–62 in Table 2.1, is called a class interval. The end numbers,
60 and 62, are called class limits; the smaller number (60) is the lower class limit, and the larger
number (62) is the upper class limit. The terms class and class interval are often used
interchangeably, although the class interval is actually a symbol for the class.

FREQUENCY HISTOGRAMS
The frequency histogram is the frequency of the class divided by the total frequency of all classes
and is generally expressed as a percentage. For example, the relative frequency of the class 66–68
in Table 2.1 is 42/100 = 42%. The sum of the relative frequencies of all classes is clearly 1, or
100%.
0,50

0,40

0,30

0,20

0,10

0,00
60-62 63-65 66-68 69-71 72-74

Figure 2.2 Frequency histogram of the data set


If the frequencies in Table 2.1 are replaced with the corresponding relative frequencies, the
resulting table is called a relative–frequency distribution, percentage distribution, or relative–
frequency table.
Graphic representation of relative–frequency distributions can be obtained from the histogram or
frequency polygon simply by changing the vertical scale from frequency to relative frequency,
keeping exactly the same diagram. The resulting graphs are called relative–frequency histograms
(or percentage histograms) and relative–frequency polygons (or percentage polygons),
respectively.

CUMULATIVE–FREQUENCY DISTRIBUTIONS AND OGIVES


The total frequency of all values less than the upper class boundary of a given class interval is
called the cumulative frequency up to and including that class interval. For example, the
cumulative frequency up to and including the class interval 66–68 in Table 2.1 is 5 + 18 + 42 =
65, signifying that 65 students have weights less than 68.5 kg.
A table presenting such cumulative frequencies is called a cumulative–frequency distribution,
cumulative–frequency table, or briefly a cumulative distribution, and is shown in Table 2.2 for the
student weight distribution of Table 2.1.

25
Table 2.2 cumulative–frequency distribution of Student weight distribution

Weight (kg) Number of


Students
Less than 59.5 0
Less than 62.5 5
Less than 65.5 23
Less than 68.5 65
Less than 71.5 92
Less than 74.5 100

A graph showing the cumulative frequency less than any upper class boundary plotted against the
upper class boundary is called a cumulative–frequency polygon, or ogive, and is shown in Fig. 2–
2 for the student weight distribution of Table 2.1.

For some purposes, it is desirable to consider a cumulative–frequency distribution of all values


greater than or equal to the lower class boundary of each class interval. Because in this case we
consider weights of 59.5 kg or more, 62.5 kg or more, etc., this is sometimes called an "or more"
cumulative distribution, while the one considered above is a "less than" cumulative distribution.
One is easily obtained from the other. The corresponding ogives are then called "or more" and
"less than" ogives. Whenever we refer to cumulative distributions or ogives without qualification,
the "less than" type is implied.

GENERAL RULES FOR FORMING FREQUENCY DISTRIBUTIONS OF CONTINOUS


DATA
1. Determine the largest and smallest numbers in the raw data and thus find the range (the
difference between the largest and smallest numbers).
2. Divide the range into a convenient number of class intervals having the same size. You can find
the convenient number of class by using the formula

M=1+3.3 log N

Where M is the number of class and N is the number of data. M value is rounded and converted
into an integer number.

26
3. The range of the numbers is divided into M in order to get the number of intervals. Usually
interval number is rounded into simpler integer numbers (i.e. 92 is rounded into 100, 22 into 25
etc.)
4. Start the lower class limit of first interval from the lowest number or a bit lower than the lowest
number of data set. (i.e if the lowest number is 23, start from 20.
5. Form the each class by adding the number obtained in previous calculation. Be sure that the
greatest class limit of last interval includes the greatest number in the data set.
6. Number of the class must be between 5 and 20, depending on the data.

EXAMPLE 3.2
The data set of an annual amount of rain (mm) of a city is given in following table .
700 315 450 615 625 625
420 645 635 895 500 565
650 585 665 555 410 715
365 455 535 545 575 645
550 735 615 675 835 595

The histogram of the data is shown in Fig. 3.1.


HISTOGRAM
10
Number of observation

10 9

6
4
4 3
2 2
2

0
300-399 400-499 500-599 600-699 700-799 800-899
Annual rain (mm)

Fig. 3.1. Histogram of annual amount of rain


The frequency histogram of the data is shown in Fig. 3.2.

Fig. 3.2. Frequency of annual amount of rain

27
The cumulative frequency distribution of the data is shown in Fig. 3.3.

Fig. 3.3. The cumulative frequency histogram of the data

Fig. 3.3 shows she cumulative frequency diagram. It is seen that 50% of the rain are below
600 kg.
The appearance of the frequency histogram is affected by the number of class intervals.
The use of too few classes causes too much loss of information whereas too many class
intervals may lead to irregular histograms, with very few observations (or maybe none) in
some intervals. The choice of the number of class intervals is important.

19
20
20/30
Frequency

18
16
15/30
14
12
10
10/30
8 6
5
6
5/30
4
2
0
0/30
300-499

500-699

700-899

Annual rain (mm)

(a) (b)
Fig. 1.7. Effect of number of class intervals on the frequency histogram (a) Low class number (b)
High class number

GENERAL RULES FOR FORMING FREQUENCY DISTRIBUTIONS OF DISCRETE


DATA
Since the the discrete data sets are includes only integer numbers, it may not be necessary to
cumulate the value in order to form intervals in general. Instead, each number may be directly used
as a class limit.
If you have more dispersed discrete data set, you may group them as explained in the rules of
continous data.

28
EXAMPLE 3.1
2 dice are thrown 65 times and the number of observations for sum of them are given in
Table 2.3. Table 2.3. Number of observation for the Sum of two dices
Number of
Sum observation
2 1
3 2
4 3
5 6
6 12
7 16
8 12
9 7
10 4
11 1
12 1

The sum of two dice is discrete type of data and its histogram can be drawn directly without
grouping in general. (Figure 2.4)

Figure 2.4. Histogram of sum of two dice

TYPES OF FREQUENCY CURVES


Frequency curves arising in practice take on certain characteristic shapes, as shown in Fig. 2–3.
1. The symmetrical, or bell–shaped, frequency curves are characterized by the fact that
observations equidistant from the central maximum have the same frequency. An important
example is the normal curve.
2. In the moderately asymmetrical, or skewed, frequency curves the tail of the curve to one side
of the central maximum is longer than that to the other. If the longer tail occurs to the right, the
curve is said to be skewed to the right or to have positive skewness, while if the reverse is true, the
curve is said to be skewed to the left or to have negative skewness.
3. In a J–shaped or reverse J–shaped curve a maximum occurs at one end.
4. A U–shaped frequency curve has maxima at both ends.
5. A bimodal frequency curve has two maxima.

29
6. A multimodal frequency curve has more than two maxima.

30
Chapter 4

THE MEAN, MEDIAN, MODE, AND OTHER


MEASURES OF CENTRAL TENDENCY
There are several kinds of averages such as mean, median and mode which are used to show the
center of data set.
If the arithmetic mean is obtained by sampling from a population, the mean is called as sample
mean denoted by 𝑿𝑿� . If it is calculated for a population, it is called as population mean and denoted
by μ or μx.

THE ARITHMETIC MEAN


For a data set, the arithmetic mean, also called the mathematical expectation or average, is the
central value of a discrete set of numbers: specifically, the sum of the values divided by the number
of values.
The arithmetic mean, or briefly the mean, of a set of N numbers X1, X2, X3, . . . , XN is denoted by
X (read "X bar") and is defined as
𝑋𝑋1 + 𝑋𝑋2 + 𝑋𝑋3 +. . . +𝑋𝑋𝑁𝑁 ∑𝑗𝑗=1 𝑋𝑋𝑗𝑗 ∑ 𝑋𝑋
𝑋𝑋 = = =
𝑁𝑁 𝑁𝑁 𝑁𝑁

EXAMPLE 4.4
The arithmetic mean of the numbers 2, 4 and 8 is
2+4+8
𝑋𝑋 = = 4,67
3

∫1 𝑋𝑋1 + ∫2 𝑋𝑋2 +. . . ∫𝐾𝐾 𝑋𝑋𝐾𝐾 ∑𝐾𝐾


𝑗𝑗=1 ∫𝑗𝑗 𝑋𝑋𝑗𝑗 ∑ ∫ 𝑋𝑋 ∑ ∫ 𝑋𝑋
𝑋𝑋 = = = =
∫1 + ∫2 +. . . +∫𝐾𝐾 ∑𝐾𝐾
𝑗𝑗=1 ∫𝑗𝑗 ∑∫ 𝑁𝑁

where N = Σf is the total frequency (i.e., the total number of cases).

EXAMPLE 4.5
If 5, 8, 6, and 2 occur with frequencies 3, 2, 4, and 1, respectively, the arithmetic mean is
(3)(5) + (2)(8) + (4)(6) + (1)(2)
𝑋𝑋̄ =
3+2+4+1
(3)(5) + (2)(8) + (4)(6) + (1)(2)
𝑋𝑋̄ =
3+2+4+1

31
THE WEIGHTED ARITHMETIC MEAN
Sometimes we associate with the numbers X1, X2,..., XK certain weighting factors (or weights) w1,
w2, w3 … wK depending on the significance or importance attached to the numbers. In this case,
𝑤𝑤1 𝑋𝑋1 + 𝑤𝑤2 𝑋𝑋2 +. . . . . . . +𝑤𝑤𝐾𝐾 𝑋𝑋𝐾𝐾
̄ =
𝑋𝑋𝑋𝑋
𝑤𝑤1 + 𝑤𝑤2 +. . . . . +𝑤𝑤𝐾𝐾
is called the weighted arithmetic mean. Note the similarity to equation (2), which can be considered
a weighted arithmetic mean with weights f1, f2,.... ,fK.

EXAMPLE 4.6
Midterm examination, final examination and homeworks in a course are weighted as 0.4, 0.4 and
0.2, respectively. A student has a midterm grade of 70, final examination grade of 85 and
homeworks grade of 90. The mean grade of the student is
(𝟎𝟎. 𝟒𝟒)𝟕𝟕𝟕𝟕 + (𝟎𝟎. 𝟒𝟒)𝟖𝟖𝟖𝟖 + (𝟎𝟎. 𝟐𝟐)𝟗𝟗𝟗𝟗
̄ =
𝑿𝑿𝑿𝑿
𝟎𝟎. 𝟒𝟒 + 𝟎𝟎. 𝟒𝟒 + 𝟎𝟎. 𝟐𝟐
����
𝑋𝑋𝑋𝑋 = 80

Properties of the Arithmetic Mean


1. The algebraic sum of the deviations of a set of numbers from their arithmetic mean is zero.

EXAMPLE 4.7
The deviations of the numbers 8, 3, 5, 12, and 10 from their arithmetic mean 7.6 are 8–7.6,
3–7.6, 5–7.6, 12–7.6, and 10–7.6, or 0.4, –4.6, –2.6, 4.4, and 2.4, with algebraic sum 0.4–
4.6–2.6 + 4.4 + 2.4 = 0.
2. The sum of the squares of the deviations of a set of numbers Xj from any number a is a
minimum if and only if a = X
3. If f1 numbers have mean m1, f2 numbers have mean m2,... ,fK numbers have mean mK, then the
mean of all the numbers is
𝑓𝑓1 𝑚𝑚1 + 𝑓𝑓2 𝑚𝑚2 +. . . . . . . +𝑓𝑓𝐾𝐾 𝑚𝑚𝐾𝐾
𝑋𝑋̄ =
𝑓𝑓1 + 𝑓𝑓2 +. . . . . +𝑓𝑓𝐾𝐾
that is, a weighted arithmetic mean of all the means

THE GEOMETRIC MEAN G


The geometric mean G of a set of N positive numbers X1, X2, X3,..., XN is the Nth root of the
product of the numbers:
𝐺𝐺 = 𝑁𝑁�𝑋𝑋1 𝑋𝑋2 𝑋𝑋3 . . . . . 𝑋𝑋𝑁𝑁

EXAMPLE 4.8
The geometric mean of the numbers 2, 4, and 8 is
3
𝐺𝐺 = �(2)(4)(8) = 4

32
THE HARMONIC MEAN H
The harmonic mean H of a set of N numbers X1, X2, X3,..., XN is the reciprocal of the arithmetic
mean of the reciprocals of the numbers:
1 𝑁𝑁
𝐻𝐻 = =
1 𝑁𝑁 1 1
∑ ∑
𝑁𝑁 𝑗𝑗=1 𝑋𝑋𝑗𝑗 𝑋𝑋
In practice it may be easier to remember that
1
1 ∑ 𝑋𝑋 1 1
= = �
𝐻𝐻 𝑁𝑁 𝑁𝑁 𝑋𝑋

EXAMPLE 4.9
The harmonic mean of the numbers 2, 4, and 8 is
3 3
𝐻𝐻 = = = 3.43
1 1 1 7
2+4+8 8

THE RELATION BETWEEN THE ARITHMETIC, GEOMETRIC, AND HARMONIC


MEANS
The geometric mean of a set of positive numbers X1, X2,..., XN is less than or equal to their
arithmetic mean but is greater than or equal to their harmonic mean. In symbols,

The equality signs hold only if all the numbers X1, X2,..., XN are identical.

The Pythagorean means are the three "classic" means A (the arithmetic mean), G (the geometric
mean), and H (the harmonic mean). The figure above shows how these means on two elements a
and b could be constructed geometrically, and also demonstrates that
𝐻𝐻 ≤ 𝐺𝐺 ≤ 𝑋𝑋

EXAMPLE 4.10
The set 2, 4, 8 has arithmetic mean 4.67, geometric mean 4, and harmonic mean 3.43.

THE MEDIAN
The median of a set of numbers arranged in order of magnitude (i.e., in an array) is either the
middle value or the arithmetic mean of the two middle values.

33
EXAMPLE 4.11
The set of numbers 3, 4, 4, 5, 6, 8, 8, 8, and 10 has median 6.

EXAMPLE 4.12
The set of numbers 5, 5, 7, 9, 11, 12, 15, and 18 has median ½ (9+ll)= 10.

Geometrically the median is the value of X (abscissa) corresponding to the vertical line which
divides a histogram into two parts having equal areas. This value of X is sometimes denoted by 𝑋𝑋�.

THE MODE
The mode of a set of numbers is that value which occurs with the greatest frequency; that is, it is
the most common value. The mode may not exist, and even if it does exist it may not be unique.

EXAMPLE 4.13
The set 2, 2, 5, 7, 9, 9, 9, 10, 10, 11, 12, and 18 has mode 9.

EXAMPLE 4.14
The set 3, 5, 8, 10, 12, 15, and 16 has no mode.

EXAMPLE 4.14
The set 2, 3, 4, 4, 4, 5, 5, 7, 7, 7, and 9 has two modes, 4 and 7, and is called bimodal.
A distribution having only one mode is called unimodal.

THE EMPIRICAL RELATION BETWEEN THE MEAN, MEDIAN, AND MODE


The figures below show the relative positions of the mean, median, and mode for frequency curves
which are symmetrical, skewed to the right and left.
For symmetrical curves, mean = mode = median (Fig. 4.2)
For the curves skewed to the left, mean < median < mode (Fig. 4.3)
For the curves skewed to the right, mean > median > mode (Fig. 4.4)
7

0
Median
Mean,
Mode,

0 1 2 3 4 5 6 7 8 9 10

Fig. 4.2

34
Fig. 4.3

Fig. 4.4

THE ROOT MEAN SQUARE (RMS)


The root mean square (RMS), or quadratic mean, of a set of numbers X1, X2,..., XN is defined by

∑𝑁𝑁 2
𝑗𝑗=1 𝑋𝑋𝑗𝑗
𝑅𝑅𝑅𝑅𝑅𝑅 = �
𝑁𝑁
RMS is a statistical measure of the magnitude of a varying quantity. It is especially useful when
variates are positive and negative, e.g., sinusoids. RMS is used in various fields, including
electrical engineering; one of the more prominent uses of RMS is in the field of signal amplifiers.

Fig.4.5. Sinusoidal curve and heights of waves

EXAMPLE 4.15
The RMS of the set -1, 1.2, -1.1, 1, -0.9 and 0.9 is

RMS = 1.02

35
QUANTILES

QUARTILES, DECILES, AND PERCENTILES


If a set of data is arranged in order of magnitude, the middle value (or arithmetic mean of the two
middle values) that divides the set into two equal parts is the median. By extending this idea, we
can think of those values which divide the set into four equal parts.
These values, denoted by Ql, Q2, and Q3, are called the first, second, and third quartiles,
respectively, the value Q2 being equal to the median.
Similarly, the values that divide the data into 10 equal parts are called deciles and are denoted by
D1, D2,..., D9, while the values dividing the data into 100 equal parts are called percentiles and are
denoted by P1, P2,..., P99.
The fifth decile and the 50th percentile correspond to the median. The 25th and 75th percentiles
.correspond to the first and third quartiles, respectively.
Collectively, quartiles, deciles, percentiles, and other values obtained by equal subdivisions of the
data are called quantiles.

Fig. 4.5. Quartiles

Fig. 4.6. Deciles

36
Fig. 4.7. Percentiles

ABOUT MEAN, MODE AND MEDIAN


That's why when you read an announcement by a corporation executive or a business proprietor
that the average pay of the people who work in his establishment is so much, the figure may
mean something and it may not. If the average is a median, you can learn something significant
from it: Half the employees make more than that; half make less. But if it is a mean (and believe
me it may be that if its nature is unspecified) you may be getting nothing more revealing than
the average of one $45,000 income—the proprietor's—and the salaries of a crew of un¬derpaid
workers. "Average annual pay of $5,700" may conceal both the $2,000 salaries and the owner's
profits taken in the form of a whopping salary.
Let's take a longer look at that one. The facing page shows how many people get how much.
The boss might like to express the situation as "average wage $5,700"— using that deceptive
mean. The mode, however, is more revealing: most common rate of pay in this business is
$2,000 a year. As usual, the median tells more about the situation than any other single figure
does; half the people get more than $3,000 and half get less.

37
From “How to Lie With Statistics, Darrel Huff, 1954”

38
COMPARISON BETWEEN MEAN, MEDIAN AND MODE
1. Use of average:
The arithmetic Mean is comparatively stable and is widely used than the Median and Mode. It
is suitable for general purposes, unless there is any particular reason to select any other type of
average. As for the simplicity is concerned mode is the simplest of three.
Mode is the most usual or typical item, hence it can be located by inspection also. Median
divides the curve into two equal parts and is simpler than the mean. In certain eases Median is
as stable as the mean.
2. Algebraic manipulation:
Mean lends itself to algebraic manipulation. For example, we can calculate aggregate when the
number of items and the average of the series is given. Median and Mode cannot be algebraically
manipulated.
3. Extreme and abnormal items:
Presence of extreme and abnormal items can lead to certain misleading conclusion in case of
mean. As for Mode and Median are concerned, they are not much influenced by the presence of
abnormal items in the series. Statisticians are of the view that median or mode should be used
in such cases because they are least influenced.
4. Qualitative expression:
Mean cannot be used when the data is qualitative or is not capable of numerical expressions.
With the help of Median we can measure quantities which are capable of numerical expression.
We can measure the intelligence or health of boys etc. Similarly, mode is the average that proves
useful for non-numerical data.
5. Presence of Skewness:
In case of a symmetrical curve, the value of mean, median and mode would coincide. But when
skewness is present, there is not much change in the value of mode. The value of median and
mean changes with the presence of positive or negative skewness to the positive or negative side
respectively. The value of mean changes to a greater extent than the value of median because it
is affected by the position and value of every item.
6. Fluctuations of sampling:
Mean is least affected by fluctuations of sampling. If the number of items is large, the
abnormalities on the one side cancel the abnormalities on the other. Median distributes the curve
into two equal parts and is affected by the fluctuations of sampling. Mode is affected to a great
extent than even the median.
7. As a measure of dispersion:
Dispersion is a measure of variability within a group of data and for this measure, averages are
used to ascertain the degree of deviation. We know that the total of the deviations from the mean
is equal to zero hence square of deviations will be the minimum one.
Due to this fact, mean is the usual basis for this measure of dispersion. Median as a basis of
dispersion is considered better because the deviations from the median are least and median is
in wide practice. Mode is not much suitable as a measure of dispersion.

39
WHICH MEANS SHOULD BE USED? (In Turkish)
ARİTMETİK ORTALAMA
Aritmetik ortalama çok popüler olarak hesaplanıp kullanılmakla beraber bazı önemli
dezavantajları bulunmaktadır.
Aşırı değerlere duyarlı (yani güçsüz) bir merkezsel konum ölçüsüdür. Eğer veri dizisi için
asimetrik olarak sadece bir uçsal değer ya aşırı küçük ya aşırı büyük ise aritmetik ortalama o
aşırı değere yaklaşma gösterir.
Aritmetik ortalama her türlü ölçülme ölçekli sayısal veri için kullanılamaz. İsimsel ölçekli
sayısal veriler için aritmetik ortalama anlamsızdır. Sırasal ölçekli sayısal veriler için aritmetik
ortalama kullanılması büyük tartışmalara açıktır. Birçok kişi değişik kişilerin sıralamalarının
aynı olduğunu kabul etmedikleri için elde edilen verilerin toplamının ve bu toplamdan çıkartılan
aritmetik ortalamanın anlamsız olacağını kabul etmektedirler. Ancak işletme alanı, davranışsal
bilimler ve sosyal bilimlerde, özellikle anket verileri, sırasal ölçekli olmakta ve buna rağmen bu
verilerin aritmetik ortalamaları pratikte önemli alanlarda kullanılmaktadır. Aralıksal ölçekli ve
oransal ölçekli sayısal veriler için aritmetik ortalama anlamlıdır.

GEOMETRİK ORTALAMA
İstatistiksel araştırmalarda gözlem sonuçları arasındaki oransal (nispî) farkların mutlak
farklardan daha önemli olduğu durumlarda geometrik ortalamaya başvurulur. Diğer bir ifade ile
gözlem sonuçlarının her biri bir önceki gözlem sonucuna bağlı olarak değişiyorsa ve bu
değişmenin hızı saptanmak istenirse geometrik ortalama sağlıklı sonuçlar verir.
Geometrik ortalama bulmak veri değerlerinin pozitif olması gerekir. Eğer tek bir veri değeri sıfır
ise geometrik ortalama anlamsız olur.

HARMONİK ORTALAMA
Harmonik ortalama genellikle, ekonomik olaylarda 1 birim ile alınan ortalama miktara veya bir
mamülün bir biriminin üretimi için harcanan ortalamaya ihtiyaç duyulduğunda kullanılır.

MEDYAN
Eğer verilerin dağılımı simetrik olmayıp çarpıklık gösteriyorlarsa, medyan değeri tercih edilen
merkezsel konum ölçüsü olarak kullanılır ve medyanın aritmetik ortalama değerinden daha
uygun bir ölçü olduğu kabul edilir. Simetrik olmama, sıralanmış veri değerleri için ya en küçük
değerlerin ya da en büyük değerlerin diğerlerinden çok daha fazla uzaklaşması ile ortaya çıkar.
Bu beklenmedik küçük veya büyük değerlere dışlak (İngilizce: outlier) veriler adı verilir. Eğer
veriler dağılımı asimetrik olan dışlak veriler kapsıyorsa, medyan aritmetik ortalamaya nazaran
daha tercih edilir merkezsel konum ölçüsü olarak kullanılır. Bu halde, istatistiksel terminolojiye
göre, medyan, aritmetik ortalamadan daha güçlü (İngilizce: robust) bir ölçüdür.

MODE
En sık gözlenen değerin bir anlam ifade edeceği durumlar için kullanılır. Sayısal bir değeri
göstermekle birlikte, elde ediliş biçimi medyanda olduğu gibi sayısal değildir. Kalitatif veriler
için de kullanılır. En sık gözlenen kalitatif veya kantitatif değer, mode olarak seçilir.

40
Chapter 5

THE STANDARD DEVIATION AND OTHER


MEASURES OF DISPERSION
The degree to which numerical data tend to spread about an average value is called the dispersion,
or variation, of the data. Various measures of this dispersion (or variation) are available, the most
common being the range, mean deviation, semi–interquartile range, 10–90 percentile range, and
standard deviation.

THE VARIANCE
The variance of a set of data is defined as the square of the standard deviation and is thus given by
Varx or S2.
∑𝑁𝑁 ̄ 2
𝑖𝑖=1(𝑋𝑋𝑋𝑋 − 𝑋𝑋 )
𝑉𝑉𝑉𝑉𝑟𝑟𝑥𝑥 = � �
𝑁𝑁
When it is necessary to distinguish the standard deviation of a population from the standard
deviation of a sample drawn from this population, we often use the symbol S2 for the latter and σ2
(lowercase Greek sigma) for the former. Thus S2 and σ2 would represent the sample variance and
population variance, respectively.

THE STANDARD DEVIATION


Since the unit of variance is the square unit, it is necessary to take the square root and normalize
the unit. The square root of variance.is the standard deviation.
𝑆𝑆𝑥𝑥 = �𝑉𝑉𝑉𝑉𝑟𝑟𝑥𝑥
Sometimes the standard deviation of a sample's data is defined with (N–l) replacing N because the
resulting value represents a better estimate of the standard deviation of a population from which
the sample is taken. For large values of N (certainly N>30), there is practically no difference
between the two definitions.

EXAMPLE 5.1
The lengths of the players (cm) of two different basketball team A and B are given in
following table.
A B
210 190
210 190
160 190
170 180
180 180
XA=186 XB=186

41
The means of the lenghts of each team are the same (186 cm). This doesn’t mean that both
team are equvalent. There is a difference between the dispersion of the data within the
teams. Although Team A has very dispersed lengths of players, the Team B has similar
lenghts. In order to interpret the difference within the teams, the variance or the standard
deviation must be calculated.
∑𝑁𝑁 ̄ 2
𝑖𝑖=1(𝑋𝑋𝑋𝑋 − 𝑋𝑋 )
𝑉𝑉𝑉𝑉𝑟𝑟𝑥𝑥 = � �
𝑁𝑁
VarA=424.4 cm2
VarB=24.0 cm2
𝑆𝑆𝑥𝑥 = �𝑉𝑉𝑉𝑉𝑟𝑟𝑥𝑥

SA=20.6 cm
SB= 4.9 cm
Since the SA value is greater than SB value, the dispersion within Team A is greater than
that of Team B.

COEFFICIENT OF VARIATION
Coefficient of variation shows the variation in a set and is used for comparing the variation in
two sets of data having different means.
𝑆𝑆𝑆𝑆
𝐶𝐶𝑣𝑣𝑥𝑥 =
𝑋𝑋̄

EXAMPLE 5.2
The mean and the standard deviation of X data set are given as 789 and 139 and of Y set
750 and 135, respectively. Compare the sets and find which set has higher variation.

SOLUTION
𝑆𝑆𝑆𝑆 139
𝐶𝐶𝑣𝑣𝑥𝑥 = = = 0.175
𝑋𝑋̄ 789

𝑆𝑆𝑆𝑆 135
𝐶𝐶𝑣𝑣𝑦𝑦 = = = 0.180
𝑌𝑌̄ 750

Therefore, Cvy has higher variation than Cvx

SKEWNESS
Skewness is the degree of asymmetry, or departure from symmetry, of a distribution. Normal
distribution gives symmetrical bell-shape (Fig 5.1) For skewed distributions, the mean tends to lie
on the same side of the mode as the longer tail (see Figs. 5.2 and 5.3). The measure of skewness
is

42
∑𝑁𝑁 ̄ 3
𝑖𝑖=1(𝑋𝑋𝑋𝑋 − 𝑋𝑋 )
� �
𝑁𝑁
𝐶𝐶𝑠𝑠𝑥𝑥 =
𝑆𝑆𝑥𝑥 3
If the frequency curve (smoothed frequency polygon) of a distribution has a longer tail to the right
of the central maximum than to the left, the distribution is said to be skewed to the right, or to have
positive skewness. If the reverse is true, it is said to be skewed to the left, or to have negative
skewness.

Csx = 0

0 1 2 3 4 5 6 7 8 9 10

Figure 5.1 Symmetrical

Csx > 0

0 1 2 3 4 5 6 7 8 9 10

Figure 5.2 Positively skewed

Csx < 0

0 1 2 3 4 5 6 7 8 9 10

Figure 5.3 Negatively skewed

43
KURTOSIS
Kurtosis is the degree of peakedness of a distribution, usually taken relative to a normal
distribution. The normal distribution shown in Figure 5.4, which is not very peaked or very flat–
topped, is called mesokurtic. A distribution having a relatively high peak, such as the curve of
Figure 5.5 is called leptokurtic, while the curve of Figure 5.6, which is flat–topped, is called
platykurtic.
The kurtosis is defined by
4
∑𝑁𝑁 (𝑋𝑋𝑋𝑋 − 𝑋𝑋̄)
� 𝑖𝑖=1 �
𝑁𝑁
𝑘𝑘𝑥𝑥 = −3
𝑆𝑆𝑥𝑥 4
which is positive for a leptokurtic distribution, negative for a platykurtic distribution, and zero for
the normal distribution.

kx = 0

0 1 2 3 4 5 6 7 8 9 10

Figure 5.4 Mesokurtic

kx >0

0 1 2 3 4 5 6 7 8 9 10

Figure 5.5 Leptokurtic

kx < 0

0 1 2 3 4 5 6 7 8 9 10

Figure 5.6 Platykurtic

44
Chapter 6

PROBABILITY DISTRIBUTION FUNCTIONS


INTRODUCTION
It has been observed that certain functions F(x) and f(x) can successfully express the distributions
of many random variables. In engineering practice it is frequently attempted to adopt one of these
functions whose analytical forms are known and values are tabulated because of the facility of
their use.
There are alot of distribution functions which express the distribution of different types of data. It
is important to chose the most appropriate one when you are making analysis of your data.
However it is not easy to choose correct one. Although there are no general rules for selecting the
best distribution functions, the engineer has to make a choice based on his experience and
knowledge as regards the properties of the commonly used distribution. The comparison of the
histogram of the observed data with the chosen probability density function generally helps in the
decision making. It should be tried to consider the mechanism that gives rise to the chosen
function, if possible.
In this chapter some probability distribution functions that are commonly used in engineering
applications will be introduced.

NORMAL DISTRIBUTION
A large number of random variables encountered in practical applications fit to the normal
(Gaussian) distribution with the following probability density function:
1
𝑓𝑓(𝑥𝑥) = 𝑒𝑒𝑒𝑒𝑒𝑒[−(𝑥𝑥 − 𝜇𝜇𝑥𝑥 )2 /2𝜎𝜎 2 𝑥𝑥] − ∞ < 𝑥𝑥 < ∞
𝜎𝜎𝑥𝑥 √2𝜋𝜋
This distribution is shown briefly as N(μ, σ2) It has two parameters: μx mean of the random
variable, and σx, its standard deviation. Normal distribution is symmetrical (Cs=0) with a kurtosis
coefficient equal to 0 (K=0)
It is not easy to take the integral of the analytical form of the probability distribution function F(x)
of the normal distribution. Instead it is tabulated form (Z–table) is used easily by using the formula
below for the normal distribution
𝑋𝑋 − 𝜇𝜇𝑥𝑥
𝑍𝑍 =
𝜎𝜎𝑥𝑥
where the standard normal variable Z has the mean 0 and standard deviation 1. The distribution
N(0,1) of the variable Z is called the standard normal distribution. Its probability distribution
function is given in Z–Table.
Since the normal distribution is symmetrical, this table is prepared for the positive values of Z
only. The probabilities of Z exceeding a certain positive value z, F1(z)=A are given. For positive
z, we can compute the probability of nonexceedance as F(z)=1– F1(z) and for negative we have
F(z)= F1(|z|) because of symmetry around the mean 0.

45
Figure 6.1. Normal distribution and the probabilities of the normal variable remaining in the
intervals of certain lengths around the mean

46
The probability density function of normal distribution is bell–shaped around the mean μx. The
mode and median are equal to the mean because of the symmetry. The probabilities of the normal
variable to remain in the intervals around the mean of width one, two and three standard deviations
are equal to 0.683, 0.955 and 0.9975 (nearly 1), respectively.
The probability paper of the normal distribution facilitates the applications for the normal
variables. The ordinate axis of this paper is scaled such that the cumulative distribution function
of the normal distribution appears as a straight line (Figure). Since the distribution is symmetrical
the value (median) corresponding to gives the mean.

Figure 6.2. Normal probability paper

Several variables of natural and social sciences are found to be normally distributed. This can be
explained by the central limit theorem. This theorem states that, the distribution of the variable
where Xi are independent random variables approaches the normal distribution
𝑛𝑛

𝑋𝑋 = � 𝑐𝑐𝑖𝑖 𝑋𝑋𝑖𝑖
𝑖𝑖=1

where Xi are independent random variables approaches the normal distribution with the increase
of n, whatever the distribution of variables Xi are. The approach is rather fast such that the normal
distribution can be assumed for n≥10. Thus, if a random variable is affected by a large number of
independent variables such that the effects are additive, then it can be assumed to be distributed
normally.
A difficulty in using the normal distribution for physical (engineering) variables is the following.
Such variables can usually take only positive values. A normal variable may vary in the range of
However, the probability of a normal variable to assume values outside the interval
is negligibly small, and therefore, if its mean is much larger than its standard deviation,
then the probability of the variable assuming a negative value will almost vanish.
A useful property of the normal distribution is that the sum (or the difference) of normally
distributed variables also follows the normal distribution. If X and Y are two independent normal
variables, the distribution of and the distribution of F is

47
Testing the data whether it is normally distributed.
1. Sketch the cumulative frequency distribution of the data on the normal probability paper.
If the plot is nearly a straight line, the distribution we can decide that the data follows the
normal distribution.
2. Calculate skewness coefficient (Cs) and kurtosis coefficient (kx) of the data. If Cs is equal
to 0 (± 0.05) and kx=0 (±0.10), we can say that the data is normally distributed.
3. Calculate the mean, median and mode of the data. If they are equal or quite similar to each
other, then the distribution is normal.
Many variables encountered in engineering problems can be assumed to be normal. Random
measurement errors occur due to the additive effects of several factors. Therefore they are expected
to be normally distributed, with a standard deviation called the Standard error. Certain properties
of building materials are also normally distributed.
In some cases where there is no reason to expect for the normal distribution, the assumption is still
made because of the ease of using it. However, the normal distribution is certainly not valid in
every cases because the variable is skewed. Most hydrologic variables (such as the discharge in a
stream, the precipitation depth at a location) are not symmetrically distributed. For such variables,
distributions other than normal must be used.

EXAMPLE 6.1
Annual flow of a river (m3/s) is assumed to be normally distributed with μx=60, σx=8. What
is the probability of the flow to remain
a) less than 55
b) more than 73
c) in the range of 55 and 73 m3/s?

SOLUTION
The values of the standard normal variable corresponding to 55 and 73, respectively, are computed
by:
𝑋𝑋−𝜇𝜇𝑥𝑥
𝑍𝑍 = 𝜎𝜎𝑥𝑥

a)

55 − 60
𝑍𝑍 = = −0.63
8
F1(–0.63)=F1(0.63)=0.2643
Then the probability of the flow less than 55 is
26.43 %.

48
b)

73 − 60
Z= = 1.63
8
F2(1.63)=F2(1.63)=0.0516
Then the probability of the flow more than 73
is 5.16 %.

c)
P(55<X<73) = P(–0.5<Z<1.63)
Since the total area under the normal
distribution curve is equal to one, the area
between 55 and 73 (or z = – 0.5 and z = 1.63)
is.
1–(0.2643+0.0516)=0.6841 68.41 %

LOGNORMAL DISTRIBUTION
It is often attempted to transform a nonnormal random variable to a normal variable because the
normal distribution has well known properties and is easy to use. The most commonly used
transformation is the logarithmic transformation.
If the transformed variable
Y=ln X
fits the normal distribution, then the distribution of the original variable X is called lognormal:
1
𝑓𝑓(𝑥𝑥) = 𝑒𝑒𝑒𝑒𝑒𝑒[−(𝑙𝑙𝑙𝑙 𝑥𝑥 − 𝜇𝜇𝑌𝑌 )2 /2𝜎𝜎 2 𝑌𝑌 ] 𝑥𝑥 > 0
𝑥𝑥 𝜎𝜎𝑌𝑌 √2𝜋𝜋
The lognormal distribution is positively skewed.

49
Fig 6.3. Lognormal distribution
This distribution is defined only for the positive values of the variable X. The parameters of μy and
σy are related to μx and σx, the parameters of X, by the following equations:
𝑌𝑌 − 𝜇𝜇𝑦𝑦
𝑍𝑍 =
𝜎𝜎𝑦𝑦

0.5 0.5
𝜎𝜎2 𝜎𝜎2
Y=ln X 𝜇𝜇𝑌𝑌 = 𝑙𝑙𝑙𝑙 �𝜇𝜇𝑥𝑥 / �𝜇𝜇2 𝑥𝑥 + 1� � 𝜎𝜎𝑌𝑌 = �𝑙𝑙𝑙𝑙 � 𝜇𝜇2𝑋𝑋 + 1��
𝑥𝑥 𝑥𝑥

Since the logarithm of a product is the sum of the logarithms of the multipliers, it may be expected
by the central limit theorem that if a random variable arises by the multiplication of the effects of
several independent variables, then its distribution will approach the lognormal.
The fact that a lognormal random variable can take only positive values, facilitates the fitting of
this distribution to the physical variables.
In civil engineering applications, the lognormal distribution has been used for hydrologic
variables, in the problems related to fatigue and in earthquakes. In the mining and extraction
industries lognormal distribution has been also used in large scale.
The z–table prepared for the normal distribution can be used for the lognormal distribution as well.
The parameters μy and σy can be estimated in two ways. Either they are computed from the
logarithms of the observations of the X variable, or they are estimated from equations above using
the computed values of μx and σx. The second approach preserves the parameters of the original
variable.
The probability paper of the normal distribution can be used for the lognormal distribution when
the abscissa axis is logarithmically scaled.
A property of the lognormal distribution is that the product of lognormal variables is lognormally
distributed.

EXAMPLE 6.2
Solve the problem of previous example assuming that X is lognormally distributed.

SOLUTION
The parameters of the variable

50
0.5
82
𝜇𝜇𝑌𝑌 = 𝑙𝑙𝑙𝑙 �60/ � 2 + 1� � = 4.086
60
0.5
82
𝜎𝜎𝑌𝑌 = �𝑙𝑙𝑙𝑙 � 2 + 1�� = 0.132
60

a) The probability which are less than 55 is ;

Y=ln 55=4.007
4.007 − 4.086
𝑍𝑍1 = = −0.60
0.132
From Z–table
P(Y<4.007)=F(–0.60) = 0.2743  27.43%

b) The probability which are more than 73 is ;

Y=ln 73=4.290
4.290 − 4.086
𝑍𝑍2 = = 1.55
0.132
From Z–table
P(Y>4.290)=F(1.55) = 0.0606  6.06%

c) Between 55 and 73 is ;

1–(0.2743 + 0.0606)=0.6651  66.51%

This value is somewhat smaller than 68.41 obtained by the assumption of normal distribution.

51
Table 4.1 Z – Table (Normal Distribution)
z 0 0,01 0,02 0,03 0,04 0,05 0,06 0,07 0,08 0,09
0,0 0,5000 0,4960 0,4920 0,4880 0,4840 0,4801 0,4761 0,4721 0,4681 0,4641
0,1 0,4602 0,4562 0,4522 0,4483 0,4443 0,4404 0,4364 0,4325 0,4286 0,4247
0,2 0,4207 0,4168 0,4129 0,4090 0,4052 0,4013 0,3974 0,3936 0,3897 0,3859
0,3 0,3821 0,3783 0,3745 0,3707 0,3669 0,3632 0,3594 0,3557 0,3520 0,3483
0,4 0,3446 0,3409 0,3372 0,3336 0,3300 0,3264 0,3228 0,3192 0,3156 0,3121
0,5 0,3085 0,3050 0,3015 0,2981 0,2946 0,2912 0,2877 0,2843 0,2810 0,2776
0,6 0,2743 0,2709 0,2676 0,2643 0,2611 0,2578 0,2546 0,2514 0,2483 0,2451
0,7 0,2420 0,2389 0,2358 0,2327 0,2296 0,2266 0,2236 0,2206 0,2177 0,2148
0,8 0,2119 0,2090 0,2061 0,2033 0,2005 0,1977 0,1949 0,1922 0,1894 0,1867
0,9 0,1841 0,1814 0,1788 0,1762 0,1736 0,1711 0,1685 0,1660 0,1635 0,1611
1,0 0,1587 0,1562 0,1539 0,1515 0,1492 0,1469 0,1446 0,1423 0,1401 0,1379
1,1 0,1357 0,1335 0,1314 0,1292 0,1271 0,1251 0,1230 0,1210 0,1190 0,1170
1,2 0,1151 0,1131 0,1112 0,1093 0,1075 0,1056 0,1038 0,1020 0,1003 0,0985
1,3 0,0968 0,0951 0,0934 0,0918 0,0901 0,0885 0,0869 0,0853 0,0838 0,0823
1,4 0,0808 0,0793 0,0778 0,0764 0,0749 0,0735 0,0721 0,0708 0,0694 0,0681
1,5 0,0668 0,0655 0,0643 0,0630 0,0618 0,0606 0,0594 0,0582 0,0571 0,0559
1,6 0,0548 0,0537 0,0526 0,0516 0,0505 0,0495 0,0485 0,0475 0,0465 0,0455
1,7 0,0546 0,0436 0,0427 0,0418 0,0409 0,0401 0,0392 0,0384 0,0375 0,0367
1,8 0,0359 0,0351 0,0344 0,0336 0,0329 0,0322 0,0314 0,0307 0,0301 0,0294
1,9 0,0287 0,0281 0,0274 0,0268 0,0262 0,0256 0,0250 0,0244 0,0239 0,0233
2,0 0,0228 0,0222 0,0217 0,0212 0,0207 0,0202 0,0197 0,0192 0,0188 0,0183
2,1 0,0179 0,0174 0,0170 0,0166 0,0162 0,0158 0,0154 0,0150 0,0146 0,0143
2,2 0,0139 0,0136 0,0132 0,0129 0,0125 0,0122 0,0119 0,0116 0,0113 0,0110
2,3 0,0107 0,0104 0,0102 0,0099 0,0096 0,0094 0,0091 0,0089 0,0087 0,0084
2,4 0,0082 0,0080 0,0078 0,0075 0,0073 0,0071 0,0069 0,0068 0,0066 0,0064
2,5 0,0062 0,0060 0,0059 0,0057 0,0055 0,0054 0,0052 0,0051 0,0049 0,0048
2,6 0,0047 0,0045 0,0044 0,0043 0,0041 0,0040 0,0039 0,0038 0,0037 0,0036
2,7 0,0035 0,0034 0,0033 0,0032 0,0031 0,0030 0,0029 0,0026 0,0027 0,0026
2,8 0,0026 0,0025 0,0024 0,0023 0,0023 0,0022 0,0021 0,0021 0,0020 0,0019
2,9 0,0019 0,0018 0,0018 0,0017 0,0016 0,0016 0,0015 0,0015 0,0014 0,0014
3,0 0,0013 0,0013 0,0013 0,0012 0,0012 0,0011 0,0011 0,0011 0,0010 0,0010
3,1 0.0010 0.0009 0.0009 0.0009 0,0008 0,0008 0,0008 0,0008 0,0007 0,0007
3,2 0,0007 0,0007 0,0006 0,0006 0,0006 0,0006 0,0006 0,0005 0,0005 0,0005
3,3 0,0005 0,0005 0,0005 0,0004 0,0004 0,0004 0,0004 0,0004 0,0004 0,0003
3,4 0,0003 0,0003 0,0003 0,0003 0,0003 0,0003 0,0003 0,0003 0,0003 0,0002
3,5 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002
3,6 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002
3,7 0,0002 0,0002 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001
3,8 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001
3,9 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000

52
NORMAL DISTRIBUTION PAPER

53
LOGNORMAL DISTRIBUTION PAPER

54
GAMA DISTRIBUTION

55
56
EXAMPLE 6.3

57
Chapter 7

SAMPLING DISTRIBUTIONS
THE CONCEPT OF SAMPLING DISTRIBUTION
The real value of any β parameter of a random variable is never determined since it is not possible
to observe the whole population of this random variable. We can only calculate a value of statistic
b which is an estimate of this parameter from a sample. (Generally Greek letters are used for
parameters and the corresponding Latin letters are used for their statistics). Statistic b is not equal
to parameter β it is the best estimate of β which can be obtained from the sample at hand. If we
had various samples drawn from the same population, bi statistics corresponding to the β parameter
computed from these samples would not be equal. For instance if the Sx statistic corresponding to
the σx standard deviation parameter is calculated from various samples, different values are
obtained.
The values of a statistic calculated from different samples have a distribution since we can treat
any statistic as a random variable. The probability distribution of the values of any statistic to be
calculated from various samples of same size is called the sampling distribution of this statistic
(Fig. 7.1).

Population

n Sample 1, b1

n Sample 2, b2

Sample 3, b3
n .
. .
. .
.
.

n Sample N, bN

Fig. 7.1. The values of b statistic estimated from various samples of same size n (b1, b2, …..,bN)
corresponding to parameter β

To know the sampling distribution of a statistic is important for the following reason. As
mentioned above, the statistic determined from the sample at hand is not equal to the real value of
the population parameter. Without observing the whole population it is not possible to determine
the value of the parameter with an absolute correctness. However we can determine the interval in
which the unknown parameter value will remain around the calculated statistic with a given
probability. For this reason, the sampling distribution of that statistic must be known.

58
Let us sketch the sampling distribution f(b) of the b statistic of the parameter calculated β from
samples of magnitude N (Fig. 7.2). The expected value of this distribution will be the b0 value
calculated from the sample at hand. For determining (b1, b2) interval in which the unknown β
population parameter will remain with a given Pc probability, such a symmetrical interval (b1, b2)
is chosen around b0 that the percentage of the sampling distribution within this interval is Pc (Fig.
7.2). Here Pc is called the confidence level and the interval (b1, b2) is called the confidence interval
at this level. Values as 0.90, 0.95, 0.99 are used for Pc in practice. The confidence interval widens
as Pc increases since the probability that β remains within a wider interval is higher.
The properties of the sampling distribution depend on the distribution of the random variable of
the population, the parameter under consideration and the size of the sample. As the number of
elements of the sample N increases, the confidence interval corresponding to a certain confidence
level gets narrower (Fig. 3). In other words, the confidence interval within which the parameter
will remain with a certain probability is smaller for large samples, expressing that the error in
parameter estimation is reduced.

Fig. 7.2. Determination of the confidence interval (b1, b2) in the sampling distribution f(b) of the
statistic b at a confidence level Pc

Fig. 7.3. Narrowing of the confidence interval with the increase of N, the number of elements in
the sample
From the above explanations one might think that the parameter β is considered as a random
variable which remains within a certain interval with a certain probability. As a matter of fact, the
parameter β is not a random variable but an unknown constant. Therefore, to interpret the concepts
of confidence level and confidence interval as follows is more correct. The ratio Pc of the β
parameter values to be estimated from numerous samples of the same size drawn from the same
population will remain within the interval(b1, b2)determined for this confidence level. In other
words we may rely at Pc level that the unknown β value is in confidence interval (b1, b2). Or, the
random interval (b1, b2) contains β with probability Pc.

59
(Here an important point to be noted is that the boundaries of the confidence interval will change
from sample to sample since the assumption E(b)=b0 is made for each sample. Therefore, this is a
random interval, and Pc is the ratio of such intervals that contain β However, since there is only
one sample at hand in practice, the confidence interval determined from this sample is used).
In some problems it might be more meaningful to choose the confidence interval only to the left
(or to the right) of b0 (one sided confidence interval) instead of choosing it symmetrically on both
sides of the b0 value (two sided confidence interval). For example when the resistance of a material
or the capacity of a channel is of subject it is more meaningful to use a one sided confidence
interval to the left of b0 thus determining the lowest value for the parameter considered at the
chosen confidence level. To find the lower bound of the one sided confidence interval b value is
calculated from the sampling distribution such that the probability of remaining smaller than the b
value is 1–Pc.
On the other hand if the wind load effecting a structure or the flood of a river is of subject, the one
sided confidence interval to the right of b0 can be used. In this case to determine the upper bound
of the confidence interval the b value, of which the exceedance probability is 1–Pc is calculated
from the sampling distribution. It can be said that the parameter under consideration will not
exceed this value at a confidence level of Pc.
Sampling distributions can be determined theoretically only for some statistics. This is generally
possible for large samples; sampling distributions which are valid in cases when the number of
elements in the sample approaches infinity (N→ ∞) are called as asymptotic distributions.
Asymptotic distributions can only be used approximately for small samples (N<30). As N reduces,
the errors increase rapidly. Exact distributions, which are also valid for small samples, have been
obtained theoretically only for some special statistics when the random variable is normally
distributed. In other cases sampling distributions can only be found experimentally. To do this,
numerous samples of required size of the random variable fitting the given distribution are
generated by means of a computer and the value of the required statistic is calculated from each
sample. The frequency distribution of these values obtained is determined. This frequency
distribution can approximately be assumed as the sampling distribution which is searched.

SAMPLING DISTRIBUTIONS

1. Normal Distribution For Samples


X−X
Z= 𝑏𝑏1 = 𝑋𝑋� − 𝑍𝑍 𝑆𝑆𝑥𝑥 /√𝑁𝑁 𝑏𝑏2 = 𝑋𝑋� + 𝑍𝑍 𝑆𝑆𝑥𝑥 /√𝑁𝑁
Sx /√N

EXAMPLE 7.1
Total dissolved solids (TDS) of a river is a random variable with a mean value of 𝑋𝑋� =80
mg/l and a standard deviation of 𝑠𝑠𝑥𝑥 =16 mg/l. What is the probability of the annual mean
TDS value, to be calculated from samples taken during 36 months of the year from a river,
exceeding 90 mg/1? (Distribution is normal)

SOLUTION
The mean and standard deviation of the annual mean TDS calculated from a sample of N=36 is 80
mg/1 and 16/120.5 mg/1 respectively. The value of the standard normal variable corresponding to
90 mg/1 with normal distribution assumption is:

60
90 − 80
𝑍𝑍 = = 3.75
16/√36
From normal distribution table (z–Table) the exceedence probability of this value is F1(Z)=0.0001.
Thus the probability that the annual mean TDS to be calculated from samples recorded during 36
months is greater than 90 mg/1 is % 0.01.

EXAMPLE 7.2
The mean and standard deviation of the failure load experiments performed on 36 steel
beams are found as 8490 kg and 300 kg, respectively. Find the limits of confidence interval
of the mean at Pc=95%. (Distribution is normal)

SOLUTION
Since the sampling distribution is normal, the value of the standard normal variable of which the
exceedence probability is (1/𝑃𝑃𝑐𝑐 )/2 = (1 − 0.95)/2 = 0.025 can be read from the normal
distribution table (Table 6.1) as 𝑍𝑍0.025 = 1.96
𝑏𝑏1 = 𝑋𝑋� − 𝑍𝑍 𝑆𝑆𝑥𝑥 /√𝑁𝑁
𝑏𝑏2 = 𝑋𝑋� + 𝑍𝑍 𝑆𝑆𝑥𝑥 /√𝑁𝑁
𝑏𝑏1 = 8490 − 1.96(300/√36) = 8392
𝑏𝑏2 = 8490 + 1.96(300/√36) = 8588
The limits of confidence interval is shown in Fig. 7.4

Fig 7.4. Confidence interval of the mean failure load in previous example at 95% confidence level

It can be stated that the μx (population mean) of which the value is unknown, remains within the
interval (8392, 8588) with a probability of 95%. However the result is only approximately correct
since asymptotical distribution is used with a sample which is not sufficiently large (N < 30).

2. t – (Student) Distribution
Distributions which are valid for small samples can be determined analytically only for some
special statistics. This is only possible when the random variable is normally distributed.
T distribution is developed especially for the sample having small numbers of data (N < 30)
𝑋𝑋 − 𝑋𝑋�
𝑡𝑡 =
𝑆𝑆𝑥𝑥 /√𝑁𝑁

61
The distribution of the t statistic is the t distribution with a degree of freedom (d.f. = N–1). The t
distribution table (t–Table) gives the probability that the variable t is greater than a selected t0
value (P (t > to). t distribution is a symmetrical distribution like the normal distribution. Its
mean value is 0, its variance is greater than that of the standard normal distribution since it is N /
(N – 2 ) > 1. For large N values the variance approaches to 1 and the t distribution approaches to
the standard normal distribution.

Fig. 7.5 t-distribution (r: number of samples)

The limits of the interval for t-distribution is calculated with the following formula.
𝑏𝑏1 = 𝑋𝑋� − 𝑡𝑡 𝑆𝑆𝑥𝑥 /√𝑁𝑁 𝑏𝑏2 = 𝑋𝑋� + 𝑡𝑡 𝑆𝑆𝑥𝑥 /√𝑁𝑁

EXAMPLE 7.3
Total dissolved solids (TDS) of a river is a random variable with a mean value of X=80
mg/l and a standard deviation of Sx=16 mg/l. What is the probability of the annual mean
TDS value, calculated from samples taken during 12 months of the year from a river,
exceeding 90 mg/1? (Distribution is symmetrical)

SOLUTION
Since the sample is a small one (N<30), t–distribution is used.
The value of the variable t corresponding to 90 mg/l is
90 − 80
𝑡𝑡 = = 2.16
16/√12
From t–Table (Table 7.1) the exceedance probability of this t value is read as 0.025 for d.f.=N–
1=11. The probability that the annual mean TDS, to be evaluated from records of 12 months,
exceeds 90 mg/1 is 0.025 (this value is greater than 0.015 calculated with the assumption of
asymptotic normal distribution).

EXAMPLE 7.4
In Example 7.2, if the number of sample is 25 it is more suitable to use the exact t
distribution instead of the normal distribution. (The mean = 8490 kg and standard deviation
= 300 kg, Distribution is symmetrical)

62
Find the limits of confidence interval of the mean at Pc=95%.

SOLUTION
From t–Table 7.1, the value of t of which the exceedance probability is 0.025 is read as t0.025 =2.064
for d.f.=N-1=24. Thus the limits of the confidence interval of the mean at 95% confidence level:
𝑏𝑏1 = 𝑋𝑋� − 𝑡𝑡 𝑆𝑆𝑥𝑥 /√𝑁𝑁 = 8490 − 2.064 (300) /√25 = 8366.2
𝑏𝑏2 = 𝑋𝑋� + 𝑡𝑡 𝑆𝑆𝑥𝑥 /√𝑁𝑁 = 8490 + 2.064 (300)/ √25 = 8613.8

Thus it can be said that the unknown population parameter μx will remain within the interval (8366,
8614) with a probability of 95 %. Since the sample is small, the confidence interval widens when
the t distribution is used.

63
Table 7.1. Student’s t- Distribution Table

P 0.45 0.40 0.35 0.20 0.15 0.10 0.05 0.025 0.01 0.005
(d.f)
1 0.158 0.325 0.510 1.376 1.963 3.078 6.314 12.71 31.82 63.66
2 0.142 0.289 0.443 1.061 1.386 1.886 2.920 4.303 6.965 9.925
3 0.137 0.277 0.424 0.978 1.230 1.636 2.313 3.182 4.541 5.841
4 0.134 0.271 0.414 0.941 1.190 1.333 2.132 2.776 3.747 4.604
5 0.132 0.267 0.408 0.920 1.136 1.476 2.015 2.571 3.365 4.032
6 0.131 0.265 0.404 0.906 1.134 1.440 1.943 2.447 3143 3.707
7 0.130 0.263 0.402 0.896 1.119 1.415 1.895 2.365 2.998 3.499
8 0.130 0.262 0.399 0.889 1.108 1.397 1.860 2.306 2.896 3.355
9 0.129 0.261 0.396 0.883 1.100 1.383 1.833 2.262 2.821 3.250
10 0.129 0.260 0.397 0.879 1.093 1.372 1.812 2.228 2.764 3.169
11 0.129 0.260 0.396 0876 1.088 1.363 1.796 2.201 2.718 3.106
12 0.121 0.239 0.395 0.873 1.083 1.356 1.782 2.179 2.681 3.055
13 0.128 0.259 0.394 0670 1.079 1.350 1.771 2.160 2.650 3.012
14 0.128 0.258 0.393 0.866 1.076 1.345 1.761 2.145 2.624 2.977
15 0.121 0.256 0.393 0.866 1.074 1.341 1.753 2.131 2.602 2.947
16 0.121 0.236 0.392 0.865 1.071 1.337 1.746 2120 2.583 2.921
17 0.121 0.237 0392 0.863 1.069 1.333 1.740 2.110 2.567 2.898
18 0.127 0.237 0.392 0.862 1.067 1.330 1.734 2.101 2.552 2.878
19 0.127 0.257 0.391 0.861 1.066 1.328 1.729 2.093 2.539 2.861
20 0.127 0.257 0.391 0.860 1.064 1.325 1.725 2.086 2.521 2.845
21 0.121 0.237 0.391 0.839 1.063 1.323 1.721 2.080 2.516 2.631
22 0.127 0.236 0.390 0.838 1.061 1.321 1.717 2.074 2.508 2.819
23 0.127 0.236 0.390 0.851 1.060 1.319 1.714 2.069 2.500 2.807
24 0.127 0.256 0.390 0.857 1.039 1.318 1.711 2.064 2.492 2.797
25 0.127 0236 0.390 0.856 1.058 1.316 1.708 2.060 2.485 2.787
26 0.127 0.236 0.390 0.836 1.058 1.313 1.706 2.056 2.479 2.779
27 0.127 0.236 0.389 0.835 1.057 1.314 1.703 2.052 2.473 2.771
28 0.127 0.236 0.389 0.855 1.056 1.313 1.701 2.048 2.467 2.763
29 0.127 0.236 0.389 0.654 1.055 1.311 1.699 2.045 2.462 2.756
30 0.127 0.256 0.389 0854 1.055 1.310 1.697 2.042 2. 457 2.750
40 0.126 0.233 0.388 0.851 1.050 1.303 1.684 2.021 2.423 2.704
60 0.126 0.254 0.387 0.848 1.046 1.296 1.671 2.000 2.390 2.660
120 0.126 0.254 0.386 0.843 1.041 1.289 1.658 1.980 2.358 2.617
∞ 0.126 0.233 0.383 0.842 1.036 1.282 1.645 1.960 2.326 2.576

64
3. X2 - Distribution
The sampling distribution of the standard deviation can be determined through the X2 (chi–square)
statistic defined below:

2
𝑁𝑁 𝑆𝑆𝑥𝑥2
𝜒𝜒 = 2
𝜎𝜎𝑥𝑥
The distribution of this statistic is the X2 distribution with a degree of freedom d.f.=N – l.

X2 table (Table 7.2) gives the probability that the X2 statistic exceeds a selected Xo2value (P(X2 >
Xo2). X2 distribution is a special case of the 2 parameter gamma distribution with α=n/2, β=2. For
large N values this distribution approaches to the normal distribution with the mean N and the
variance 2N.
The limits of the interval is calculated with the following formula.
1
𝑏𝑏1,2 = 𝑁𝑁 𝑆𝑆𝑥𝑥2
𝑋𝑋 2

EXAMPLE 7.5
In example 7.2. if the distribution is positively skewed and the number of sample is 27,
find the limits of confidence interval of the variance for Pc=90 % by the X2 distribution.
(The mean=8490 kg and standard deviation=300 kg)

SOLUTION

From X2 Table (Table 7.2) the values with exceedance probabilities (l–Pc)/2=0.05 and 1–0.05=0.95
can be found as X20.05=38.885 and X20.95=15.379, respectively. Thus the limits of the confidence
interval:
1
𝑏𝑏1 = 27𝑥𝑥3002 = 62 500
38.885
1
𝑏𝑏2 = 27𝑥𝑥3002 = 157 997
15.379

65
Then population variance remains within the interval (62 500, 157 997) with a probability of 90
%.

66
Table 7.2. X2 Distribution Table

Degree of Probability
freedom 0,99 0,98 0,95 0,90 0,50 0,10 0,05 0,02 0,01
1 0 0.001 0.004 0.158 0.455 2.706 3.841 5.024 6.635
2 0.020 0.051 0.103 0.211 1.386 4.605 5.991 7.378 9.210
3 0.115 0.216 0.352 0.584 2.366 6.251 7.815 9.348 11.345
4 0.297 0.484 0.711 1.064 3.357 7.779 9.488 11.143 13.277
5 0.554 0.831 1.145 1.610 4.351 9.236 11.071 12.833 15.086
6 0.872 1.237 1.635 2.204 5.348 10.645 12.592 14.449 16.812
7 1.239 1.690 2.167 2.833 6.346 12.017 14.067 16.013 18.475
8 1.646 2.180 2.733 3.490 7.344 13.362 15.507 17.535 20.090
9 2.088 2.700 3.325 4.168 8.343 14.684 16.919 19.023 21.666
10 2.558 3.247 3.940 4.865 9.342 15.987 18.307 20.483 23.209
11 3.053 3.816 4.575 5.578 10.341 17.275 19.675 21.920 24.725
12 3.571 4.404 5.226 6.304 11.340 18.549 21.026 23.337 26.217
13 4.107 5.009 5.892 7.042 12.340 19.812 22.362 24.736 27.688
14 4.660 5.629 6.571 7.790 13.339 21.064 23.685 26.119 29.141
15 5.229 6.262 7.261 8.547 14.339 22.307 24.996 27.488 30.578
16 5.812 6.908 7.962 9.312 15.339 23.542 26.296 28.845 32.000
17 6.408 7.564 8.672 10.085 16.338 24.769 27.587 30.191 33.409
18 7.015 8.231 9.390 10.865 17.338 25.989 28.869 31.526 34.805
19 7.633 8.907 10.117 11.651 18.338 27.204 30.144 32.852 36.191
20 8.260 9.591 10.851 12.443 19.337 28.412 31.410 34.170 17.566
21 8.897 10.283 11.591 13.240 20.337 29.615 32.671 35.479 38.932
22 9.542 10.982 12.338 14.042 21.337 30.813 33.924 36.781 40.289
23 10.196 11.689 13.091 14.848 22.337 32.007 35.173 38.076 41.638
24 10.856 12.401 13.848 15.659 23.337 33.196 36.415 39.364 42.980
25 11.524 13.120 14.611 16.473 24.337 34.382 37.653 40.647 44.314
26 12.198 13.844 15.379 17.292 25.336 35.567 38.885 41.923 45.642
27 12.879 14.573 16.151 18.114 26.336 36.741 40.113 43.194 46.963
28 13.565 15.308 16.928 18.939 27.336 37.916 41.337 44.461 48.278
29 14.257 16.047 17.708 19.768 28.336 39.088 42.557 45.722 49.588
30 14.954 16.791 18.493 20.599 29.336 40.256 43.773 46.979 50.892
40 22.164 24.433 26.509 29.051 39.335 51.805 55.759 59.342 63.691
50 29.707 32.357 34.764 37.689 49.335 63.167 67.505 71.420 76.154
60 37.485 40.482 43.188 46.459 59.335 74.397 79.082 83.298 88.379
70 45.442 48.756 51.739 55.329 69.334 85.527 90.531 95.023 100.425
80 53.540 57.153 60.392 64.278 79.334 96.578 101.879 106.629 112.329
90 61.754 65.647 69.126 73.291 89.334 107.561 113.145 118.136 124.116
100 70.065 74.222 77.930 82.358 99.334 118.498 124.342 129.561 135.807

67
REFERENCES:
– Bayazıt, M., Oğuz, B., 1998, Probability and Statistics for Engineers, Birsen Yayınevi.
– Spiegel, M.R., 1992, Schaum’s Outline Series, Theory and Problems of Statitistics, 2 Ed. Mc
Graw Hill.

68
Chapter 8

STATISTICAL HYPOTHESIS TESTING


HYPOTHESIS TESTS FOR PARAMETERS
It is known that the true value of a parameter of a random variable can never be known as its
population cannot be observed in its entirety. Sometimes we make a statistical hypothesis such
that the value of a parameter β equals βo a value chosen by us. The procedure of checking whether
this value can be accepted is called the testing of the hypothesis β=βo (β is any parameter (such
as the mean, standard deviation) obtained from sample and βo is a value of population (we have
chosen for the parameter in the particular problem we are concerned with).
We can test a hypothesis only by comparing the value of βo with b the value of the statistic
corresponding to the parameter β that we have computed from a sample. Obviously, we cannot
simply decide that the hypothesis is false when b is not equal to βo. It could be that the value of b
estimated from the sample is somewhat different from the assumed value βo because of the
sampling distribution, although β is in fact equal to βo Therefore we must accept the hypothesis if
the value of b is not too far from βo. How large should the difference between b and βo be for
rejecting the hypothesis? An exact answer to this question cannot be given because we cannot
expect to be correct all the time. However, we should use a systematic approach in testing the
statistical hypothesis to standardize the procedure. This is explained below.
In testing the hypothesis β=βo called Ho (null hypothesis), we must first decide about the
following:
1. The alternate hypothesis H1 should be chosen. This is the hypothesis that will be assumed to
be accepted when the null hypothesis is rejected by the test. The alternate hypothesis may have the
forms H1: β≠βo, β<βo or β>βo depending upon the structure of the problem. If our objective is
simply to check whether β equals βo or not, we work with a hypothesis H1: β≠βo. But when we are
interested in knowing especially if β is greater than βo for β≠βo then we must test according to the
hypothesis H1: β>βo. Similarly, when we want to know if β is smaller than βo for β≠βo the alternate
hypothesis should be of the form β<βo For example, in examining the strength of a material it is
critical if the strength is below a particular value, therefore the alternate hypothesis is H1: β<βo On
the other hand, in working with flood flows, the exceedance of a certain value becomes critical,
and we must choose H1: β>βo
2. The difference β–βo that can be tolerated for accepting the hypothesis H0 is related to the level
of significance a for which the test is performed. Once the value of a is chosen, the sampling
distribution of the statistic b drawn with the mean βo is divided into the regions of acceptance and
rejection. The acceptance region is near the value βo The rejection region, with an area equal to a
(the probability of the statistic to be in this critical region) is farther from βo. This region is either
on one tail, or on both tails of the sampling distribution, depending upon the type of the alternate
hypothesis H1:

I. Two–tailed test
Ho: β=βo
H1: β≠βo

69
If the hypothesis Ho: β=βo is to be checked against the hypothesis Ho: β≠βo then the null hypothesis
should be rejected either when the observed value of b is much larger or much smaller than βo.
Therefore the region of rejection is placed on both tails of the sampling distribution symmetrically.
The region of acceptance in this case is between the values of the statistic with the exceedance
probabilities of 1–α/2 and α/2 (Fig. 1). If the observed value of the statistic lies in this region the
null hypothesis is accepted, otherwise it is rejected, implying that the alternate hypothesis H1: β≠βo
is accepted.

Fig. 8.1. Testing of the hypothesis Ho: β=βo with H1: β≠βo

II. One–tailed test


Ho: β=βo
H1: β<βo or β>βo
If the hypothesis Ho: β=βo is to be checked against the hypothesis H1, then the null hypothesis will
be rejected only when the observed value of b is H1: β>βo (or β<βo), much larger (or much
smaller). Therefore the critical (rejection) region is on the right (left) tail of the sampling
distribution. It is to the right of the value of the statistic with the exceedance probability of a (or to
the left of the value of the statistic with the exceedance probability of 1–α) (Fig. 6.2). If the
observed value of the statistic lies in this region of rejection the null hypothesis is rejected (the
alternate hypothesis H1 is accepted), otherwise H0 is accepted.

Fig. 8.2. Testing of the hypothesis Ho: β=βo with H1: β>βo
It is seen that statistical hypothesis testing helps us in deciding whether an assumed value βo for a
parameter βo of a random variable can be accepted as true by comparing it with the value of the
corresponding statistic b obtained from a sample. If their difference is not too big, then it is thought
that this can be explained as being caused by the sampling distribution and the hypothesis β≠βo is
accepted. However, errors are unavoidable because the whole population can never be observed.
Decisions made in hypothesis testing can have four different relations to the unknown reality, as
shown in the following table.

70
REAL SITUATION
DECISION H0 IS TRUE H0 IS FALSE
ACCEPT H0 CORRECT DECISION INCORRECT DECISION
(TYPE II ERROR)
REJECT H0 INCORRECT DECISION CORRECT DECISION
(ACCEPT H1) (TYPE I ERROR)

It is seen that two kinds of errors may exist in the decisions made in hypothesis testing. Type I
error corresponds to the rejection of the null hypothesis when it is in fact true. Type II error is
made when we accept the null hypothesis although it is in fact false.

Fig. 8.3. Probability of type II error increases with the decrease of the probability of type I error,
α

The type II error is accepted as a non significant error and it should be tolerated.

APPLICATIONS
Hypothesis testing has several applications in engineering problems. As an example, we can check
whether the expected value of the strength of a material conforms to the specifications by
comparing it with the experimental results. It is tested whether the mean of the experimental data
is significantly lower than the specified value or not. For another example, we can check if the
mean precipitation depths before and after the construction of a reservoir are significantly different
to decide about the possible effect of reservoir construction on the precipitation. In the test the
result may differ according to the chosen level of significance. In practice the value of a is usually
taken as 0.05 or 0.10. Such a standard value of the significant level facilitates the transmission of
information. The reduction of the value of a decreases the probability that an error is made (error
type I) when the null hypothesis is rejected but increases the probability of error type II.
Note: In hypothesis testing, limits of confidence interval (b1 and b2) is calculated according to the
population and the statistical values (mean or standard deviation) of sample is checked whether it
is within these limits.

EXAMPLE 8.1
It is known that the mean annual precipitation at a location is 68 cm (H0:μx=68) with a
standard deviation of 12 cm. 36 years of samples are taken and the mean is calculated as
71 cm (H1:μx≠68).
Find if the mean calculated from samples (𝑋𝑋̄=71) is similar to the mean calculated from
population (μx=68) (Distribution is normal, α=0.10)

71
SOLUTION

Ho: β=βo (𝑋𝑋̄=μx )


H1: β≠βo (𝑋𝑋̄≠μx)

The mean precipitation estimated from measurements over N=36 years is 71 cm. The sampling
distribution of the mean is normal with the standard deviation equal to 𝜎𝜎𝑋𝑋 /√𝑁𝑁 = 12/√36 =
2.0𝑐𝑐𝑐𝑐
𝑏𝑏1 = 𝜇𝜇𝑥𝑥 − 𝑍𝑍 𝜎𝜎𝑥𝑥 /√𝑁𝑁
𝑏𝑏2 = 𝜇𝜇𝑥𝑥 + 𝑍𝑍 𝜎𝜎𝑥𝑥 /√𝑁𝑁
Since the level of significance as α=0.10 and α/2= 0.05, z-scores from z–Table is 1.65.
Therefore the limits of the accept region are:
𝑏𝑏1 = 68 − 1.65 × 12/√36 = 64.7
𝑏𝑏2 = 68 + 1.65 × 12/√36 = 71.3

The measured value (71.0) is inside this region, therefore the H0 is accepted at the α=0.10 level.

EXAMPLE 8.2
If the record length was N=64 years (assuming that all the other data remain the same), test
the hypotheses.

SOLUTION

Ho: β=βo (𝑋𝑋̄=μx)


H1: β≠βo (𝑋𝑋̄≠μx)
The standard deviation of the sampling distribution of the mean would be 12/√64 = 1.5𝑐𝑐𝑐𝑐 and
the boundaries of the region of acceptance:

72
𝑏𝑏1 = 68 − 1.65 × 1.5 = 65.5𝑐𝑐𝑐𝑐
𝑏𝑏2 = 68 + 1.65 × 1.5 = 70.5𝑐𝑐𝑐𝑐

In this case the measured value is outside of the acceptance region, and H1 is accepted. This is
because the acceptance region is narrower when the size of the sample is larger.
In the test, the probability of making type I error is α=0.10. The probability of making type II error
can be computed only if a value is assumed for the populanon mean μx. Assuming μx=72 cm, a
type II error will be made for N=36 when the measured mean is between 64 and 71.3 cm. Because
in this case the null hypothesis H0:μx=68 will be accepted although it is false (we assumed μx=72
cm. Let us compute the probability of making a type II error:
64.7 − 72
𝑧𝑧1 = = −3.65
2
71.3 − 72
𝑧𝑧2 = = −0.35
2
𝑃𝑃(64.7 < 𝑥𝑥 < 71.3) = 𝑃𝑃(−3.65 < 𝑧𝑧 < −0.35) = 𝑃𝑃(𝑧𝑧 < −0.35) − 𝑃𝑃(𝑧𝑧 < −3.65)
= 0.3632 − 0.0001 = 0.3631

On the other hand, if μx was equal to 74 cm. the probability of type II error would be:
64.7 − 74
𝑧𝑧1 = = −4.65
2
71.3 − 74
𝑧𝑧2 = = −1.35
2
𝑃𝑃(𝑧𝑧 < −1.35) − 𝑃𝑃(𝑧𝑧 < −4.65) = 0.0885
It is seen that the probability of type II error increases rapidly with the approach of the
hypothesized parameter value to the population parameter value. But in this case, the making of a
type II error (i.e. accepting the null hypothesis when it is false) will not have serious consequences.
For this reason, an appropriate value for a (such as 0.05 or 0.10) is chosen in practice and the
probability of the type II error is not considered.

EXAMPLE 8.3
A manufacturer claims that the mean of the weights of his products is (μx) 2.15 kg. In order
to check the manufacturers’ claim, 9 samples are taken and the mean (𝑥𝑥̄ ) is found as 1.95
kg. Test the hypothesis at level of error of α=0.10. (N=9, σx=0.4, Distribution is normal,
α=0.10)
SOLUTION

73
For small samples the sampling distribution of the mean is t distribution with the degrees of
freedom N-1=8. For the two–tailed test 0.10/2=0.05 and t0.05=1.860 from t–Table. The boundaries
of the acceptance region:
Ho: 𝑋𝑋̄=μx
H1: 𝑋𝑋̄≠μx
𝑏𝑏1 = 𝜇𝜇𝑥𝑥 − 𝑡𝑡. 𝜎𝜎𝑥𝑥 /√𝑁𝑁
𝑏𝑏2 = 𝜇𝜇𝑥𝑥 + 𝑡𝑡. 𝜎𝜎𝑥𝑥 /√𝑁𝑁

𝑏𝑏1 = 2.15 − 1.860 (0.4)/√9 = 1.90


𝑏𝑏2 = 2.15 + 1.860 (0.4)/√9 = 2.40

Then measured value 𝑥𝑥̄ =1.95 t is inside this region and the H0 is accepted.

EXAMPLE 8.4
What happens, if we change the previous problem as if the mean of sample is less than the
mean of population (ie. 𝑥𝑥̄ <μx) ?

SOLUTION
The test change into one-tailed one.

Ho: 𝑥𝑥̄ =μx


H1: 𝑥𝑥̄ <μx
For one–tailed test, t0.10=1.397. The lower boundary of the acceptance region:
b = 2.15 − 1.397 (0.4)/√9 = 1.96

74
Since 𝑥𝑥̄ =1.95 < 1.96, H1 is accepted, although the difference (1.96–1.95) is very small.

EXAMPLE 8.5
It is desired to produce steel bars with length of μx=12 cm, σx=2,5 cm. If the mean length
of a 49 element sample is 𝑥𝑥̄ =11,2 cm, is it accepted that the desired mean value is achieved?
(α=0.95)

SOLUTION
H0: μx=12 cm
H1: μx≠12 cm
At the significance level of 0.05, the boundaries of the region of acceptance are:

𝑏𝑏1 = 12 − 1.96 (2.5)/√49 = 11.3


𝑏𝑏2 = 12 + 1.96 (2.5)/√49 = 12.7

Normal distribution is used as the sampling distribution of the mean because the population
standard deviation is assumed to be known. Thus z0.05=l.96 from z–Table. The measured mean
11,2 cm is outside the acceptance region and H1 is accepted. It means that standard value for the
mean was not achieved.

EXAMPLE 8.6
A company claims that standard deviation of their products is 20. A sample of 23 is taken
from the products and the standard deviation is found as 24. Find the whether the
company’s claim is correct or not. (α=0.10 and distribution is positively skewed)
SOLUTION
Ho: σx2 = Sx2
H1: σx2 ≠ Sx2

We must test the hypothesis Ho: σx=20 with the alternate hypothesis H1: σx≠20. At the significance
level of 0.10, the boundaries of the region of acceptance are:
1
𝑏𝑏1,2 = 𝑁𝑁𝑆𝑆 2
𝑋𝑋 2 𝑥𝑥

75
1
𝑏𝑏1 = 23 242 = 390.5
33.924
1
𝑏𝑏2 = 23 242 = 1073.8
12.338

Since the variance of population (202=400) is between b1 and b2, H0 is accepted.

COMPARISON TEST WITH T-DISTRIBUTION


A paired t-test is used to compare two population means where you have two samples in which
observations in one sample can be paired with observations in the other sample.
Examples of where this might occur are:
• Before-and-after observations on the same subjects (e.g. students’ diagnostic test results
before and after a particular module or course).
• A comparison of two different methods of measurement or two different treatments where
the measurements/treatments are applied to the same subjects (e.g. bloodpressure
measurements using a stethoscope and a dynamap).

STEPS OF THE TEST

1.Setting the hypotheses


For two-tailed test
𝐻𝐻0 : 𝜇𝜇1 = 𝜇𝜇2
𝐻𝐻1 : 𝜇𝜇1 ≠ 𝜇𝜇2
For one-tailed test
𝐻𝐻0 : 𝜇𝜇1 = 𝜇𝜇2
𝐻𝐻1 : 𝜇𝜇1 < 𝜇𝜇2 or 𝐻𝐻1 : 𝜇𝜇1 > 𝜇𝜇2

2.Finding the critical t-values


For two-tailed test
𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐 ∝/2,(𝑛𝑛1 +𝑛𝑛2 −2)

For one-tailed test


𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐 ∝,(𝑛𝑛1 +𝑛𝑛2 −2)

3.Finding the calculated t-values.


𝑥𝑥̄ 1 −𝑥𝑥̄ 2 −(𝜇𝜇1 −𝜇𝜇2 ) 𝑆𝑆1 2 +𝑆𝑆2 2
𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐 = 𝑆𝑆𝑥𝑥̄ 1 −𝑥𝑥̄ 2
𝑆𝑆𝑥𝑥̄ 1 −𝑥𝑥̄ 2 = � 𝑛𝑛

4. Comparing t-values
tcal < tcrt => Accept H0

76
tcal > tcrt => Accept H1

EXAMPLE 8.6
In order to compare the performance of two types of oils a test drive is performed with 15
of motorbikes fulled with TypeA oil and the same amount of motorbikes with TypeB. The
mean distance of motorbikes measured for TypeA is 25 km and the standard deviation is
0.5 km. For TypeB they are measured 22 km and 0.4 km, respectively. (α=0.01)
Is performance of A statistically better than that of B?

SOLUTION
TypeA oil 𝑥𝑥̅ A=25 km nA=15 SA=0.5 km
TypeB oil 𝑥𝑥̅ B=22 km nB=15 SB=0.4 km

1) One tailed test


𝐻𝐻0 : 𝜇𝜇𝐴𝐴 = 𝜇𝜇𝐵𝐵
𝐻𝐻1 : 𝜇𝜇𝐴𝐴 > 𝜇𝜇𝐵𝐵

2) 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐 = 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐 ∝,(𝑛𝑛A +𝑛𝑛B −2) = 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐 0.01,28 = 2.467

𝑥𝑥̄ 𝐴𝐴 −𝑥𝑥̄ B −(𝜇𝜇A −𝜇𝜇B ) 25−22−(0) 𝑆𝑆𝐴𝐴 2 +𝑆𝑆B 2 0.52 +0.42


3) 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐 = 𝑆𝑆𝑥𝑥̄ A −𝑥𝑥̄ B
= 0.165
= 18.18 𝑆𝑆𝑥𝑥̄ A −𝑥𝑥̄ B = � =� =
𝑛𝑛 15
0.165

4) Since 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐 > 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐 (18.18 > 2.467) H1 is accepted. It means that, the performance of TypeA
oil is statisticaly greater than that of TypeB.

REFERENCES:
– Bayazıt, M., Oğuz, B., 1998, Probability and Statistics for Engineers, Birsen Yayınevi.
– Spiegel, M.R., 1992, Schaum’s Outline Series , Theory and Problems of Statitistics, 2/ed in SI
units, Mc Graw Hill.
Yıldız, N., Akbulut, Ö., Bircan, H., Istatistiğe Giriş, 2005, Aktif Yayınevi

77
Chapter 9

REGRESSION ANALYSIS
Regression analysis is a technique used for the modeling and analysis of numerical data consisting
of values of a at least two variables : (1) dependent variable (response variable) and (2) one or
more independent variables (explanatory variables). The dependent variable in the regression
equation is modeled as a function of the independent variables, corresponding parameters
("constants"). The parameters are estimated so as to give a "best fit" of the data.

y: Dependent variable
x: Independent variable
In engineering problems that the values two (or more) random variables take in an observation
may not be statistically independent of each other, thus there may be a relation between these
variables. The existence of such a relation shows either that one variable is effected by the other
or that both variables are effected by other variables. As an example, the relation between
precipitation and flow in a basin originates because flow takes place due to the effect (as a
consequence) of precipitation. The relation between flows in neighboring basins arises due to the
fact that the flows are affected by the precipitation of that region.
However, these relations are not of a deterministic (functional) character, in other words when one
of the variables takes a certain value the other will not always take the same value. This value will
change more or less in various observations with the effect of other variables which we have not
considered in the relation .As an example, when flow of one of the two neighboring basins takes
a certain value the flow of the other does not always take the same value. Still, the determination
of the existence and the form of a nonfunctional relationship between the variables has a great
importance in practice. Because by using this relationship it is possible to estimate a future value
of a variable depending on known value(s) of another (or more than one) variable(s). While this
estimate will not be the exact future value of the variable under consideration, it will be the best
estimate closest to this value. The interval within which the difference of the estimated value from
the real value (error) will remain can be determined with a certain probability.
The mathematical expression showing a relation of the above mentioned type is called the
regression equation. The aim of the regression analysis is to check whether there is a significant
relation between the variables under consideration and, if there is one, then to obtain the regression
equation expressing this relation and to evaluate the confidence interval of the estimates to be
made by using this equation.
The regression analysis can be classfied as simple linear or multivariate linear regression analysis.
Simple linear regression analysis is the most frequently used one in which there is a linear
relationship between two variables. On the other hand, in multivariate linear regression analysis,
it is assumed that there is a linear relationship among more than two variables.
In the content of this course notes, only simple linear regression analysis will be discussed.

78
SIMPLE LINEAR REGRESSION ANALYSIS

Correlation Coefficient (r)


There are several types of correlation coefficients but the one that is most common is the Pearson
correlation (r). This measures strength and direction of the linear relationship between two
variables. It cannot capture nonlinear relationships between two variables and cannot differentiate
between dependent and independent variables.
∑𝑁𝑁
𝑖𝑖=1(𝑥𝑥𝑖𝑖 −𝑥𝑥̄ )(𝑦𝑦𝑖𝑖 −𝑦𝑦̄ )
𝑟𝑟𝑥𝑥,𝑦𝑦 =
𝑁𝑁 𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦
A value of exactly 1.0 means there is a perfect positive relationship between the two variables. For
a positive increase in one variable, there is also a positive increase in the second variable. A value
of -1.0 means there is a perfect negative relationship between the two variables. This shows that
the variables move in opposite directions - for a positive increase in one variable, there is a
decrease in the second variable. If the correlation is 0, there is no relationship between the two
variables.
The strength of the relationship varies in degree based on the value of the correlation coefficient.
For example, a value of 0.2 shows there is a positive relationship between the two variables, but it
is weak and likely insignificant. Experts do not consider correlations significant until the value
surpasses at least 0.8. However, a correlation coefficient with an absolute value of 0.9 or greater
would represent a very strong relationship.

Regression Equation
Let us assume that X and Y are two random variables between which there is a significant
relationship. In order to put the equation of this relation we should determine the regression
equation.
To evaluate the regression coefficients a and b in the regression equation of Y with respect to X

The following expression for regression coefficients are obtained:

EXAMPLE 9.1
Annual flows (106 m3) at two stations on the river Dicle are given below:
Year 1956 1957 1958 1959 1960 1961 1962 1963
X 16077 14817 11720 9352 10537 6743 10162 29232
Y 4629 4556 2507 1612 2125 1054 2272 11883

Year 1964 1965 1966 1967 1968 1969 1970 1971


X 16439 13019 17729 20368 26748 33566 12314 10914
Y – 4041 5191 5328 6543 7606 3445 3161

79
There are 15 years long joint observations at the two stations. If each of these observation is shown
by a point on the X–Y coordinate system we see that these points are distributed with a small
scatter around a straight line (Fig. 7.1). (The point for the year 1964 is not shown since Y flow of
this year was not recorded).

a) Find the correlation coefficient if there is a linear relation between two stations
b) Find the regression equation
c) Find the missing Y value of 1964

14.000
Y

12.000

y = 0,3139x - 694,56
10.000
R2 = 0,8106

8.000

6.000

4.000

2.000

0
0 5.000 10.000 15.000 20.000 25.000 30.000 35.000 40.000
X

Fig. 7.1. Plot of the annual flows measured at two stations on the river Dicle

a) The following values (106 m3) for the statistics are assessed for the sample we have already
started to analyze:
𝑥𝑥̅ = 16220
𝑦𝑦�= 4397
Sx = 7670
Sy = 2674
(Note: Do not use the X and Y values of 1964 and take N as 15)

The correlation coefficient: rx,y = 0.90


That means that X, Y factors have a relation with a percentage of 90 %.

b) The coefficients of the regression line of flows at station Y with respect to flows at station
X are:
= 0.313

= − 694(106 m3)
The regression equation is:

80
y = 0.313 x – 694

c) The value of Y for the year of 1964 is:


y = 0.313 (16439) – 694
y = 4451

COMMON MISCONCEPTIONS ABOUT CORRELATION

Correlation and causality


The conventional dictum that "correlation does not imply causation" means that correlation cannot
be validly used to infer a causal relationship between the variables. This dictum should not be
taken to mean that correlations cannot indicate causal relations. However, the causes underlying
the correlation, if any, may be indirect and unknown. Consequently, establishing a correlation
between two variables is not a sufficient condition to establish a causal relationship (in either
direction).
Here is a simple example: hot weather may cause both a reduction in purchases of warm clothing
and an increase in ice–cream purchases. Therefore warm clothing purchases are correlated with
ice–cream purchases. But a reduction in warm clothing purchases does not cause ice–cream
purchases and ice–cream purchases do not cause a reduction in warm clothing purchases.
A correlation between age and height in children is fairly causally transparent, but a correlation
between mood and health in people is less so. Does improved mood lead to improved health? Or
does good health lead to good mood? Or does some other factor underlie both? Or is it pure
coincidence? In other words, a correlation can be taken as evidence for a possible causal
relationship, but cannot indicate what the causal relationship, if any, might be.

Correlation and linearity


Anscombe's quartet comprises four datasets that have identical simple statistical properties, yet
appear very different when graphed. Each dataset (Table 9.1) consists of eleven (x, y) points. They
were constructed in 1973 by the statistician F.J. Anscombe to demonstrate both the importance of
graphing data before analyzing it and the effect of outliers on statistical properties.
Table 9.1 Anscombe's Quartet
1 2 3 4
X1 Y1 X2 Y2 X3 Y3 X4 Y4
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

Statistical properties of each sets are given in Table 9.2

81
Table 9.2. Statistical propeties of Anscombe’s quartet
Property Value
Mean of x in each case 9.0
Variance of x in each case 11.0
Mean of y in each case 7.5
Variance of y in each case 4.12
Correlation between x and y in each case 0.816
Linear regression equation in each case y = 3 + 0.5x

Fig 9.1. Four sets of data with the same correlation of 0.81

The images in the figure (Fig 9.1) shows plots of Anscombe's quartet, and a linear regresion line
obtained for each. As seen from Table 9.2, the correlation coefficient for each set is the same
(0.816). However, as can be seen on the plots, the distribution of the variables is very different.
The first one (top left) seems to be distributed normally, and corresponds to what one would expect
when considering two variables correlated and following the assumption of normality. The second
one (top right) is not distributed normally; while an obvious relationship between the two variables
can be observed, it is not linear, and the Pearson correlation coefficient is not relevant. In the third
case (bottom left), the linear relationship is perfect, except for one outlier which exerts enough
influence to lower the correlation coefficient from 1 to 0.81. Finally, the fourth example (bottom
right) shows another example when one outlier is enough to produce a high correlation coefficient,
even though the relationship between the two variables is not linear.
These examples indicate that the correlation coefficient, as a summary statistic, cannot replace the
individual examination of the data. They also show that the correlation coefficient or regression
equation must be calculated after being visually convicted that there is an actual correlation
between x and y variables.

82
Chapter 10

VARIANCE ANALYSIS
INTRODUCTION
Variance analysis is a statistical method to put forth the differences among data sets. It is very
similar to comparison test (in Chapter 8.). However by comparison test only two sets of data can
be compared. By variance analysis, it can be compared more than two sets of data with each
other.
Variancs analysis is made Fisher’s distribution which is a modification X2 distribution. The
distribution has tables of 𝛼𝛼 = 0.01, 0.05, 0.10 and 0.25.

STEPS OF VARIANCE ANALYSIS


1–Hypotheses are set.
𝐻𝐻0 : 𝜇𝜇1 = 𝜇𝜇2 = 𝜇𝜇3 . . . . . . = 𝜇𝜇𝑝𝑝
𝐻𝐻1 : 𝜇𝜇1 ≠ 𝜇𝜇2 ≠ 𝜇𝜇3 . . . . . . ≠ 𝜇𝜇𝑝𝑝

2–Kritik tablo değeri F tablosundan belirlenir.


𝐹𝐹𝑐𝑐 = 𝐹𝐹𝛼𝛼 (mdf, rdf)

3–Fcal value is calculated.


𝑀𝑀𝑀𝑀𝑀𝑀
𝐹𝐹𝑐𝑐𝑐𝑐𝑐𝑐 =
𝑅𝑅𝑅𝑅𝑅𝑅

4–Fc is compared with Fcal value.


If Fc > Fcal H0 is accepted.
If Fc < Fcal H1 is accepted.

FINDING FCAL VALUE


Total sum of squares of the deviations from the mean of all the observations is

2
𝑥𝑥. .2
𝑇𝑇𝑇𝑇𝑇𝑇 = � � 𝑥𝑥𝑖𝑖𝑖𝑖 −
𝑛𝑛. 𝑝𝑝

Treatment sum of squares of the deviations of the treatment means is

𝑥𝑥𝑖𝑖2 . 𝑥𝑥. .2
𝑀𝑀𝑀𝑀𝑀𝑀 = � −
𝑛𝑛 𝑛𝑛. 𝑝𝑝

83
Residual Sum of Squres of the deviations from the means within treatments is

RSS = TSS -MSS

The degree of freedoms are calculated as given below.


Total degree of freedom (tdf) = (n.p) - 1
Treatment degree of freedom (mdf) = p - 1
Residual degree of freedom (rdf) = tdf - mdf

Treatment Mean Square (MMS) = MSS / mdf


Residual Mean Square (RMS) = RSS / rdf

𝑀𝑀𝑀𝑀𝑀𝑀 (𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉 𝑜𝑜𝑜𝑜 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇)


Finally, Fcal value is obtained as 𝐹𝐹𝑐𝑐𝑐𝑐𝑐𝑐 = 𝑅𝑅𝑅𝑅𝑅𝑅 (𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉 𝑜𝑜𝑜𝑜 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅)

Mathematical operations can be summarized as in the following table.

Table 10.3 Variance Analysis Table (VAT)


Variation Degrees of Sum of Mean of Fcal
Freedom Squares Squares
Treatment Mdf MSS MMS MMS/RMS
Residual Rdf RSS RMS
Total Tdf TSS

EXAMPLE 10.1
Four types of batteries are produced in a battery plant. In order to find the differences amoung the
performances of batteries, 6 samples are taken for each type of batteries and life of the batteries
are measured. Is there any differences amoung the performances of batteries at α=0.01 ?

Tablo 10.4 Information about the batteries


Life of batteries, (h)
Samples A B C D
1 64 98 75 55
2 72 91 93 66
3 68 97 78 49
4 77 82 71 64
5 56 85 63 70
6 95 77 76 68
𝑥𝑥̅ 72 88.3 76 62 Total
∑𝑥𝑥𝑖𝑖 432 530 456 372 1790

84
SOLUTION
1) H0 : µA = µB = µC = µD
H1 : µA ≠ µB ≠ µC ≠ µD

2) 𝐹𝐹𝑐𝑐 = 𝐹𝐹𝛼𝛼(mdf,rdf)

Since mdf =3, rdf =20 and α =0.01, Fc value can be found from Fisher’s Table as
𝐹𝐹𝑐𝑐 = 𝐹𝐹0.01.(3,20) = 4.94
𝑀𝑀𝑀𝑀𝑀𝑀
3) 𝐹𝐹𝑐𝑐𝑐𝑐𝑐𝑐 = 𝑅𝑅𝑅𝑅𝑅𝑅

2 𝑥𝑥.. 2 17902
𝑇𝑇𝑇𝑇𝑇𝑇 = � � 𝑥𝑥𝑖𝑖𝑖𝑖 − = 642 + 722 +. . . +682 − = 4207.8
𝑛𝑛. 𝑝𝑝 24

n (Number of samples) = 6
p (Number of batteries) = 4

∑ xi.2 x.. 2 4322 + 5302 + 4562 + 3722 17902


𝑀𝑀𝑀𝑀𝑀𝑀 = − = − = 2136.5
n n. p 6 24

RSS = TSS – MSS = 4207.8 – 2136.5= 2071.3

Degrees of freedoms can be calculated as

tdf= n*p-1 = 6*4-1=23


mdf = p-1 = 4-1=3
rdf = tdf-mdf =23-3 =20
Finally,

𝑀𝑀𝑀𝑀𝑀𝑀 2136.5
𝑀𝑀𝑀𝑀𝑀𝑀 = = = 712.2
𝑚𝑚𝑚𝑚𝑚𝑚 3

𝑅𝑅𝑅𝑅𝑅𝑅 2071.3
𝑅𝑅𝑅𝑅𝑅𝑅 = = = 103.6
𝑟𝑟𝑟𝑟𝑟𝑟 20

𝑀𝑀𝑀𝑀𝑀𝑀 712.2
𝐹𝐹𝑐𝑐𝑐𝑐𝑐𝑐 = = = 6.87
𝑅𝑅𝑅𝑅𝑅𝑅 103.6

4) Since 𝐹𝐹𝑐𝑐𝑐𝑐𝑐𝑐 (6.87) > Fc (4.94), H1 is accepted. There is/are difference(s) among the
performance of battery(ies)

The calculations can be summarized in Variance Analysis Table below.

Table 10.4 (VAT) of Example 10.1

85
Variation Degrees of Sum of Mean of Fcal
Freedom Squares Squares
Treatment 3 2136.5 712.2 6.87
Residual 20 2071.3 103.6
Total 23 4207.8

In order to find the different one(s), multiple comparison tests such as LSD, Duncan, SNK,
Bonferroni, Tukey, Scheffe, Dunnet tests are used. In this book, LSD Test will be used.

LSD (LEAST SQUARE DIFFERENCES) TEST


In this test, LSD value is calculated first. Then the absolute value of differences between each pairs
of sets are calculated and compared with LSD value.

2 𝑅𝑅𝑅𝑅𝑅𝑅
𝐿𝐿𝐿𝐿𝐿𝐿 = �𝐹𝐹α(1,rdf)
𝑛𝑛
If the difference(s) is/are greater than LSD value, it means that there is/are important difference(s)
between the data set.

EXAMPLE 10.2

Apply LSD Test to the previous example and find the different one(s) of batteries. (α=0.01)

SOLUTION

2 ∗ 𝑅𝑅𝑅𝑅𝑅𝑅 2 ∗ 103.6
𝐿𝐿𝐿𝐿𝐿𝐿 = �𝐹𝐹0.01(1,20) = �8.10 ∗ = 16.7
𝑛𝑛 6

The means Absolute


compared differences (AD) LSD RESULT
|A–B| |72 – 88.3| = 16.3 16.7 AD<LSD Not important
|A–C| |72–76| = 4.0 16.7 AD<LSD Not important
|A–D| |72–62| = 10.0 16.7 AD<LSD Not important
|B–C| |88.3–76| = 12.3 16.7 AD<LSD Not important
|B–D| |88.3–62| = 26.3 16.7 AD>LSD Important *
|C–D| |76–62| = 14.0 16.7 AD<LSD Not important

As a result, it can stated that there is an important difference between the performance of
Battery B and Battery D.

86
Fisher’s Table (0,01 levels)
mdf
rdf 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ∞
1 4052 5000 5403 5625 5764 5859 5928 5982 6023 6056 6106 6157 6209 6235 6261 6287 6313 6339 6366
2 98,50 99,00 99,20 99,20 99,30 99,30 99,40 99,40 99,40 99,00 99,40 99,40 99,40 99,50 99,50 99,50 99,50 99,50 99,50
3 34,10 30,80 29,50 28,70 28,20 27,90 27,70 27,50 27,30 27,00 27,10 26,90 26,70 26,60 26,50 26,40 26,30 26,20 26,10
4 21,20 18,00 16,70 16,00 15,50 15,20 15,00 14,80 14,70 15,00 14,40 14,20 14,00 13,90 13,80 13,70 13,70 13,60 13,50
5 16,30 13,30 12,10 11,40 11,00 10,70 10,50 10,30 10,20 10,00 9,89 9,72 9,55 9,47 9,38 9,29 9,20 9,11 9,02
6 13,70 10,90 9,78 9,15 8,75 8,47 8,26 8,10 7,98 7,90 7,72 7,56 7,40 7,31 7,23 7,14 7,06 6,97 6,88
7 12,20 9,55 8,45 7,85 7,46 7,19 6,99 6,84 6,72 6,60 6,47 6,31 6,16 6,07 5,99 5,91 5,82 5,74 5,65
8 11,30 8,65 7,59 7,01 6,63 6,37 6,18 6,03 5,91 5,80 5,67 5,52 5,36 5,28 5,20 5,12 5,03 4,95 4,86
9 10,60 8,02 6,99 6,42 6,06 5,80 5,61 5,47 5,35 5,30 5,11 4,96 4,81 4,73 4,65 4,57 4,48 4,40 4,31
10 10,00 7,56 6,55 5,99 5,64 5,39 5,20 5,06 4,94 4,85 4,71 4,56 4,41 4,33 4,21 4,17 4,08 4,00 3,91
11 9,65 7,21 6,22 5,67 5,32 5,07 4,89 4,74 4,63 4,54 4,40 4,25 4,10 4,02 3,94 3,86 3,78 3,69 3,60
12 9,33 6,93 5,95 5,41 5,06 4,82 4,64 4,50 4,39 4,30 4,16 4,01 3,86 3,78 3,70 3,62 3,54 3,45 3,36
13 9,07 6,70 5,74 5,21 4,86 4,62 4,44 4,30 4,19 4,10 3,96 3,82 3,66 3,59 3,51 3,43 3,34 3,25 3,17
14 8,86 6,51 5,56 5,04 4,69 4,46 4,28 4,14 4,03 3,94 3,80 3,66 3,51 3,43 3,35 3,27 3,18 3,09 3,00
15 8,68 6,36 5,42 4,89 4,56 4,32 4,14 4,00 3,89 3,80 3,67 3,52 3,37 3,29 3,21 3,13 3,05 2,96 2,87
16 8,53 6,23 5,29 4,77 4,44 4,20 4,03 3,89 3,78 3,69 3,55 3,41 3,26 3,18 3,10 3,02 2,93 2,84 2,75
17 8,40 6,11 5,18 4,67 4,34 4,10 3,93 3,79 3,68 3,59 3,46 3,31 3,16 3,08 3,00 2,92 2,83 2,75 2,65
18 8,29 6,01 5,09 4,58 4,25 4,01 3,84 3,71 3,60 3,51 3,37 3,23 3,08 3,00 2,92 2,84 2,75 2,66 2,57
19 8,18 5,93 5,01 4,50 4,17 3,94 3,77 3,63 3,52 3,43 3,30 3,15 3,00 2,92 2,84 2,76 2,67 2,58 2,49
20 8,10 5,85 4,94 4,43 4,10 3,87 3,70 3,56 3,46 3,37 3,23 3,09 2,94 2,86 2,78 2,69 2,61 2,52 2,42
22 7,95 5,72 4,80 4,31 3,99 3,76 3,59 3,45 3,35 3,26 3,12 2,98 2,83 2,75 2,70 2,58 2,50 2,40 2,30
24 7,82 5,61 4,70 4,22 3,90 3,67 3,50 3,36 3,26 3,17 3,03 2,89 2,74 2,66 2,60 2,49 2,40 2,31 2,20
26 7,72 5,53 4,60 4,14 3,82 3,59 3,42 3,29 3,18 3,09 2,96 2,81 2,66 2,58 2,50 2,42 2,33 2,23 2,10
28 7,64 5,45 4,60 4,07 3,75 3,53 3,36 3,23 3,12 3,03 2,90 2,75 2,60 2,52 2,40 2,35 2,26 2,17 2,10
30 7,56 5,39 4,50 4,02 3,70 3,47 3,30 3,17 3,07 2,98 2,84 2,70 2,55 2,47 2,40 2,30 2,21 2,11 2,00
40 7,31 5,18 4,30 3,83 3,51 3,29 3,12 2,99 2,89 2,80 2,66 2,52 2,37 2,29 2,20 2,11 2,02 1,92 1,80
60 7,08 4,98 4,10 3,65 3,34 3,12 2,95 2,82 2,72 2,63 2,50 2,35 2,20 2,12 2,00 1,94 1,84 1,73 1,60
120 6,85 4,79 4,00 3,48 3,17 2,96 2,79 2,66 2,56 2,47 2,34 2,19 2,03 1,95 1,90 1,76 1,66 1,53 1,40
200 6,76 4,71 3,90 3,41 3,11 2,89 2,73 2,60 2,50 2,41 2,27 2,13 1,97 1,89 1,80 1,69 1,58 1,44 1,30
∞ 6,63 4,61 3,80 3,32 3,02 2,80 2,64 2,51 2,41 2,32 2,18 2,04 1,88 1,79 1,70 1,59 1,47 1,32 1,00

87
Fisher’s Table (0.05 levels)
mdf
rdf 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ∞
1 161 200 216 225 230 234 237 239 241 242 244 246 248 249 250 251 252 253 254
2 18.5 19.0 19.2 19.2 19.3 19.3 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.5 19.5 19.5 19.5 19.5 19.5
3 10.1 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.66 5.63
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.37
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 2.58 2.54
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.72 2.65 2.61 2.57 2.53 2.49 2.45 2.40
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2.13
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.31 2.23 2.19 2.15 2.10 2.06 2.01 1.96
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.84
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.18 2.10 2.05 2.01 1.96 1.92 1.87 1.81
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.15 2.07 2.03 1.98 1.94 1.89 1.84 1.78
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.13 2.05 2.01 1.96 1.91 1.86 1.81 1.76
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.11 2.03 1.98 1.94 1.89 1.84 1.79 1.73
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.09 2.01 1.96 1.92 1.87 1.82 1.77 1.71
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 2.07 1.99 1.95 1.90 1.85 1.80 1.75 1.69
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 2.06 1.97 1.93 1.88 1.84 1.79 1.73 1.67
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 2.04 1.96 1.91 1.87 1.82 1.77 1.71 1.65
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 2.03 1.94 1.90 1.85 1.81 1.75 1.70 1.64
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93 1.89 1.84 1.79 1.74 1.68 1.62
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 1.51
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.75 1.66 1.61 1.55 1.50 1.43 1.35 1.25
∞ 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1.75 1.67 1.57 1.52 1.46 1.39 1.32 1.22 1.00

88
REFERENCES
Bayazıt, M., Oğuz, B., 1998, Probability and Statistics for Engineers, Birsen Yayınevi.

Spiegel, M.R., 1992, Schaum’s Outline Series , Theory and Problems of Statitistics, 2/ed in SI units,
Mc Graw Hill.

Yıldız, N., Akbulut, Ö., Bircan, H., Istatistiğe Giriş, 2005, Aktif Yayınevi

89

You might also like