Probability and Statistics Ver 2.11 (2022)
Probability and Statistics Ver 2.11 (2022)
Probability and Statistics Ver 2.11 (2022)
V9.05 2022
1
CONTENTS
CONTENTS............................................................................................................................ 2
Chapter 1................................................................................................................................... 4
INTRODUCTION TO STATISTICS ........................................................................................ 4
DEFINITION .......................................................................................................................... 4
METHOD OF STATISTICS .................................................................................................. 4
STATISTICAL VARIABLES ................................................................................................ 5
MISUSE OF STATISTICS ..................................................................................................... 6
Chapter 2................................................................................................................................... 7
PROBABILITY ....................................................................................................................... 7
THE PROBABILITIES OF SIMPLE EVENTS ..................................................................... 7
THE PROBABILITIES OF COMPOUND EVENTS ............................................................ 8
PROBABILITIES OF ENGINEERING PROBLEMS ......................................................... 13
EXAMPLES .......................................................................................................................... 14
BERNOULLI TRIALS ......................................................................................................... 18
Chapter 3................................................................................................................................. 23
FREQUENCY ANALYSIS .................................................................................................... 23
DEFINITIONS ...................................................................................................................... 23
FREQUENCY DISTRIBUTIONS ........................................................................................ 23
HISTOGRAMS AND FREQUENCY POLIGONS ............................................................. 24
FREQUENCY HISTOGRAMS ............................................................................................ 25
CUMULATIVE–FREQUENCY DISTRIBUTIONS AND OGIVES .................................. 25
GENERAL RULES FOR FORMING FREQUENCY DISTRIBUTIONS of CONTINOUS
DATA .................................................................................................................................... 26
GENERAL RULES FOR FORMING FREQUENCY DISTRIBUTIONS of DISCRETE
DATA .................................................................................................................................... 28
TYPES OF FREQUENCY CURVES ................................................................................... 29
Chapter 4................................................................................................................................. 31
THE MEAN, MEDIAN, MODE, AND OTHER MEASURES OF CENTRAL TENDENCY ...... 31
THE ARITHMETIC MEAN ................................................................................................. 31
THE WEIGHTED ARITHMETIC MEAN........................................................................... 32
THE GEOMETRIC MEAN G .............................................................................................. 32
THE HARMONIC MEAN H ................................................................................................ 33
THE RELATION BETWEEN THE ARITHMETIC, GEOMETRIC, AND HARMONIC
MEANS ................................................................................................................................. 33
THE MEDIAN ...................................................................................................................... 33
THE MODE .......................................................................................................................... 34
THE EMPIRICAL RELATION BETWEEN THE MEAN, MEDIAN, AND MODE ........ 34
THE ROOT MEAN SQUARE (RMS) ................................................................................. 35
QUANTILES ........................................................................................................................ 36
QUARTILES, DECILES, AND PERCENTILES ................................................................ 36
Chapter 5................................................................................................................................. 41
THE STANDARD DEVIATION AND OTHER MEASURES OF DISPERSION ..................... 41
THE VARIANCE ................................................................................................................. 41
2
THE STANDARD DEVIATION ......................................................................................... 41
COEFFICIENT OF VARIATION ........................................................................................ 42
SKEWNESS .......................................................................................................................... 42
KURTOSIS ........................................................................................................................... 44
Chapter 6................................................................................................................................. 45
PROBABILITY DISTRIBUTION FUNCTIONS ..................................................................... 45
INTRODUCTION ................................................................................................................. 45
NORMAL DISTRIBUTION................................................................................................. 45
LOGNORMAL DISTRIBUTION ........................................................................................ 49
GAMA DISTRIBUTION ...................................................................................................... 55
Chapter 7................................................................................................................................. 58
SAMPLING DISTRIBUTIONS .............................................................................................. 58
THE CONCEPT OF SAMPLING DISTRIBUTION ........................................................... 58
SAMPLING DISTRIBUTIONS ........................................................................................... 60
Chapter 8................................................................................................................................. 69
STATISTICAL HYPOTHESIS TESTING .............................................................................. 69
HYPOTHESIS TESTS FOR PARAMETERS ..................................................................... 69
APPLICATIONS .................................................................................................................. 71
COMPARISON TEST WITH T-DISTRIBUTION .............................................................. 76
Chapter 9................................................................................................................................. 78
REGRESSION ANALYSIS................................................................................................... 78
SIMPLE LINEAR REGRESSION ANALYSIS................................................................... 79
COMMON MISCONCEPTIONS ABOUT CORRELATION ............................................. 81
Chapter 10............................................................................................................................... 83
VARIANCE ANALYSIS ........................................................................................................ 83
INTRODUCTION ................................................................................................................. 83
STEPS OF VARIANCE ANALYSIS ................................................................................... 83
LSD (Least Square DIfferences) Test ................................................................................... 86
REFERENCES ......................................................................................................................... 89
3
Chapter 1
INTRODUCTION TO STATISTICS
DEFINITION
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or
explanation, and presentation of data. It is applicable to a wide variety of academic disciplines,
from the natural and social sciences to the humanities, and to government and business.
Statistics is the production of new and extra data by using existing ones.
Statistical methods can be used to summarize or describe a collection of data; this is called
descriptive statistics. In addition, patterns in the data may be modeled in a way that accounts for
randomness and uncertainty in the observations, and then used to draw inferences about the process
or population being studied; this is called inferential statistics. Both descriptive and inferential
statistics comprise applied statistics. There is also a discipline called mathematical statistics,
which is concerned with the theoretical basis of the subject.
In applying statistics to a scientific, industrial, or societal problem, one begins with a process or
population to be studied. This might be a population of people in a country, of crystal grains in a
rock, or of goods manufactured by a particular factory during a given period. It may instead be a
process observed at various times; data collected about this kind of "population" constitute what
is called a time series.
For practical reasons, rather than compiling data about an entire population, one usually studies a
chosen subset of the population, called a sample. Data are collected about the sample in an
observational or experimental setting. The data are then subjected to statistical analysis, which
serves two related purposes: description and inference.
* Descriptive statistics can be used to summarize the data, either numerically or graphically, to
describe the sample. Basic examples of numerical descriptors include the mean and standard
deviation. Graphical summarizations include various kinds of charts and graphs.
* Inferential statistics is used to model patterns in the data, accounting for randomness and
drawing inferences about the larger population. These inferences may take the form of answers to
yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation),
descriptions of association (correlation), or modeling of relationships (regression). Other modeling
techniques include ANOVA, time series, and data mining.
The word statistics is also the plural of statistic (singular), which refers to the result of applying a
statistical algorithm to a set of data, as in economic statistics, crime statistics, etc.
The word "statistics" is used in several different senses. In the broadest sense, "statistics" refers to
a range of techniques and procedures for analyzing data, interpreting data, displaying data, and
making decisions based on data. This is what courses in "statistics" generally cover.
METHOD OF STATISTICS
Exact and unique solutions can be found for many problems encountered in natural sciences when
the values of the relevant variables are known. We can determine the acceleration of a body of
known mass by a certain force applying the Newton's law of motion. This is known as the
deterministic approach.
4
Consider, however, the throw of a die. Nobody could possibly tell which side would show up.
There are several problems where the outcome cannot be predicted with certainty.
Many examples can be given for engineering problems that cannot be solved deterministically.
We cannot know how much precipitation will fall in Istanbul next year, or how large a force would
cause a certain beam to fail. Uncertainties due to natural causes or the variability of the properties
of materials prohibit the prediction of the outcome in such problems. These type of problems can
be solved by probabilistic approach.
Engineers often have to deal with uncertainties. Examples are the annual volume of flow in a
stream, the traffic density at a junction, the magnitude of the next earthquake at a certain location.
These variables will behave randomly, making an exact prediction impossible. Engineers must
take the uncertainty into consideration to achieve reliable and economical solutions. Only when
the scatter around the mean value is small enough not to have a significant effect, we can neglect
the uncertainty and adopt the mean of the variable.
STATISTICAL VARIABLES
A variable is a symbol, such as X, Y, H, x, or B, that can assume any of a prescribed set of values,
called the domain of the variable. If the variable can assume only one value, it is called a constant.
EXAMPLE 1.1
The number N of children in a family, which can assume any of the values 0, 1, 2, 3,... but cannot
be 2.5 or 3.842, is a discrete variable.
EXAMPLE 1.2
The age A of an individual, which can be 62 years, 63.8 years, or 65.8341 years, depending on the
accuracy of measurement, is a continuous variable.
Data that can be described by a discrete or continuous variable are called discrete data or
continuous data, respectively. The number of children in each of 1000 families is an example of
discrete data, while the heights of 100 university students is an example of continuous data. In
general, measurements give rise to continuous data, while enumerations, or countings, give rise to
discrete data.
5
EXAMPLE 1.3
If students were asked to name their favorite color, then the variable would be qualitative (i.e. pink,
yellow, red ….)
EXAMPLE 1.4
If the time that took to answer above question were measured, then the variable would be
quantitative. (i.e. 1,35 seconds, 2 seconds ….)
MISUSE OF STATISTICS
“If you need statistics to analyze your experiment, then you've done the wrong experiment. If your
data speak for themselves, don't interrupt!”
“There are three kinds of lies: lies, damned lies, and statistics.”
This well–known saying is part of a phrase attributed to Benjamin Disraeli and popularized in the
U.S. by Mark Twain: The semi–ironic statement refers to the persuasive power of numbers, and
succinctly describes how even accurate statistics can be used to bolster inaccurate arguments.
How to Lie with Statistics is Darrell Huff's perennially popular introduction to statistics for the
general reader. Written in 1954, it is a brief, breezy, illustrated volume outlining the common
errors, both intentional and unintentional, associated with the interpretation of statistics, and how
these errors can lead to biased or inaccurate conclusions. Although a number of more recent
versions have been released, the original edition contained humorous, witty illustrations by Irving
Geis.
Over time it has become one of the most widely read statistics books in history, with over one and
a half million copies sold in the English–language edition. It has also been widely translated.
Themes of the book include "Correlation does not imply causation" and "Using Random
Sampling". It also shows how statistical graphs can be used to distort reality:
By truncating the bottom of a line or bar chart, one makes differences seem larger than they are
By representing one–dimensional quantities on a pictogram by two– or three–dimensional objects
to compare their sizes, one makes the reader forget that the images don't scale the same way the
quantities do. Two rows of small images would give a better idea than one small and one big one.
6
Chapter 2
PROBABILITY
The chance of the occurrence of a random event is defined as its probability. The basic axiom
of the probability theory states that each random event has a certain probability that varies in the
range of 0 to 1. Denoting the random variable by a capital letter and its value in an observation by
the corresponding small letter we can write
P(X=xi)=pi
where X = xi, is a random event, P is the symbol for the probability of the event, and pi is the
probability of the event X = xi
pi = 0 implies that the event X = xi will never occur, pi =1 means that the event will occur in all
observations. With the increase of the probability from 0 to 1, the chance of occurrence of the
event also increases, i.e. the event is seen more frequently.
EXAMPLE 2.1
We are throwing a die. The probability of having 6 is given as :
P(X=1) = 1/6
There are six random events in the throw of a die with equal probabilities.
P( X = 1) = P(X = 2) = P(X = 3) = (P(X = 4) = P(X = 5) = P(X = 6) = 1/6
As one of these events will certainly occur in each throw, the total probability is given as:
6
� 𝑃𝑃(𝑋𝑋 = 𝑖𝑖) = 1
𝑖𝑖=1
𝑛𝑛
� 𝑃𝑃(𝑋𝑋 = 𝑖𝑖) = 1
𝑖𝑖=1
EXAMPLE 2.2
What is the probability that a card drawn at random from a deck of cards will be an ace?
SOLUTION
Since of the 52 cards in the deck, 4 are aces, the probability is 4/52.
7
1 ♠ 1 ♥ 1 ♦ 1 ♣
2 ♠ 2 ♥ 2 ♦ 2 ♣
3 ♠ 3 ♥ 3 ♦ 3 ♣
4 ♠ 4 ♥ 4 ♦ 4 ♣
5 ♠ 5 ♥ 5 ♦ 5 ♣
6 ♠ 6 ♥ 6 ♦ 6 ♣
7 ♠ 7 ♥ 7 ♦ 7 ♣
8 ♠ 8 ♥ 8 ♦ 8 ♣
9 ♠ 9 ♥ 9 ♦ 9 ♣
10 ♠ 10 ♥ 10 ♦ 10 ♣
11 ♠ 11 ♥ 11 ♦ 11 ♣
12 ♠ 12 ♥ 12 ♦ 12 ♣
13 ♠ 13 ♥ 13 ♦ 13 ♣
♠: spade ♥: hearth ♦: diamond ♣ : club
In general, the probability of an event is the number of favorable outcomes divided by the total
number of possible outcomes. (This assumes the outcomes are all equally likely.)
EXAMPLE 2.3
The same principle can be applied to the problem of determining the probability of obtaining
different totals from a pair of dice.
SOLUTION
As shown below, there are 36 possible outcomes when a pair of dice is thrown.
To calculate the probability that the sum of the two dice will equal 5, calculate the number of
outcomes that sum to 5 and divide by the total number of outcomes (36). Since four of the
outcomes have a total of 5 (1,4; 2,3; 3,2; 4,1), the probability of the two dice adding up to 5 is 4/36
= 1/9 . In like manner, the probability of obtaining a sum of 12 is computed by dividing the number
of favorable outcomes (there is only one) by the total number of outcomes (36). The probability is
therefore 1/36 .
8
Theory of Set
If two sets have no common point they are called the disjoint sets, their intersection is the empty
set.
In Fig. 2.1, the sets A and B are disjoint, A∩B=φ. Just like in the set theory, the union of two
events consists of points that are at least in one of the events AUB.
Fig. 2.1. Events that are disjoint and not in the sample space of a random variable
Probability theory tells us how the probability of a compound random event can be computed when
the probabilities of (simple or compound) events that constitute it are known.
By extending this theorem, we can see that the sum of the probabilities of all sixmple events for a
random variable is equal to one as one of these simple events will certainly occur in an observation
These type of event is called as mutually exclusive events.
EXAMPLE 2.4
What is the probability of rolling a die and getting either a 1 or a 6?
SOLUTION
Since it is impossible to get both a 1 and a 6, these two events are mutually exclusive. Therefore,
P(1 U 6) = P(1) + P(6) = 1/6 + 1/6 = 1/3
9
Nondisjoint events (Not mutually exclusive events)
The probability of the union of two events that are not disjoint can easily be computed by the Venn
diagram, where each event is represented by the region inside a closed curve whose area is assumed
to be proportional to the probability of the event. In the Venn diagram of Fig. 2.4.
It is seen that in the general case we must subtract the probability of the intersection from the sum
of the probabilities P(A) and P(B). Eq. (2.5) reduces to Eq.(2.4) when A and B are disjoint events.
EXAMPLE 2.5
What is the probability that a card selected from a deck will be either an ace (1) or a spade
(♠)?
SOLUTION
The relevant probabilities are:
P(1) = 4/52
P(♠) = 13/52
The only way in which an ace and a spade can both be drawn is to draw the ace of spades. There
is only one ace of spades, so:
P(1 and ♠) = 1/52 .
The probability of an ace or a spade can be computed as:
P(1 or spade) =P(1) + P(♠) – P(1 and ♠) = 4/52 + 13/52 – 1/52 = 16/52 = 4/13.
EXAMPLE 2.6
Define the probability of union of two sets in the throw of a die:
P(E) = P{2,4,6} = 1/2
P(S) = P{1,2} = 1/3
SOLUTION
The probability of the union of the sets E and S can be computed as
P(EUS) = P (1,2,4,6)=P(E) + P(S) – P(E∩S)
P(E∩S)=P(E).P(S) = 1/2 . 1/3 = 1/6
P(EUS) = P(E) + P(S) – P(E∩S)
P(EUS) = 1/2 + 1/3 – 1/6
P(EUS) = 2/3
Multiplication rule
10
The probability of the multiple events which are coming one, then another one an so on is determined as
the multiplication of all probabilities
In other words, the probability of A and B both occurring is the product of the probability of A
and the probability of B.
EXAMPLE 2.7
What is the probability that a fair coin will come up with heads twice in a row? Two events
must occur: a head on the first toss and a head on the second toss.
SOLUTION
Since the probability of each event is 1/2, the probability of both events is: 1/2 x 1/2 = 1/4.
EXAMPLE 2.8
Consider a problem: Someone draws a card at random out of a deck, replaces it, and then
draws another card at random. What is the probability that the first card is the ace of clubs
and the second card is a club (any club).
SOLUTION
Since there is only one ace of clubs in the deck, the probability of the first event is 1/52. Since
13/52 = 1/4 of the deck is composed of clubs, the probability of the second event is 1/4. Therefore,
the probability of both events is: 1/52 x 1/4 = 1/208 .
EXAMPLE 2.9
Consider the probability of rolling two dice and getting a 6 on both of the rolls.
SOLUTION
The events are defined in the following way:
Event A: 6 on the first roll: p(A) = 1/6
Event B: 6 on the second roll: p(B) = 1/6
P(6 ∩ 6) = 1/6 x 1/6
P(6 ∩ 6) = 1/36
Conditional Probability
A conditional probability is the probability of an event given that another event has occurred. For
example, what is the probability that the total of two dice will be greater than 8 given that the first
die is a 6?
This can be computed by considering only outcomes for which the first die is a 6. Then, determine
the proportion of these outcomes that total more than 8. All the possible outcomes for two dice are
shown below:
11
There are 6 outcomes for which the first die is a 6, and of these, there are four that total more than
8 (6,3; 6,4; 6,5; 6,6). The probability of a total greater than 8 given that the first die is 6 is therefore
4/6 = 2/3.
More formally, this probability can be written as:
P(total>8 | Die 1 = 6) = 2/3.
In this equation, the expression to the left of the vertical bar represents the event and the expression
to the right of the vertical bar represents the condition. Thus it would be read as "The probability
that the total is greater than 8 given that Die 1 is 6 is 2/3."
In more abstract form, P(A|B) is the probability of event A given that event B occurred.
P(A∩B)= P(A).P(B|A)
EXAMPLE 2.11
If someone draws a card at random from a deck and then, without replacing the first card,
draws a second card, what is the probability that both cards will be aces?
SOLUTION
Event A is that the first card is an ace. Since 4 of the 52 cards are aces, P(A) = 4/52 = 1/13. Given
that the first card is an ace, what is the probability that the second card will be an ace as well? Of
the 51 remaining cards, 3 are aces.
Therefore, P(B|A) = 3/51 = 1/17
P (A and B) = 1/13 x 1/17 = 1/221.
EXAMPLE 2.11
Consider the probability of having İ letter, then T letter and then Ü letter when you draw
them from a box of 29 letters of Turkish alphabets.
SOLUTION
2 conditions:
A) The probability when we replace the letters into the box
12
B) The probability when we don’t replace the letters into the box
EXAMPLE 2.10
Consider the throwing of a die and define the probabilities of getting more than 4 and less
than or equal to 4.
SOLUTION
The events are defined in the following way:
P(X>4) = 2/6
P(X<=4)= 4/6
Getting more than 4 and less than or equal to 4 is the complimentary event of each other,
then
P(X>4) + P(X<=4) = 1
2/6 + 4/6 = 1
13
Although we can never make an infinite number of observations, it can be assumed that fi
approaches pi quite rapidly with the increase of N. If no precipitation has been observed at a station
for ni=900 days along a period of N= 1500 days than the probability of no precipitation can be
estimated as
P(X = 0) = 900 / 1500 = 0.60
With the increase of the observation period the estimated frequency will be a better estimate of the
true probability.
Sometimes the term "probability" is used in a somewhat different sense. The probability of an
event that cannot be observed many times cannot be estimated by using the mentioned equations
but we can provide an estimate on the basis of our past experience and information on its structure.
Such an estimate will obviously be subjective. Compare, for example, the statement “The
probability of rain tomorrow is 50%" with the statement 'The probability of a rainy day in this
location is 50%". The methods of the probability theory, however, can be applied no matter how
the probabilities have been estimated.
EXAMPLES
EXAMPLE 2.13
The number of vehicles waiting for a left turn at a cross–section is observed to vary between
0 and 6, with the following probabilities:
P(X=0)=4/60
P(X=1)=16/60
P(X=2)=20/60
P(X=3)=14/60
P(X=4)=3/60
P(X=5)=2/60
P(X=6)=1/60
SOLUTION
There are 7 simple events in the sample space of the random variable X "the number of vehicles
waiting for a left turn", the sum of their probabilities adding up to 1.
We can compute, for example, the probability that more than 3 vehicles are waiting as simple
events are disjoint:
P(X > 3) = P(X = 4) + P(X = 5) + P(X = 6) = 3 / 60 + 2 / 60 + 1 / 60 = 6/60=1/10
The event "less than or equal to 3 vehicles waiting" is the complementary of the above event.
Therefore
P(X ≤ 3) = l – P(X > 3) = 1–1/10 = 9/10
EXAMPLE 2.14
There are 3 bulldozers at a construction site, each having a probability of no failure during
the total period of construction equal to 0.50. Let us consider the random variable X “the
number of bulldozers in operation throughout the construction period".
14
SOLUTION
There are 4 events in the sample space of X as X can be equal to 0, 1, 2 or 3. Let us compute their
probabilities.
Denoting a bulldozer in operation by S, and a bulldozer not in operation by F, following 8
combinations are possible for the 3 bulldozers: FFF, FFS. FSF. SFF, FSS, SFS, SSF. SSS. These
combinations have equal probabilities because the probability of failure F is assumed to be equal
to the probability of no–failure (success 5). Since the sum of the probabilities should be equal to
1, each of the above combinations has a probability of 1/8.
Now we can determine the probabilities of the random events of the variable X:
P(X = 0) = P(FFF) = 1/8
P(X = 1) = P(FFS) + P(FSF) + P(SFF) =1/8 + 1/8 + 1/8 = 3/8
P(X = 2) = P(FSS) + P(SFS) + P(SSF) =1/8 + 1/8 + 1/8 = 3/8
P(X = 3) = P(SSS) = l/8
The sum of these probabilities is 1, as expected.
We shall discuss how this problem can be solved when the probability of failure is not equal to
0.50 later, in relation to the Bernoulli trials
EXAMPLE 2.15
It is known that the probability that a job is completed in 2–4 days is 0.50, in 4–6 days
0.25, and in 2–6 days 0.55. Let us compute the probability that the job is completed in 4
days.
SOLUTION
𝐴𝐴 = {(𝑋𝑋 = 2) ∪ (𝑋𝑋 = 3) ∪ (𝑋𝑋 = 4)}
𝑃𝑃(𝐴𝐴) = 0.50
𝐵𝐵 = {(𝑋𝑋 = 4) ∪ (𝑋𝑋 = 5) ∪ (𝑋𝑋 = 6)}
𝑃𝑃(𝐵𝐵) = 0.25
The intersection and union of the events A and B are:
Let as define the following events:
𝐴𝐴 ∪ 𝐵𝐵 = {(𝑋𝑋 = 2) ∪ (𝑋𝑋 = 3) ∪ (𝑋𝑋 = 4) ∪ (𝑋𝑋 = 5) ∪ (𝑋𝑋 = 6)}
𝐴𝐴 ∩ 𝐵𝐵 = {𝑋𝑋 = 4}
The probability of intersection can be computed by Eq. (2.5):
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝑋𝑋 = 4) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐵𝐵) − 𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵)
= 0.50 + 0.25 − 0.55 = 0.20
The probability that the job is completed in 4 days is determined as 0.20.
EXAMPLE 2.16
The following electric circuit has three switch as A, B and C. The probabilities of being
open in these switches are P(A)=0.15, P(B)=0.10 and P(C)=0.02. Calculate the probability
that lamp will be off.
B
15
C
SOLUTION
The lamp will be off when the switch A and B or switch C is being open. Thus we must compute
the probability of the union of A∩B and C. Using the addition and multiplication rule together,:
EXAMPLE 2.17
One can get to town 2 from town 1 either by the route A or by the routes B and C through
town 3. In winter the probabilities of the routes being open are P(A)=0.40, P(B)=0.75,
P(C)=0.67. These events are not independent. The probability of the route C to be open
when B is open is given as P(CB)=0.80, the probability of the route A to be open when
both B and C are open is P(AB∩C)=0.5. Let us determine the probability that one can get
from 1 to 2 in winter.
SOLUTION
The travel between the points 1 and 2 is possible if the route A is open or both B and C are open.
Let us find the probability that both B and C are open
SOLUTION
Since the events of failure will occur when one of the elements fail, we must determine the
probability of the union of probability of failure of the elements (P(AUBUC) ).
A B C
EXAMPLE 2.19
17
A frame is hold by to robs, A and B, with probabilities of breaking off equal to 0.1 for each
rob. The probability that one of the robs breaks off when the other does is 0.8. Determine
the probability of failure of the frame. (The frame fails, when either one of the robs is
broken)
A B
SOLUTION
We are looking for the probability P(AUB) where A and B denote the events of breaking off of
the robs. From the equation:
𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐵𝐵) − 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)
where the probability of intersection is:
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴|𝐵𝐵)𝑃𝑃(𝐵𝐵) = 0.8 × 0.1 = 0.08
Substituting into the first equation:
𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 0.1 + 0.1 − 0.08 = 0.12
If the events A and B were independent:
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴)𝑃𝑃(𝐵𝐵) = 0.1 × 0.1 = 0.01
𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 0.1 + 0.1 − 0.01 = 0.19
If the events A and B were functionally dependent, i.e. 𝑃𝑃(𝐴𝐴|𝐵𝐵) = 1
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴|𝐵𝐵)𝑃𝑃(𝐵𝐵) = 1 × 0.1 = 0.1
𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 0.1 + 0.1 − 0.1 = 0.10
The probability breaking of the frame may vary in the range of 0.10 and 0.19 as a function of the
dependence of the events A and B.
BERNOULLI TRIALS
Let us consider an experiment where only two outcomes are possible (there are only two simple
events in the sample space). Suppose one of the outcomes corresponds (arbitrarily) to "success"
and the other to "failure". The probability of the success will be denoted by p, and the probability
of the failure by q=1–p. Such an experiment is called a Bernoulli trial.
As a simple example, if the "success" in the throw of a die is equated with the throw of a six, then
p=1/6 and q=5/6
EXAMPLE 2.20
18
In throwing a die event, let us repeat the experiment n times (independent Bernoulli trials). Now
consider the random variable x, the number of times of success in n trials. X is a discrete variable
that is an integer in the range of 0 to n. Let us compute the probability of X=x in n trials. Suppose
n=3 (three trials).
SOLUTION
The event of no success (X=0) will occur only when all the three trials are failures. Since the trials
are considered independent this has the probability:
P(X=0) = qqq = q3
1 success in 3 trials can occur in three different ways: first trial successful and the others failures,
second trial successful and the others failures, third trial successful and the others failures. Each
of these three events has the probability pq2 and the probability of their union is:
P(X=l) = 3pq2 : (pqq) + (qpq) + (qqp)
2 successes in 3 trials can also occur in three different ways: first two trials successful and the third
trial failure, first and third trials successful and the second failure, second and third trials successful
and the first failure. Each event has the probability p2q and therefore the probability of 2 successes
in 3 events is:
P(X=2) = 3p2q : (ppq) + (pqp) + (qpp)
Finally the probability of 3 successes in 3 events is:
P(X=3) = ppp = p3
n=0 1
n=1 1 1
n=2 1 2 1
n=3 1 3 3 1
n=4 1 4 6 4 1
n=5 1 5 10 10 5 1
n=6 1 6 15 20 15 5 1
19
EXAMPLE 2.21
The probability of a successful bid for a contractor is assumed to be p=l/3. Let us compute
the probability of 0, 1, 2, 3 and 4 successes in 4 bids. The variable X denotes the number
of successes in 4 bids.
SOLUTION
0
𝑃𝑃(𝑋𝑋 = 0) = � � (1/3)0 (2/3)4 = 16/81
4
1
𝑃𝑃(𝑋𝑋 = 1) = � � (1/3)1 (2/3)3 = 32/81
4
2
𝑃𝑃(𝑋𝑋 = 2) = � � (1/3)2 (2/3)2 = 24/81
4
3
𝑃𝑃(𝑋𝑋 = 3) = � � (1/3)3 (2/3)1 = 8/81
4
4
𝑃𝑃(𝑋𝑋 = 4) = � � (1/3)4 (2/3)0 = 1/81
4
The probability that the contractor is successful at least once in 4 bids can be computed as follows:
𝑃𝑃[𝑋𝑋 ≥ 1] = 1 − 𝑃𝑃[𝑋𝑋 < 1] = 1 − 𝑃𝑃[𝑋𝑋 = 0] = 1 − 16/81 = 65/81
We can compute the probability of the first success to occur in the yth trial as follows. This will
happen when the first y–1 trial are failures, and the next trial is a success. The probability of the
probability of success at the yth trial is p. Therefore:
𝑃𝑃(𝑌𝑌 = 𝑦𝑦) = 𝑞𝑞 𝑦𝑦−1 𝑝𝑝 𝑦𝑦 = 1,2. .. (2.72)
EXAMPLE 2.23
In the previous example, the probabilities of first success in the first, second, third,... bids
are:
P(Y=l) = (2/3)° l/3 = l/3
20
P(Y=2) = (2/3)1 1/3 = 2/9
P(Y=3) = (2/3)2 l/3 = 4/27
P(Y=4) = (2/3)3 l/3 = 8/81
……………..
It can be expected that the first success will occur, on the average, in the trial no.
E(Y)=1/p=3.
Return period (recurrence interval) T is defined as the average interval between two consecutive
successes. As this coincides with the average time to the first success it is seen that
T=1/p
EXAMPLE 2.24
The spillway of a dam is designed for a discharge that is exceeded with the probability of
0.01 each year. The life of the dam is assumed to be 50 years. What is the probability that
the design flow is exceeded (at least once) during the life of the project?
SOLUTION
Using Equation of the Bernoulli distribution:
0
𝑃𝑃(𝑋𝑋 ≥ 1) = 1 − 𝑃𝑃(𝑋𝑋 = 0) = 1 − � � 0.010 0.9950 = 0.395
50
What is the probability that the first exceedance occurs after more than 10 years? Using Eq. (2.72)
of the geometric distribution:
10 10
What is the average interval between two consecutive exceedances of the design flood? Eq. (2.73)
of the geometric distribution gives:
E(Y) = 1/0.01 = 100 years
By Eq. (2.74) this is also the return period of the design flood
T= 100 years which means that the design flood (or a larger flood) will occur, on the average, once
every 100 years (this is called the 100–year flood). It should not be understood that this event
(exceedance of the design flood) will occur at regular intervals of 100 years. In fact there is a
probability of 0.012=0.0001 that it will occur twice in two consecutive years.
If the spillway was designed for the 50–year flood (p=1/50), the probability of no such flood
occurring during the life of the project would be:
0
𝑃𝑃(𝑌𝑌 = 0) = � � (1/50)0 (49/50)0 = 0.364
50
The probability of no occurrence of an event with the return period of T years during a time interval
of T years can be computed as
𝑇𝑇(𝑇𝑇 − 1) 2
𝑃𝑃(𝑌𝑌 = 0) = (1 − 𝑝𝑝)𝑇𝑇 = 1 − 𝑇𝑇𝑇𝑇 + 𝑝𝑝 +. ..
2
which approaches 𝑒𝑒 −𝑇𝑇𝑇𝑇 for large values of T. Therefore:
21
1
− 𝑝𝑝
𝑃𝑃(𝑌𝑌 = 0) ≅ 𝑒𝑒 −𝑇𝑇𝑇𝑇 = 𝑒𝑒 𝑝𝑝 = 𝑒𝑒 −1 = 0.368
The value of 0.364 obtained for 7=50 years is very close to this.
EXAMPLE 2.25
A breakwater is designed for waves of the return period T=10 years (probability of
exceedance p=0.10). The probability that the breakwater is damaged when larger waves
occur is assumed to be 0.20. What is the probability of damage in a 3 year period?
SOLUTION
The probability of larger waves occurring X=0,l,2, and 3 years in a period of 3 years are calculated
by the Bernoulli distribution:
0
𝑃𝑃(𝑋𝑋 = 0) = � � 0.100 (1 − 0.10)3 = 0.729
3
1
𝑃𝑃(𝑋𝑋 = 1) = � � 0.101 (1 − 0.10)2 = 0.243
3
2
𝑃𝑃(𝑋𝑋 = 2) = � � 0.102 (1 − 0.10)1 = 0.027
3
3
𝑃𝑃(𝑋𝑋 = 3) = � � 0.103 (1 − 0.10)0 = 0.001
3
Using the total probability theorem, the probability of no damage in a 3 year period is computed
as :
1.0 x 0.729 + (1 – 0.20) 0.243 + (1 – 0.20)2 0.0027 + (1 – 0.20)3 0.001 = 0.94
The probability of damage in 3 years is:
1–0.94 = 0.06
REFERENCES:
– Bayazıt, M., Oğuz, B., 1998, Probability and Statistics for Engineers, Birsen Yayınevi.
– Spiegel, M.R., 1992, Schaum’s Outline Series , Theory and Problems of Statitistics, 2/ed in SI
units, Mc Graw Hill.
22
Chapter 3
FREQUENCY ANALYSIS
DEFINITIONS
Raw Data
Raw data are collected data that have not been organized numerically. An example is the set of
weights of 100 male students obtained from an alphabetical listing of university records.
Arrays
An array is an arrangement of raw numerical data in ascending or descending order of magnitude.
Range
The difference between the largest and smallest numbers is called the range of the data. For
example, if the largest weight of 100 male students is 74 kilograms (kg) and the smallest weight
is 60 kg, the range is 74–60 =14 kg.
FREQUENCY DISTRIBUTIONS
When summarizing large masses of raw data, it is often useful to distribute the data into classes,
or categories, and to determine the number of individuals belonging to each class, called the class
frequency.
A tabular arrangement of data by classes together with the corresponding class frequencies is
called a frequency distribution, or frequency table. Table 2.1 is a the weights of 100 male students
at MAT271E Course (recorded to the nearest kilogram) and Table 2.2 is a frequency distribution
of the weights.
23
Table 2.2 Frequency distribution of weights of the students
Weight (kg) Number of Students
60–62 5
63–65 18
66–68 42
69–71 27
72–74 8
Total 100
The first class (or category), for example, consists of weights from 60 to 62 kg and is indicated by
the range symbol 60–62. Since five students have weights belonging to this class, the
corresponding class frequency is 5.
Data organized and summarized as in the above frequency distribution are often called grouped
data. Although the grouping process generally destroys much of the original detail of the data, but
a clear "overall" picture of data is gained.
40
35
30
25
20
15
10
5
0
60-62 63-65 66-68 69-71 72-74
24
Class Intervals and Class Limits
A symbol defining a class, such as 60–62 in Table 2.1, is called a class interval. The end numbers,
60 and 62, are called class limits; the smaller number (60) is the lower class limit, and the larger
number (62) is the upper class limit. The terms class and class interval are often used
interchangeably, although the class interval is actually a symbol for the class.
FREQUENCY HISTOGRAMS
The frequency histogram is the frequency of the class divided by the total frequency of all classes
and is generally expressed as a percentage. For example, the relative frequency of the class 66–68
in Table 2.1 is 42/100 = 42%. The sum of the relative frequencies of all classes is clearly 1, or
100%.
0,50
0,40
0,30
0,20
0,10
0,00
60-62 63-65 66-68 69-71 72-74
25
Table 2.2 cumulative–frequency distribution of Student weight distribution
A graph showing the cumulative frequency less than any upper class boundary plotted against the
upper class boundary is called a cumulative–frequency polygon, or ogive, and is shown in Fig. 2–
2 for the student weight distribution of Table 2.1.
M=1+3.3 log N
Where M is the number of class and N is the number of data. M value is rounded and converted
into an integer number.
26
3. The range of the numbers is divided into M in order to get the number of intervals. Usually
interval number is rounded into simpler integer numbers (i.e. 92 is rounded into 100, 22 into 25
etc.)
4. Start the lower class limit of first interval from the lowest number or a bit lower than the lowest
number of data set. (i.e if the lowest number is 23, start from 20.
5. Form the each class by adding the number obtained in previous calculation. Be sure that the
greatest class limit of last interval includes the greatest number in the data set.
6. Number of the class must be between 5 and 20, depending on the data.
EXAMPLE 3.2
The data set of an annual amount of rain (mm) of a city is given in following table .
700 315 450 615 625 625
420 645 635 895 500 565
650 585 665 555 410 715
365 455 535 545 575 645
550 735 615 675 835 595
10 9
6
4
4 3
2 2
2
0
300-399 400-499 500-599 600-699 700-799 800-899
Annual rain (mm)
27
The cumulative frequency distribution of the data is shown in Fig. 3.3.
Fig. 3.3 shows she cumulative frequency diagram. It is seen that 50% of the rain are below
600 kg.
The appearance of the frequency histogram is affected by the number of class intervals.
The use of too few classes causes too much loss of information whereas too many class
intervals may lead to irregular histograms, with very few observations (or maybe none) in
some intervals. The choice of the number of class intervals is important.
19
20
20/30
Frequency
18
16
15/30
14
12
10
10/30
8 6
5
6
5/30
4
2
0
0/30
300-499
500-699
700-899
(a) (b)
Fig. 1.7. Effect of number of class intervals on the frequency histogram (a) Low class number (b)
High class number
28
EXAMPLE 3.1
2 dice are thrown 65 times and the number of observations for sum of them are given in
Table 2.3. Table 2.3. Number of observation for the Sum of two dices
Number of
Sum observation
2 1
3 2
4 3
5 6
6 12
7 16
8 12
9 7
10 4
11 1
12 1
The sum of two dice is discrete type of data and its histogram can be drawn directly without
grouping in general. (Figure 2.4)
29
6. A multimodal frequency curve has more than two maxima.
30
Chapter 4
EXAMPLE 4.4
The arithmetic mean of the numbers 2, 4 and 8 is
2+4+8
𝑋𝑋 = = 4,67
3
EXAMPLE 4.5
If 5, 8, 6, and 2 occur with frequencies 3, 2, 4, and 1, respectively, the arithmetic mean is
(3)(5) + (2)(8) + (4)(6) + (1)(2)
𝑋𝑋̄ =
3+2+4+1
(3)(5) + (2)(8) + (4)(6) + (1)(2)
𝑋𝑋̄ =
3+2+4+1
31
THE WEIGHTED ARITHMETIC MEAN
Sometimes we associate with the numbers X1, X2,..., XK certain weighting factors (or weights) w1,
w2, w3 … wK depending on the significance or importance attached to the numbers. In this case,
𝑤𝑤1 𝑋𝑋1 + 𝑤𝑤2 𝑋𝑋2 +. . . . . . . +𝑤𝑤𝐾𝐾 𝑋𝑋𝐾𝐾
̄ =
𝑋𝑋𝑋𝑋
𝑤𝑤1 + 𝑤𝑤2 +. . . . . +𝑤𝑤𝐾𝐾
is called the weighted arithmetic mean. Note the similarity to equation (2), which can be considered
a weighted arithmetic mean with weights f1, f2,.... ,fK.
EXAMPLE 4.6
Midterm examination, final examination and homeworks in a course are weighted as 0.4, 0.4 and
0.2, respectively. A student has a midterm grade of 70, final examination grade of 85 and
homeworks grade of 90. The mean grade of the student is
(𝟎𝟎. 𝟒𝟒)𝟕𝟕𝟕𝟕 + (𝟎𝟎. 𝟒𝟒)𝟖𝟖𝟖𝟖 + (𝟎𝟎. 𝟐𝟐)𝟗𝟗𝟗𝟗
̄ =
𝑿𝑿𝑿𝑿
𝟎𝟎. 𝟒𝟒 + 𝟎𝟎. 𝟒𝟒 + 𝟎𝟎. 𝟐𝟐
����
𝑋𝑋𝑋𝑋 = 80
EXAMPLE 4.7
The deviations of the numbers 8, 3, 5, 12, and 10 from their arithmetic mean 7.6 are 8–7.6,
3–7.6, 5–7.6, 12–7.6, and 10–7.6, or 0.4, –4.6, –2.6, 4.4, and 2.4, with algebraic sum 0.4–
4.6–2.6 + 4.4 + 2.4 = 0.
2. The sum of the squares of the deviations of a set of numbers Xj from any number a is a
minimum if and only if a = X
3. If f1 numbers have mean m1, f2 numbers have mean m2,... ,fK numbers have mean mK, then the
mean of all the numbers is
𝑓𝑓1 𝑚𝑚1 + 𝑓𝑓2 𝑚𝑚2 +. . . . . . . +𝑓𝑓𝐾𝐾 𝑚𝑚𝐾𝐾
𝑋𝑋̄ =
𝑓𝑓1 + 𝑓𝑓2 +. . . . . +𝑓𝑓𝐾𝐾
that is, a weighted arithmetic mean of all the means
EXAMPLE 4.8
The geometric mean of the numbers 2, 4, and 8 is
3
𝐺𝐺 = �(2)(4)(8) = 4
32
THE HARMONIC MEAN H
The harmonic mean H of a set of N numbers X1, X2, X3,..., XN is the reciprocal of the arithmetic
mean of the reciprocals of the numbers:
1 𝑁𝑁
𝐻𝐻 = =
1 𝑁𝑁 1 1
∑ ∑
𝑁𝑁 𝑗𝑗=1 𝑋𝑋𝑗𝑗 𝑋𝑋
In practice it may be easier to remember that
1
1 ∑ 𝑋𝑋 1 1
= = �
𝐻𝐻 𝑁𝑁 𝑁𝑁 𝑋𝑋
EXAMPLE 4.9
The harmonic mean of the numbers 2, 4, and 8 is
3 3
𝐻𝐻 = = = 3.43
1 1 1 7
2+4+8 8
The equality signs hold only if all the numbers X1, X2,..., XN are identical.
The Pythagorean means are the three "classic" means A (the arithmetic mean), G (the geometric
mean), and H (the harmonic mean). The figure above shows how these means on two elements a
and b could be constructed geometrically, and also demonstrates that
𝐻𝐻 ≤ 𝐺𝐺 ≤ 𝑋𝑋
EXAMPLE 4.10
The set 2, 4, 8 has arithmetic mean 4.67, geometric mean 4, and harmonic mean 3.43.
THE MEDIAN
The median of a set of numbers arranged in order of magnitude (i.e., in an array) is either the
middle value or the arithmetic mean of the two middle values.
33
EXAMPLE 4.11
The set of numbers 3, 4, 4, 5, 6, 8, 8, 8, and 10 has median 6.
EXAMPLE 4.12
The set of numbers 5, 5, 7, 9, 11, 12, 15, and 18 has median ½ (9+ll)= 10.
Geometrically the median is the value of X (abscissa) corresponding to the vertical line which
divides a histogram into two parts having equal areas. This value of X is sometimes denoted by 𝑋𝑋�.
THE MODE
The mode of a set of numbers is that value which occurs with the greatest frequency; that is, it is
the most common value. The mode may not exist, and even if it does exist it may not be unique.
EXAMPLE 4.13
The set 2, 2, 5, 7, 9, 9, 9, 10, 10, 11, 12, and 18 has mode 9.
EXAMPLE 4.14
The set 3, 5, 8, 10, 12, 15, and 16 has no mode.
EXAMPLE 4.14
The set 2, 3, 4, 4, 4, 5, 5, 7, 7, 7, and 9 has two modes, 4 and 7, and is called bimodal.
A distribution having only one mode is called unimodal.
0
Median
Mean,
Mode,
0 1 2 3 4 5 6 7 8 9 10
Fig. 4.2
34
Fig. 4.3
Fig. 4.4
∑𝑁𝑁 2
𝑗𝑗=1 𝑋𝑋𝑗𝑗
𝑅𝑅𝑅𝑅𝑅𝑅 = �
𝑁𝑁
RMS is a statistical measure of the magnitude of a varying quantity. It is especially useful when
variates are positive and negative, e.g., sinusoids. RMS is used in various fields, including
electrical engineering; one of the more prominent uses of RMS is in the field of signal amplifiers.
EXAMPLE 4.15
The RMS of the set -1, 1.2, -1.1, 1, -0.9 and 0.9 is
RMS = 1.02
35
QUANTILES
36
Fig. 4.7. Percentiles
37
From “How to Lie With Statistics, Darrel Huff, 1954”
38
COMPARISON BETWEEN MEAN, MEDIAN AND MODE
1. Use of average:
The arithmetic Mean is comparatively stable and is widely used than the Median and Mode. It
is suitable for general purposes, unless there is any particular reason to select any other type of
average. As for the simplicity is concerned mode is the simplest of three.
Mode is the most usual or typical item, hence it can be located by inspection also. Median
divides the curve into two equal parts and is simpler than the mean. In certain eases Median is
as stable as the mean.
2. Algebraic manipulation:
Mean lends itself to algebraic manipulation. For example, we can calculate aggregate when the
number of items and the average of the series is given. Median and Mode cannot be algebraically
manipulated.
3. Extreme and abnormal items:
Presence of extreme and abnormal items can lead to certain misleading conclusion in case of
mean. As for Mode and Median are concerned, they are not much influenced by the presence of
abnormal items in the series. Statisticians are of the view that median or mode should be used
in such cases because they are least influenced.
4. Qualitative expression:
Mean cannot be used when the data is qualitative or is not capable of numerical expressions.
With the help of Median we can measure quantities which are capable of numerical expression.
We can measure the intelligence or health of boys etc. Similarly, mode is the average that proves
useful for non-numerical data.
5. Presence of Skewness:
In case of a symmetrical curve, the value of mean, median and mode would coincide. But when
skewness is present, there is not much change in the value of mode. The value of median and
mean changes with the presence of positive or negative skewness to the positive or negative side
respectively. The value of mean changes to a greater extent than the value of median because it
is affected by the position and value of every item.
6. Fluctuations of sampling:
Mean is least affected by fluctuations of sampling. If the number of items is large, the
abnormalities on the one side cancel the abnormalities on the other. Median distributes the curve
into two equal parts and is affected by the fluctuations of sampling. Mode is affected to a great
extent than even the median.
7. As a measure of dispersion:
Dispersion is a measure of variability within a group of data and for this measure, averages are
used to ascertain the degree of deviation. We know that the total of the deviations from the mean
is equal to zero hence square of deviations will be the minimum one.
Due to this fact, mean is the usual basis for this measure of dispersion. Median as a basis of
dispersion is considered better because the deviations from the median are least and median is
in wide practice. Mode is not much suitable as a measure of dispersion.
39
WHICH MEANS SHOULD BE USED? (In Turkish)
ARİTMETİK ORTALAMA
Aritmetik ortalama çok popüler olarak hesaplanıp kullanılmakla beraber bazı önemli
dezavantajları bulunmaktadır.
Aşırı değerlere duyarlı (yani güçsüz) bir merkezsel konum ölçüsüdür. Eğer veri dizisi için
asimetrik olarak sadece bir uçsal değer ya aşırı küçük ya aşırı büyük ise aritmetik ortalama o
aşırı değere yaklaşma gösterir.
Aritmetik ortalama her türlü ölçülme ölçekli sayısal veri için kullanılamaz. İsimsel ölçekli
sayısal veriler için aritmetik ortalama anlamsızdır. Sırasal ölçekli sayısal veriler için aritmetik
ortalama kullanılması büyük tartışmalara açıktır. Birçok kişi değişik kişilerin sıralamalarının
aynı olduğunu kabul etmedikleri için elde edilen verilerin toplamının ve bu toplamdan çıkartılan
aritmetik ortalamanın anlamsız olacağını kabul etmektedirler. Ancak işletme alanı, davranışsal
bilimler ve sosyal bilimlerde, özellikle anket verileri, sırasal ölçekli olmakta ve buna rağmen bu
verilerin aritmetik ortalamaları pratikte önemli alanlarda kullanılmaktadır. Aralıksal ölçekli ve
oransal ölçekli sayısal veriler için aritmetik ortalama anlamlıdır.
GEOMETRİK ORTALAMA
İstatistiksel araştırmalarda gözlem sonuçları arasındaki oransal (nispî) farkların mutlak
farklardan daha önemli olduğu durumlarda geometrik ortalamaya başvurulur. Diğer bir ifade ile
gözlem sonuçlarının her biri bir önceki gözlem sonucuna bağlı olarak değişiyorsa ve bu
değişmenin hızı saptanmak istenirse geometrik ortalama sağlıklı sonuçlar verir.
Geometrik ortalama bulmak veri değerlerinin pozitif olması gerekir. Eğer tek bir veri değeri sıfır
ise geometrik ortalama anlamsız olur.
HARMONİK ORTALAMA
Harmonik ortalama genellikle, ekonomik olaylarda 1 birim ile alınan ortalama miktara veya bir
mamülün bir biriminin üretimi için harcanan ortalamaya ihtiyaç duyulduğunda kullanılır.
MEDYAN
Eğer verilerin dağılımı simetrik olmayıp çarpıklık gösteriyorlarsa, medyan değeri tercih edilen
merkezsel konum ölçüsü olarak kullanılır ve medyanın aritmetik ortalama değerinden daha
uygun bir ölçü olduğu kabul edilir. Simetrik olmama, sıralanmış veri değerleri için ya en küçük
değerlerin ya da en büyük değerlerin diğerlerinden çok daha fazla uzaklaşması ile ortaya çıkar.
Bu beklenmedik küçük veya büyük değerlere dışlak (İngilizce: outlier) veriler adı verilir. Eğer
veriler dağılımı asimetrik olan dışlak veriler kapsıyorsa, medyan aritmetik ortalamaya nazaran
daha tercih edilir merkezsel konum ölçüsü olarak kullanılır. Bu halde, istatistiksel terminolojiye
göre, medyan, aritmetik ortalamadan daha güçlü (İngilizce: robust) bir ölçüdür.
MODE
En sık gözlenen değerin bir anlam ifade edeceği durumlar için kullanılır. Sayısal bir değeri
göstermekle birlikte, elde ediliş biçimi medyanda olduğu gibi sayısal değildir. Kalitatif veriler
için de kullanılır. En sık gözlenen kalitatif veya kantitatif değer, mode olarak seçilir.
40
Chapter 5
THE VARIANCE
The variance of a set of data is defined as the square of the standard deviation and is thus given by
Varx or S2.
∑𝑁𝑁 ̄ 2
𝑖𝑖=1(𝑋𝑋𝑋𝑋 − 𝑋𝑋 )
𝑉𝑉𝑉𝑉𝑟𝑟𝑥𝑥 = � �
𝑁𝑁
When it is necessary to distinguish the standard deviation of a population from the standard
deviation of a sample drawn from this population, we often use the symbol S2 for the latter and σ2
(lowercase Greek sigma) for the former. Thus S2 and σ2 would represent the sample variance and
population variance, respectively.
EXAMPLE 5.1
The lengths of the players (cm) of two different basketball team A and B are given in
following table.
A B
210 190
210 190
160 190
170 180
180 180
XA=186 XB=186
41
The means of the lenghts of each team are the same (186 cm). This doesn’t mean that both
team are equvalent. There is a difference between the dispersion of the data within the
teams. Although Team A has very dispersed lengths of players, the Team B has similar
lenghts. In order to interpret the difference within the teams, the variance or the standard
deviation must be calculated.
∑𝑁𝑁 ̄ 2
𝑖𝑖=1(𝑋𝑋𝑋𝑋 − 𝑋𝑋 )
𝑉𝑉𝑉𝑉𝑟𝑟𝑥𝑥 = � �
𝑁𝑁
VarA=424.4 cm2
VarB=24.0 cm2
𝑆𝑆𝑥𝑥 = �𝑉𝑉𝑉𝑉𝑟𝑟𝑥𝑥
SA=20.6 cm
SB= 4.9 cm
Since the SA value is greater than SB value, the dispersion within Team A is greater than
that of Team B.
COEFFICIENT OF VARIATION
Coefficient of variation shows the variation in a set and is used for comparing the variation in
two sets of data having different means.
𝑆𝑆𝑆𝑆
𝐶𝐶𝑣𝑣𝑥𝑥 =
𝑋𝑋̄
EXAMPLE 5.2
The mean and the standard deviation of X data set are given as 789 and 139 and of Y set
750 and 135, respectively. Compare the sets and find which set has higher variation.
SOLUTION
𝑆𝑆𝑆𝑆 139
𝐶𝐶𝑣𝑣𝑥𝑥 = = = 0.175
𝑋𝑋̄ 789
𝑆𝑆𝑆𝑆 135
𝐶𝐶𝑣𝑣𝑦𝑦 = = = 0.180
𝑌𝑌̄ 750
SKEWNESS
Skewness is the degree of asymmetry, or departure from symmetry, of a distribution. Normal
distribution gives symmetrical bell-shape (Fig 5.1) For skewed distributions, the mean tends to lie
on the same side of the mode as the longer tail (see Figs. 5.2 and 5.3). The measure of skewness
is
42
∑𝑁𝑁 ̄ 3
𝑖𝑖=1(𝑋𝑋𝑋𝑋 − 𝑋𝑋 )
� �
𝑁𝑁
𝐶𝐶𝑠𝑠𝑥𝑥 =
𝑆𝑆𝑥𝑥 3
If the frequency curve (smoothed frequency polygon) of a distribution has a longer tail to the right
of the central maximum than to the left, the distribution is said to be skewed to the right, or to have
positive skewness. If the reverse is true, it is said to be skewed to the left, or to have negative
skewness.
Csx = 0
0 1 2 3 4 5 6 7 8 9 10
Csx > 0
0 1 2 3 4 5 6 7 8 9 10
Csx < 0
0 1 2 3 4 5 6 7 8 9 10
43
KURTOSIS
Kurtosis is the degree of peakedness of a distribution, usually taken relative to a normal
distribution. The normal distribution shown in Figure 5.4, which is not very peaked or very flat–
topped, is called mesokurtic. A distribution having a relatively high peak, such as the curve of
Figure 5.5 is called leptokurtic, while the curve of Figure 5.6, which is flat–topped, is called
platykurtic.
The kurtosis is defined by
4
∑𝑁𝑁 (𝑋𝑋𝑋𝑋 − 𝑋𝑋̄)
� 𝑖𝑖=1 �
𝑁𝑁
𝑘𝑘𝑥𝑥 = −3
𝑆𝑆𝑥𝑥 4
which is positive for a leptokurtic distribution, negative for a platykurtic distribution, and zero for
the normal distribution.
kx = 0
0 1 2 3 4 5 6 7 8 9 10
kx >0
0 1 2 3 4 5 6 7 8 9 10
kx < 0
0 1 2 3 4 5 6 7 8 9 10
44
Chapter 6
NORMAL DISTRIBUTION
A large number of random variables encountered in practical applications fit to the normal
(Gaussian) distribution with the following probability density function:
1
𝑓𝑓(𝑥𝑥) = 𝑒𝑒𝑒𝑒𝑒𝑒[−(𝑥𝑥 − 𝜇𝜇𝑥𝑥 )2 /2𝜎𝜎 2 𝑥𝑥] − ∞ < 𝑥𝑥 < ∞
𝜎𝜎𝑥𝑥 √2𝜋𝜋
This distribution is shown briefly as N(μ, σ2) It has two parameters: μx mean of the random
variable, and σx, its standard deviation. Normal distribution is symmetrical (Cs=0) with a kurtosis
coefficient equal to 0 (K=0)
It is not easy to take the integral of the analytical form of the probability distribution function F(x)
of the normal distribution. Instead it is tabulated form (Z–table) is used easily by using the formula
below for the normal distribution
𝑋𝑋 − 𝜇𝜇𝑥𝑥
𝑍𝑍 =
𝜎𝜎𝑥𝑥
where the standard normal variable Z has the mean 0 and standard deviation 1. The distribution
N(0,1) of the variable Z is called the standard normal distribution. Its probability distribution
function is given in Z–Table.
Since the normal distribution is symmetrical, this table is prepared for the positive values of Z
only. The probabilities of Z exceeding a certain positive value z, F1(z)=A are given. For positive
z, we can compute the probability of nonexceedance as F(z)=1– F1(z) and for negative we have
F(z)= F1(|z|) because of symmetry around the mean 0.
45
Figure 6.1. Normal distribution and the probabilities of the normal variable remaining in the
intervals of certain lengths around the mean
46
The probability density function of normal distribution is bell–shaped around the mean μx. The
mode and median are equal to the mean because of the symmetry. The probabilities of the normal
variable to remain in the intervals around the mean of width one, two and three standard deviations
are equal to 0.683, 0.955 and 0.9975 (nearly 1), respectively.
The probability paper of the normal distribution facilitates the applications for the normal
variables. The ordinate axis of this paper is scaled such that the cumulative distribution function
of the normal distribution appears as a straight line (Figure). Since the distribution is symmetrical
the value (median) corresponding to gives the mean.
Several variables of natural and social sciences are found to be normally distributed. This can be
explained by the central limit theorem. This theorem states that, the distribution of the variable
where Xi are independent random variables approaches the normal distribution
𝑛𝑛
𝑋𝑋 = � 𝑐𝑐𝑖𝑖 𝑋𝑋𝑖𝑖
𝑖𝑖=1
where Xi are independent random variables approaches the normal distribution with the increase
of n, whatever the distribution of variables Xi are. The approach is rather fast such that the normal
distribution can be assumed for n≥10. Thus, if a random variable is affected by a large number of
independent variables such that the effects are additive, then it can be assumed to be distributed
normally.
A difficulty in using the normal distribution for physical (engineering) variables is the following.
Such variables can usually take only positive values. A normal variable may vary in the range of
However, the probability of a normal variable to assume values outside the interval
is negligibly small, and therefore, if its mean is much larger than its standard deviation,
then the probability of the variable assuming a negative value will almost vanish.
A useful property of the normal distribution is that the sum (or the difference) of normally
distributed variables also follows the normal distribution. If X and Y are two independent normal
variables, the distribution of and the distribution of F is
47
Testing the data whether it is normally distributed.
1. Sketch the cumulative frequency distribution of the data on the normal probability paper.
If the plot is nearly a straight line, the distribution we can decide that the data follows the
normal distribution.
2. Calculate skewness coefficient (Cs) and kurtosis coefficient (kx) of the data. If Cs is equal
to 0 (± 0.05) and kx=0 (±0.10), we can say that the data is normally distributed.
3. Calculate the mean, median and mode of the data. If they are equal or quite similar to each
other, then the distribution is normal.
Many variables encountered in engineering problems can be assumed to be normal. Random
measurement errors occur due to the additive effects of several factors. Therefore they are expected
to be normally distributed, with a standard deviation called the Standard error. Certain properties
of building materials are also normally distributed.
In some cases where there is no reason to expect for the normal distribution, the assumption is still
made because of the ease of using it. However, the normal distribution is certainly not valid in
every cases because the variable is skewed. Most hydrologic variables (such as the discharge in a
stream, the precipitation depth at a location) are not symmetrically distributed. For such variables,
distributions other than normal must be used.
EXAMPLE 6.1
Annual flow of a river (m3/s) is assumed to be normally distributed with μx=60, σx=8. What
is the probability of the flow to remain
a) less than 55
b) more than 73
c) in the range of 55 and 73 m3/s?
SOLUTION
The values of the standard normal variable corresponding to 55 and 73, respectively, are computed
by:
𝑋𝑋−𝜇𝜇𝑥𝑥
𝑍𝑍 = 𝜎𝜎𝑥𝑥
a)
55 − 60
𝑍𝑍 = = −0.63
8
F1(–0.63)=F1(0.63)=0.2643
Then the probability of the flow less than 55 is
26.43 %.
48
b)
73 − 60
Z= = 1.63
8
F2(1.63)=F2(1.63)=0.0516
Then the probability of the flow more than 73
is 5.16 %.
c)
P(55<X<73) = P(–0.5<Z<1.63)
Since the total area under the normal
distribution curve is equal to one, the area
between 55 and 73 (or z = – 0.5 and z = 1.63)
is.
1–(0.2643+0.0516)=0.6841 68.41 %
LOGNORMAL DISTRIBUTION
It is often attempted to transform a nonnormal random variable to a normal variable because the
normal distribution has well known properties and is easy to use. The most commonly used
transformation is the logarithmic transformation.
If the transformed variable
Y=ln X
fits the normal distribution, then the distribution of the original variable X is called lognormal:
1
𝑓𝑓(𝑥𝑥) = 𝑒𝑒𝑒𝑒𝑒𝑒[−(𝑙𝑙𝑙𝑙 𝑥𝑥 − 𝜇𝜇𝑌𝑌 )2 /2𝜎𝜎 2 𝑌𝑌 ] 𝑥𝑥 > 0
𝑥𝑥 𝜎𝜎𝑌𝑌 √2𝜋𝜋
The lognormal distribution is positively skewed.
49
Fig 6.3. Lognormal distribution
This distribution is defined only for the positive values of the variable X. The parameters of μy and
σy are related to μx and σx, the parameters of X, by the following equations:
𝑌𝑌 − 𝜇𝜇𝑦𝑦
𝑍𝑍 =
𝜎𝜎𝑦𝑦
0.5 0.5
𝜎𝜎2 𝜎𝜎2
Y=ln X 𝜇𝜇𝑌𝑌 = 𝑙𝑙𝑙𝑙 �𝜇𝜇𝑥𝑥 / �𝜇𝜇2 𝑥𝑥 + 1� � 𝜎𝜎𝑌𝑌 = �𝑙𝑙𝑙𝑙 � 𝜇𝜇2𝑋𝑋 + 1��
𝑥𝑥 𝑥𝑥
Since the logarithm of a product is the sum of the logarithms of the multipliers, it may be expected
by the central limit theorem that if a random variable arises by the multiplication of the effects of
several independent variables, then its distribution will approach the lognormal.
The fact that a lognormal random variable can take only positive values, facilitates the fitting of
this distribution to the physical variables.
In civil engineering applications, the lognormal distribution has been used for hydrologic
variables, in the problems related to fatigue and in earthquakes. In the mining and extraction
industries lognormal distribution has been also used in large scale.
The z–table prepared for the normal distribution can be used for the lognormal distribution as well.
The parameters μy and σy can be estimated in two ways. Either they are computed from the
logarithms of the observations of the X variable, or they are estimated from equations above using
the computed values of μx and σx. The second approach preserves the parameters of the original
variable.
The probability paper of the normal distribution can be used for the lognormal distribution when
the abscissa axis is logarithmically scaled.
A property of the lognormal distribution is that the product of lognormal variables is lognormally
distributed.
EXAMPLE 6.2
Solve the problem of previous example assuming that X is lognormally distributed.
SOLUTION
The parameters of the variable
50
0.5
82
𝜇𝜇𝑌𝑌 = 𝑙𝑙𝑙𝑙 �60/ � 2 + 1� � = 4.086
60
0.5
82
𝜎𝜎𝑌𝑌 = �𝑙𝑙𝑙𝑙 � 2 + 1�� = 0.132
60
Y=ln 55=4.007
4.007 − 4.086
𝑍𝑍1 = = −0.60
0.132
From Z–table
P(Y<4.007)=F(–0.60) = 0.2743 27.43%
Y=ln 73=4.290
4.290 − 4.086
𝑍𝑍2 = = 1.55
0.132
From Z–table
P(Y>4.290)=F(1.55) = 0.0606 6.06%
c) Between 55 and 73 is ;
This value is somewhat smaller than 68.41 obtained by the assumption of normal distribution.
51
Table 4.1 Z – Table (Normal Distribution)
z 0 0,01 0,02 0,03 0,04 0,05 0,06 0,07 0,08 0,09
0,0 0,5000 0,4960 0,4920 0,4880 0,4840 0,4801 0,4761 0,4721 0,4681 0,4641
0,1 0,4602 0,4562 0,4522 0,4483 0,4443 0,4404 0,4364 0,4325 0,4286 0,4247
0,2 0,4207 0,4168 0,4129 0,4090 0,4052 0,4013 0,3974 0,3936 0,3897 0,3859
0,3 0,3821 0,3783 0,3745 0,3707 0,3669 0,3632 0,3594 0,3557 0,3520 0,3483
0,4 0,3446 0,3409 0,3372 0,3336 0,3300 0,3264 0,3228 0,3192 0,3156 0,3121
0,5 0,3085 0,3050 0,3015 0,2981 0,2946 0,2912 0,2877 0,2843 0,2810 0,2776
0,6 0,2743 0,2709 0,2676 0,2643 0,2611 0,2578 0,2546 0,2514 0,2483 0,2451
0,7 0,2420 0,2389 0,2358 0,2327 0,2296 0,2266 0,2236 0,2206 0,2177 0,2148
0,8 0,2119 0,2090 0,2061 0,2033 0,2005 0,1977 0,1949 0,1922 0,1894 0,1867
0,9 0,1841 0,1814 0,1788 0,1762 0,1736 0,1711 0,1685 0,1660 0,1635 0,1611
1,0 0,1587 0,1562 0,1539 0,1515 0,1492 0,1469 0,1446 0,1423 0,1401 0,1379
1,1 0,1357 0,1335 0,1314 0,1292 0,1271 0,1251 0,1230 0,1210 0,1190 0,1170
1,2 0,1151 0,1131 0,1112 0,1093 0,1075 0,1056 0,1038 0,1020 0,1003 0,0985
1,3 0,0968 0,0951 0,0934 0,0918 0,0901 0,0885 0,0869 0,0853 0,0838 0,0823
1,4 0,0808 0,0793 0,0778 0,0764 0,0749 0,0735 0,0721 0,0708 0,0694 0,0681
1,5 0,0668 0,0655 0,0643 0,0630 0,0618 0,0606 0,0594 0,0582 0,0571 0,0559
1,6 0,0548 0,0537 0,0526 0,0516 0,0505 0,0495 0,0485 0,0475 0,0465 0,0455
1,7 0,0546 0,0436 0,0427 0,0418 0,0409 0,0401 0,0392 0,0384 0,0375 0,0367
1,8 0,0359 0,0351 0,0344 0,0336 0,0329 0,0322 0,0314 0,0307 0,0301 0,0294
1,9 0,0287 0,0281 0,0274 0,0268 0,0262 0,0256 0,0250 0,0244 0,0239 0,0233
2,0 0,0228 0,0222 0,0217 0,0212 0,0207 0,0202 0,0197 0,0192 0,0188 0,0183
2,1 0,0179 0,0174 0,0170 0,0166 0,0162 0,0158 0,0154 0,0150 0,0146 0,0143
2,2 0,0139 0,0136 0,0132 0,0129 0,0125 0,0122 0,0119 0,0116 0,0113 0,0110
2,3 0,0107 0,0104 0,0102 0,0099 0,0096 0,0094 0,0091 0,0089 0,0087 0,0084
2,4 0,0082 0,0080 0,0078 0,0075 0,0073 0,0071 0,0069 0,0068 0,0066 0,0064
2,5 0,0062 0,0060 0,0059 0,0057 0,0055 0,0054 0,0052 0,0051 0,0049 0,0048
2,6 0,0047 0,0045 0,0044 0,0043 0,0041 0,0040 0,0039 0,0038 0,0037 0,0036
2,7 0,0035 0,0034 0,0033 0,0032 0,0031 0,0030 0,0029 0,0026 0,0027 0,0026
2,8 0,0026 0,0025 0,0024 0,0023 0,0023 0,0022 0,0021 0,0021 0,0020 0,0019
2,9 0,0019 0,0018 0,0018 0,0017 0,0016 0,0016 0,0015 0,0015 0,0014 0,0014
3,0 0,0013 0,0013 0,0013 0,0012 0,0012 0,0011 0,0011 0,0011 0,0010 0,0010
3,1 0.0010 0.0009 0.0009 0.0009 0,0008 0,0008 0,0008 0,0008 0,0007 0,0007
3,2 0,0007 0,0007 0,0006 0,0006 0,0006 0,0006 0,0006 0,0005 0,0005 0,0005
3,3 0,0005 0,0005 0,0005 0,0004 0,0004 0,0004 0,0004 0,0004 0,0004 0,0003
3,4 0,0003 0,0003 0,0003 0,0003 0,0003 0,0003 0,0003 0,0003 0,0003 0,0002
3,5 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002
3,6 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002 0,0002
3,7 0,0002 0,0002 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001
3,8 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001 0,0001
3,9 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000
52
NORMAL DISTRIBUTION PAPER
53
LOGNORMAL DISTRIBUTION PAPER
54
GAMA DISTRIBUTION
55
56
EXAMPLE 6.3
57
Chapter 7
SAMPLING DISTRIBUTIONS
THE CONCEPT OF SAMPLING DISTRIBUTION
The real value of any β parameter of a random variable is never determined since it is not possible
to observe the whole population of this random variable. We can only calculate a value of statistic
b which is an estimate of this parameter from a sample. (Generally Greek letters are used for
parameters and the corresponding Latin letters are used for their statistics). Statistic b is not equal
to parameter β it is the best estimate of β which can be obtained from the sample at hand. If we
had various samples drawn from the same population, bi statistics corresponding to the β parameter
computed from these samples would not be equal. For instance if the Sx statistic corresponding to
the σx standard deviation parameter is calculated from various samples, different values are
obtained.
The values of a statistic calculated from different samples have a distribution since we can treat
any statistic as a random variable. The probability distribution of the values of any statistic to be
calculated from various samples of same size is called the sampling distribution of this statistic
(Fig. 7.1).
Population
n Sample 1, b1
n Sample 2, b2
Sample 3, b3
n .
. .
. .
.
.
n Sample N, bN
Fig. 7.1. The values of b statistic estimated from various samples of same size n (b1, b2, …..,bN)
corresponding to parameter β
To know the sampling distribution of a statistic is important for the following reason. As
mentioned above, the statistic determined from the sample at hand is not equal to the real value of
the population parameter. Without observing the whole population it is not possible to determine
the value of the parameter with an absolute correctness. However we can determine the interval in
which the unknown parameter value will remain around the calculated statistic with a given
probability. For this reason, the sampling distribution of that statistic must be known.
58
Let us sketch the sampling distribution f(b) of the b statistic of the parameter calculated β from
samples of magnitude N (Fig. 7.2). The expected value of this distribution will be the b0 value
calculated from the sample at hand. For determining (b1, b2) interval in which the unknown β
population parameter will remain with a given Pc probability, such a symmetrical interval (b1, b2)
is chosen around b0 that the percentage of the sampling distribution within this interval is Pc (Fig.
7.2). Here Pc is called the confidence level and the interval (b1, b2) is called the confidence interval
at this level. Values as 0.90, 0.95, 0.99 are used for Pc in practice. The confidence interval widens
as Pc increases since the probability that β remains within a wider interval is higher.
The properties of the sampling distribution depend on the distribution of the random variable of
the population, the parameter under consideration and the size of the sample. As the number of
elements of the sample N increases, the confidence interval corresponding to a certain confidence
level gets narrower (Fig. 3). In other words, the confidence interval within which the parameter
will remain with a certain probability is smaller for large samples, expressing that the error in
parameter estimation is reduced.
Fig. 7.2. Determination of the confidence interval (b1, b2) in the sampling distribution f(b) of the
statistic b at a confidence level Pc
Fig. 7.3. Narrowing of the confidence interval with the increase of N, the number of elements in
the sample
From the above explanations one might think that the parameter β is considered as a random
variable which remains within a certain interval with a certain probability. As a matter of fact, the
parameter β is not a random variable but an unknown constant. Therefore, to interpret the concepts
of confidence level and confidence interval as follows is more correct. The ratio Pc of the β
parameter values to be estimated from numerous samples of the same size drawn from the same
population will remain within the interval(b1, b2)determined for this confidence level. In other
words we may rely at Pc level that the unknown β value is in confidence interval (b1, b2). Or, the
random interval (b1, b2) contains β with probability Pc.
59
(Here an important point to be noted is that the boundaries of the confidence interval will change
from sample to sample since the assumption E(b)=b0 is made for each sample. Therefore, this is a
random interval, and Pc is the ratio of such intervals that contain β However, since there is only
one sample at hand in practice, the confidence interval determined from this sample is used).
In some problems it might be more meaningful to choose the confidence interval only to the left
(or to the right) of b0 (one sided confidence interval) instead of choosing it symmetrically on both
sides of the b0 value (two sided confidence interval). For example when the resistance of a material
or the capacity of a channel is of subject it is more meaningful to use a one sided confidence
interval to the left of b0 thus determining the lowest value for the parameter considered at the
chosen confidence level. To find the lower bound of the one sided confidence interval b value is
calculated from the sampling distribution such that the probability of remaining smaller than the b
value is 1–Pc.
On the other hand if the wind load effecting a structure or the flood of a river is of subject, the one
sided confidence interval to the right of b0 can be used. In this case to determine the upper bound
of the confidence interval the b value, of which the exceedance probability is 1–Pc is calculated
from the sampling distribution. It can be said that the parameter under consideration will not
exceed this value at a confidence level of Pc.
Sampling distributions can be determined theoretically only for some statistics. This is generally
possible for large samples; sampling distributions which are valid in cases when the number of
elements in the sample approaches infinity (N→ ∞) are called as asymptotic distributions.
Asymptotic distributions can only be used approximately for small samples (N<30). As N reduces,
the errors increase rapidly. Exact distributions, which are also valid for small samples, have been
obtained theoretically only for some special statistics when the random variable is normally
distributed. In other cases sampling distributions can only be found experimentally. To do this,
numerous samples of required size of the random variable fitting the given distribution are
generated by means of a computer and the value of the required statistic is calculated from each
sample. The frequency distribution of these values obtained is determined. This frequency
distribution can approximately be assumed as the sampling distribution which is searched.
SAMPLING DISTRIBUTIONS
�
X−X
Z= 𝑏𝑏1 = 𝑋𝑋� − 𝑍𝑍 𝑆𝑆𝑥𝑥 /√𝑁𝑁 𝑏𝑏2 = 𝑋𝑋� + 𝑍𝑍 𝑆𝑆𝑥𝑥 /√𝑁𝑁
Sx /√N
EXAMPLE 7.1
Total dissolved solids (TDS) of a river is a random variable with a mean value of 𝑋𝑋� =80
mg/l and a standard deviation of 𝑠𝑠𝑥𝑥 =16 mg/l. What is the probability of the annual mean
TDS value, to be calculated from samples taken during 36 months of the year from a river,
exceeding 90 mg/1? (Distribution is normal)
SOLUTION
The mean and standard deviation of the annual mean TDS calculated from a sample of N=36 is 80
mg/1 and 16/120.5 mg/1 respectively. The value of the standard normal variable corresponding to
90 mg/1 with normal distribution assumption is:
60
90 − 80
𝑍𝑍 = = 3.75
16/√36
From normal distribution table (z–Table) the exceedence probability of this value is F1(Z)=0.0001.
Thus the probability that the annual mean TDS to be calculated from samples recorded during 36
months is greater than 90 mg/1 is % 0.01.
EXAMPLE 7.2
The mean and standard deviation of the failure load experiments performed on 36 steel
beams are found as 8490 kg and 300 kg, respectively. Find the limits of confidence interval
of the mean at Pc=95%. (Distribution is normal)
SOLUTION
Since the sampling distribution is normal, the value of the standard normal variable of which the
exceedence probability is (1/𝑃𝑃𝑐𝑐 )/2 = (1 − 0.95)/2 = 0.025 can be read from the normal
distribution table (Table 6.1) as 𝑍𝑍0.025 = 1.96
𝑏𝑏1 = 𝑋𝑋� − 𝑍𝑍 𝑆𝑆𝑥𝑥 /√𝑁𝑁
𝑏𝑏2 = 𝑋𝑋� + 𝑍𝑍 𝑆𝑆𝑥𝑥 /√𝑁𝑁
𝑏𝑏1 = 8490 − 1.96(300/√36) = 8392
𝑏𝑏2 = 8490 + 1.96(300/√36) = 8588
The limits of confidence interval is shown in Fig. 7.4
Fig 7.4. Confidence interval of the mean failure load in previous example at 95% confidence level
It can be stated that the μx (population mean) of which the value is unknown, remains within the
interval (8392, 8588) with a probability of 95%. However the result is only approximately correct
since asymptotical distribution is used with a sample which is not sufficiently large (N < 30).
2. t – (Student) Distribution
Distributions which are valid for small samples can be determined analytically only for some
special statistics. This is only possible when the random variable is normally distributed.
T distribution is developed especially for the sample having small numbers of data (N < 30)
𝑋𝑋 − 𝑋𝑋�
𝑡𝑡 =
𝑆𝑆𝑥𝑥 /√𝑁𝑁
61
The distribution of the t statistic is the t distribution with a degree of freedom (d.f. = N–1). The t
distribution table (t–Table) gives the probability that the variable t is greater than a selected t0
value (P (t > to). t distribution is a symmetrical distribution like the normal distribution. Its
mean value is 0, its variance is greater than that of the standard normal distribution since it is N /
(N – 2 ) > 1. For large N values the variance approaches to 1 and the t distribution approaches to
the standard normal distribution.
The limits of the interval for t-distribution is calculated with the following formula.
𝑏𝑏1 = 𝑋𝑋� − 𝑡𝑡 𝑆𝑆𝑥𝑥 /√𝑁𝑁 𝑏𝑏2 = 𝑋𝑋� + 𝑡𝑡 𝑆𝑆𝑥𝑥 /√𝑁𝑁
EXAMPLE 7.3
Total dissolved solids (TDS) of a river is a random variable with a mean value of X=80
mg/l and a standard deviation of Sx=16 mg/l. What is the probability of the annual mean
TDS value, calculated from samples taken during 12 months of the year from a river,
exceeding 90 mg/1? (Distribution is symmetrical)
SOLUTION
Since the sample is a small one (N<30), t–distribution is used.
The value of the variable t corresponding to 90 mg/l is
90 − 80
𝑡𝑡 = = 2.16
16/√12
From t–Table (Table 7.1) the exceedance probability of this t value is read as 0.025 for d.f.=N–
1=11. The probability that the annual mean TDS, to be evaluated from records of 12 months,
exceeds 90 mg/1 is 0.025 (this value is greater than 0.015 calculated with the assumption of
asymptotic normal distribution).
EXAMPLE 7.4
In Example 7.2, if the number of sample is 25 it is more suitable to use the exact t
distribution instead of the normal distribution. (The mean = 8490 kg and standard deviation
= 300 kg, Distribution is symmetrical)
62
Find the limits of confidence interval of the mean at Pc=95%.
SOLUTION
From t–Table 7.1, the value of t of which the exceedance probability is 0.025 is read as t0.025 =2.064
for d.f.=N-1=24. Thus the limits of the confidence interval of the mean at 95% confidence level:
𝑏𝑏1 = 𝑋𝑋� − 𝑡𝑡 𝑆𝑆𝑥𝑥 /√𝑁𝑁 = 8490 − 2.064 (300) /√25 = 8366.2
𝑏𝑏2 = 𝑋𝑋� + 𝑡𝑡 𝑆𝑆𝑥𝑥 /√𝑁𝑁 = 8490 + 2.064 (300)/ √25 = 8613.8
Thus it can be said that the unknown population parameter μx will remain within the interval (8366,
8614) with a probability of 95 %. Since the sample is small, the confidence interval widens when
the t distribution is used.
63
Table 7.1. Student’s t- Distribution Table
P 0.45 0.40 0.35 0.20 0.15 0.10 0.05 0.025 0.01 0.005
(d.f)
1 0.158 0.325 0.510 1.376 1.963 3.078 6.314 12.71 31.82 63.66
2 0.142 0.289 0.443 1.061 1.386 1.886 2.920 4.303 6.965 9.925
3 0.137 0.277 0.424 0.978 1.230 1.636 2.313 3.182 4.541 5.841
4 0.134 0.271 0.414 0.941 1.190 1.333 2.132 2.776 3.747 4.604
5 0.132 0.267 0.408 0.920 1.136 1.476 2.015 2.571 3.365 4.032
6 0.131 0.265 0.404 0.906 1.134 1.440 1.943 2.447 3143 3.707
7 0.130 0.263 0.402 0.896 1.119 1.415 1.895 2.365 2.998 3.499
8 0.130 0.262 0.399 0.889 1.108 1.397 1.860 2.306 2.896 3.355
9 0.129 0.261 0.396 0.883 1.100 1.383 1.833 2.262 2.821 3.250
10 0.129 0.260 0.397 0.879 1.093 1.372 1.812 2.228 2.764 3.169
11 0.129 0.260 0.396 0876 1.088 1.363 1.796 2.201 2.718 3.106
12 0.121 0.239 0.395 0.873 1.083 1.356 1.782 2.179 2.681 3.055
13 0.128 0.259 0.394 0670 1.079 1.350 1.771 2.160 2.650 3.012
14 0.128 0.258 0.393 0.866 1.076 1.345 1.761 2.145 2.624 2.977
15 0.121 0.256 0.393 0.866 1.074 1.341 1.753 2.131 2.602 2.947
16 0.121 0.236 0.392 0.865 1.071 1.337 1.746 2120 2.583 2.921
17 0.121 0.237 0392 0.863 1.069 1.333 1.740 2.110 2.567 2.898
18 0.127 0.237 0.392 0.862 1.067 1.330 1.734 2.101 2.552 2.878
19 0.127 0.257 0.391 0.861 1.066 1.328 1.729 2.093 2.539 2.861
20 0.127 0.257 0.391 0.860 1.064 1.325 1.725 2.086 2.521 2.845
21 0.121 0.237 0.391 0.839 1.063 1.323 1.721 2.080 2.516 2.631
22 0.127 0.236 0.390 0.838 1.061 1.321 1.717 2.074 2.508 2.819
23 0.127 0.236 0.390 0.851 1.060 1.319 1.714 2.069 2.500 2.807
24 0.127 0.256 0.390 0.857 1.039 1.318 1.711 2.064 2.492 2.797
25 0.127 0236 0.390 0.856 1.058 1.316 1.708 2.060 2.485 2.787
26 0.127 0.236 0.390 0.836 1.058 1.313 1.706 2.056 2.479 2.779
27 0.127 0.236 0.389 0.835 1.057 1.314 1.703 2.052 2.473 2.771
28 0.127 0.236 0.389 0.855 1.056 1.313 1.701 2.048 2.467 2.763
29 0.127 0.236 0.389 0.654 1.055 1.311 1.699 2.045 2.462 2.756
30 0.127 0.256 0.389 0854 1.055 1.310 1.697 2.042 2. 457 2.750
40 0.126 0.233 0.388 0.851 1.050 1.303 1.684 2.021 2.423 2.704
60 0.126 0.254 0.387 0.848 1.046 1.296 1.671 2.000 2.390 2.660
120 0.126 0.254 0.386 0.843 1.041 1.289 1.658 1.980 2.358 2.617
∞ 0.126 0.233 0.383 0.842 1.036 1.282 1.645 1.960 2.326 2.576
64
3. X2 - Distribution
The sampling distribution of the standard deviation can be determined through the X2 (chi–square)
statistic defined below:
2
𝑁𝑁 𝑆𝑆𝑥𝑥2
𝜒𝜒 = 2
𝜎𝜎𝑥𝑥
The distribution of this statistic is the X2 distribution with a degree of freedom d.f.=N – l.
X2 table (Table 7.2) gives the probability that the X2 statistic exceeds a selected Xo2value (P(X2 >
Xo2). X2 distribution is a special case of the 2 parameter gamma distribution with α=n/2, β=2. For
large N values this distribution approaches to the normal distribution with the mean N and the
variance 2N.
The limits of the interval is calculated with the following formula.
1
𝑏𝑏1,2 = 𝑁𝑁 𝑆𝑆𝑥𝑥2
𝑋𝑋 2
EXAMPLE 7.5
In example 7.2. if the distribution is positively skewed and the number of sample is 27,
find the limits of confidence interval of the variance for Pc=90 % by the X2 distribution.
(The mean=8490 kg and standard deviation=300 kg)
SOLUTION
From X2 Table (Table 7.2) the values with exceedance probabilities (l–Pc)/2=0.05 and 1–0.05=0.95
can be found as X20.05=38.885 and X20.95=15.379, respectively. Thus the limits of the confidence
interval:
1
𝑏𝑏1 = 27𝑥𝑥3002 = 62 500
38.885
1
𝑏𝑏2 = 27𝑥𝑥3002 = 157 997
15.379
65
Then population variance remains within the interval (62 500, 157 997) with a probability of 90
%.
66
Table 7.2. X2 Distribution Table
Degree of Probability
freedom 0,99 0,98 0,95 0,90 0,50 0,10 0,05 0,02 0,01
1 0 0.001 0.004 0.158 0.455 2.706 3.841 5.024 6.635
2 0.020 0.051 0.103 0.211 1.386 4.605 5.991 7.378 9.210
3 0.115 0.216 0.352 0.584 2.366 6.251 7.815 9.348 11.345
4 0.297 0.484 0.711 1.064 3.357 7.779 9.488 11.143 13.277
5 0.554 0.831 1.145 1.610 4.351 9.236 11.071 12.833 15.086
6 0.872 1.237 1.635 2.204 5.348 10.645 12.592 14.449 16.812
7 1.239 1.690 2.167 2.833 6.346 12.017 14.067 16.013 18.475
8 1.646 2.180 2.733 3.490 7.344 13.362 15.507 17.535 20.090
9 2.088 2.700 3.325 4.168 8.343 14.684 16.919 19.023 21.666
10 2.558 3.247 3.940 4.865 9.342 15.987 18.307 20.483 23.209
11 3.053 3.816 4.575 5.578 10.341 17.275 19.675 21.920 24.725
12 3.571 4.404 5.226 6.304 11.340 18.549 21.026 23.337 26.217
13 4.107 5.009 5.892 7.042 12.340 19.812 22.362 24.736 27.688
14 4.660 5.629 6.571 7.790 13.339 21.064 23.685 26.119 29.141
15 5.229 6.262 7.261 8.547 14.339 22.307 24.996 27.488 30.578
16 5.812 6.908 7.962 9.312 15.339 23.542 26.296 28.845 32.000
17 6.408 7.564 8.672 10.085 16.338 24.769 27.587 30.191 33.409
18 7.015 8.231 9.390 10.865 17.338 25.989 28.869 31.526 34.805
19 7.633 8.907 10.117 11.651 18.338 27.204 30.144 32.852 36.191
20 8.260 9.591 10.851 12.443 19.337 28.412 31.410 34.170 17.566
21 8.897 10.283 11.591 13.240 20.337 29.615 32.671 35.479 38.932
22 9.542 10.982 12.338 14.042 21.337 30.813 33.924 36.781 40.289
23 10.196 11.689 13.091 14.848 22.337 32.007 35.173 38.076 41.638
24 10.856 12.401 13.848 15.659 23.337 33.196 36.415 39.364 42.980
25 11.524 13.120 14.611 16.473 24.337 34.382 37.653 40.647 44.314
26 12.198 13.844 15.379 17.292 25.336 35.567 38.885 41.923 45.642
27 12.879 14.573 16.151 18.114 26.336 36.741 40.113 43.194 46.963
28 13.565 15.308 16.928 18.939 27.336 37.916 41.337 44.461 48.278
29 14.257 16.047 17.708 19.768 28.336 39.088 42.557 45.722 49.588
30 14.954 16.791 18.493 20.599 29.336 40.256 43.773 46.979 50.892
40 22.164 24.433 26.509 29.051 39.335 51.805 55.759 59.342 63.691
50 29.707 32.357 34.764 37.689 49.335 63.167 67.505 71.420 76.154
60 37.485 40.482 43.188 46.459 59.335 74.397 79.082 83.298 88.379
70 45.442 48.756 51.739 55.329 69.334 85.527 90.531 95.023 100.425
80 53.540 57.153 60.392 64.278 79.334 96.578 101.879 106.629 112.329
90 61.754 65.647 69.126 73.291 89.334 107.561 113.145 118.136 124.116
100 70.065 74.222 77.930 82.358 99.334 118.498 124.342 129.561 135.807
67
REFERENCES:
– Bayazıt, M., Oğuz, B., 1998, Probability and Statistics for Engineers, Birsen Yayınevi.
– Spiegel, M.R., 1992, Schaum’s Outline Series, Theory and Problems of Statitistics, 2 Ed. Mc
Graw Hill.
68
Chapter 8
I. Two–tailed test
Ho: β=βo
H1: β≠βo
69
If the hypothesis Ho: β=βo is to be checked against the hypothesis Ho: β≠βo then the null hypothesis
should be rejected either when the observed value of b is much larger or much smaller than βo.
Therefore the region of rejection is placed on both tails of the sampling distribution symmetrically.
The region of acceptance in this case is between the values of the statistic with the exceedance
probabilities of 1–α/2 and α/2 (Fig. 1). If the observed value of the statistic lies in this region the
null hypothesis is accepted, otherwise it is rejected, implying that the alternate hypothesis H1: β≠βo
is accepted.
Fig. 8.1. Testing of the hypothesis Ho: β=βo with H1: β≠βo
Fig. 8.2. Testing of the hypothesis Ho: β=βo with H1: β>βo
It is seen that statistical hypothesis testing helps us in deciding whether an assumed value βo for a
parameter βo of a random variable can be accepted as true by comparing it with the value of the
corresponding statistic b obtained from a sample. If their difference is not too big, then it is thought
that this can be explained as being caused by the sampling distribution and the hypothesis β≠βo is
accepted. However, errors are unavoidable because the whole population can never be observed.
Decisions made in hypothesis testing can have four different relations to the unknown reality, as
shown in the following table.
70
REAL SITUATION
DECISION H0 IS TRUE H0 IS FALSE
ACCEPT H0 CORRECT DECISION INCORRECT DECISION
(TYPE II ERROR)
REJECT H0 INCORRECT DECISION CORRECT DECISION
(ACCEPT H1) (TYPE I ERROR)
It is seen that two kinds of errors may exist in the decisions made in hypothesis testing. Type I
error corresponds to the rejection of the null hypothesis when it is in fact true. Type II error is
made when we accept the null hypothesis although it is in fact false.
Fig. 8.3. Probability of type II error increases with the decrease of the probability of type I error,
α
The type II error is accepted as a non significant error and it should be tolerated.
APPLICATIONS
Hypothesis testing has several applications in engineering problems. As an example, we can check
whether the expected value of the strength of a material conforms to the specifications by
comparing it with the experimental results. It is tested whether the mean of the experimental data
is significantly lower than the specified value or not. For another example, we can check if the
mean precipitation depths before and after the construction of a reservoir are significantly different
to decide about the possible effect of reservoir construction on the precipitation. In the test the
result may differ according to the chosen level of significance. In practice the value of a is usually
taken as 0.05 or 0.10. Such a standard value of the significant level facilitates the transmission of
information. The reduction of the value of a decreases the probability that an error is made (error
type I) when the null hypothesis is rejected but increases the probability of error type II.
Note: In hypothesis testing, limits of confidence interval (b1 and b2) is calculated according to the
population and the statistical values (mean or standard deviation) of sample is checked whether it
is within these limits.
EXAMPLE 8.1
It is known that the mean annual precipitation at a location is 68 cm (H0:μx=68) with a
standard deviation of 12 cm. 36 years of samples are taken and the mean is calculated as
71 cm (H1:μx≠68).
Find if the mean calculated from samples (𝑋𝑋̄=71) is similar to the mean calculated from
population (μx=68) (Distribution is normal, α=0.10)
71
SOLUTION
The mean precipitation estimated from measurements over N=36 years is 71 cm. The sampling
distribution of the mean is normal with the standard deviation equal to 𝜎𝜎𝑋𝑋 /√𝑁𝑁 = 12/√36 =
2.0𝑐𝑐𝑐𝑐
𝑏𝑏1 = 𝜇𝜇𝑥𝑥 − 𝑍𝑍 𝜎𝜎𝑥𝑥 /√𝑁𝑁
𝑏𝑏2 = 𝜇𝜇𝑥𝑥 + 𝑍𝑍 𝜎𝜎𝑥𝑥 /√𝑁𝑁
Since the level of significance as α=0.10 and α/2= 0.05, z-scores from z–Table is 1.65.
Therefore the limits of the accept region are:
𝑏𝑏1 = 68 − 1.65 × 12/√36 = 64.7
𝑏𝑏2 = 68 + 1.65 × 12/√36 = 71.3
The measured value (71.0) is inside this region, therefore the H0 is accepted at the α=0.10 level.
EXAMPLE 8.2
If the record length was N=64 years (assuming that all the other data remain the same), test
the hypotheses.
SOLUTION
72
𝑏𝑏1 = 68 − 1.65 × 1.5 = 65.5𝑐𝑐𝑐𝑐
𝑏𝑏2 = 68 + 1.65 × 1.5 = 70.5𝑐𝑐𝑐𝑐
In this case the measured value is outside of the acceptance region, and H1 is accepted. This is
because the acceptance region is narrower when the size of the sample is larger.
In the test, the probability of making type I error is α=0.10. The probability of making type II error
can be computed only if a value is assumed for the populanon mean μx. Assuming μx=72 cm, a
type II error will be made for N=36 when the measured mean is between 64 and 71.3 cm. Because
in this case the null hypothesis H0:μx=68 will be accepted although it is false (we assumed μx=72
cm. Let us compute the probability of making a type II error:
64.7 − 72
𝑧𝑧1 = = −3.65
2
71.3 − 72
𝑧𝑧2 = = −0.35
2
𝑃𝑃(64.7 < 𝑥𝑥 < 71.3) = 𝑃𝑃(−3.65 < 𝑧𝑧 < −0.35) = 𝑃𝑃(𝑧𝑧 < −0.35) − 𝑃𝑃(𝑧𝑧 < −3.65)
= 0.3632 − 0.0001 = 0.3631
On the other hand, if μx was equal to 74 cm. the probability of type II error would be:
64.7 − 74
𝑧𝑧1 = = −4.65
2
71.3 − 74
𝑧𝑧2 = = −1.35
2
𝑃𝑃(𝑧𝑧 < −1.35) − 𝑃𝑃(𝑧𝑧 < −4.65) = 0.0885
It is seen that the probability of type II error increases rapidly with the approach of the
hypothesized parameter value to the population parameter value. But in this case, the making of a
type II error (i.e. accepting the null hypothesis when it is false) will not have serious consequences.
For this reason, an appropriate value for a (such as 0.05 or 0.10) is chosen in practice and the
probability of the type II error is not considered.
EXAMPLE 8.3
A manufacturer claims that the mean of the weights of his products is (μx) 2.15 kg. In order
to check the manufacturers’ claim, 9 samples are taken and the mean (𝑥𝑥̄ ) is found as 1.95
kg. Test the hypothesis at level of error of α=0.10. (N=9, σx=0.4, Distribution is normal,
α=0.10)
SOLUTION
73
For small samples the sampling distribution of the mean is t distribution with the degrees of
freedom N-1=8. For the two–tailed test 0.10/2=0.05 and t0.05=1.860 from t–Table. The boundaries
of the acceptance region:
Ho: 𝑋𝑋̄=μx
H1: 𝑋𝑋̄≠μx
𝑏𝑏1 = 𝜇𝜇𝑥𝑥 − 𝑡𝑡. 𝜎𝜎𝑥𝑥 /√𝑁𝑁
𝑏𝑏2 = 𝜇𝜇𝑥𝑥 + 𝑡𝑡. 𝜎𝜎𝑥𝑥 /√𝑁𝑁
Then measured value 𝑥𝑥̄ =1.95 t is inside this region and the H0 is accepted.
EXAMPLE 8.4
What happens, if we change the previous problem as if the mean of sample is less than the
mean of population (ie. 𝑥𝑥̄ <μx) ?
SOLUTION
The test change into one-tailed one.
74
Since 𝑥𝑥̄ =1.95 < 1.96, H1 is accepted, although the difference (1.96–1.95) is very small.
EXAMPLE 8.5
It is desired to produce steel bars with length of μx=12 cm, σx=2,5 cm. If the mean length
of a 49 element sample is 𝑥𝑥̄ =11,2 cm, is it accepted that the desired mean value is achieved?
(α=0.95)
SOLUTION
H0: μx=12 cm
H1: μx≠12 cm
At the significance level of 0.05, the boundaries of the region of acceptance are:
Normal distribution is used as the sampling distribution of the mean because the population
standard deviation is assumed to be known. Thus z0.05=l.96 from z–Table. The measured mean
11,2 cm is outside the acceptance region and H1 is accepted. It means that standard value for the
mean was not achieved.
EXAMPLE 8.6
A company claims that standard deviation of their products is 20. A sample of 23 is taken
from the products and the standard deviation is found as 24. Find the whether the
company’s claim is correct or not. (α=0.10 and distribution is positively skewed)
SOLUTION
Ho: σx2 = Sx2
H1: σx2 ≠ Sx2
We must test the hypothesis Ho: σx=20 with the alternate hypothesis H1: σx≠20. At the significance
level of 0.10, the boundaries of the region of acceptance are:
1
𝑏𝑏1,2 = 𝑁𝑁𝑆𝑆 2
𝑋𝑋 2 𝑥𝑥
75
1
𝑏𝑏1 = 23 242 = 390.5
33.924
1
𝑏𝑏2 = 23 242 = 1073.8
12.338
4. Comparing t-values
tcal < tcrt => Accept H0
76
tcal > tcrt => Accept H1
EXAMPLE 8.6
In order to compare the performance of two types of oils a test drive is performed with 15
of motorbikes fulled with TypeA oil and the same amount of motorbikes with TypeB. The
mean distance of motorbikes measured for TypeA is 25 km and the standard deviation is
0.5 km. For TypeB they are measured 22 km and 0.4 km, respectively. (α=0.01)
Is performance of A statistically better than that of B?
SOLUTION
TypeA oil 𝑥𝑥̅ A=25 km nA=15 SA=0.5 km
TypeB oil 𝑥𝑥̅ B=22 km nB=15 SB=0.4 km
4) Since 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐 > 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐 (18.18 > 2.467) H1 is accepted. It means that, the performance of TypeA
oil is statisticaly greater than that of TypeB.
REFERENCES:
– Bayazıt, M., Oğuz, B., 1998, Probability and Statistics for Engineers, Birsen Yayınevi.
– Spiegel, M.R., 1992, Schaum’s Outline Series , Theory and Problems of Statitistics, 2/ed in SI
units, Mc Graw Hill.
Yıldız, N., Akbulut, Ö., Bircan, H., Istatistiğe Giriş, 2005, Aktif Yayınevi
77
Chapter 9
REGRESSION ANALYSIS
Regression analysis is a technique used for the modeling and analysis of numerical data consisting
of values of a at least two variables : (1) dependent variable (response variable) and (2) one or
more independent variables (explanatory variables). The dependent variable in the regression
equation is modeled as a function of the independent variables, corresponding parameters
("constants"). The parameters are estimated so as to give a "best fit" of the data.
y: Dependent variable
x: Independent variable
In engineering problems that the values two (or more) random variables take in an observation
may not be statistically independent of each other, thus there may be a relation between these
variables. The existence of such a relation shows either that one variable is effected by the other
or that both variables are effected by other variables. As an example, the relation between
precipitation and flow in a basin originates because flow takes place due to the effect (as a
consequence) of precipitation. The relation between flows in neighboring basins arises due to the
fact that the flows are affected by the precipitation of that region.
However, these relations are not of a deterministic (functional) character, in other words when one
of the variables takes a certain value the other will not always take the same value. This value will
change more or less in various observations with the effect of other variables which we have not
considered in the relation .As an example, when flow of one of the two neighboring basins takes
a certain value the flow of the other does not always take the same value. Still, the determination
of the existence and the form of a nonfunctional relationship between the variables has a great
importance in practice. Because by using this relationship it is possible to estimate a future value
of a variable depending on known value(s) of another (or more than one) variable(s). While this
estimate will not be the exact future value of the variable under consideration, it will be the best
estimate closest to this value. The interval within which the difference of the estimated value from
the real value (error) will remain can be determined with a certain probability.
The mathematical expression showing a relation of the above mentioned type is called the
regression equation. The aim of the regression analysis is to check whether there is a significant
relation between the variables under consideration and, if there is one, then to obtain the regression
equation expressing this relation and to evaluate the confidence interval of the estimates to be
made by using this equation.
The regression analysis can be classfied as simple linear or multivariate linear regression analysis.
Simple linear regression analysis is the most frequently used one in which there is a linear
relationship between two variables. On the other hand, in multivariate linear regression analysis,
it is assumed that there is a linear relationship among more than two variables.
In the content of this course notes, only simple linear regression analysis will be discussed.
78
SIMPLE LINEAR REGRESSION ANALYSIS
Regression Equation
Let us assume that X and Y are two random variables between which there is a significant
relationship. In order to put the equation of this relation we should determine the regression
equation.
To evaluate the regression coefficients a and b in the regression equation of Y with respect to X
EXAMPLE 9.1
Annual flows (106 m3) at two stations on the river Dicle are given below:
Year 1956 1957 1958 1959 1960 1961 1962 1963
X 16077 14817 11720 9352 10537 6743 10162 29232
Y 4629 4556 2507 1612 2125 1054 2272 11883
79
There are 15 years long joint observations at the two stations. If each of these observation is shown
by a point on the X–Y coordinate system we see that these points are distributed with a small
scatter around a straight line (Fig. 7.1). (The point for the year 1964 is not shown since Y flow of
this year was not recorded).
a) Find the correlation coefficient if there is a linear relation between two stations
b) Find the regression equation
c) Find the missing Y value of 1964
14.000
Y
12.000
y = 0,3139x - 694,56
10.000
R2 = 0,8106
8.000
6.000
4.000
2.000
0
0 5.000 10.000 15.000 20.000 25.000 30.000 35.000 40.000
X
Fig. 7.1. Plot of the annual flows measured at two stations on the river Dicle
a) The following values (106 m3) for the statistics are assessed for the sample we have already
started to analyze:
𝑥𝑥̅ = 16220
𝑦𝑦�= 4397
Sx = 7670
Sy = 2674
(Note: Do not use the X and Y values of 1964 and take N as 15)
b) The coefficients of the regression line of flows at station Y with respect to flows at station
X are:
= 0.313
= − 694(106 m3)
The regression equation is:
80
y = 0.313 x – 694
81
Table 9.2. Statistical propeties of Anscombe’s quartet
Property Value
Mean of x in each case 9.0
Variance of x in each case 11.0
Mean of y in each case 7.5
Variance of y in each case 4.12
Correlation between x and y in each case 0.816
Linear regression equation in each case y = 3 + 0.5x
Fig 9.1. Four sets of data with the same correlation of 0.81
The images in the figure (Fig 9.1) shows plots of Anscombe's quartet, and a linear regresion line
obtained for each. As seen from Table 9.2, the correlation coefficient for each set is the same
(0.816). However, as can be seen on the plots, the distribution of the variables is very different.
The first one (top left) seems to be distributed normally, and corresponds to what one would expect
when considering two variables correlated and following the assumption of normality. The second
one (top right) is not distributed normally; while an obvious relationship between the two variables
can be observed, it is not linear, and the Pearson correlation coefficient is not relevant. In the third
case (bottom left), the linear relationship is perfect, except for one outlier which exerts enough
influence to lower the correlation coefficient from 1 to 0.81. Finally, the fourth example (bottom
right) shows another example when one outlier is enough to produce a high correlation coefficient,
even though the relationship between the two variables is not linear.
These examples indicate that the correlation coefficient, as a summary statistic, cannot replace the
individual examination of the data. They also show that the correlation coefficient or regression
equation must be calculated after being visually convicted that there is an actual correlation
between x and y variables.
82
Chapter 10
VARIANCE ANALYSIS
INTRODUCTION
Variance analysis is a statistical method to put forth the differences among data sets. It is very
similar to comparison test (in Chapter 8.). However by comparison test only two sets of data can
be compared. By variance analysis, it can be compared more than two sets of data with each
other.
Variancs analysis is made Fisher’s distribution which is a modification X2 distribution. The
distribution has tables of 𝛼𝛼 = 0.01, 0.05, 0.10 and 0.25.
2
𝑥𝑥. .2
𝑇𝑇𝑇𝑇𝑇𝑇 = � � 𝑥𝑥𝑖𝑖𝑖𝑖 −
𝑛𝑛. 𝑝𝑝
𝑥𝑥𝑖𝑖2 . 𝑥𝑥. .2
𝑀𝑀𝑀𝑀𝑀𝑀 = � −
𝑛𝑛 𝑛𝑛. 𝑝𝑝
83
Residual Sum of Squres of the deviations from the means within treatments is
EXAMPLE 10.1
Four types of batteries are produced in a battery plant. In order to find the differences amoung the
performances of batteries, 6 samples are taken for each type of batteries and life of the batteries
are measured. Is there any differences amoung the performances of batteries at α=0.01 ?
84
SOLUTION
1) H0 : µA = µB = µC = µD
H1 : µA ≠ µB ≠ µC ≠ µD
2) 𝐹𝐹𝑐𝑐 = 𝐹𝐹𝛼𝛼(mdf,rdf)
Since mdf =3, rdf =20 and α =0.01, Fc value can be found from Fisher’s Table as
𝐹𝐹𝑐𝑐 = 𝐹𝐹0.01.(3,20) = 4.94
𝑀𝑀𝑀𝑀𝑀𝑀
3) 𝐹𝐹𝑐𝑐𝑐𝑐𝑐𝑐 = 𝑅𝑅𝑅𝑅𝑅𝑅
2 𝑥𝑥.. 2 17902
𝑇𝑇𝑇𝑇𝑇𝑇 = � � 𝑥𝑥𝑖𝑖𝑖𝑖 − = 642 + 722 +. . . +682 − = 4207.8
𝑛𝑛. 𝑝𝑝 24
n (Number of samples) = 6
p (Number of batteries) = 4
𝑀𝑀𝑀𝑀𝑀𝑀 2136.5
𝑀𝑀𝑀𝑀𝑀𝑀 = = = 712.2
𝑚𝑚𝑚𝑚𝑚𝑚 3
𝑅𝑅𝑅𝑅𝑅𝑅 2071.3
𝑅𝑅𝑅𝑅𝑅𝑅 = = = 103.6
𝑟𝑟𝑟𝑟𝑟𝑟 20
𝑀𝑀𝑀𝑀𝑀𝑀 712.2
𝐹𝐹𝑐𝑐𝑐𝑐𝑐𝑐 = = = 6.87
𝑅𝑅𝑅𝑅𝑅𝑅 103.6
4) Since 𝐹𝐹𝑐𝑐𝑐𝑐𝑐𝑐 (6.87) > Fc (4.94), H1 is accepted. There is/are difference(s) among the
performance of battery(ies)
85
Variation Degrees of Sum of Mean of Fcal
Freedom Squares Squares
Treatment 3 2136.5 712.2 6.87
Residual 20 2071.3 103.6
Total 23 4207.8
In order to find the different one(s), multiple comparison tests such as LSD, Duncan, SNK,
Bonferroni, Tukey, Scheffe, Dunnet tests are used. In this book, LSD Test will be used.
2 𝑅𝑅𝑅𝑅𝑅𝑅
𝐿𝐿𝐿𝐿𝐿𝐿 = �𝐹𝐹α(1,rdf)
𝑛𝑛
If the difference(s) is/are greater than LSD value, it means that there is/are important difference(s)
between the data set.
EXAMPLE 10.2
Apply LSD Test to the previous example and find the different one(s) of batteries. (α=0.01)
SOLUTION
2 ∗ 𝑅𝑅𝑅𝑅𝑅𝑅 2 ∗ 103.6
𝐿𝐿𝐿𝐿𝐿𝐿 = �𝐹𝐹0.01(1,20) = �8.10 ∗ = 16.7
𝑛𝑛 6
As a result, it can stated that there is an important difference between the performance of
Battery B and Battery D.
86
Fisher’s Table (0,01 levels)
mdf
rdf 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ∞
1 4052 5000 5403 5625 5764 5859 5928 5982 6023 6056 6106 6157 6209 6235 6261 6287 6313 6339 6366
2 98,50 99,00 99,20 99,20 99,30 99,30 99,40 99,40 99,40 99,00 99,40 99,40 99,40 99,50 99,50 99,50 99,50 99,50 99,50
3 34,10 30,80 29,50 28,70 28,20 27,90 27,70 27,50 27,30 27,00 27,10 26,90 26,70 26,60 26,50 26,40 26,30 26,20 26,10
4 21,20 18,00 16,70 16,00 15,50 15,20 15,00 14,80 14,70 15,00 14,40 14,20 14,00 13,90 13,80 13,70 13,70 13,60 13,50
5 16,30 13,30 12,10 11,40 11,00 10,70 10,50 10,30 10,20 10,00 9,89 9,72 9,55 9,47 9,38 9,29 9,20 9,11 9,02
6 13,70 10,90 9,78 9,15 8,75 8,47 8,26 8,10 7,98 7,90 7,72 7,56 7,40 7,31 7,23 7,14 7,06 6,97 6,88
7 12,20 9,55 8,45 7,85 7,46 7,19 6,99 6,84 6,72 6,60 6,47 6,31 6,16 6,07 5,99 5,91 5,82 5,74 5,65
8 11,30 8,65 7,59 7,01 6,63 6,37 6,18 6,03 5,91 5,80 5,67 5,52 5,36 5,28 5,20 5,12 5,03 4,95 4,86
9 10,60 8,02 6,99 6,42 6,06 5,80 5,61 5,47 5,35 5,30 5,11 4,96 4,81 4,73 4,65 4,57 4,48 4,40 4,31
10 10,00 7,56 6,55 5,99 5,64 5,39 5,20 5,06 4,94 4,85 4,71 4,56 4,41 4,33 4,21 4,17 4,08 4,00 3,91
11 9,65 7,21 6,22 5,67 5,32 5,07 4,89 4,74 4,63 4,54 4,40 4,25 4,10 4,02 3,94 3,86 3,78 3,69 3,60
12 9,33 6,93 5,95 5,41 5,06 4,82 4,64 4,50 4,39 4,30 4,16 4,01 3,86 3,78 3,70 3,62 3,54 3,45 3,36
13 9,07 6,70 5,74 5,21 4,86 4,62 4,44 4,30 4,19 4,10 3,96 3,82 3,66 3,59 3,51 3,43 3,34 3,25 3,17
14 8,86 6,51 5,56 5,04 4,69 4,46 4,28 4,14 4,03 3,94 3,80 3,66 3,51 3,43 3,35 3,27 3,18 3,09 3,00
15 8,68 6,36 5,42 4,89 4,56 4,32 4,14 4,00 3,89 3,80 3,67 3,52 3,37 3,29 3,21 3,13 3,05 2,96 2,87
16 8,53 6,23 5,29 4,77 4,44 4,20 4,03 3,89 3,78 3,69 3,55 3,41 3,26 3,18 3,10 3,02 2,93 2,84 2,75
17 8,40 6,11 5,18 4,67 4,34 4,10 3,93 3,79 3,68 3,59 3,46 3,31 3,16 3,08 3,00 2,92 2,83 2,75 2,65
18 8,29 6,01 5,09 4,58 4,25 4,01 3,84 3,71 3,60 3,51 3,37 3,23 3,08 3,00 2,92 2,84 2,75 2,66 2,57
19 8,18 5,93 5,01 4,50 4,17 3,94 3,77 3,63 3,52 3,43 3,30 3,15 3,00 2,92 2,84 2,76 2,67 2,58 2,49
20 8,10 5,85 4,94 4,43 4,10 3,87 3,70 3,56 3,46 3,37 3,23 3,09 2,94 2,86 2,78 2,69 2,61 2,52 2,42
22 7,95 5,72 4,80 4,31 3,99 3,76 3,59 3,45 3,35 3,26 3,12 2,98 2,83 2,75 2,70 2,58 2,50 2,40 2,30
24 7,82 5,61 4,70 4,22 3,90 3,67 3,50 3,36 3,26 3,17 3,03 2,89 2,74 2,66 2,60 2,49 2,40 2,31 2,20
26 7,72 5,53 4,60 4,14 3,82 3,59 3,42 3,29 3,18 3,09 2,96 2,81 2,66 2,58 2,50 2,42 2,33 2,23 2,10
28 7,64 5,45 4,60 4,07 3,75 3,53 3,36 3,23 3,12 3,03 2,90 2,75 2,60 2,52 2,40 2,35 2,26 2,17 2,10
30 7,56 5,39 4,50 4,02 3,70 3,47 3,30 3,17 3,07 2,98 2,84 2,70 2,55 2,47 2,40 2,30 2,21 2,11 2,00
40 7,31 5,18 4,30 3,83 3,51 3,29 3,12 2,99 2,89 2,80 2,66 2,52 2,37 2,29 2,20 2,11 2,02 1,92 1,80
60 7,08 4,98 4,10 3,65 3,34 3,12 2,95 2,82 2,72 2,63 2,50 2,35 2,20 2,12 2,00 1,94 1,84 1,73 1,60
120 6,85 4,79 4,00 3,48 3,17 2,96 2,79 2,66 2,56 2,47 2,34 2,19 2,03 1,95 1,90 1,76 1,66 1,53 1,40
200 6,76 4,71 3,90 3,41 3,11 2,89 2,73 2,60 2,50 2,41 2,27 2,13 1,97 1,89 1,80 1,69 1,58 1,44 1,30
∞ 6,63 4,61 3,80 3,32 3,02 2,80 2,64 2,51 2,41 2,32 2,18 2,04 1,88 1,79 1,70 1,59 1,47 1,32 1,00
87
Fisher’s Table (0.05 levels)
mdf
rdf 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ∞
1 161 200 216 225 230 234 237 239 241 242 244 246 248 249 250 251 252 253 254
2 18.5 19.0 19.2 19.2 19.3 19.3 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.5 19.5 19.5 19.5 19.5 19.5
3 10.1 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.66 5.63
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.37
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 2.58 2.54
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.72 2.65 2.61 2.57 2.53 2.49 2.45 2.40
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2.13
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.31 2.23 2.19 2.15 2.10 2.06 2.01 1.96
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.84
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.18 2.10 2.05 2.01 1.96 1.92 1.87 1.81
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.15 2.07 2.03 1.98 1.94 1.89 1.84 1.78
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.13 2.05 2.01 1.96 1.91 1.86 1.81 1.76
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.11 2.03 1.98 1.94 1.89 1.84 1.79 1.73
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.09 2.01 1.96 1.92 1.87 1.82 1.77 1.71
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 2.07 1.99 1.95 1.90 1.85 1.80 1.75 1.69
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 2.06 1.97 1.93 1.88 1.84 1.79 1.73 1.67
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 2.04 1.96 1.91 1.87 1.82 1.77 1.71 1.65
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 2.03 1.94 1.90 1.85 1.81 1.75 1.70 1.64
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93 1.89 1.84 1.79 1.74 1.68 1.62
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 1.51
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.75 1.66 1.61 1.55 1.50 1.43 1.35 1.25
∞ 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1.75 1.67 1.57 1.52 1.46 1.39 1.32 1.22 1.00
88
REFERENCES
Bayazıt, M., Oğuz, B., 1998, Probability and Statistics for Engineers, Birsen Yayınevi.
Spiegel, M.R., 1992, Schaum’s Outline Series , Theory and Problems of Statitistics, 2/ed in SI units,
Mc Graw Hill.
Yıldız, N., Akbulut, Ö., Bircan, H., Istatistiğe Giriş, 2005, Aktif Yayınevi
89