Introduction To Statistics PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Introduction to Statistics

K M Billah
Lecturer, Dept. of Civil Engineering, UU
Two Types of Statistics
• Descriptive statistics of a POPULATION
• Relevant notation (Greek):
–  mean
– N population size
–  sum

• Inferential statistics of SAMPLES from a


population.
– Assumptions are made that the sample reflects
the population in an unbiased form.
Roman Notation:
– X mean
– n sample size
–  sum
Measures of Central Tendency
• These measures express the average
distribution of a set of values in the data.
– Mean
– Median
– Mode

Mean is the average of a set of data.


To calculate the mean, find the sum of the data
and then divide by the number of data.
12, 15, 11, 11, 7, 13
First, find the sum of the data.
12 + 15 +11 + 11 + 7 + 13 = 69
Then divide by the number of data.
69 / 6 = 11.5

The mean is 11.5


Sigma notation Σ
• The sigma notation is a shorthand notation
used to sum up a large number of terms.
• Σx = x1+x2+x3+ … +xn
• One uses this notation because it is more
convenient to write the sum in this fashion.
The Mean
• Given a sample of n data points, x1,
x2, x3, … xn, the formula for the mean
or average is given below.

the sum of the data pts


x
 x

n the number data pts
Inferential mean of a sample: X=(x)/n
Mean of a population: =(x)/N
The Median
• Because the mean can be sensitive to
extreme values, the median is sometimes
useful and more accurate.

• The median is simply the middle value


among some scores of a variable.
The Median
• The median is the middle value of a distribution of data.
• How do you find the median?
• First, if possible, arrange the data from smallest value to
largest value.
• The location of the median can be calculated using this
formula: (n+1)/2.
• If (n+1)/2 is a whole number then that value gives the
location. Just report the value of that location as the
median.
• If (n+1)/2 is not a whole number then the first whole
number less than the location value and the first whole
number greater than the location value will be used to
calculate the median. Take the data located at those 2
values and calculate the average, this is the median.
63 73 84 86 88 95 97 97 100

The median is 88.

Half the numbers are Half the numbers are

less than the median. greater than the median.


63 73 84 88 95 97 97 100

88 + 95 = 183

183 ÷ 2 The median is


91.5
The Mode
• The most frequent response or value for a
variable.
• In a frequency distribution graph, the
mode is the score corresponding to the
peak or highest point of the distribution.
• Multiple modes are possible: bimodal or
multimodal data.
The Mode
• The mode is the most frequent number in a
collection of data.
• Example A: 3, 10, 8, 8, 7, 8, 10, 3, 3, 3
• The mode of the above example is 3, because 3
has a frequency of 4.
• Example B: 2, 5, 1, 5, 1, 2
• This example has no mode because 1, 2, and 5
have a frequency of 2.
• Example C: 5, 7, 9, 1, 7, 5, 0, 4
• This example has two modes 5 and 7. This is
said to be bimodal.
63 73 84 86 88 95 97 97 100

The value 97 appears twice.


All other numbers appear just once.

97 is the MODE
Central Tendency and the
Shape of the Distribution
• Because the mean, the median, and the mode are all measuring
central tendency, the three measures are often systematically
related to each other.
• In a symmetrical distribution, the mean and median will always be
equal.
• If a symmetrical distribution has only one mode, then the mode,
mean, and median will all have the same value.
• In a skewed distribution, the mode will be located at the peak on one
side and the mean usually will be displaced toward the tail on the
other side.
• The median is usually located between the mean and the mode.
mean
median
mode

10 20 30 40 50 60 70 80 90

median median

mean mode mode mean

10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
Shape of the distribution:
Skewness
• A measure of the lack of symmetry, or the
lopsidedness of a distribution.
• Use ‘median’.
Shape of Distribution:
Kurtosis
How flat or peaked a distribution appears.
(Does not affect the central tendency)

Leptokurtic Mesokurtic Platykurtic


(Normal Distribution)
Measures of Dispersion
• Measures of dispersion tell us about
variability in the data.
• Basic question: how much do values differ
for a variable from the min to max, and
distance among scores in between.
• We use:
– Range
– Variance
– Standard Deviation
• Measures of dispersion give us information
about how much our variables vary from the
mean of the data.
• Dispersion is also known as the spread or
range of variability.
 Ask the following questions.
1. Where is the middle of the distribution?
2. How wide is the distribution?
3. What is the shape of the distribution?
The Range
• r=h–l
Where h is high and l is low
• In other words, the range gives us the
value between the minimum and
maximum values of a variable.
• Understanding this statistic is important in
understanding your data, especially for
management and diagnostic purposes.
63 73 84 86 88 95 97 97

97 34 is the RANGE
-63 or spread
of this set of data
34
Variance: a measure of how data
points differ from the mean
• Data Set 1: 3, 5, 7, 10, 10
Data Set 2: 7, 7, 7, 7, 7

What is the mean and median of the above data set?

Data Set 1: mean = 7, median = 7


Data Set 2: mean = 7, median = 7

But we know that the two data sets are not identical!
The variance shows how they are different.
We want to find a way to represent these two data set
numerically.
How to Calculate?
• If we conceptualize the spread of a distribution
as the extent to which the values in the
distribution differ from the mean and from each
other, then a reasonable measure of spread
might be the average deviation, or difference, of
the values from the mean.

( x  X )
N
• Although this might seem reasonable, this expression
always equals 0, because the negative deviations about the
mean always cancel out the positive deviations about the
mean.
• We could just drop the negative signs, which is the same
mathematically as taking the absolute value, which is known
as the mean deviations.
• The concept of absolute value does not lend itself to the kind
of advanced mathematical manipulation necessary for the
development of inferential statistical formulas.
• The average of the squared deviations about the mean is
called the variance.

x  X 
2

  2 For population variance

x  X 
2
For sample variance
s 
2

n 1
MEASURES OF VARIABILITY
POPULATION VARIANCE

• The population variance is the mean squared


deviation from the population mean:
N

(x i  )
2  i 1
N
• Where 2 stands for the population variance
•  is the population mean
• N is the total number of values in the population
• xi is the value of the i-th observation.
•  represents a summation
MEASURES OF VARIABILITY
SAMPLE VARIANCE

• The sample variance is defined as follows:


N

(x i  x)
s2  i 1
n 1

• Where s2 stands for the sample variance


• x is the sample mean
• n is the total number of values in the sample
• xi is the value of the i-th observation.
•  represents a summation
An example related to deviation
about the central value
• There are five exam scores below:
584, 613, 622, 693, 755.
• The mean is
(584+613+622+693+755)/5 = 653.4
• The deviation for each score can be computed
by subtracting mean from each score:
755-653.4 = 101.6
An example related to deviation
about the central value
693-653.4 = 39.6
622-653.4 = -31.4
613.653.4 = -40.4
584-653.4 = -69.4
These deviations may be summarized by the collective measure that
considers each deviation.
With the previous data, this procedure results in

(101.6) 2  (40.4) 2  (69.4) 2  (39.6) 2  (31.4) 2 19325.2


  3.865.04
5 5

You might also like