Statistics 180930091746

Dr.
Dalia El-Shafei
Assistant professor, Community Medicine Department, Zagazig
University
STATISTICS
It is the science of dealing with numbers.

It is used for collection, summarization, presentation
and analysis of data.
It provides a way of organizing data to get information
on a wider and more formal (objective) basis than
relying on personal experience (subjective).
Collection Summarization Presentation Analysis

USES OF MEDICAL STATISTICS:
Planning, monitoring & evaluating community health
care programs.
Epidemiological research studies.
Diagnosis of community health problems.
Comparison of health status & diseases in different

countries and in one country over years.
Form standards for the different biological measurements

as weight, height.
Differentiate between diseased & normal groups

TYPES OF STATISTICS
• Describe or summarize the data • Use data to make inferences or
of a target population. generalizations about population.
• Describe the data which is • Make conclusions for population
already known. that is beyond available data.
• Organize, analyze & present • Compare, test and predicts future
data in a meaningful manner. outcomes.
• Final results are shown in • Final results is the probability
forms of tables and graphs. scores.
• Tools: measures of central • Tools: hypothesis tests
tendency & dispersion.
Descriptiv Inferential
e
TYPES OF DATA
Data
Quantitative Qualitative
Continuous
Discrete (no
(decimals Categorical Ordinal
decimal)
allowed)
Weight, height, Blood groups, Have levels as

No. of hospitals,
Hemoglobin Male & female low, moderate,
No. of patients
level Black & white high.
SOURCES OF DATA COLLECTION
1ry sources
2ry sources
PRESENTATION OF DATA
Tabular Graphical
presentation. presentation
Graphic presentations usually accompany tables to illustrate &

clarify information.
Tables are essential in presentation of scientific data & diagrams
are complementary to summarize these tables in an easy way.
TABULATION
• Table must be self-explanatory.

 Basic form of presentation
• Title: written at the top of table to define

precisely the content, the place and the time.
• Clear heading of the columns & rows

• Units of measurements should be indicated.
• The size of the table depends on the number of

classes “2 -10 rows or classes”.
TYPES OF TABLES
Frequency
List
Distribution table
LIST
Number of patients in each hospital department are:

Medicine 100 patients
Surgery 80 “
ENT 28 “
Ophthalmology 30 “
FREQUENCY DISTRIBUTION TABLE
Assume we have a group of 20 individuals whose blood groups were
as followed: A, AB, AB, O, B, A, A, B, B, AB, O, AB, AB, A, B, B,
B, A, O, A. we want to present these data by table.
Distribution of the studied individuals according to

blood group:
These are blood pressure measurements of 30 patients with
hypertension. Present these data in frequency table: 150, 155, 160,
154, 162, 170, 165, 155, 190, 186, 180, 178, 195, 200, 180,156, 173,
188, 173, 189, 190, 177, 186, 177, 174, 155, 164, 163, 172, 160.
Frequency distribution of blood pressure measurements
among studied patients:
Blood pressure Frequency
%
“mmHg” Tally Number
150 – 1111 1 6 20
160 – 1111 1 6 20
170 – 1111 111 8 26.7
180 – 1111 1 6 20
190 - 111 3 10
200 - 1 1 3.3
Total 30 100
GRAPHICAL PRESENTATION
Simple easy to understand.
Save a lot of words.
Simple easy to understand. Save a lot of words.
Self explanatory.
Has a clear title indicating its content “written under the graph”.
Fully labeled.
The y axis (vertical) is usually used for frequency.

Graphs
Bar chart
Pie diagram
Histogram
Scatter diagram
Line graph
Frequency polygon
BAR CHART
 Used for presenting discrete or qualitative data.
 It is a graphical presentation of magnitude (value or
percentage) by rectangles of constant width & lengths
proportional to the frequency & separated by gaps
Simple
Component Multiple
SIMPLE BAR CHART
MULTIPLE BAR CHART
Percentage of Persons Aged ≥18 Years Who Were Current Smokers,
by Age and Sex — United States, 2002
COMPONENT BAR CHART
PIE DIAGRAM
 Consist of a circle whose area represents the total
frequency (100%) which is divided into segments.
 Each segment represents a proportional composition of
the total frequency.
HISTOGRAM
• It is very similar to bar chart with the difference that the
rectangles or bars are adherent (without gaps).
• It is used for presenting class frequency table
(continuous data).
• Each bar represents a class and its height represents the
frequency (number of cases), its width represent the class
interval.
SCATTER DIAGRAM
It is useful to represent the relationship between 2 numeric

measurements, each observation being represented by a
point corresponding to its value on each axis.
LINE GRAPH
• It is diagram showing the relationship between two
numeric variables (as the scatter) but the points are
joined together to form a line (either broken line or
smooth curve)

FREQUENCY POLYGON
 Derived from a histogram by connecting the mid points of the tops
of the rectangles in the histogram.
 The line connecting the centers of histogram rectangles is called
frequency polygon. We can draw polygon without rectangles so
we will get simpler form of line graph.
 A special type of frequency polygon is “the Normal Distribution
Curve”.
NORMAL DISTRIBUTION CURVE
“GAUSSIAN DISTRIBUTION CURVE”
The NDC is the frequency polygon of a quantitative continuous
variable measured in large number.
It is a form of presentation of frequency distribution of biologic
variables “weights, heights, hemoglobin level and blood pressure”.
CHARACTERISTICS OF THE CURVE:
Bell shaped, continuous curve
Symmetrical i.e. can be divided into 2 equal halves

vertically
Tails never touch the base line but extended to infinity

in either direction
The mean, median and mode values coincide
Described by 2 parameters: arithmetic mean (X)

“location of the center of the curve” & standard
deviation (SD) “scatter around the mean”
AREAS UNDER THE NORMAL CURVE:
X ± 1 SD = 68% of the area on each side of the mean.
X ± 2 SD = 95% of area on each side of the mean.
X ± 3 SD = 99% of area on each side of the mean.
SKEWED DATA
If we represent a collected data by a frequency polygon & the
resulted curve does not simulate the NDC (with all its
characteristics) then these data are
“Not normally distributed”
“Curve may be skewed to the Rt. or to the Lt. side”
CAUSES OF SKEWED CURVE
The data collected are from:
Heterogeneous group Diseased or abnormal population
So; the results obtained from these data can not be applied
or generalized on the whole population.
NDC can be used in distinguishing between normal from
abnormal measurements.
Example:
If we have NDC for Hb levels for a population of normal
adult males with mean±SD = 11±1.5
If we obtain a Hb reading for an individual = 8.1 & we

want to know if he/she is normal or anemic.
If this reading lies within the area under the curve at 95%
of normal (i.e. mean±2 SD)he /she will be considered
normal. If his reading is less then he is anemic.
• Normal range for Hb in this example will be:
Higher HB level: 11+2 (1.5) =14.
Lower Hb level: 11–2 (1.5) = 8.
i.e the normal Hb range of adult males is from 8 to 14.
Our sample (8.1) lies within the 95% of his population.

So; this individual is normal because his reading lies
within the 95% of his population.
DATA SUMMARIZATION
Mean
Data summarization
Measures of
Mode
Central tendency
Median
Range
Variance
Measures of
Dispersion Standard
deviation
Coefficient of
variation
Mean
Data summarization
Measures of
Mode
Central tendency
Median
Range
Variance
Measures of
Dispersion Standard
deviation
Coefficient of
variation
ARITHMETIC MEAN
Sum of observation divided by the number of observations.
x = mean
∑ denotes the (sum of)
x the values of observation
n the number of observation
ARITHMETIC MEAN
ARITHMETIC MEAN
In case of frequency distribution data we calculate the
mean by this equation:
ARITHMETIC MEAN
 If data is presented in frequency table with class intervals
we calculate mean by the same equation but using the
midpoint of class interval.
MEDIAN
 The middle observation in a series of observation
after arranging them in an ascending or
descending manner
Rank of median
Odd no. Even no.
(n + 1)/2 (n + 1)/2 n/2

MEDIAN
MEDIAN
MODE
The most frequent occurring value in the data.

ADVANTAGES & DISADVANTAGES OF THE
MEASURES OF CENTRAL TENDENCY:
• Usually preferred since it takes into account each individual
observation
• Main disadvantage is that it is affected by the value of extreme
Mean observations.
• Useful descriptive measure if there are one or two

extremely high or low values.
Median
• Seldom used.
Mode
Mean
Data summarization
Measures of
Mode
Central tendency
Median
Range
Variance
Measures of
Dispersion Standard
deviation
Coefficient of
variation
MEASURE OF DISPERSION
Describes the degree of variations or scatter or

dispersion of the data around its central values
(dispersion = variation = spread = scatter).
RANGE
 The difference between the largest & smallest values.
 It is the simplest measure of variation
It can be expressed as an interval such as 4-10, where 4 is

the smallest value & 10 is highest.
But often, it is expressed as interval width. For example,
the range of 4-10 can also be expressed as a range of 6.
RANGE
Disadvantages:
VARIANCE
• To get the average of differences between the mean & each
observation in the data; we have to reduce each value from the
mean & then sum these differences and divide it by the number of
observation.
V = ∑ (mean - x) / n
• The value of this equation will be equal to zero, because the
differences between each value & the mean will have negative and
positive signs that will equalize zero on algebraic summation.
• To overcome this zero we square the difference between the mean

& each value so the sign will be always positive
. Thus we get:
• V = ∑ (mean - x)2 / n-1
STANDARD DEVIATION “SD”
The main disadvantage of the variance is that it is the

square of the units used.
So, it is more convenient to express the variation in the
original units by taking the square root of the variance.
This is called the standard deviation (SD). Therefore SD =
√V
i.e. SD = √ ∑ (mean – x)2 / n - 1

COEFFICIENT OF VARIATION “COV”
• The coefficient of variation expresses the standard

deviation as a percentage of the sample mean.
• C.V is useful when, we are interested in the relative size

of the variability in the data.
• Example:
If we have observations 5, 7, 10, 12 and 16. Their mean
will be 50/5=10. SD = √ (25+9 +0 + 4 + 36 ) / (5-1) = √ 74
/ 4 = 4.3
C.V. = 4.3 / 10 x 100 = 43%
Another observations are 2, 2, 5, 10, and 11. Their mean =
30 / 5 = 6
SD = √ (16 + 16 + 1 + 16 + 25)/(5 –1) = √ 74 / 4 = 4.3
C.V = 4.3 /6 x 100 = 71.6 %
Both observations have the same SD but they are different

in C.V. because data in the 1st group is homogenous (so
C.V. is not high), while data in the 2nd observations is
heterogeneous (so C.V. is high).
• Example: In a study where age was recorded the
following were the observed values: 6, 8, 9, 7, 6. and the
number of observations were 5.
• Calculate the mean, SD and range, mode and median.
Mean = (6 + 8 + 9 + 7 + 6) / 5 = 7.2 Median = 7 Mode= 6
Range = 9 – 6 = 3
Variance = (7.2-6)2 + (7.2-8)2 + (7.2-9)2 + (7.2-7)2 + (7.2-

6)2 / 5-1 = (1.2)2 + (- 0.8)2 + (-1.8) 2 +(0.2)2 + (1.2)2 / 4 =
1.7
S.D. = √ 1.7 = 1.3

INFERENTIAL STATISTICS
TYPES OF STATISTICS
• Describe or summarize the data • Use data to make inferences or
of a target population. generalizations about population.
• Describe the data which is • Make conclusions for population
already known. that is beyond available data.
• Organize, analyze & present • Compare, test and predicts future
data in a meaningful manner. outcomes.
• Final results are shown in • Final results is the probability
forms of tables and graphs. scores.
• Tools: measures of central • Tools: hypothesis tests
tendency & dispersion.
Descriptiv Inferential
e
INFERENCE
Inference involves making a generalization about a larger
group of individuals on the basis of a subset or sample.
HYPOTHESIS TESTING
To find out whether the observed variation among sampling
is explained by sampling variations, chance or is really a
difference between groups.
The method of assessing the hypotheses testing is known

as “significance test”.
Significance testing is a method for assessing whether a

result is likely to be due to chance or due to a real effect.
NULL & ALTERNATIVE HYPOTHESES:
 In hypotheses testing, a specific hypothesis is formulated &
data is collected to accept or to reject it.
 Null hypotheses means: H0: x1=x2 this means that there is
no difference between x1 & x2.
 If we reject the null hypothesis, i.e there is a difference
between the 2 readings, it is either H1: x1 < x2 or H2: x1> x2
 In other words the null hypothesis is rejected because x1 is
different from x2.
GENERAL PRINCIPLES OF TESTS OF SIGNIFICANCE
Set up a null hypothesis and its alternative.
Find the value of the test statistic.
Refer the value of the test statistic to a

known distribution which it would follow
if the null hypothesis was true.
Conclude that the data are consistent or

inconsistent with the null hypothesis.
 If the data are not consistent with the null hypotheses,
the difference is said to be “statistically significant”.
 If the data are consistent with the null hypotheses it is
said that we accept it i.e. statistically insignificant.
 In medicine, we usually consider that differences are

significant if the probability is <0.05.
 This means that if the null hypothesis is true, we shall
make a wrong decision <5 in a 100 times.
TESTSTests of significance
OF SIGNIFICANCE
Qualitative
Quantitative variables
variables
>2
2 Means X2 test Z test
Means
Large
sample Small sample “<60” ANOVA
“>60”
Paired t-
z test t-test
test
COMPARING TWO MEANS OF LARGE SAMPLES USING
THE NORMAL DISTRIBUTION: (Z TEST OR SND
STANDARD NORMAL DEVIATE)
If we have a large sample size “≥ 60” & it follows a

normal distribution then we have to use the z-test.
z = (population mean - sample mean) / SD.
If the result of z >2 then there is significant difference.
The normal range for any biological reading lies between

the mean value of the population reading ± 2 SD. (includes
95% of the area under the normal distribution curve).
COMPARING TWO MEANS OF SMALL SAMPLES
USING T-TEST
 If we have a small sample size (<60), we can use the t
distribution instead of the normal distribution.
Degree of freedom = (n1+n2)-2
The value of t will be compared to values in the specific table of "t

distribution test" at the value of the degree of freedom.
If t-value is less than that in the table, then the difference between
samples is insignificant.
If t-value is larger than that in the table so the difference is significant
i.e. the null hypothesis is rejected.
Statistical
significance
Small P-
value
Big t-value
PAIRED T-TEST:
If we are comparing repeated observation in the same
individual or difference between paired data, we have to
use paired t-test where the analysis is carried out using the
mean and standard deviation of the difference between
each pair.
Paired t= mean of difference/sq r of SD² of
difference/number of sample.
d.f=n – 1
ANALYSIS OF VARIANCE “ANOVA”
• Subgroups to be compared are defined by just one factor

• Comparison between means of different socio-economic
One-way
ANOVA classes
• When the subdivision is based upon more than one

Two-way factor
ANOVA
 The main idea in ANOVA is that we have to take into account the
variability within the groups & between the groups
F-value is equal to the ratio between the means sum square
of between the groups & within the groups.
F = between-groups MS / within-groups MS
TESTSTests of significance
OF SIGNIFICANCE
Qualitative
Quantitative variables
variables
>2
2 Means X2 test Z test
Means
Large
sample Small sample “<60” ANOVA
“>60”
Paired t-
z test t-test
test
CHI -SQUARED TEST
Qualitative data are arranged in table formed by rows &

columns. One variable define the rows & the categories of
the other variable define the columns.
A chi-squared test is used to test whether there is an

association between the row variable & the column
variable or, in other words whether the distribution of
individuals among the categories of one variable is
independent of their distribution among the categories of
the other.
O = observed value in the table
E = expected value
Expected (E) = Row total Χ Column total
Grand total
Degree of freedom = (row - 1) (column - 1)

EXAMPLE HYPOTHETICAL STUDY
 Two groups of patients are treated using different spinal

manipulation techniques
 Gonstead vs. Diversified
 The presence or absence of pain after treatment is the

outcome measure.
 Two categories
 Technique used
 Pain after treatment
GONSTEAD VS. DIVERSIFIED EXAMPLE -
RESULTS
Pain after treatment

Yes No Row Total
Technique
Gonstead 9 21 30
Diversified 11 29 40
Column Total 20 50 70
Grand Total
9 out of 30 (30%) still had pain after Gonstead treatment

and 11 out of 40 (27.5%) still had pain after Diversified,
but is this difference statistically significant?
FIRST FIND THE EXPECTED VALUES FOR EACH CELL
Expected (E) = Row total Χ Column total

Grand total
Yes No Row Total
Technique
Gonstead 9 21 30 Multiply row total
Diversified 11 29 40
Column Total 20 50 70 Divide by grand total
Grand Total
Times column total
 To find E for cell a (and similarly for the rest)

Evidence-based Chiropractic
 Find E for all cells

Yes No Row Total
Technique
9 21
Gonstead 30
E = 30*20/70=8.6 E = 30*50/70=21.4
11 29
Diversified 40
E=40*20/70=11.4 E=40*50/70=28.6
Column Total 20 50 70
Grand Total
2
 Use the Χ formula with each cell and then add
them together
(9 - 8.6)2 (21 - 21.4)2

0.0186 0.0168
8.6 21.4
=
(11 -11.4)2 (29 -28.6)2
0.0316 0.0056
11.4 28.6
Χ2 = 0.0186 + 0.0168 + 0.0316 + 0.0056 = 0.0726

2
o Find df and then consult a Χ table to see if statistically
significant
Degree of freedom = (row - 1) (column - 1)
o There are two categories for each variable in this case, so

df = 1
o Critical value at the 0.05 level and one df is 3.84
2
o Therefore, Χ is not statistically significant
Z TEST FOR COMPARING TWO PERCENTAGES
p1=% in the 1st group. p2 = % in the 2nd group

q1=100-p1 q2=100-p2
n1= sample size of 1st group
n2=sample size of 2nd group .
Z test is significant (at 0.05 level) if the result>2.
Example:
If the no. of anemic patients in group 1 which includes 50
patients is 5 & the no. of anemic patients in group 2 which
contains 60 patients is 20.
To find if groups 1 & 2 are statistically different in

prevalence of anemia we calculate z test.
P1=5/50=10%, p2=20/60=33%,
q1=100-10=90, q2=100-33=67
Z=10 – 33/ √ 10x90/50 + 33x67/60
Z= 23 / √ 18 + 36.85 z= 23/ 7.4 z= 3.1
Therefore there is statistical significant difference between
percentages of anemia in the studied groups (because z >2).
CORRELATION & REGRESSION
CORRELATION & REGRESSION
Correlation measures the closeness of the
association between 2 continuous variables, while
Linear regression gives the equation of the straight
line that best describes & enables the prediction of
one variable from the other.
CORRELATION
t-test for
correlation is
used to test the
significance of the
association.
CORRELATION IS NOT CAUSATION!!!
LINEAR REGRESSION
Same as correlation Differ than correlation
•Determine the relation & •The independent factor has to

prediction of the change in a be specified from the dependent
variable due to changes in variable.
other variable. •The dependent variable in
•t-test is also used for the linear regression must be a
assessment of the level of continuous one.
significance. •Allows the prediction of
dependent variable for a
particular independent variable
“But, should not be used outside
the range of original data”.
SCATTERPLOTS
 An X-Y graph with symbols that represent the

values of two variables
Regression
line
MULTIPLE REGRESSION
 The dependency of a dependent variable on several
independent variables, not just one.
 Test of significance used is the ANOVA. (F test).
For example: if neonatal birth weight depends on these
factors: gestational age, length of baby and head
circumference. Each factor correlates significantly with
baby birth weight (i.e. has +ve linear correlation). We can
do multiple regression analysis to obtain a mathematical
equation by which we can predict the birth weight of any
neonate if we know the values of these factors.

Statistics 180930091746

Uploaded by

Copyright:

Available Formats

Statistics 180930091746

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics 180930091746

Uploaded by

Copyright:

Available Formats

Dr.

It is the science of dealing with numbers.

Collection Summarization Presentation Analysis

Epidemiological research studies.

Diagnosis of community health problems.

Comparison of health status & diseases in different

Form standards for the different biological measurements

Differentiate between diseased & normal groups

Weight, height, Blood groups, Have levels as

Graphic presentations usually accompany tables to illustrate &

• Table must be self-explanatory.

• Title: written at the top of table to define

• Clear heading of the columns & rows

• The size of the table depends on the number of

Number of patients in each hospital department are:

Distribution of the studied individuals according to

Simple easy to understand.

Save a lot of words.

Simple easy to understand. Save a lot of words.

The y axis (vertical) is usually used for frequency.

It is useful to represent the relationship between 2 numeric

Symmetrical i.e. can be divided into 2 equal halves

Tails never touch the base line but extended to infinity

The mean, median and mode values coincide

Described by 2 parameters: arithmetic mean (X)

Heterogeneous group Diseased or abnormal population

If we obtain a Hb reading for an individual = 8.1 & we

Our sample (8.1) lies within the 95% of his population.

(n + 1)/2 (n + 1)/2 n/2

The most frequent occurring value in the data.

• Useful descriptive measure if there are one or two

Describes the degree of variations or scatter or

It can be expressed as an interval such as 4-10, where 4 is

• To overcome this zero we square the difference between the mean

The main disadvantage of the variance is that it is the

i.e. SD = √ ∑ (mean – x)2 / n - 1

• The coefficient of variation expresses the standard

• C.V is useful when, we are interested in the relative size

Both observations have the same SD but they are different

Mean = (6 + 8 + 9 + 7 + 6) / 5 = 7.2 Median = 7 Mode= 6

Variance = (7.2-6)2 + (7.2-8)2 + (7.2-9)2 + (7.2-7)2 + (7.2-

S.D. = √ 1.7 = 1.3

The method of assessing the hypotheses testing is known

Significance testing is a method for assessing whether a

Set up a null hypothesis and its alternative.

Find the value of the test statistic.

Refer the value of the test statistic to a

Conclude that the data are consistent or

 In medicine, we usually consider that differences are

If we have a large sample size “≥ 60” & it follows a

z = (population mean - sample mean) / SD.

If the result of z >2 then there is significant difference.

The normal range for any biological reading lies between

The value of t will be compared to values in the specific table of "t

• Subgroups to be compared are defined by just one factor

• When the subdivision is based upon more than one

Qualitative data are arranged in table formed by rows &

A chi-squared test is used to test whether there is an

Degree of freedom = (row - 1) (column - 1)