Statistics 180930091746

Download as pdf or txt
Download as pdf or txt
You are on page 1of 117

Dr.

Dalia El-Shafei
Assistant professor, Community Medicine Department, Zagazig
University
STATISTICS

It is the science of dealing with numbers.


It is used for collection, summarization, presentation
and analysis of data.
It provides a way of organizing data to get information
on a wider and more formal (objective) basis than
relying on personal experience (subjective).

Collection Summarization Presentation Analysis


USES OF MEDICAL STATISTICS:
Planning, monitoring & evaluating community health
care programs.

Epidemiological research studies.

Diagnosis of community health problems.

Comparison of health status & diseases in different


countries and in one country over years.

Form standards for the different biological measurements


as weight, height.

Differentiate between diseased & normal groups


TYPES OF STATISTICS
• Describe or summarize the data • Use data to make inferences or
of a target population. generalizations about population.
• Describe the data which is • Make conclusions for population
already known. that is beyond available data.
• Organize, analyze & present • Compare, test and predicts future
data in a meaningful manner. outcomes.
• Final results are shown in • Final results is the probability
forms of tables and graphs. scores.
• Tools: measures of central • Tools: hypothesis tests
tendency & dispersion.

Descriptiv Inferential
e
TYPES OF DATA
Data

Quantitative Qualitative

Continuous
Discrete (no
(decimals Categorical Ordinal
decimal)
allowed)

Weight, height, Blood groups, Have levels as


No. of hospitals,
Hemoglobin Male & female low, moderate,
No. of patients
level Black & white high.
SOURCES OF DATA COLLECTION
1ry sources
2ry sources
PRESENTATION OF DATA
Tabular Graphical
presentation. presentation

Graphic presentations usually accompany tables to illustrate &


clarify information.
Tables are essential in presentation of scientific data & diagrams
are complementary to summarize these tables in an easy way.
TABULATION

• Table must be self-explanatory.


 Basic form of presentation

• Title: written at the top of table to define


precisely the content, the place and the time.

• Clear heading of the columns & rows


• Units of measurements should be indicated.

• The size of the table depends on the number of


classes “2 -10 rows or classes”.
TYPES OF TABLES

Frequency
List
Distribution table
LIST

Number of patients in each hospital department are:


Medicine 100 patients
Surgery 80 “
ENT 28 “
Ophthalmology 30 “
FREQUENCY DISTRIBUTION TABLE
Assume we have a group of 20 individuals whose blood groups were
as followed: A, AB, AB, O, B, A, A, B, B, AB, O, AB, AB, A, B, B,
B, A, O, A. we want to present these data by table.

Distribution of the studied individuals according to


blood group:
These are blood pressure measurements of 30 patients with
hypertension. Present these data in frequency table: 150, 155, 160,
154, 162, 170, 165, 155, 190, 186, 180, 178, 195, 200, 180,156, 173,
188, 173, 189, 190, 177, 186, 177, 174, 155, 164, 163, 172, 160.
Frequency distribution of blood pressure measurements
among studied patients:
Blood pressure Frequency
%
“mmHg” Tally Number
150 – 1111 1 6 20
160 – 1111 1 6 20
170 – 1111 111 8 26.7
180 – 1111 1 6 20
190 - 111 3 10
200 - 1 1 3.3

Total 30 100
GRAPHICAL PRESENTATION

Simple easy to understand.

Save a lot of words.

Simple easy to understand. Save a lot of words.

Self explanatory.

Has a clear title indicating its content “written under the graph”.

Fully labeled.

The y axis (vertical) is usually used for frequency.


Graphs
Bar chart

Pie diagram

Histogram

Scatter diagram

Line graph

Frequency polygon
BAR CHART
 Used for presenting discrete or qualitative data.
 It is a graphical presentation of magnitude (value or
percentage) by rectangles of constant width & lengths
proportional to the frequency & separated by gaps

Simple

Component Multiple
SIMPLE BAR CHART
MULTIPLE BAR CHART
Percentage of Persons Aged ≥18 Years Who Were Current Smokers,
by Age and Sex — United States, 2002
COMPONENT BAR CHART
PIE DIAGRAM
 Consist of a circle whose area represents the total
frequency (100%) which is divided into segments.
 Each segment represents a proportional composition of
the total frequency.
HISTOGRAM
• It is very similar to bar chart with the difference that the
rectangles or bars are adherent (without gaps).
• It is used for presenting class frequency table
(continuous data).
• Each bar represents a class and its height represents the
frequency (number of cases), its width represent the class
interval.
SCATTER DIAGRAM

It is useful to represent the relationship between 2 numeric


measurements, each observation being represented by a
point corresponding to its value on each axis.
LINE GRAPH
• It is diagram showing the relationship between two
numeric variables (as the scatter) but the points are
joined together to form a line (either broken line or
smooth curve)

FREQUENCY POLYGON
 Derived from a histogram by connecting the mid points of the tops
of the rectangles in the histogram.
 The line connecting the centers of histogram rectangles is called
frequency polygon. We can draw polygon without rectangles so
we will get simpler form of line graph.
 A special type of frequency polygon is “the Normal Distribution
Curve”.
NORMAL DISTRIBUTION CURVE
“GAUSSIAN DISTRIBUTION CURVE”
The NDC is the frequency polygon of a quantitative continuous
variable measured in large number.
It is a form of presentation of frequency distribution of biologic
variables “weights, heights, hemoglobin level and blood pressure”.
CHARACTERISTICS OF THE CURVE:
Bell shaped, continuous curve

Symmetrical i.e. can be divided into 2 equal halves


vertically

Tails never touch the base line but extended to infinity


in either direction

The mean, median and mode values coincide

Described by 2 parameters: arithmetic mean (X)


“location of the center of the curve” & standard
deviation (SD) “scatter around the mean”
AREAS UNDER THE NORMAL CURVE:
X ± 1 SD = 68% of the area on each side of the mean.
X ± 2 SD = 95% of area on each side of the mean.
X ± 3 SD = 99% of area on each side of the mean.
SKEWED DATA
If we represent a collected data by a frequency polygon & the
resulted curve does not simulate the NDC (with all its
characteristics) then these data are
“Not normally distributed”
“Curve may be skewed to the Rt. or to the Lt. side”
CAUSES OF SKEWED CURVE
The data collected are from:

Heterogeneous group Diseased or abnormal population

So; the results obtained from these data can not be applied
or generalized on the whole population.
NDC can be used in distinguishing between normal from
abnormal measurements.

Example:
If we have NDC for Hb levels for a population of normal
adult males with mean±SD = 11±1.5

If we obtain a Hb reading for an individual = 8.1 & we


want to know if he/she is normal or anemic.

If this reading lies within the area under the curve at 95%
of normal (i.e. mean±2 SD)he /she will be considered
normal. If his reading is less then he is anemic.
• Normal range for Hb in this example will be:
Higher HB level: 11+2 (1.5) =14.
Lower Hb level: 11–2 (1.5) = 8.
i.e the normal Hb range of adult males is from 8 to 14.

Our sample (8.1) lies within the 95% of his population.


So; this individual is normal because his reading lies
within the 95% of his population.
DATA SUMMARIZATION
Mean

Data summarization
Measures of
Mode
Central tendency

Median

Range

Variance
Measures of
Dispersion Standard
deviation
Coefficient of
variation
Mean

Data summarization
Measures of
Mode
Central tendency

Median

Range

Variance
Measures of
Dispersion Standard
deviation
Coefficient of
variation
ARITHMETIC MEAN
Sum of observation divided by the number of observations.

x = mean
∑ denotes the (sum of)
x the values of observation
n the number of observation
ARITHMETIC MEAN
ARITHMETIC MEAN
In case of frequency distribution data we calculate the
mean by this equation:
ARITHMETIC MEAN
 If data is presented in frequency table with class intervals
we calculate mean by the same equation but using the
midpoint of class interval.
MEDIAN
 The middle observation in a series of observation
after arranging them in an ascending or
descending manner

Rank of median
Odd no. Even no.

(n + 1)/2 (n + 1)/2 n/2


MEDIAN
MEDIAN
MODE

The most frequent occurring value in the data.


ADVANTAGES & DISADVANTAGES OF THE
MEASURES OF CENTRAL TENDENCY:
• Usually preferred since it takes into account each individual
observation
• Main disadvantage is that it is affected by the value of extreme
Mean observations.

• Useful descriptive measure if there are one or two


extremely high or low values.
Median

• Seldom used.
Mode
Mean

Data summarization
Measures of
Mode
Central tendency

Median

Range

Variance
Measures of
Dispersion Standard
deviation
Coefficient of
variation
MEASURE OF DISPERSION

Describes the degree of variations or scatter or


dispersion of the data around its central values
(dispersion = variation = spread = scatter).
RANGE
 The difference between the largest & smallest values.
 It is the simplest measure of variation

It can be expressed as an interval such as 4-10, where 4 is


the smallest value & 10 is highest.
But often, it is expressed as interval width. For example,
the range of 4-10 can also be expressed as a range of 6.
RANGE
Disadvantages:
VARIANCE
• To get the average of differences between the mean & each
observation in the data; we have to reduce each value from the
mean & then sum these differences and divide it by the number of
observation.
V = ∑ (mean - x) / n
• The value of this equation will be equal to zero, because the
differences between each value & the mean will have negative and
positive signs that will equalize zero on algebraic summation.

• To overcome this zero we square the difference between the mean


& each value so the sign will be always positive
. Thus we get:
• V = ∑ (mean - x)2 / n-1
STANDARD DEVIATION “SD”

The main disadvantage of the variance is that it is the


square of the units used.
So, it is more convenient to express the variation in the
original units by taking the square root of the variance.
This is called the standard deviation (SD). Therefore SD =
√V

i.e. SD = √ ∑ (mean – x)2 / n - 1


COEFFICIENT OF VARIATION “COV”

• The coefficient of variation expresses the standard


deviation as a percentage of the sample mean.

• C.V is useful when, we are interested in the relative size


of the variability in the data.
• Example:
If we have observations 5, 7, 10, 12 and 16. Their mean
will be 50/5=10. SD = √ (25+9 +0 + 4 + 36 ) / (5-1) = √ 74
/ 4 = 4.3
C.V. = 4.3 / 10 x 100 = 43%
Another observations are 2, 2, 5, 10, and 11. Their mean =
30 / 5 = 6
SD = √ (16 + 16 + 1 + 16 + 25)/(5 –1) = √ 74 / 4 = 4.3
C.V = 4.3 /6 x 100 = 71.6 %

Both observations have the same SD but they are different


in C.V. because data in the 1st group is homogenous (so
C.V. is not high), while data in the 2nd observations is
heterogeneous (so C.V. is high).
• Example: In a study where age was recorded the
following were the observed values: 6, 8, 9, 7, 6. and the
number of observations were 5.
• Calculate the mean, SD and range, mode and median.

Mean = (6 + 8 + 9 + 7 + 6) / 5 = 7.2 Median = 7 Mode= 6

Range = 9 – 6 = 3

Variance = (7.2-6)2 + (7.2-8)2 + (7.2-9)2 + (7.2-7)2 + (7.2-


6)2 / 5-1 = (1.2)2 + (- 0.8)2 + (-1.8) 2 +(0.2)2 + (1.2)2 / 4 =
1.7

S.D. = √ 1.7 = 1.3


INFERENTIAL STATISTICS
TYPES OF STATISTICS
• Describe or summarize the data • Use data to make inferences or
of a target population. generalizations about population.
• Describe the data which is • Make conclusions for population
already known. that is beyond available data.
• Organize, analyze & present • Compare, test and predicts future
data in a meaningful manner. outcomes.
• Final results are shown in • Final results is the probability
forms of tables and graphs. scores.
• Tools: measures of central • Tools: hypothesis tests
tendency & dispersion.

Descriptiv Inferential
e
INFERENCE
Inference involves making a generalization about a larger
group of individuals on the basis of a subset or sample.
HYPOTHESIS TESTING
To find out whether the observed variation among sampling
is explained by sampling variations, chance or is really a
difference between groups.

The method of assessing the hypotheses testing is known


as “significance test”.

Significance testing is a method for assessing whether a


result is likely to be due to chance or due to a real effect.
NULL & ALTERNATIVE HYPOTHESES:
 In hypotheses testing, a specific hypothesis is formulated &
data is collected to accept or to reject it.
 Null hypotheses means: H0: x1=x2 this means that there is
no difference between x1 & x2.
 If we reject the null hypothesis, i.e there is a difference
between the 2 readings, it is either H1: x1 < x2 or H2: x1> x2
 In other words the null hypothesis is rejected because x1 is
different from x2.
GENERAL PRINCIPLES OF TESTS OF SIGNIFICANCE

Set up a null hypothesis and its alternative.

Find the value of the test statistic.

Refer the value of the test statistic to a


known distribution which it would follow
if the null hypothesis was true.

Conclude that the data are consistent or


inconsistent with the null hypothesis.
 If the data are not consistent with the null hypotheses,
the difference is said to be “statistically significant”.
 If the data are consistent with the null hypotheses it is
said that we accept it i.e. statistically insignificant.

 In medicine, we usually consider that differences are


significant if the probability is <0.05.
 This means that if the null hypothesis is true, we shall
make a wrong decision <5 in a 100 times.
TESTSTests of significance
OF SIGNIFICANCE

Qualitative
Quantitative variables
variables

>2
2 Means X2 test Z test
Means
Large
sample Small sample “<60” ANOVA
“>60”

Paired t-
z test t-test
test
COMPARING TWO MEANS OF LARGE SAMPLES USING
THE NORMAL DISTRIBUTION: (Z TEST OR SND
STANDARD NORMAL DEVIATE)

If we have a large sample size “≥ 60” & it follows a


normal distribution then we have to use the z-test.

z = (population mean - sample mean) / SD.

If the result of z >2 then there is significant difference.

The normal range for any biological reading lies between


the mean value of the population reading ± 2 SD. (includes
95% of the area under the normal distribution curve).
COMPARING TWO MEANS OF SMALL SAMPLES
USING T-TEST
 If we have a small sample size (<60), we can use the t
distribution instead of the normal distribution.
Degree of freedom = (n1+n2)-2

The value of t will be compared to values in the specific table of "t


distribution test" at the value of the degree of freedom.

If t-value is less than that in the table, then the difference between
samples is insignificant.
If t-value is larger than that in the table so the difference is significant
i.e. the null hypothesis is rejected.
Statistical
significance
Small P-
value

Big t-value
PAIRED T-TEST:
If we are comparing repeated observation in the same
individual or difference between paired data, we have to
use paired t-test where the analysis is carried out using the
mean and standard deviation of the difference between
each pair.
Paired t= mean of difference/sq r of SD² of
difference/number of sample.
d.f=n – 1
ANALYSIS OF VARIANCE “ANOVA”

• Subgroups to be compared are defined by just one factor


• Comparison between means of different socio-economic
One-way
ANOVA classes

• When the subdivision is based upon more than one


Two-way factor
ANOVA

 The main idea in ANOVA is that we have to take into account the
variability within the groups & between the groups
F-value is equal to the ratio between the means sum square
of between the groups & within the groups.

F = between-groups MS / within-groups MS
TESTSTests of significance
OF SIGNIFICANCE

Qualitative
Quantitative variables
variables

>2
2 Means X2 test Z test
Means
Large
sample Small sample “<60” ANOVA
“>60”

Paired t-
z test t-test
test
CHI -SQUARED TEST

Qualitative data are arranged in table formed by rows &


columns. One variable define the rows & the categories of
the other variable define the columns.

A chi-squared test is used to test whether there is an


association between the row variable & the column
variable or, in other words whether the distribution of
individuals among the categories of one variable is
independent of their distribution among the categories of
the other.
O = observed value in the table
E = expected value
Expected (E) = Row total Χ Column total
Grand total

Degree of freedom = (row - 1) (column - 1)


EXAMPLE HYPOTHETICAL STUDY

 Two groups of patients are treated using different spinal


manipulation techniques
 Gonstead vs. Diversified

 The presence or absence of pain after treatment is the


outcome measure.

 Two categories
 Technique used
 Pain after treatment
GONSTEAD VS. DIVERSIFIED EXAMPLE -
RESULTS

Pain after treatment


Yes No Row Total
Technique

Gonstead 9 21 30
Diversified 11 29 40
Column Total 20 50 70
Grand Total

9 out of 30 (30%) still had pain after Gonstead treatment


and 11 out of 40 (27.5%) still had pain after Diversified,
but is this difference statistically significant?
FIRST FIND THE EXPECTED VALUES FOR EACH CELL

Expected (E) = Row total Χ Column total


Grand total
Pain after treatment
Yes No Row Total
Technique

Gonstead 9 21 30 Multiply row total

Diversified 11 29 40
Column Total 20 50 70 Divide by grand total
Grand Total
Times column total

 To find E for cell a (and similarly for the rest)


Evidence-based Chiropractic

 Find E for all cells


Pain after treatment
Yes No Row Total
Technique

9 21
Gonstead 30
E = 30*20/70=8.6 E = 30*50/70=21.4
11 29
Diversified 40
E=40*20/70=11.4 E=40*50/70=28.6
Column Total 20 50 70
Grand Total
2
 Use the Χ formula with each cell and then add
them together

(9 - 8.6)2 (21 - 21.4)2


0.0186 0.0168
8.6 21.4
=
(11 -11.4)2 (29 -28.6)2
0.0316 0.0056
11.4 28.6

Χ2 = 0.0186 + 0.0168 + 0.0316 + 0.0056 = 0.0726


Evidence-based Chiropractic

2
o Find df and then consult a Χ table to see if statistically
significant
Degree of freedom = (row - 1) (column - 1)

o There are two categories for each variable in this case, so


df = 1
o Critical value at the 0.05 level and one df is 3.84
2
o Therefore, Χ is not statistically significant
Z TEST FOR COMPARING TWO PERCENTAGES

p1=% in the 1st group. p2 = % in the 2nd group


q1=100-p1 q2=100-p2
n1= sample size of 1st group
n2=sample size of 2nd group .
Z test is significant (at 0.05 level) if the result>2.
Example:
If the no. of anemic patients in group 1 which includes 50
patients is 5 & the no. of anemic patients in group 2 which
contains 60 patients is 20.

To find if groups 1 & 2 are statistically different in


prevalence of anemia we calculate z test.

P1=5/50=10%, p2=20/60=33%,
q1=100-10=90, q2=100-33=67
Z=10 – 33/ √ 10x90/50 + 33x67/60
Z= 23 / √ 18 + 36.85 z= 23/ 7.4 z= 3.1
Therefore there is statistical significant difference between
percentages of anemia in the studied groups (because z >2).
CORRELATION & REGRESSION
CORRELATION & REGRESSION
Correlation measures the closeness of the
association between 2 continuous variables, while
Linear regression gives the equation of the straight
line that best describes & enables the prediction of
one variable from the other.
CORRELATION
t-test for
correlation is
used to test the
significance of the
association.
CORRELATION IS NOT CAUSATION!!!
LINEAR REGRESSION
Same as correlation Differ than correlation

•Determine the relation & •The independent factor has to


prediction of the change in a be specified from the dependent
variable due to changes in variable.
other variable. •The dependent variable in
•t-test is also used for the linear regression must be a
assessment of the level of continuous one.
significance. •Allows the prediction of
dependent variable for a
particular independent variable
“But, should not be used outside
the range of original data”.
Evidence-based Chiropractic

SCATTERPLOTS

 An X-Y graph with symbols that represent the


values of two variables

Regression
line
MULTIPLE REGRESSION
 The dependency of a dependent variable on several
independent variables, not just one.
 Test of significance used is the ANOVA. (F test).
For example: if neonatal birth weight depends on these
factors: gestational age, length of baby and head
circumference. Each factor correlates significantly with
baby birth weight (i.e. has +ve linear correlation). We can
do multiple regression analysis to obtain a mathematical
equation by which we can predict the birth weight of any
neonate if we know the values of these factors.

You might also like