Biostatistics Word New
Biostatistics Word New
Biostatistics Word New
Terminologies:
Variable: This is a quantity which varies such that it may take any one of a
specified set of values. It may be measurable or non-measurable.
Parameter : A summary value which in some way characterizes the nature of the
population in the variable under study.
Descriptive biostatistics:
It is the study of biostatistical procedures which deal with the collection,
representation, calculation and processing, i.e., the summarization of the data to
make it more informative and comprehensible. The primary function of
descriptive statistics is to provide meaningful and convenient techniques for
describing features of data that are of interest. The failure to choose appropriate
descriptive statistics often lead to faulty scientific inference. The field of
descriptive statistics is not concerned with the implications or conclusions that can
be drawn from the sets of the data.
Inferential biostatistics:
It constitutes the procedures which serve to make generalization or drawing
conclusions on the basis of the studies of the sample. This is also known as
sampling biostatistics. The study of the quantitative aspects of the inferential
process provides a solid basis, on which the more general substantive process of
inference can be founded.
Collection of data
Sampling method
In this method the data is collected from a small group of population which
is termed as sample. A sample is a portion of the population selected to represent
the population.
Types of samples
There are two types of samples which are used in biostatistics :
1. Qualitative samples: when we say that children from African population are
taller than those in India it is called as qualitative sample.
2. Quantitative samples: when we try to know the number of decayed teeth of
individuals of particular age group then it is called quantitative sample.
Size of samples
The total number of units which are used in the study to get significant
results is termed as sample size. To select the proper sample size is very important.
The sample size should not be very small or very large because the conclusions are
directly affected by it.
Advantages of the sampling method
This method is comparatively more economical as it consumes less energy,
less time and less expenditure.
It requires less number of investigations.
It is most suited to those places and situations where census method cannot
be applied.
Disadvantages of sampling method
It requires services of experts, otherwise incorrect or misleading results will
be obtained.
In this method selection of appropriate method of sampling is necessary.
If the population is very small and we need precise information then the census
method is preferred. If the population is very large or the field of investigation is
very wide and the quick results are required, sampling methods should be used.
Cluster sampling
In this method the population is divided into separate natural groups of
elements. These groups are called clusters. Each cluster includes only one type of
elements. A simple random sample is taken from each cluster. A cluster may
consist of units such as villages, wards, blocks, factories, slums of a town, children
of a school, etc.
Generally the clusters are natural groupings and if they are geographic
regions, the sampling is called as 'area sampling'.
o Probability sampling procedure
Cluster o Clusters of population units are selected at random
sampling o All or some units in the chosen clusters are studied.
o Subjective procedure
Nonprobability o Probability of selection for the population units cannot
sampling be determined.
In non random sampling, the samples are drawn without following any
crtiteria or any yardstick. The sample collected does not show any specific
approach nor the samples can be used to assess properly the accuracy of the
estimator. In this sampling procedure many investigator biases are likely to occur.
This is of three types :
1. Accidental, Haphazard or Convenience Sampling: this is known as
accidental accessibility or haphazard sampling. The major reason is
administrative convenience. The sample chosen with ease of access being
the sole concern.
convenience o a researcher's convenience forms the basis for selecting a
sample of units
sampling
Judgment o researcher exerts some effort in selecting a sample that is believed to be most
appropriate.
sampling
Researcher will usually be knowledgeable about the nature of the ideal population.
Requires greater researcher effort
Generally more appropriate than a convenience sample.
Can be very useful
when you need to reach a targeted sample quickly
when sampling for proportionality is not the primary concern.
Likely to yield opinions of your target population
Likely to overweight more readily accessible subgroups.
non proportional Specify the minimum number of sampled units you want in each category.
Not concerned with numbers that match the proportions in the population.
quota
Simply need enough to assure the ability to talk about even small groups
sampling in the population.
Nonprobabilistic analogue of stratified random sampling
Typically used to assure that smaller groups are adequately represented
5. Snowball Sampling
o Identifying someone who meets the criteria for inclusion
o Ask them to recommend others who
they may know
also meet the criteria.
o Useful when trying to reach populations that are inaccessible or hard to find.
A. Sampling Error
Sampling o The difference between a statistic value generated through sampling and
o The parameter value, which can be determined only through a census study
error
o Magnitude of the sampling error says how precisely the population parameter can be estim
a sample value
o Estimate the average amount of sampling error associated with a given sampling procedure
o True population parameter value is unknown
o Sample statistic value may vary from sample to sample within the population
PRESENTATION OF DATA
Objective of classification of data :
make the data simple,
concise, meaningful,
interesting and
helpful in further analysis.
two main methods of presenting data:
Tabulation and
Diagrams
TABULATION
classified on the following bases:
Geographical. i.e , area-wise, e.g. cities, districts etc.
Chronological i,e, on the basis of time.
Qualitative i.e according to some attribute.
Quantitative i,e in terms of magnitude.
The two elements of classification are
The variable and
The frequency.
Variable: a name denoting a condition , occurrence or effect that can assume
different values
Divided: subgroups ,classes.
have lowest and highest values
Class interval : difference between the upper and lower limit of a class
Eg: in the class 5 -14,
5 - lower limit and 14 - upper limit.
class interval = 14 - 5 =9.
Frequency: is the number of units belonging to each group of the variable.
Frequency distribution table: way of presenting data in the tables
Frequency distribution table
• Title of the table – named at the bottom
• The no of class intervals - between 5 and 20. no rigidity about it.
• The class intervals - at equal width.
• Clearly defined class limits – to avoid ambiguity.
For e.g., 0-4.5-9. 10-14. Etc.
• Clearly defined row and column with the headings
• Units of measurement should be specified.
• If the data is not original, the source of the data should be mentioned at the
bottom of the table.
Diagrams:
Extremely useful
attractive to the eyes,
give a bird's eye view of the entire data,
have a lasting impression
TYPES OF DIAGRAMS:
Bar Diagram : qualitative data.
Multiple Bar: qualitative data
Component Bar Diagram: qualitative data.
Proportional Bar Diagram
Histogram: quantitative data of continuous type.
Frequency Polygon: qualitative data
Pie Diagram: qualitative data
Line diagram: qualitative data
Cartograms or Spot Map: geographical distribution of frequencies
Basic rules :
Self explanatory
Simple and consistent with the data.
Values of the variables - on horizontal or X-axis and the frequency - vertical
line or Y-axis.
No too many lines on the graph, should not look clumsy.
The scale of presentation – right hand top corner of the graph.
The scale of division of the two axes should be proportional.
The details of the variables and frequencies presented on the axes.
Bar Diagram
Represent qualitative data.
Only one variable.
width of the bar remains the same
The length varies according to the frequency in each category.
Bars: vertical or horizontal.
Limitation:
represent only one classification
cannot be used for comparison
Facilitate comparison of data relating to different time periods and regions.
Multiple Bar:
compare qualitative data with respect to a single variable.
Eg: sex wise or with respect to time or region.
each category of the variable have a set of bars of the same width
corresponding to the different sections without any gap in between the width
and the length corresponds to the frequency.
Component Bar Diagram:
represent qualitative data.
both, the number of cases in major groups as well as the subgroups
simultaneously
cases of the major group drawn
each rectangle is divided according to no in the subgroups.
Proportional Bar Diagram:
represent qualitative data.
compare only the proportion of sub-groups between different major groups
of observations, then bars are drawn for each group with the same length,
either as 1 or 100%. These are then divided according to the sub-group
proportion in each major group.
PIE DIAGRAM
The frequency of the group is shown in a circle.
Degree of angle denotes the frequency.
Instead of comparing the length of bar , the areas of segments are compared.
Line diagram:
useful to study changes of values in the variable over time
simplest type
X-axis, - hours, days, weeks, months or years
Y-axis- value of any quantity pertaining to X-axis,
Histogram
quantitative data of continuous type.
bar diagram without gap between the bars.
represents a frequency distribution.
X-axis: the size of an observation is marked. Starting from 0 the limit of
each class interval is marked, the width corresponding to the width of the
class interval in the frequency distribution.
Y-axis :the frequencies are marked. A rectangle is drawn above each class
interval with height proportional to the frequency of that interval.
Frequency Polygon
frequency distribution of quantitative data
compare two or more frequency distributions.
The first point and last point are joined to the midpoint of previous and next
class respectively.
SCATTER DIAGRAM
Fig.--. Height and Weight of 20 students of CODS
80
70
60
Weight in KGs
50
40 Weight
30
20
10
0
3 4 5 6 7
Height in feet
X = Σ Xi
n
Σ : sigma, means the sum of.
Xi : is the value of each observation in the data,
n: is the number of observations in the data.
MODE
value in a series of observations which occurs with the greatest frequency.
Eg: series on age at eruption of the canine as 6,6,5,7, 8, 6, 7, 5;
6 - mode.
Ill defined mode :
Mode = 3 Median - 2 mean.
Variability & it’s measures
Types –
Biological variability
Real variability
Experimental variability
Biological variability
Normal or natural differences within accepted biological limits
Individual variability
Periodical variability
Class , group or category variability
Real variability
When the difference b/w two readings is more than the defined limits
Due to the external factors
Experimental variability
Errors or variations due to materials & methods
Observer error – Subjective error
Objective error
Instrument error
Sampling error
Measures of variability
Synonyms:
Measures of dispersion
Measures of variation or scatter
Dispersion is the degree of spread or variation of the variable about a central
value.
Uses:
Determine reliability of an average
Serve as a basis of control of variability
Comparison of two or more series
Facilitate further statistical analysis
A good measure of dispersion : simple , easy to compute , based on all items
, amenable for further analysis and not affected by extreme values.
Of individual observations -
Range
Interquartile range
Mean deviation
Standard deviation
Coefficient of variation
Variability of samples-
Standard error of mean
Standard error of difference b/w 2 means
Standard error of proportion
Difference b/w 2 proportions
Standard error of correlation coefficient
Standard deviation of regression coefficient
Range
Difference between the value of the smallest item and the value of the
largest item.
simplest method.
gives no information about the values that lie between the extreme values.
subjected to fluctuations from sample to sample.
Mean deviation
The average of the deviations from the arithmatic mean
M.D= Σ(x-x)
52,44,54,56,60,64,66,76,60,68
41,54,43,45,60,75,77,66,79,60
Standard deviation:
most important and widely used
it is the square root of the mean of the squared deviations from arithmetic
mean.
root mean square deviation
Greater the deviation – greater the dispersion
Smaller the deviation- higher degree of uniformity
Calculation of S.D
For ungrouped data:
Calculate the mean = x
Diff of each observation from mean,
d = xi – x
Square these = d²
Total these = Σ d²
Divide this by no of observations minus 1,
variance = d²/ (n-1)
Square root of this variance is
S.D = Σ d²
(n-1)
For grouped data: with single units for class intervals
Make frequency table
Determine mid pt of each range
SD= Σ (Xi- x)2 fi
n-1
Xi – individual observation in the class
x- mean
fi – frequency
n- total frequency
Calculation for grouped data with range for class interval:
Class intervals in terms of range:
Frequency- -centered in mid points
S= Σ (xi- x) fi
n-1
Xi – -midpoint of class interval
x- mean
fi – frequency
n- total frequency
Uses of standard deviation
Summarizes the deviations , of a large distribution
Indicates whether the variation from mean is by chance or real
Helps in finding standard error
Helps in finding the suitable size of sample
Standard deviation is only interpretable as a summary measure for variations
having approximately symmetric preparations
Coefficient of variation
Compare relative variability
Variation of same character in two or more series
compare the variability of one character in two different groups having
different magnitude of values or
to compare two characters in the same group by expressing in percentage
CV= S.D x 100
mean
Higher the C.V greater variability
Normal distribution & Normal curve
Height of bars or curve greatest in middle
Values are spread around mean
Maximum values around mean , few at extremes
half values above & half below mean
Accept it Reject it
Tests of significance
Parametric and non parametric tests or methods
Parametric methods
The methods of statistical inference that are based on the assumption that the
population has a certain probability distribution, the resulting collection of
statistical tests and procedures are referred to as parametric methods. For example,
t- distribution and F-distribution are associated with the values of parameters of an
assumed normal probability distribution.
Non parametric methods
The statistical procedures that do not require assumptions of any form of
probability distribution from which experiments come are known as non
parametric methods. These are also called distribution free methods. For example,
chi square frequency techniques are non parametric.
Parametric tests
Eg. T test, Z test, Chi-square test,Pearson correlation coifficient
Non parametric tests
Eg. Chi-square test, Kruskal-Wallis test, Spearman correlation
coifficient
Tests of significance- Steps involved
Define the problem
state the hypothesis
Null hypothesis
Alternate hypothesis
Fix the level of significance
Select appropriate test to find test statistic
Find degree of freedom (df)
Compare the observed test statistic with theoretical one at desired level of
significance & corresponding DF
If the observed test statistic value is greater than the theoretical value, reject
the null hypothesis.
Draw the inference based on the level of significance
Objective of using tests of significance
To compare – sample mean with population
Means of two samples
Sample proportion with population
Proportion of two samples
Association b/w two attributes
t - test
Student’s t-test
Designed by W.S Gossett
Unpaired t- test (two independent samples)
Paired t- test ( single sample correlated observation)
Essential conditions:
randomly selected samples from the corresponding populations
Homogeneity of variances in the 2 samples
Quantitative data
Variable normally distributed
samples < 30
Unpaired t- test
Unpaired data of independent observation made on the individual of two
different or separate groups or samples drawn from 2 populations
Null hypothesis is stated
difference between means of two samples
(X1-X2) measures variation in variable
calculate the t value
t = (X1-X2)
SE
Paired t- test
To study the role of factor or cause when the observations are made
before & after the its play:
Eg: exertion on pulse rate, effect of a drug on blood pressure etc
To compare the effect of 2drugs , given to the same individual in the sample
on two different occasions
eg: adrenaline & noradrenaline on pulse rate
to study the comparative accuracy of 2 difft instruments
eg: 2 difft types of sphygmomanometers
to compare the results of 2 difft lab techniques
To compare the observations made at two different sites in the same body
Testing procedure:
Null hypothesis
X1-X2= x
Calculate mean of the difference x = Σ x /n
calculate SD of differences & SE of mean
SE= SD/ √ n
Determine t value
t= x -o
SD / √n
• Find the degrees of freedom , n-1
• refer the table & find the probability
• P >0.05 not significant
• P< 0.05 significant
Variance ratio test or F test
Variance: a measure of the extent of the variation present in a set of data
Obtained by taking the sum of squares
Measured in squared units
Comparison of variance b/w two samples
Test developed by Fisher & Snedecor
Involves another distribution called F – distribution
Calculate variance of two samples first, S1 2 & S2 2,
(Variance = SD²)
F = S12 / S2 2
S12 > S2 2
S12 - numerator
Significance of F by referring to F- table
Degrees of freedom , (n1 – 1 ) & (n2 – 1) in the two samples
Table gives variance ratio values at diff levels of significance at df (n 1 – 1)
given horizontally and (n2 – 2) , vertically
Eg sample A : sum of squares = 36 ; df = 8
Sample B : sum of squares = 42 : df = 9
F = 42/9 / 36 /8 = 42/9 x 8/36 = 1.04
This value of F < table value at p =0.05, not significant
Analysis of variance(ANOVA) test
Compare more than two samples
Compares variation between the classes as well as within the classes
For such comparisons there is high chance of error using t or Z test
Variation in experimental studies – natural variation/ random / error
variation
Variation caused due to experimenter- imposed variation or treatment
variation
A :b/w groups variation = random variation (always) + imposed variation
(maybe)
B :Within group variation = random variation
Total variation = A+B
If there is no real difference b/w groups, then
REFERENCES
B.K. Mahajan. Methods in Biostatistics, 6th edition, Jaypee brothers
P.S.S.Sundar Rao, J.Richard. An introduction to Biostatistics,3rd edition,
Prentice Hall of India.
James F Jekel, David L Katz, Joann G Elmore. Epidemiology, biostatistics
and preventive medicine, 2nd edition, WB Saunders Company
Research methodology- C.R.Kothari.
Preventive and Community Dentistry- Soben Peter 4th edition.