Application of Statistical Test in Clinical Research

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

International Journal of Medicine Research

International Journal of Medicine Research


ISSN: 2455-7404; Impact Factor: RJIF 5.42
www.medicinesjournal.com
Volume 1; Issue 5; November 2016; Page No. 36-44

Application of statistical test in clinical research


1
Anirban Goswami, 2 Dr. Mohd Wasim Ahmed, 3 Dr. Rajesh, 4 Dr. Najmus Sehar, 5 Dr. Mohd Ishtiyaque Alam
1
Investigator (Statistics), Regional Research Institute of Unani Medicine, Patna, under CCRUM, Ministry of Ayush, India
2, 3
Research Officer (U), Scientist L-1, Regional Research Institute of Unani Medicine, Patna, under CCRUM, Ministry of Ayush, India
4
Research Officer (U), Scientist L-3, Regional Research Institute of Unani Medicine, Patna, under CCRUM, Ministry of Ayush, India
5
Research Officer Incharge, Scientist L-4, Regional Research Institute of Unani Medicine, Patna, under CCRUM, Ministry of Ayush,
India

Abstract
Clinical research is increasingly based on the empirical studies and the results of these are usually presented and analyzed with
statistical methods. Therefore discuss frequently used statistical tests for different type of data set under assumption of normality
or non-normality. The statistical tests applied when normality (and homogeneity of variance) assumptions are satisfied otherwise
the equivalent non-parametric statistical test used. Advice will be presented for selecting statistical tests on the basis of very
simple cases. It is therefore an advantage for any physician or researcher he/she is familiar with the frequently used statistical
tests, as this is the only way he or she can evaluate the statistical methods in scientific publications and thus correctly interpret
their findings.

Keywords: clinical research, statistical test

1. Introduction In clinical research, patient’s and investigator’s responses to


Clinical research are conducted to collect and recorded data treatments can be documented according to the occurrence of
on each subject, such as the patient's demographic some meaningful and well-defined event such as death,
characteristics, disease related risk factors, medical history, infection, or cure of a certain disease, any serious adverse
biochemical markers, pathological history, medical therapies, events, biochemical and pathological findings. In addition the
and outcome or endpoint data at different time points. This nature of these data can be parametric or non-parametric. In
data may be continuous or discrete. Understanding that the this regard parametric test is used on parametric data, while
types/assumptions of data are more important as they non-parametric data is examined with a non-parametric test.
determine which method of data analysis is to be use and how Parametric statistical tests are done when data follow the
to report the results [1]. For the assessment of the safety, normal distribution. Parametric test are the most powerful
efficacy, and / or the mechanism of action of an statistical test because they use all of the information in the
investigational medicinal product, or new drug or device that numbers. Non-parametric statistical test are used when the
is in development. data don’t follow a particular distribution but can be ordered
Data can be divided into two main types: quantitative and and sometimes are called distribution free test.
qualitative. Quantitative data can be either continuous
variables that one can measure (such as height, weight, or 2. Statistical test used in Clinical Research
blood pressure) or discrete variables (such as numbers of Z-test
patients attained in OPD per day or numbers of attacks of A Z-test is a hypothesis test based on the Z-statistic, which
asthma per child per month). Qualitative data tend to be follows the standard normal distribution under the null
categories; people are male or female, Indain or Bangladeshi, hypothesis. This test is used when the outcome is continuous
they have a disease or are in good health and they are and the exposure, or predictor, is binary. We can use this test
belonging to lower or middle or higher socio-economic under the assuming for the sample size is greater than 30,
status. There are four types of scales that appear in social observations should be independent from each other, one
sciences: nominal, ordinal, interval, and ratio scales. They are observation isn’t related or doesn’t affect another
categorized into two groups: categorical and continuous scale observations, data should be followed normally distributed
data. Nominal and ordinal scales are categorical data or non- and data should be randomly selected from a population,
parametric data; interval and ratio scales are continuous data where each item has an equal chance of being selected. There
or parametric data. When categorical data has unordered are two type of test under Z-test as one sample Z-test and two
scales it is called nominal scales. Blood group, gender are sample Z-test. The one sample Z-test, which tests the mean of
example of the nominal scale. Categorical data that has a normally distributed population with known variance. For
ordered scales are called ordinal scale. Severities of illness, example in clinical research, if someone said they had found
amount of pain are example of ordinal scale. There should be a new drug that cures cancer, some other would want to be
distinction between them because the data analysis method is sure it was probably true. A hypothesis test will tell him if it’s
different depending on the scale of measurement [2]. probably true, or probably not true. The two sample Z-test

36
International Journal of Medicine Research

used to determine whether two population means are different vectors (multivariate means) of two multivariate data sets.
when the variances are known and statistic is assumed to This test can be used under the assumptions (1) the variables
have a normal distribution. For example in clinical research, of each data set follow a multivariate normal distribution,
suppose two flu drugs A and B, Drug A works on 41 people each variable may be tested for univariate normality, (2) the
out of a sample of 195, Drug B works on 351 people in a objects have been independently sampled, (3) in a two-
sample of 605 and to test the effect of two drugs equal or not. sampled test, the two data sets being tested have (near)
equivalent variance-covariance matrices, Bartlett's test may
Student t-test be used to evaluate if this assumption holds, (4) each data set
Student t-test, in statistics, a method of testing hypotheses describes one population with one multivariate mean. No
about the mean of a small sample drawn from a normally subpopulations exist within each data set. Example in clinical
distributed population when the population standard deviation research, a certain type of tropical disease is characterized by
is unknown. We can use this test under the assuming for the fever, low blood pressure and body aches. Suppose a
sample size is lesser than 30, observations should be researcher team are working on a new drug to treat this type
independent from each other, one observation isn’t related or of disease and wanted to determine whether the drug is
doesn’t affect another observations, data should be followed effective. They took a random sample of 20 people with this
normally distributed and data should be randomly selected type of disease and 18 with a placebo. Based on the data they
from a population, where each item has an equal chance of wanted to determine whether the drug is effective at reducing
being selected. There are two type of Student t-test under one these three symptoms.
sample and two sample. One sample student t-test is a
statistical procedure used to examine the mean difference ANOVA
between the sample and the known value of the population Analysis of variance (ANOVA) is used in statistics that splits
mean. It is used to determine if a mean response changes the total variability found inside a data set into two parts:
under different experimental conditions. In other hand, two- systematic factors and random factors. The systematic factors
sample t-test is used to compare the means of two have a statistical influence on the given data set, but the
independent populations, denoted µ1 and µ2 with standard random factors do not. Analysts use the ANOVA test to
deviation of the populations should be equal. This test has determine the result independent variables have on the
ubiquitous application in the analysis of controlled clinical dependent variable amid a regression study. It is an extension
research. For example in clinical research, the comparison of of the two-sample t-test and Z-test. In 1918, Ronald Fisher
mean decreases in diastolic blood pressure between two developed a test called the analysis of variance. This test is
groups of patients receiving different antihypertensive agents, also called the Fisher analysis of variance, used to the
or estimating pain relief from a new treatment relative to that analysis of variance between and within the groups whenever
of a placebo based on subjective assessment of percent the groups are more than two [7]. When we set the Type one
improvement in two parallel groups [3, 4]. error to be 0.05, and we have several of groups, each time we
tested a mean against another there would be a 0.05
Student paired ‘t’ test probability of having a type one error rate. This would mean
It is a statistical technique that is applied to paired data of that with six T-tests we would have a 0.30 (.05×6) probability
independent observations from one sample only when each of having a type one error rate. This is much higher than the
individual gives a pair of observation or compare two desired 0.05. ANOVA creates a way to test several null
population means in the case of two samples that are hypothesis at the same time at the Type one error 0.05. We
correlated. Paired sample t-test is used in ‘before-after’ can use this test under the assuming, each group sample is
studies, or when the samples are the matched pairs, or when it drawn from a normally distributed population, all populations
is a case-control study. We can use this test under have a common variance, all samples are drawn
assumptions of the number of observations in each data set independently of each other, within each sample, the
must be the same, and they must be organized in pairs, in observations are sampled randomly & independently of each
which there is a definite relationship between each pair of other and factor effects are additive in nature. Example in
data observations, data were taken as random samples clinical research, ANOVA method might be appropriate for
follows as Normal distribution and the variance of two comparing mean responses among a number of parallel-dose
samples is equal, Cases must be independent of each other. groups or among various strata based on patients’ background
This statistical test used in clinical research to compare the information, such as race, age group, or disease severity [4].
effect of two drugs, given to the same individuals in the
sample at two different occasions, e.g., adrenaline and ANCOVA
noradrenalin on puls rate, number of hours for which sleep is In clinical research, patients who meet inclusion and
induced by two hypnotics and so on [5]. exclusion criteria are randomly assigned to each treatment
group. Under the assumption of targeted patient population is
Hotelling’s T2 test homogeneous, we can expect that patient characteristics such
Hotelling’s T2 test is the multivariate generalization of the as age, gender, and weight are comparable between treatment
Student’s t-test [6]. Hotelling's T2 test should be described by groups. If the patient population is known to be
multiple response variables. A one-sample Hotelling's T2 test heterogeneous in terms of some demographic variables, then
can be used to test if a set of objects (which should be a a stratified randomization according to these variables should
sample of a single statistical population) has a mean equal to be applied. At the beginning of the study, clinical data are
a hypothetical mean. A two-sample Hotelling's T2 test may be usually collected at randomization to establish baseline
used to test for significant differences between the mean values. After the administration of study drug, clinical data
37
International Journal of Medicine Research

are often collected at each visit over the entire duration of Otherwise, declare that comparison to not be statistically
study. These clinical data are analyzed to assess the efficacy significant.
and safety of the treatments. As pointed out earlier, before the
analysis of endpoint values. Characteristics between Holm’s Test
treatments of the patient are usually examined by an analysis The Holm test is a powerful and versatile multiple
of variance (ANOVA) if the variable is continuous. For the comparison test. It can be used in clinical research to
analysis of endpoint values, although the technique of compare all pairs of means, compare each group mean to a
analysis of variance (ANOVA) can be directly applied, it is control mean, or compare preselected pairs of means. It is not
believed the endpoint values are usually linearly related to the restricted to being used as a follow up to ANOVA but instead
baseline values. Therefore an adjusted analysis of variance it can be used in any multiple comparisons context [11].
should be considered to account for the baseline values. This
adjusted analysis of variance is called analysis of covariance Newman-Keuls Test
(ANCOVA) [8]. In addition, ANCOVA provides a method for Newman-Keuls Test also referred to as the “Student
comparing response means among two or more groups Newman-Keuls Test”. It is described variously as a stepwise
adjusted for a quantitative concomitant variable, or or multiple-stage test. The range statistic varies for each
‘covariate’, thought to influence the response. The attention pairwise comparison as a function of the number of group
here is confined to cases in which the response, y, might be means in between the two being compared. A different
linearly related to the covariate, x. ANCOVA combines shortest significant range is computed for each pairwise
regression and ANOVA methods by fitting simple linear comparison of means. Means are first ordered by rank, and
regression models within each group and comparing the largest and smallest means are tested. If there is no
regressions among groups. Assumptions for ANCOVA as significant differences, testing stops there and it is concluded
each independent variable, the relationship between the that none is significantly different. Then means of the next
response (y) and the covariate (x) is linear, the lines greatest difference are tested using a different shortest
expressing these linear relationships are all parallel significant range. Testing is continued until no further
(homogeneity of regression slopes), the covariate is significant differences are found.
independent of the treatment effects (i.e. the covariant and This tests used when the group sample sizes are equal. For
independent variables are independent. ANCOVA might be example, to test with 5 treatment means X5 > X1, p-value <
applied 1) comparing cholesterol levels (y) between a treated 0.05. X4 = X1, p-value = ns. Can’t test different between X1
group and a reference group adjusted for age (x, in years) 2) and X3, X1 and X2, or X2 and X3. Can test different between
comparing scar healing (y) between conventional and laser X2 and X5 if the different between the means exceeds the
surgery adjusted for excision size (x, in mm) 3) comparing difference between the means of X1 and X5. The Student-
exercise tolerance (y) in 3 dose levels of a treatment used for Newman-Keuls (SNK) test is more powerful than Tukey's
angina patients adjusted for smoking habits (x, in method, so it will detect real differences more frequently [12].
cigarettes/day). However, Newman-Keuls test offers poor protection against a
type I error. This is especially the case when treatment means
Bartlett Test fall into groups which are themselves widely spaced apart.
Bartlett Test can be used to test for homogeneity of variance Differences between means within groups will be significant
[9]
. In addition, it can be used when the variances across more often than they should be at the specified level of α.
groups are not equal, the usual analysis of variance
assumptions are not satisfied and the ANOVA F test is not Tukey Multiple Comparison Test
valid and equal sample sizes from several normal In clinical research, the researcher may still need to
populations. For example, to use this tests for checking understand subgroup differences among the different
equality of variances among the treatment groups. The experimental and control groups. The subgroup differences
Levene's,Cochran's, and Hartley's statistical tests are also are called “pairwise” differences. ANOVA does not provide
used to test for homogeneity of variance. tests of pairwise differences when the researcher needs to test
pairwise differences. Tukey’s multiple comparison analysis
Bonferroni Test method tests each experimental group against each control
It is a multiple comparison test of significance based on group [13]. The Tukey method is preferred if there are equal
individual p-value is derived [10]. It can be used to correct any group sizes among the experimental and control groups. A
set of p-values for multiple comparisons, and is not restricted modified Tukey-Kramer method can be applied for
to use as a test to ANOVA. It works like as (1) compute a p- comparisons of unequal-sized groups. We can use this test
value for each comparison. Do no corrections for multiple under assuming the observations being tested are independent
comparisons when you do this calculation. (2) Define the within and among the groups, the groups associated with each
familywise significance threshold. Often this value is kept set mean in the test are normally distributed and there is equal
to the traditional value of 0.05. (3) Divide the value you within-group variance across the groups associated with each
chose in step 2 by the number of comparisons you are making mean in the test (homogeneity of variance). Example in
in this family of comparisons. If you use the traditional 0.05 clinical research, consider the data on effect of maternal
definition of significance, and are making 20 comparisons, smoking on child birth weight, in this case only the effect of
then the new threshold is 0.05/20, or 0.0025. (4) Call each duration of smoking is statistically significant. To find which
comparison "statistically significant" if the p- value from step duration or durations are making a significant impact,
1 is less than or equal to the value computed in step 3. compare mean birth weight for different duration.

38
International Journal of Medicine Research

Scheffe Test significant outliers in the related groups. Example in clinical


The Scheffe Test (also called Scheffe’s procedure or research, consider the two groups with two different
Scheffe’s method) is a multiple comparison test used treatment modalities with measured different physical and
in Analysis of Variance [14]. In clinical research, researchers biochemical parameters (e.g pulse, systolic blood pressure,
used to ANOVA and got a significant F-statistic (i.e. rejected serum sodium level etc.) in each group at different time
the null hypothesis that the means are the same), then intervals (say pre-intervention, after 1 month and after two
Sheffe’s test to find out which pairs of means are significant. months) and to test the effect of each treatment modality on
The Scheffe test corrects alpha (level of significant) for these parameters over time and at the same time look for any
simple and complex mean comparisons. Complex mean significant difference existing between the two groups using
comparisons involve comparing more than one pair of means repeated measurement of ANOVA.
simultaneously. For example, suppose four different
antibiotics were tested for mortality rates among patients with Repeated Measurement of ANCOVA
necrotizing fasciitis. All the ANOVA can determine is if It is used in randomized clinical research, suppose
there were significant differences among the groups’ measurements are often collected on each patient at a
mortality rates. It cannot identify which drug produced the baseline visit and several post-randomization time points. In
lowest mortality rates, or if two or three of the drugs were the longitudinal analysis of covariance in which the post
equivalent in effectiveness and one was ineffective. The baseline values form the response vector and the baseline
Scheffee method provides that detailed information about value is treated as a covariate can be used to evaluate the
each drug. treatment differences at the post baseline time points. A
constrained longitudinal data analysis in which the baseline
Dunnett's test value is included in the response vector together with the post
The Tukey-Kramer method has wide appeal for all pairwise base line values and a constraint of a common baseline mean
comparisons, Dunnett’s test is the preferred method if the across treatment groups is imposed on the model as a result
goal is to maintain the overall significance level when of randomization [18]. If the baseline value is subject to
performing multiple tests to compare a set of treatment means missingness, the constrained longitudinal data analysis is
with a control group. Dunnett developed this method of shown to be more efficient for estimating the treatment
multiple comparisons for obtaining a set of simultaneous differences at post baseline time points than the longitudinal
confidence intervals for preplanned treatment versus control analysis of covariance. The efficiency gain increases with the
contrasts ti – t1 (i= 2,…,v) where level 1 corresponds to the number of subjects missing baseline and the number of
control treatment [15]. In additions, the Dunnett test is quite subjects missing all post baseline values, and, for the pre-post
useful in clinical research when the researcher wishes to test design, decreases with the absolute correlation between
two or more experimental groups against a single control baseline and post baseline values.
group [16]. It tests each experimental group’s mean against the
control group mean. The other methods test each study group Pearson Correlation Test
against the total group mean (i.e., the grand mean). This Pearson Correlation is a statistical procedure applied to
difference in testing approach makes the Dunnett method calculate association between two continuous or ordinal scale
much more likely to find a significant difference because the variables. It is used when both variables being studied are
grand mean includes all group means and thus normally distributed. This coefficient is affected by extreme
mathematically it is less extreme than individual group values, which may exaggerate or dampen the strength of
means. The more extreme group means will produce larger relationship, and is therefore inappropriate when either or
mean differences than tests comparing one group mean to the both variables are not normally distributed. Pearson’s
grand mean. coefficient test while the significance of the coefficient is
expressed by p-value. Pearson’s correlation is denoted by a
Repeated Measurement of ANOVA small letter ‘r’ and its values may range from -1 to +1. The
In a Clinical Research we record the data on the patients value of the correlation coefficient from 0 to 1 is positive
more than two times. In such a situation using the standard correlation and it designates proportional growth of values in
ANOVA procedures is not appropriate as it does not consider both variables. An example of positive correlation is the
dependencies between observations within subjects in the duration of diabetes mellitus and the degree of damage of eye
analysis. To deal with such types of study data Repeated capillaries. The correlation coefficient value from 0 to -1
Measure ANOVA should be used [17]. We can use this indicates negative correlation, i.e. a rise in the value of one
method under the assumptions, (1) the dependent variable that is proportional to a decline in the value of the
variable should be measured at the continuous level (i.e. other; e.g. oxygen concentration in the air drops with the rise
measured in hours), intelligence (measured using IQ score), in altitude above sea level. Perfect correlations, i.e. the values
exam performance (measured from 0 to 100), weight of the coefficient of correlation r = ± 1 are not
(measured in kg), and so forth. (2) The independent characteristically for biological systems and most frequently
variable should consist of at least two categorical, "related refer to theoretical models. The zero value of the coefficient
groups" or "matched pairs". "Related groups" indicates that of correlation indicates absence of linear correlation, i.e. by
the same subjects are present in both groups. (3) knowing the values of one variable, we can conclude nothing
The distribution of the dependent variable in the two or more on the values of the other.
related groups should be approximately normally distributed.
(4) The variances of the differences between all combinations Chi-square test (of independency)
of related groups must be equal and there should be no The chi-square test of independency is used to the association
39
International Journal of Medicine Research

between two independence categorical variables. The idea nominal variables, each with two or more possible values and
behind this test is to compare the observed frequencies with researcher want to see whether the proportions of one
the frequencies that would be expected if the null hypothesis variable are different for different values of the other
of no association/statistically independence were true. By variable. For example, suppose researcher wanted to know
assuming the variables are independent, we can also predict whether it is better to give the diphtheria, tetanus and
an expected frequency for each cell in the contingency table. pertussis (DTaP) vaccine in either the thigh or the arm, so
If the value of the test statistic for the chi-squared test of they collected data on severe reactions to this vaccine in
association is too large, it indicates a poor agreement between children aged 3 to 6 years old. One nominal variable is severe
the observed and expected frequencies and the null reaction vs. no severe reaction; the other nominal variable is
hypothesis of independence/no association is rejected. For thigh vs. arm [21]. In this case, a higher proportion of severe
example in clinical research, it will be used to test the reactions in children vaccinated in the arm; a G–test of
association between adverse event and the treatment used. independence will tell whether a difference this big is likely
The assumptions of chi-square test as independent random to have occurred by chance. Fisher's exact test is more
sampling, no more than 20% of the cells have an expected accurate than the G–test of independence when the expected
frequency less than five, and no empty cells. If the chi-square numbers are small.
test shows significant result, then we may be interested to see
the degree or strength of association among variables, but it Binomial Test
fails to explain another situation where more than or equal to It is used for testing whether a proportion from a single
20% of the cells have an expected frequency less than five. In dichotomous variable is equal to a presumed population
this case, the usual chi-square test is not valid. Then the value. Binomial test as an alternative to the z -test for
Fisher Exact test will be used to test the association among population proportions. The assumptions for the test are that
variables. This method also fails to give the strength of a) the data are dichotomous, b) observations should be
association among variables. independent from each other, and c) the total number of
The chi-square test of homogeneity is applied to a single observations in category A multiplied by the total number of
categorical variable from two different populations. It is used observations (i.e. A + B) > 10, and that the total number of
to determine whether frequency counts are distributed observations in category B multiplied by the total number of
identically across different populations. We can use this test observations > 10 (this way we can use the normal
under the assuming for each population, the sampling method approximation for the binomial test and calculate the z-
is simple random sampling and sample data are displayed in score). In clinical research, a common use of the binomial test
a contingency table (Populations x Category levels), the is for estimating a response rate, p, using the number of
expected frequency count for each cell of the table is at least patients (X) who respond to an investigative treatment out of
5. For example, in multicenter clinical trials it will be used to a total of n studied.
test differences among the centres for response of the
particular drug(s). McNemar test
In clinical research, It’s used when researcher interested to
Chi-square test (of Homogeneity) the test of improvement in response rate after a particular
The chi-square test of homogeneity is applied to a single treatment or finding a change in proportion for the paired data
categorical variable from two different populations. It is used (e.g., studies in which patients serve as their own control, or
to determine whether frequency counts are distributed in studies with before and after design). The three main
identically across different populations. We can use this test assumptions for this test are variable must be nominal with
under the assuming for each population, the sampling method two categories (i.e. dichotomous variables) and one
is simple random sampling and sample data are displayed in independent variable with two connected groups, two groups
a contingency table (Populations x Category levels), the of the dependent variable must be mutually exclusive and
expected frequency count for each cell of the table is at least sample must be a random sample and no expected
5. For example, in multicenter clinical trials it will be used to frequencies should be less than five. Data should be placed
test differences among the centres for response of the into a 2×2 contingency table, with the cell frequencies
particular drug(s). equalling the number of pairs. For example, a researcher is
testing a new medication and records if the drug worked
Fisher Exact Test (“yes”) or did not (“no”).
The Fisher's exact test is used in the approximation of the chi-
squared and normal test for a 2 x 2 contingency table, when Generalized McNemar/Stuart-Maxwell Test
cells have an expected frequency of five or less [19]. The chi- The generalization of McNemar's test extend 2x2 square
square test assumes that each cell has an expected frequency tables to KxK tables is often referred to as the generalized
of five or more, but the Fisher's exact test has no such McNemar or Stuart-Maxwell test [22, 23]. In clinical research,
assumption and can be used regardless of how small the this testing is used to analyze matched-pair pre–post data
expected frequency is. For example in clinical research, a (treatment) with multiple discrete levels (e.g. severity of pain)
study to compare two treatment regimes for controlling of the exposure (outcome) variable.
bleeding in haemophiliacs undergoing surgery when cell
frequency of 2 x 2 contingency table is five or less [20]. Bhapkar's test
This test is the marginal homogeneity by exploiting the
G–test of independence asymptotic normality of marginal proportion [24]. The idea of
G–test of independence used when researcher has two constructing test statistic is similar to the one of generalized
40
International Journal of Medicine Research

McNemar's test statistic, and the main difference lies in the many quantities of interest in medicine, such as anxiety or
calculation of elements in variance-covariance matrix. degree of handicap, are impossible to measure explicitly. In
Although the Bhapkar and Stuart-Maxwell tests are such cases, we ask a series of questions and combine the
asymptotically equivalent [25]. Bhapkar test is a more answers into a single numerical value. For example, Quality
powerful alternative to the Stuart-Maxwell test. In large of Life (QoL) scale used in clinical research should have
sample both will produce the same chi-squared value [24]. demonstrated reliability and validity, and be responsive to
change in health status, reliability is assessed through
Cochran’s Q test examination of the internal consistency at a single
This test is used to determine if there are differences on a administration of the instrument using Cronbach's α (alpha).
dichotomous dependent variable between three or more
related groups. In addition, when a binary response is Wilcoxon signed-rank test
measured several times or under different conditions, The Wilcoxon signed rank test is a non-parametric or
Cochran’s tests that the marginal probability of a positive distribution free test for the case of two related samples or
response is unchanged across the times or conditions. The repeated measurements on a single sample. It can be used (a)
Cochran Q test is an extension to the McNemar test for in place of a one-sample t-test (b) in place of a paired t-test or
related samples that provides a method for testing the (c) for ordered categorical data where a numerical scale is
differences between three or more matched sets of inappropriate but where it is possible to rank the observations
frequencies or proportions. We can use this test under the when the population can't be assumed to be normally
assuming for one dependent variable with two, mutually distributed. For example, the hours of relief provided by two
exclusive groups (i.e., the variable is dichotomous), analgesic drugs in patients suffering from arthritis and to test
dichotomous variables include perceived safety (two groups: that one drug provides longer relief than the other.
"safe" and "unsafe"), one independent variable with three or
more related groups and the cases (e.g., participants) are a Mann–Whitney U test
random sample from the population of interest. For example, The Mann–Whitney U test is a non-parametric or distribution
the data set drugs contain data for a study of three drugs to free test to compare differences between two independent
treat a chronic disease and forty-six subjects receives drugs groups when the dependent variable is either ordinal or
A, B, and C [26]. The response to each drug is either favorable continuous, but not normally distributed. The Mann-Whitney
or unfavorable and to test that differences of favorable (or Wilcoxon-Mann-Whitney) test is sometimes used for
response for the three drugs. comparing the efficacy of two treatments in clinical research.
It is often presented as an alternative to a t- test when the data
Cohen's kappa statistic are not normally distributed. Whereas a t-test is a test of
Cohen's kappa statistic is a measure of agreement between population means, the Mann-Whitney test is commonly
categorical variables. For example, kappa can be used to regarded as a test of population.
compare the ability of different raters to classify subjects into
one of several groups. Kappa also can be used to assess the Kruskal-Wallis H test
agreement between alternative methods of categorical The Kruskal-Wallis H test is a rank-based nonparametric test
assessment when new techniques are under study. In clinical that can be used to determine if there are statistically
aspect, comparison of a new measurement technique with an significant differences between two or more groups of an
established one is often needed to check whether they agree independent variable on a continuous or ordinal dependent
sufficiently for the new to replace the old. Correlation is often variable. Sometimes this test described as an ANOVA with
misleading [27]. Cohen’s Kappa used and the level of the data replaced by their ranks. It is an extension of the
agreement between raters were assessed in terms of a simple Mann-Whitney U test to three or more groups. For example
categorical diagnosis (i.e., the presence or absence of a in clinical research, it will be used to test assess differences in
disorder). albumin levels in adults different diets with different amounts
The kappa coefficient (𝜅) is used to assess inter-rater of protein.
agreement. One of the most important features of the kappa
statistic is that it is a measure of agreement, which naturally Friedman Post Hoc test
controls for chance. Kappa is always less than or equal to 1. It is a non-parametric test (distribution-free) used to compare
A value of 1 implies perfect agreement and values less than 1 observations repeated on the same subjects. This test is an
imply less than perfect agreement. In rare situations, Kappa alternative to the repeated measures ANOVA, when the
can be negative. This is a sign that the two observers agreed assumption of normality or equality of variance is not met.
less than would be expected just by chance. Possible Friedman’s Test and found a significant P- value, that means
interpretation of kappa coefficient (𝜅) as follows: that some of the groups in data have different distribution
 Poor agreement = Less than 0.20 from one another, but it is don’t know which. There for, it is
 Fair agreement = 0.20 to 0.40 needed to find out which pairs of groups are significantly
 Moderate agreement = 0.40 to 0.60 different then each other. But when we have N groups,
 Good agreement = 0.60 to 0.80 checking all of their pairs will be to perform [n over 2]
 Very good agreement = 0.80 to 1.00 comparisons, thus the need to correct for multiple
comparisons arises. In that situation we will used the
Cronbach’s α (alpha) Statistic Friedman Post Hoc test. In clinical research, this test find out
The Cronbach's alpha is a statistic for investigating the the improvement of the drug(s) among the patients follow ups
internal consistency of a questionnaire [28, 29]. Generally, for a particular disease.
41
International Journal of Medicine Research

[26]
Kolmogorov-Smirnov test centres as strata . The CMH can be generalized to IxJxK
Kolmogorov-Smirnov test is a nonparametric statistical test tables.
that compares the cumulative distributions of two data sets. It
does not assume that data are sampled from Gaussian Log-rank test
distributions (or any other defined distributions). This test (K- The Log-rank test is a nonparametric test to comparing
S test) is used to decide if a sample comes from a population distributions of time until the occurrence of an event of
with a completely specified continuous distribution and also interest among independent groups. The event is often death
assumed that the population distribution is fully specified (i.e. due to disease, but event might be any binomial outcome,
it assumes that you know the mean and Standard deviation such as cure, response, relapse, or failure. Examples where
(SD) of the overall population perhaps from prior work) [30, use of the log-rank test might be appropriate include
31]
. For example in clinical research, to compare the serum comparing survival times in cancer patients who are given a
Antioxidant levels in 30 patients with pemphigus vulgaris, an new treatment with patients who receive standard
auto-immune blistering disorder [32]. chemotherapy, or comparing times-to-cure among several
doses of a topical antifungal preparation where the patient is
Spearman Correlation Test treated for 10 weeks or until cured, whichever comes first.
Spearman correlation to test the association between two
ranked variables, or one ranked variable and one Peto log-rank or Peto's generalized wilcoxon test
measurement variable. It is appropriate when one or both This test give more weight to the initial interval of the study
variables are skewed or ordinal [33] and is robust when where there are the largest number of patient’s risk. If the rate
extreme values are present. It is used instead of linear of death is similar over time, the Peto log-rank test and log-
regression/correlation for two measurement variables if rank test will produce the similar results. Log-Rank test is
you're worried about non-normality, but this is not usually more appropriate than the Peto generalized Wilcoxon test
necessary. Spearman correlation coefficient solely tests for when the alternative hypothesis is that the risk of death for an
monotonous relationships for at least ordinally scaled individual in one group is proportional to the risk at that time
parameters. The advantages of the latter are its robustness to for a similar individual in the other group. In additions, the
outliers and skew distributions. Correlation coefficients validity of this proportional risk assumption can be elucidated
measure the strength of association and can have values by the survivor functions of both groups. If it is clear they do
between –1 and +1. The closer they are to 1, the stronger is not cross each other than the proportional risk assumption is
the association. A test variable and a statistical test can be quite probably true and then Log-rank test should be used. In
constructed from the correlation coefficient. The null other case, the Peto log-rank test used instead.
hypothesis to be tested is then that there is no linear (or
monotonous) correlation. Odds Ratio (OR)
The Odds ratio is the ratio of the odds of disease in the
Cochran Armitage trend test exposed to the odds of disease in the non-exposed. It is used
In clinical research, it is often of interest to investigate the to measure of association the risk of a particular outcome (or
relationship between the increasing dosage and the effect of disease) if a certain factor (or exposure) is present. In
the drug under study. Usually the dose levels tested are addition, odds ratio is a relative measure of risk, telling us
ordinal, and the effect of the drug is measured in binary. In how much more likely it is that someone who is exposed to
this case, Cochran-Armitage trend test is used to test for trend the factor under study will develop the outcome as compared
among binomial proportions across levels of a single factor or to someone who is not exposed.
covariate [34, 35]. This test is appropriate for a two-way table For a 2x2 contingency table:
where one variable has two levels and the other variable is  OR=1 suggests there is an equal chance of getting the
ordinal. The two-level variable represents the response, and disease among exposed group compared to unexposed
the other variable represents an explanatory variable with group.
ordered levels.  OR>1 suggests there is a more chance or likelihood of
getting the disease exposed group compared to unexposed
Mantel Haenszel (MH) test group.
Mantel Haenszel (MH) statistic used to analysis of two  OR<1 suggests there is a less chance or likelihood of
dichotomous variables while adjusting for a third variable to getting the disease among exposed group compared to
determine whether there is a relationship between the two unexposed group. Odds ratio can be used in both
variables controlling for levels of the third variable. For retrospective and prospective studies.
example, compare the frequency of smoking vs. non-smoking The Odds Ratio useful to analyse associations between
in teenage boys vs. girls in several different cities for 2x2 groups from case-control and prevalent (or cross-sectional)
replicated tables. data, rare diseases (or diseases with long latency periods) the
OR can be an approximate measure to the RR (relative risk)
Cochran Mantel Haenszel (CMH) test and to estimate the strength of an association between
Mantel Haenszel is a non-model based test used to identify exposures and outcomes.
confounders and to control for confounding in the statistical
analysis. It is used to test the conditional independence in Relative Risk (RR)
2x2xK tables. The Cochran-Mantel-Haenszel test is often The risk of the disease is probability of an individual
used in the comparison of response rates between two becoming newly disease given that the individual has the
treatment groups in a multi-center study using the study particular attribute. The Relative Risk is a ratio of the risk of
42
International Journal of Medicine Research

disease for those with the risk factor to the risk of disease for relationship between alcohol consumption and death from
those without the risk factor. In clinical research, it is used to any cause and to test the nonlinear relationship [39].
compare the risk of developing a disease in people not
receiving the treatment (or receiving a placebo) versus people Permutation test
who are receiving the treatment. Alternatively, it is used to Permutation test is used to perform a nonparametric test to
compare the risk of developing a side effect in people find out the difference between treatment groups in the
receiving a drug as compared to the people who are not assessment of new medical interventions. In addition, it is
receiving the treatment. used to study efficacy in a randomized clinical trial which
For a 2x2 contingency table: compares, in a heterogeneous patient population, two or more
 RR=1 implies that the two groups (exposed and treatments, each of which may be most effective in some
unexposed group) have same risk. patients, when the primary analysis does not adjust for
 RR>1 implies that higher risk of getting disease among covariates. The general discussion and application of
exposed group compared to unexposed group. permutation test describe by Zucker DM [40].
 RR<1 implies that lower risk of getting disease among
exposed group compared to unexposed group. 3. Conclusion
Statistical test are used to analyze the different type of data in
Sensitivity, specificity, Predictive Value Positive Test different situations and nature of the data set. The statistical
(PPT) and Predictive Value Negative Test (NNT) test has its limitations, and to overcome that another method
 Sensitivity: Sensitivity of a test is the ability to identify is used. Before using the statistical test in clinical research we
correctly those who have the disease and it is the need to check the assumptions and type of the study. Most of
proportion of patients with disease in whom the test is these statistical tests play a very important role to getting
positive. appropriate and desired result in clinical research, to make the
 Specificity: Specificity of a test is the ability to identify decision on the objectives. Researchers / Physicians are
correctly those who do not have the disease and it is the helpful to used statistical tests to determine results from
proportion of patients without disease in whom the test is experiments, clinical research of medicine and symptoms of
negative. diseases. The use of statistical test in medicine provides
 Predictive Value Positive Test (PPT): Predictive value generalizations for the public to better understand their risks
of a positive test is the likelihood of an individual with a for certain diseases, links between certain behaviors of
positive test has the disease. diseases, effectiveness of drug(s) and to significant finding of
 Predictive Value Negative Test (NNT): Predictive value experimental objectives.
of a negative test is the likelihood of an individual with a
negative test Predictive value of a positive test is the 4. References
likelihood of an individual with a positive test does not 1. Wang D, Bakhai A. Clinical Trials-A Practical Guide to
have the disease. Design, Analysis, and Reporting. Remedica Publishing,
USA. 2006.
Simpson’s Paradox 2. Campbell MJ. Statistics at Square Two (2nd Ed.).
Simpson's paradox, also known as Yule–Simpson effect was Blackwell, USA. 2006.
first described by Yule [36] and is named after Simpson's [37]. 3. Box JF, Guinness, Gosset, Fisher, and Small Samples.
In clinical research, Simpson’s Paradox arises when the Statistical Science. 1987; 2(1):45-52.
association between an exposure and an outcome is 4. Walker GA, Shostak J. Common Statistical Methods for
investigated but the exposure and outcome are strongly Clinical Research with SAS® Examples (3rd Ed.). SAS
associated with a third variable. This is a real-life example Publishing, USA. 2010.
from a medical study comparing the success rates of two 5. Mahajan BK. Methods in biostatistics for medical
treatments for kidney stones [38]. students and research workers (7th Ed.). Jaypee, India,
2010.
Tests for Linear Trend 6. Hotelling H. The generalization of Student’s ratio. Ann
In clinical study researcher may interested to dose-response Math Stat. 1931; 2(3):360-378.
effect, that is situation in which an increased value of the risk 7. Scheffé H. The Analysis of Variance (Classics Ed.). John
factor means a greater likelihood of disease. It is used to test Wiley & Sons, USA, 1999.
for a dose-response trend whenever the different level of the 8. Chow SC, Liu JP. Design and analysis of clinical trials:
risk factor (i.e. The risk factor is ordinal or at least treated as Concepts and Methodologies (2nd Ed.). John Wiley &
such). Armitage described the details of the theory [34]. For Sons, New Jersey, 2004.
example, it is used to trend test of prevalence cough would be 9. Bartlett MS. Properties of sufficiency and statistical
greater for greater amount of smoking. tests. Proceedings of the Royal Society of London Series
A. 1937; 160:268-282.
Tests for Nonlinearity 10. Hochberg Y. A Sharper Bonferroni Procedure for
Sometimes the relationship between the risk factor and Multiple Tests of Significance. Biometrika. 1988;
disease is nonlinear. For example, it could be that low and 75(4):800-802.
high doses of the risk factor are harmful compared with 11. Motulsky H. Intuitive Biostatistics: A Nonmathematical
average doses. In this case a U-shaped relationship has been Guide to Statistical Thinking (2nd Ed.). New York, NY:
found by the several authors who have investigated the Oxford University Press. 2010.

43
International Journal of Medicine Research

12. Herve Abdi, Lynne JW. Newman-Keuls Test and Tukey 34. Armitage P. Tests for Linear Trends in Proportions and
Test. Neil Salkind (Ed.), Encyclopedia of Research Frequencies, Biometrics. 1955; 11:375-386.
Design. Thousand Oaks, CA: Sage. 2010. 35. Cochran WG. Some Methods for Strengthening the
13. Mary L. McHugh. Multiple comparison analysis testing Common Chi-Square Tests. Biometrics. 1954; 10:417-
in ANOVA. Biochemia Medica. 2011; 21(3):203-9. 51.
14. Scheffé H. The Analysis of Variance. New York: Wiley. 36. Yule G. Notes on the theory of association of attributes
1959. of statistics. Biometrika. 1903; 2:121-134.
15. Dunnett CW. A multiple comparison procedure for 37. Simpson EH. The Interpretation of Interaction in
comparing several treatments with a control. Journal of Contingency Tables. Journal of the Royal Statistical
the American Statistical Association. 1955; 50:1096- Society, Series B. 1951; 13:238-241.
1121. 38. Julious SA, Mullee MA. Confounding and Simpson's
16. Dunnett CW. New tables for multiple comparisons with paradox. British Medical Journal. 1994; 309:1480-1481.
a control. Biometrics. 1964; 20:482-491. 39. Duffy JC. Alcohol consumption and all-cause mortality,
17. Singh V, Rana RK, Singhal R. Analysis of repeated International Journal of Epidemiology. 1995; 24(1):100-
measurement data in the clinical trials. Jounranl of 5.
Ayurveda and Integrative Medicine. 2013; 4(2):77-81. 40. Zucker DM. Permutation Tests in Clinical Trials. Wiley
18. Liang KY, Zeger S. Longitudinal Data Analysis of Encyclopedia of Clinical Trials, 2007.
Continuous and Discrete Responses for Pre-Post (http://pluto.mscc.huji.ac.il/~mszucker/DESIGN/perm.pd
Designs. The Indian Journal of Statistics. 2000; 62:134- f).
148.
19. Fisher RA. Statistical methods for research workers.
Genesis Publishing Pvt Ltd. 1925.
20. Sarmukaddan SB. Clinical Biostatistics (1st Ed.). New
Age International, India. 2014.
21. Jackson LA, Peterson D, Nelson JC, et al. (13 co-
authors). Vaccination site and risk of local reactions in
children one through six years of age. Pediatrics. 2013;
131:283-289.
22. Stuart A. A Test for Homogeneity of the Marginal
Distributions in a Two-Way Classification. Biometrika.
1955; 42:412-416.
23. Maxwell AE. Comparing the classification of subjects by
two independent judges. British Journal of Psychiatry.
1970; 116:651-655.
24. Bhapkar VP. A note on the equivalence of two test
criteria for hypotheses in categorical data. Journal of the
American Statistical Association. 1966; 61:228-235.
25. Keefe TJ. On the relationship between two tests for
homogeneity of the marginal distributions in a two-way
classification. Biometrics. 1982; 69:683-684.
26. Agresti A. Categorical Data Analysis (2nd Ed.). John
Wiley & Sons, New Jersey. 2002.
27. Bland JM, Altman DG. statistical methods for assessing
agreement between two methods of clinical
measurement. Lancet. 1986; i:307-10.
28. Cronbach LJ. Coefficient alpha and the internal structure
of test. Psychometrika. 1951; 16:297-334.
29. Bland JM, Altman DG. Statistics notes: Cronbach's
alpha, British Medical Journal. 1997; 314:572.
30. Lilliefors H. On the Kolmogorov Smirnov test for
normality with mean and variance unknown. JASA.
1967; 62:399:402.
31. Sprent P, Smeeton NC. Applied Nonparametric
Statistical Methods (4th Ed.). Florida: Chapman and
Hall/CRC. 2001.
32. Alireza AB, Shima Y, Sara J, Maryam Y, Farid Z, Farid
AJ. How to test normality distribution for a variable: a
real example and a simulation study. Journal of
Paramedical Sciences (JPS). 2013; 4(1):73:77.
33. Altman DG. Practical Statistics for Medical
Research. Chapman & Hall/CRC. 1990.

44

You might also like