Basic Statistical Tools in Research and Data Analysis
Basic Statistical Tools in Research and Data Analysis
Basic Statistical Tools in Research and Data Analysis
Variables
Qualitative quantitative
Interval ratio
l.
Categorical variable: variables than can be put into categories. For example, the category “Toothpaste Brands” might contain the
variables Colgate and Aquafresh.
Confounding variable: extra variables that have a hidden effect on your experimental results.
Continuous variable: a variable with infinite number of values, like “time” or “weight”.
Control variable: a factor in an experiment which must be held constant. For example, in an experiment to determine whether
light makes plants grow faster, you would have to control for soil quality and water.
Dependent variable: the outcome of an experiment. As you change the independent variable, you watch what happens to the
dependent variable.
Discrete variable: a variable that can only take on a certain number of values. For example, “number of cars in a parking lot” is
discrete because a car park can only hold so many cars.
Independent variable: a variable that is not affected by anything that you, the researcher, does. Usually plotted on the x-axis.
A measurement variable has a number associated with it. It’s an “amount” of something, or a”number” of something.
Nominal variable: another name for categorical variable.
Ordinal variable: similar to a categorical variable, but there is a clear order. For example, income levels of low, middle, and high
could be considered ordinal.
Qualitative variable: a broad category for any variable that can’t be counted (i.e. has no numerical value). Nominal and ordinal
variables fall under this umbrella term.
Quantitative variable: A broad category that includes any variable that can be counted, or has a numerical value associated with
it. Examples of variables that fall into this category include discrete variables and ratio variables.
Random variables are associated with random processes and give numbers to outcomes of random events.
A ranked variable is an ordinal variable; a variable where every data point can be put in order (1st, 2nd, 3rd, etc.).
Ratio variables: similar to interval variables, but has a meaningful zero.
Experimental research: In experimental research, the aim is to manipulate an independent variable(s) and then examine the effect that
this change has on a dependent variable(s). Since it is possible to manipulate the independent variable(s), experimental research has the
advantage of enabling a researcher to identify a cause and effect between variables. For example, take our example of 100 students
completing a maths exam where the dependent variable was the exam mark (measured from 0 to 100), and the independent variables
were revision time (measured in hours) and intelligence (measured using IQ score). Here, it would be possible to use an experimental
design and manipulate the revision time of the students. The tutor could divide the students into two groups, each made up of 50
students. In "group one", the tutor could ask the students not to do any revision. Alternately, "group two" could be asked to do 20 hours
of revision in the two weeks prior to the test. The tutor could then compare the marks that the students achieved.
Non-experimental research: In non-experimental research, the researcher does not manipulate the independent variable(s). This is not
to say that it is impossible to do so, but it will either be impractical or unethical to do so. For example, a researcher may be interested in
the effect of illegal, recreational drug use (the independent variable(s)) on certain types of behaviour (the dependent variable(s)).
However, whilst possible, it would be unethical to ask individuals to take illegal drugs in order to study what effect this had on certain
behaviours. As such, a researcher could ask both drug and non-drug users to complete a questionnaire that had been constructed to
indicate the extent to which they exhibited certain behaviours. Whilst it is not possible to identify the cause and effect between the
variables, we can still examine the association or relationship between them.In addition to understanding the difference between
dependent and independent variables, and experimental and non-experimental research, it is also important to understand the different
characteristics amongst variables. This is discussed next.
Dichotomous variables are nominal variables which have only two categories or levels. For example, if we were looking at gender, we
would most probably categorize somebody as either "male" or "female". This is an example of a dichotomous variable (and also a
nominal variable). Another example might be if we asked a person if they owned a mobile phone. Here, we may categorise mobile phone
ownership as either "Yes" or "No". In the real estate agent example, if type of property had been classified as either residential or
commercial then "type of property" would be a dichotomous variable.
Ordinal variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or
ranked. So if you asked someone if they liked the policies of the Democratic Party and they could answer either "Not very much", "They
are OK" or "Yes, a lot" then you have an ordinal variable. Why? Because you have 3 categories, namely "Not very much", "They are OK"
and "Yes, a lot" and you can rank them from the most positive (Yes, a lot), to the middle response (They are OK), to the least positive (Not
very much). However, whilst we can rank the levels, we cannot place a "value" to them; we cannot say that "They are OK" is twice as
positive as "Not very much" for example.
Quantitative variables
Quantitative or numerical data are subdivided into discrete and continuous measurements. Discrete numerical data are recorded as a
whole number such as 0, 1, 2, 3,… (integer), whereas continuous data can assume any value. Observations that can be counted constitute
the discrete data and observations that can be measured constitute the continuous data. Examples of discrete data are number of
episodes of respiratory arrests or the number of re-intubations in an intensive care unit. Similarly, examples of continuous data are the
serial serum glucose levels, partial pressure of oxygen in arterial blood and the oesophageal temperature.
A hierarchical scale of increasing precision can be used for observing and recording the data which is based on categorical, ordinal,
interval and ratio scales.
Categorical or nominal variables are unordered. The data are merely classified into categories and cannot be arranged in any particular
order. If only two categories exist (as in gender male and female), it is called as a dichotomous (or binary) data. The various causes of re-
intubation in an intensive care unit due to upper airway obstruction, impaired clearance of secretions, hypoxemia, hypercapnia,
pulmonary oedema and neurological impairment are examples of categorical variables.
Ordinal variables have a clear ordering between the variables. However, the ordered data may not have equal intervals. Examples are the
American Society of Anesthesiologists status or Richmond agitation-sedation scale.
Interval variables are similar to an ordinal variable, except that the intervals between the values of the interval variable are equally
spaced. A good example of an interval scale is the Fahrenheit degree scale used to measure temperature. With the Fahrenheit scale, the
difference between 70° and 75° is equal to the difference between 80° and 85°: The units of measurement are equal throughout the full
range of the scale.
Ratio scales are similar to interval scales, in that equal differences between scale values have equal quantitative meaning. However, ratio
scales also have a true zero point, which gives them an additional property. For example, the system of centimetres is an example of a
ratio scale. There is a true zero point and the value of 0 cm means a complete absence of length. The thyromental distance of 6 cm in an
adult may be twice that of a child in whom it may be 3 cm.
STATISTICS: DESCRIPTIVE AND INFERENTIAL STATISTICS
Descriptive statistics[4] try to describe the relationship between variables in a sample or population. Descriptive statistics provide a
summary of data in the form of mean, median and mode. Inferential statistics[4] use a random sample of data taken from a population to
describe and make inferences about the whole population. It is valuable when it is not possible to examine each member of an entire
population.
Mean,
where x = each observation and n = number of observations. Median[6] is defined as the middle of a distribution in a ranked data (with
half of the variables in the sample above and half below the median value) while mode is the most frequently occurring variable in a
distribution. Range defines the spread, or variability, of a sample.[7] It is described by the minimum and maximum values of the
variables. If we rank the data and after ranking, group the observations into percentiles, we can get better information of the pattern of
spread of the variables. In percentiles, we rank the observations into 100 equal parts. We can then describe 25%, 50%, 75% or any other
percentile amount. The median is the 50th percentile. The interquartile range will be the observations in the middle 50% of the
observations about the median (25th -75thpercentile). Variance[7] is a measure of how spread out is the distribution. It gives an indication
of how close an individual observation clusters about the mean value. The variance of a population is defined by the following formula:
where σ2 is the population variance, X is the population mean, Xi is the ith element from the population and N is the number of elements in
the population. The variance of a sample is defined by slightly different formula:
where s2 is the sample variance, x is the sample mean, xi is the ith element from the sample and n is the number of elements in the sample.
The formula for the variance of a population has the value ‘n’ as the denominator. The expression ‘n−1’ is known as the degrees of
freedom and is one less than the number of parameters. Each observation is free to vary, except the last one which must be a defined
value. The variance is measured in squared units. To make the interpretation of the data simple and to retain the basic unit of
observation, the square root of variance is used. The square root of the variance is the standard deviation (SD).[8] The SD of a population
is defined by the following formula:
where σ is the population SD, X is the population mean, Xi is the ith element from the population and N is the number of elements in the
population. The SD of a sample is defined by slightly different formula:
where s is the sample SD, x is the sample mean, xi is the ith element from the sample and n is the number of elements in the sample. An
example for calculation of variation and SD is illustrated in Table 2.
where X1 − X2 is the difference between the means of the two groups and SE denotes the standard error of the difference.
To test if the population means estimated by two dependent samples differ significantly (the paired t-test). A usual setting for paired t-test
is when measurements are made on the same subjects before and after a treatment.
The formula for paired t-test is:
where d is the mean difference and SE denotes the standard error of this difference.
The group variances can be compared using the F-test. The F-test is the ratio of variances (var l/var 2). If F differs significantly from 1.0,
then it is concluded that the group variances differ significantly.
Analysis of variance
The Student's t-test cannot be used for comparison of three or more groups. The purpose of ANOVA is to test if there is any significant
difference between the means of two or more groups.
In ANOVA, we study two variances – (a) between-group variability and (b) within-group variability. The within-group variability (error
variance) is the variation that cannot be accounted for in the study design. It is based on random differences present in our samples.
However, the between-group (or effect variance) is the result of our treatment. These two estimates of variances are compared using the
F-test.
A simplified formula for the F statistic is:
where MSb is the mean squares between the groups and MSw is the mean squares within groups.
Repeated measures analysis of variance
As with ANOVA, repeated measures ANOVA analyses the equality of means of three or more groups. However, a repeated measure
ANOVA is used when all variables of a sample are measured under different conditions or at different points in time.
As the variables are measured from a sample at different points of time, the measurement of the dependent variable is repeated. Using a
standard ANOVA in this case is not appropriate because it fails to model the correlation between the repeated measures: The data violate
the ANOVA assumption of independence. Hence, in the measurement of repeated dependent variables, repeated measures ANOVA should
be used.
Non-parametric tests
When the assumptions of normality are not met, and the sample means are not normally, distributed parametric tests can lead to
erroneous results. Non-parametric tests (distribution-free test) are used in such situation as they do not require the normality
assumption.[15] Non-parametric tests may fail to detect a significant difference when compared with a parametric test. That is, they
usually have less power.
As is done for the parametric tests, the test statistic is compared with known values for the sampling distribution of that statistic and the
null hypothesis is accepted or rejected. The types of non-parametric analysis techniques and the corresponding parametric analysis
techniques are delineated in Table 5.
A Yates correction factor is used when the sample size is small. Fischer's exact test is used to determine if there are non-random
associations between two categorical variables. It does not assume random sampling, and instead of referring a calculated statistic to a
sampling distribution, it calculates an exact probability. McNemar's test is used for paired nominal data. It is applied to 2 × 2 table with
paired-dependent samples. It is used to determine whether the row and column frequencies are equal (that is, whether there is ‘marginal
homogeneity’). The null hypothesis is that the paired proportions are equal. The Mantel-Haenszel Chi-square test is a multivariate test as
it analyses multiple grouping variables. It stratifies according to the nominated confounding variables and identifies any that affects the
primary outcome variable. If the outcome variable is dichotomous, then logistic regression is used.