Edar M-4

EXPLORATORY DATA ANALYSIS WITH R
MODULE-IV
HYPOTHESIS TESTING AND GRAPHICAL ANALYSIS
HYPOTHESIS TESTING: Using the Student’s t-test, The Wilcox on U-Test (Mann-Whitney), Paired t- and
U-Tests, Correlation and Covariance, Tests for Association.
GRAPHICAL ANALYSIS: Box-whisker Plots, Scatter Plots, Pairs Plots (Multiple Correlation Plots) Line
Charts, Pie Charts, Cleveland Dot Charts, Bar Charts, Copy Graphics to Other Applications.
HYPOTHESIS TESTING
A hypothesis is made by the researchers about the data collected for any experiment or data set. A hypothesis is
an assumption made by the researchers that are not mandatory true. In simple words, a hypothesis is a decision
taken by the researchers based on the data of the population collected.
Hypothesis Testing in R Programming is a process of testing the hypothesis made by the researcher or to
validate the hypothesis. To perform hypothesis testing, a random sample of data from the population is taken
and testing is performed. Based on the results of testing, the hypothesis is either selected or rejected.
A statistical hypothesis is an assumption about a population which may or may not be true. Hypothesis testing is
a set of formal procedures used by statisticians to either accept or reject statistical hypotheses. Statistical
hypotheses are of two types:
 Null hypothesis (H0) - represents a hypothesis of chance basis.
 Alternative hypothesis (Ha) - represents a hypothesis of observations which are influenced by some non-
random cause.
Example
Suppose we wanted to check whether a coin was fair and balanced. A null hypothesis might say that half flips
will be of head and half will of tails whereas alternative hypothesis might say that flips of head and tail may be
very different.
H0: P=0.5
Ha: P≠0.5
For example if we flipped the coin 50 times, in which 40 Heads and 10 Tails results. Using result, we need to
reject the null hypothesis and would conclude, based on the evidence, that the coin was probably not fair and
balanced.
Using the Student’s t-test:

The Student’s t-test is a method for comparing two samples, taking the means of both to determine if the
samples are different. This is a parametric test and the data should be normally distributed. Several versions of
the t-test exist, and R can handle these using the t.test() command, which has a variety of options, and the test
can be pressed into service to deal with two- and one-sample tests as well as paired tests, as shown below
table:
Two-Sample t-Test with Unequal Variance:
The general way to use the t.test() command is to compare two vectors of numeric values. We can specify the
vectors in a variety of ways, depending how data objects are set out. The default form of the t.test() does not
assume that the samples have equal variance, so the Welch two-sample test is carried out unless we specify
otherwise:
> t.test(data2, data3)
Welch Two Sample t-test
data: data2 and data3
t = -2.8151, df = 24.564, p-value = 0.009462
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.5366789 -0.5466544
sample estimates:
mean of x mean of y
5.125000 7.166667
Two-Sample t-Test with Equal Variance:
We can override the default and use the classic t-test by adding the var.equal = TRUE instruction, which forces
the command to assume that the variance of the two samples is equal. The calculation of the t-value uses
pooled variance and the degrees of freedom are unmodified, as a result, the p-value is slightly different from
the Welch version:
One-Sample t-Testing:
We can also carry out a one-sample t-test. In this version, supply the name of a single vector and the mean to
compare it to (by defaults to 0):
Using Directional Hypotheses:

We can also specify a “direction” to hypothesis. In many cases, simply testing to see if the means of two
samples are different, but may want to know if a sample mean is lower than another sample mean (or
greater), then use the alternative = instruction. The choices we have are between “two.sided”, “less”, or
“greater” and choice can be abbreviated.
Formula Syntax and Subsetting Samples in the t-Test:
The t-test is designed to compare two samples (or one sample with a “standard”). However, data will be in a
more structured form with a column for the response variable and a column for the predictor variable. For
example, consider the data are in this manner:
R deals with this by having a “formula syntax.” We create a formula using the tilde (~) symbol. Essentially
response variable goes on the left of the ~ and the predictor goes on the right like so:
If predictor column contains more than two items, the t-test cannot be used. However, still carry out a test by
subsetting this predictor column and specifying which two samples want to compare. We must use the subset
= instruction as part of the t.test() command.
The following example illustrates how to do this using the same data as follows:
> t.test(rich ~ graze, data = grass, subset = graze %in% c('mow', 'unmow'))
First specify which column want to take subset from (graze in this case) and then type %in%; this tells the
command that the list that follows is contained in the graze column.
The Wilcoxon U-Test (Mann-Whitney)
When we have two samples to compare and data are non-parametric, and then use the U-test. This test by
various names and may be known as the Mann-Whitney U-test or Wilcoxon sign rank test. We use the
wilcox.test() command to carry out the analysis. The wilcox.test() command can conduct two-sample or one-
sample tests, and we can add a variety of instructions to carry out the test. The main options are shown in
below Table:
Two-Sample U-Test:
The basic way of using the wilcox.test() is to specify the two samples want to compare as separate vectors, as
the following example shows:
In this case there is a warning message because tied values in the data. If we set exact = FALSE, this message
would not be displayed because the p-value would be determined from a normal approximation method.
One-Sample U-Test
If we specify a single numerical vector, a one-sample U-test is carried out; the default is to set mu = 0, as in the
following example:
> wilcox.test(data3, exact = FALSE)

Wilcoxon signed rank test with continuity correction
data: data3
V = 78, p-value = 0.002430 alternative hypothesis: true location is not equal to 0
In this case the p-value is taken from a normal approximation because the exact = FALSE instruction is used.
The command has assumed mu = 0 because it is not specified explicitly.
Using Directional Hypotheses

Both one- and two-sample tests use an alternative hypothesis that the location shift is not equal to 0 as their
default. This is essentially a two-sided hypothesis. We can change this by using the alternative = instruction,
where we can select “two.sided”, “less”, or “greater” as an alternative hypothesis.
We can also specify mu, the location shift. By default mu = 0. In the following example the hypothesis is set to
something other than 0.
In this example a one-sample test is carried out on the data3 sample vector. The test takes if the sample
median is less than 8. The instructions also specify to display the confidence interval and not to use an exact p-
value.
Formula Syntax and Subsetting Samples in the U-test

Data arranged into a data frame where one column represents the response variable and another represents
the predictor variable. In this case, we can use the formula syntax to describe the situation and carry out the
wilcox.test() on data. The basic form of the command becomes:
wilcox.test(response ~ predictor, data = my.data)
we can also use additional instructions. If predictor variable contains more than two samples, we cannot
conduct a U-test and must use a subset that contains exactly two samples.
The subset instruction works like so:
wilcox.test(response ~ predictor, data = my.data, subset = predictor %in% c("sample1", "sample2"))
The U-test is one of the most widely used statistical methods, so it is important to be comfortable using the
wilcox.test() command.
PAIRED T- AND U-TESTS
If we have a situation in which there is paired data, we can use matched pair versions of the t-test and the U-
test with a simple extra instruction, simply add paired = TRUE as to command. It does not matter if the data
are in two separate sample columns or are represented as response and predictor. In fact, R will carry out a
paired test even if the data do not really match up as pairs. We can use all the regular syntax and instructions,
so use subsetting and directional hypotheses as like. In the following activity the paired tests takes place:
Look at the mpd data; it contains two samples, white and yellow. These data are matched pair data and each
row represents a bi-colored target. The values are for numbers of whitefly attracted to each half of the target.
CORRELATION AND COVARIANCE:
Correlation means association - more precisely it is a measure of the extent to which two variables are related.
There are three possible results of a correlational study: a positive correlation, a negative correlation, and no
correlation.
 A positive correlation is a relationship between two variables in which both variables move in the same
direction. Therefore, when one variable increases as the other variable increases, or one variable decreases
while the other decreases. An example of positive correlation would be height and weight. Taller people tend to
be heavier.
 A negative correlation is a relationship between two variables in which an increase in one variable is
associated with a decrease in the other. An example of negative correlation would be height above sea level and
temperature. As we climb the mountain (increase in height) it gets colder (decrease in temperature).
 A zero correlation exists when there is no relationship between two variables. For example there is no
relationship between the amount of tea drunk and level of intelligence.
A correlation can be expressed visually. This is done by drawing a scatter gram (also known as a scatter plot,
scatter graph, scatter chart, or scatter diagram).
A scattergram is a graphical display that shows the relationships or associations between two numerical
variables (or co-variables), which are represented as points (or dots) for each pair of score.
A scattergraph indicates the strength and direction of the correlation between the co-variables.
When we have two continuous variables then look for a link between them, this link is called a correlation. We
can go about finding this several ways using R. The cor() command determines correlations between two
vectors, all the columns of a data frame (or matrix), or two data frames (or matrix objects).
The cov() command examines covariance. By default the Pearson product moment (that is regular parametric
correlation) is used but Spearman (rho) and Kendall (tau) methods (both non-parametric correlation) can be
specified instead. The cor.test() command carries out a test of significance of the correlation.
A variety of additional instructions to these commands, as listed in the following Table:
Simple Correlation:
Simple correlations are between two continuous variables and we can use the cor() command to obtain a
correlation coefficient like so:
> count = c(9, 25, 15, 2, 14, 25, 24, 47)
> speed = c(2, 3, 5, 9, 14, 24, 29, 34)
> cor(count, speed)
[1] 0.7237206
The default for R is to carry out the Pearson product moment, but we can specify other correlations using the
method = instruction, like so:
> cor(count, speed, method = 'spearman')
[1] 0.5269556
This example used the Spearman rho correlation but also apply Kendall’s tau by specifying method = “kendall”.
If vectors are contained within a data frame or some other object, then need to extract them in a different
fashion. Look at the women data frame. This comes as example data with distribution of R.
> data(women)
> str(women)
'data.frame': 15 obs. of 2 variables:
$ height: num 58 59 60 61 62 63 64 65 66 67 ...
$ weight: num 115 117 120 123 126 129 132 135 139 142 ...
Need to use attach() or with() commands to allow R to “read inside” the data frame and access the variables
within, and also use the $ syntax so that the command can access the variables as the following example
shows:
> cor(women$height, women$weight)
[1] 0.9954948
In this example the cor() command has calculated the Pearson correlation coefficient between the height and
weight variables contained in the women data frame. We can also use the cor() command directly on a data
frame (or matrix). If we use the data frame women then looked at the following example:
> cor(women)
height weight
height 1.0000000 0.9954948
weight 0.9954948 1.0000000
When we have more columns the matrix can be much more complex. The following example contains five
columns of data:
If we choose the Length variable and compare it to all the others in the mf data frame using the default
Pearson coefficient, then select a single variable and compare it to all the others like so:
> cor(mf$Length, mf)
Length Speed Algae NO3 BOD
[1, ] 1 -0.3432297 0.7650757 0.4547609 -0.8055507
Covariance:
In R programming, covariance can be measured using cov() function. Covariance is a statistical term used to
measures the direction of the linear relationship between the data vectors.
The cov() command uses syntax similar to the cor() command to examine covariance. The women data are
used with the cov() command in the following example:
> cov(women$height, women$weight)
[1] 69
> cov(women)
height weight
height 20 69.0000
weight 69 240.2095
The cov2cor() command is used to determine the correlation from a matrix of covariance in the following
example:
> women.cv = cov(women)
> cov2cor(women.cv)
height weight
height 1.0000000 0.9954948
weight 0.9954948 1.0000000
Significance Testing in Correlation Tests:
We can apply a significance test to correlations using the cor.test() command. If the test concludes that the
correlation coefficient is significantly different from zero, then say that the correlation coefficient is "significant."
Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship
between x and y because the correlation coefficient is significantly different from zero. That means, there is a
significant linear relationship between x and y.
If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero),
we say that correlation coefficient is "not significant".
Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship
between x and y because the correlation coefficient is not significantly different from zero." In this case we can
compare only two vectors at a time as the following example shows:
From the example, see that the Pearson correlation has been carried out between height and weight in the
women data and the result also shows the statistical significance of the correlation.
Formula Syntax:
If we data are contained in a data frame, using the attach() or with() commands is tedious, as is using the $
syntax. A formula syntax is available as an alternative, which provides a neater representation of data:
The formula is slightly different and need to specify both variables to the right of the ~. We also give the name
of the data as a separate instruction. All the additional instructions are available when using the formula
syntax as well as the subset instruction.
If data contain a separate grouping column, then specify the samples to use from it using an instruction along
the following lines:
subset = grouping %in% “sample”
TESTS FOR ASSOCIATION:

When we have categorical data then look for associations between categories by using the chisquared test. To
achieve this by using the chisq.test() command. We can add various additional instructions to the basic
command, as summarized in below Table:
Multiple Categories: Chi-Squared Tests
The most common use for a chi-squared test is where we have multiple categories and want to see if
associations exist between them. In the following example can see some categorical data set out in a data
frame as shown in below:
The data here are already in a contingency table and each cell represents a unique combination of the two
categories; here we have several habitats and several species. To run the chisq.test() command, simply by
giving the name of the data to the command like so:
In this case, given the result a name and set it up as a new object, which we examine in more detail. We get an
error message in this example; this is because of some small values for observed data and the expected values
will probably include some that are smaller than 5. The result object can be examine in more detail, start by
trying a summary() command:
The result object we created contains several parts. A simpler way to see what are dealing with it, by using the
names() command:
> names(bird.cs)
[1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
[7] "expected" "residuals"
We can access the various parts of the result object by using the $ syntax and adding the part want to
examine. For example:
> bird.cs$stat
X-squared
78.27364
> bird.cs$p.val
[1] 7.693581e-09
We can see the calculated expected values as well as the Pearson residuals by using the appropriate
abbreviation. In the following example we look at the expected values:
From the above example that have some expected values < 5 and this is the reason for the warning message.
We might prefer to display the values as whole numbers and adjust the output by using the round() command
to choose how many decimal points to display the values like so:
Monte Carlo Simulation

We can decide to determine the p-value by a slightly different method and can use a Monte Carlo simulation
to do this. Add an extra instruction to the chisq.test() command, simulate.p.value = TRUE, like so:
Yates’ Correction for 2 n 2 Tables:
When we have a 2 X 2 contingency table it is common to apply the Yates’ correction. By default this is used if
the contingency table has two rows and two columns. We can turn off the correction using the correct = FALSE
instruction in the command. Consider the following example, a 2X2 table,
At the first example, the data and when we run the chisq.test() command, see that Yates’ correction is applied
automatically. In the second example, force the command not to apply the correction by setting correct = FALSE.
Single Category: Goodness of Fit Tests

We can use the chisq.test() command to carry out a goodness of fit test. In this case, must have two vectors of
numerical values, one representing the observed values and the other representing the expected ratio of
values.
The goodness of fit tests the data against the ratios (probabilities) specified. If do not specify any, the data are
tested against equal probability.
In the following example, we have a simple data frame containing two columns; the first column contains
values relating to an old survey. The second column contains values relating to a new survey. We want to see
if the proportions of the new survey match the old one, so perform a goodness of fit test:
To run the test, use the chisq.test() command, but this time must specify the test data as a single vector and
also point to the vector that contains the probabilities:
In this example, did not have the probabilities as true probabilities but as frequencies; so use the rescale.p =
TRUE instruction to make sure that these are converted to probabilities.
The result contains all the usual items for a chi-squared result object, but if display the expected values, for
example, we do not automatically get to see the row names, even though they are present in the data:
> survey.cs$exp
[1] 20.25195 29.93766 116.22857 86.29091 39.62338 46.66753
We can get the row names from the original data using the row.names() command. We could set the names of
the expected values in the following way:
PART-II
GRAPHICAL ANALYSIS:
Graphs are a powerful way to present data and results in a concise manner. Whatever kind of data have, there
is a way to illustrate it graphically. A graph is more readily understandable than words and numbers, and
producing good graphs is a vital skill.
Some graphs are also useful in examining data so that gain some idea of patterns that may exist, this can
direct toward the correct statistical analysis.
R has powerful and flexible graphical capabilities. In general terms, R has two kinds of graphical commands:
some commands generate a basic plot of some sort, and other commands are used to adjust the output and
to produce a more customized finish.
BOX-WHISKER PLOTS:
The box-whisker plot (abbreviated to boxplot) is a useful way to visualize complex data where we have
multiple samples, and to display differences between samples. The basic form of the box-whisker plot shows
the median value, the quartiles, and the max/min values. That means, we get a lot of information in a compact
manner.
The box-whisker plot is also useful to visualize a single sample because we can show outliers. We can use the
boxplot() command to create box-whisker plots. The command can work in a variety of ways to visualize
simple or quite complex data.
Basic Boxplots
The following example shows a simple data frame composed of two columns:
We can use the boxplot() command to visualize one of the variables here:
> boxplot(fw$speed)
This produces a simple graph like as shown in below Figure:
This graph shows the typical layout of a box-whisker plot. The stripe shows the median, the box represents the
upper and lower quartiles, and the whiskers show the maximum and minimum values.
If we have several items to plot, simply give the vector names in the boxplot() command:
> boxplot(fw$count, fw$speed)
The resulting graph appears like as follows in below Figure:
In this case, specify vectors that correspond to the two columns in the data frame, but they could be
completely separate.
Customizing Boxplots:
A plot without labels is useless; the plot needs labels. We can use the xlab and ylab instructions to label the
axes. We can use the names instruction to set the labels (currently displayed as 1 and 2) for the two samples,
like so:
> boxplot(fw$count, fw$speed, names = c('count', 'speed'))
> title(xlab = 'Variable', ylab = 'Value')
The resulting plot looks like as shown in below Figure:
In this case we used the title() command to add the axis labels, but we could have specified xlab and ylab
within the boxplot() command. Now there are names for each of the samples as well as axis labels.
Notice that from the above figure, the whiskers of the count sample do not extend to the top, and that appear
to have a separate point displayed. We can determine how far out the whiskers extend, but by default this is
1.5 times the interquartile range.
We can alter this by using the range = instruction; if we specify range = 0 as shown in the following example,
the whiskers extend to the maximum and minimum values:
> boxplot(fw$count, fw$speed, names = c('count', 'speed'), range = 0, xlab = 'Variable', ylab = 'Value',
col = 'gray90')
The final graph appears like as shown in below Figure:
Consider the data in a different arrangement; commonly we have a data frame with one column representing
the response variable and another representing a predictor (or grouping) variable. In practice this means we
have one vector containing all the numerical data and another vector containing the grouping information as
text. Look at the following example:
> grass
With data in this format, it is best to use the same formula notation. When doing so, use the ~ symbol to
separate the response variable to the left and the predictor variable to the right and also instruct the
command where to find the data and set range = 0 to force the whiskers to the maximum and minimum as
before.
Consider the following example for details:
> boxplot(rich ~ graze, data = grass, range = 0) > title(xlab = 'cutting treatment', ylab = 'species richness')
This time the samples are automatically labeled; the command takes the names of the samples from the levels
of the factor, presented in alphabetical order. The resulting graph looks like as shown in below Figure:
Horizontal Boxplots
With a simple additional instruction we can display the bars horizontally rather than vertically :
> boxplot(rich ~ graze, data = grass, range = 0, horizontal = TRUE)
> title(ylab = 'cutting treatment', xlab = 'species richness')
When we use the horizontal = TRUE instruction, then graph is displayed with horizontal bars as shown in
below:
SCATTER PLOTS
The basic plot() command is a generic function that can be pressed into service for a variety of uses. Many
specialized statistical routines include a plotting routine to produce a specialized graph. We will use the plot()
command to produce xy scatter plots. The scatter plot is used especially to show the relationship between two
variables.
Basic Scatter Plots
The following data frame contains two columns of numeric values, and because they contain the same
number of observations, they could form the basis for a scatter plot:
The basic form of the plot() command requires to specify the x and y data, each being a numeric vector. We
use it like so:
>plot(x, y, ...)
If we have data contained in a data frame, must use the $ syntax to get at the variables and also use the with()
or attach() commands. For the example data here, the following commands all produce a similar result:
> plot(fw$speed, fw$count)
> with(fw, plot(speed, count))
> attach(fw) > plot(speed, count) > detach(fw)
The resulting graph looks like as shown in below Figure:
The names of the axis labels match up with what we typed into the command.
Adding Axis Labels
We can produce own axis labels easily using the xlab and ylab instructions. For example, to create labels for
these data might use something like the following:
> plot(fw$speed, fw$count, xlab = 'Speed m/s', ylab = 'Count of Mayfly')
We can still use the title() command to add axis titles later, but need to produce blank titles to start with. We
must set each title in the plot() command to blank using a pair of quotes as shown in the following:
> plot(fw$speed, fw$count, xlab = " ", ylab = " ")
Plotting Symbols:
We can use many other graphical parameters to modify basic scatter plot. We might want to alter the plotting
symbol, then use the pch = instruction, it refers to the plotting character, and can be specified in one of
several ways. We can type an integer value and this code will be reflected in the symbol/character produced.
For values from 0 to 25, get symbols that look like the ones depicted in the Figure as:
These were produced on a scatter plot using the following lines of command:
> plot(0:25, rep(1, 26), pch = 0:25, cex = 2)
> text(0:25, 0.95, as.character(0:25))
The first part produces a series of points, and sets the x values to range from 0 to 25 (to correspond to the pch
values). The y values are set at 1 so that you get a horizontal line of points; the rep() command is used to
repeat the value 1 for 26 times. In other words, get 26 1s to correspond to various x values. Now set the
plotting character to vary from 0 to 25 using pch = 0:25. Finally, make the points a bit bigger using a character
expansion factor (cex = 2). The text() command is used to add text to a current plot.
We can also specify a character from the keyboard directly by enclosing it in quotes; to produce + symbols, for
example, type the following:
> plot(fw$speed, fw$count, pch = "+")
The + symbol is also obtained via pch = 3.
Setting Axis Limits:

The plot() command works out the best size and scale of each axis to fit the plotting area. We can set the
limits of each axis quite easily using xlim = and ylim = instructions. The basic form of these instructions
requires two values—a start and an end:
>xlim = c(start, end)
>ylim = c(start, end)
We can add all of these elements together to produce a plot that matches particular requirements. In the
current example, we might type the following plot() command:
> plot(fw$speed, fw$count, xlab = 'Speed m/s', ylab = 'Count of Mayfly', pch = 18, cex = 2, col =
'gray50', xlim = c(0, 50), ylim = c(0, 50))
The resulting scatter plot looks like as shown in below Figure:
Using Formula Syntax
There is another way that we can specify what we want to plot; rather than giving the x and y values as
separate components, produce a formula to describe the situation:
> plot(count ~ speed, data = fw)
Use the tilde character (~) to symbolize formula. On the left place the response variable and on the right place
the predictor variable.
Pairs Plots (Multiple Correlation Plots):

The plot() command is a way to produce a scatter plot. If use the same data as before—two columns of
numerical data—but do not specify the columns explicitly, still get a plot of sorts:
This produces a graph like Figure as shown

below:
The command has taken the first column as the

x values, and the second column as the y values. If we try this on a data frame with more than two columns,
for example consider the data frame,
> plot(mf)
We end up with a scatterplot matrix where each pair-wise combination is plotted. This has created a pairs
plot—we can use a special command pairs() to create customized pairs plots as shown in below:
By default, the pairs() command takes all the columns in a data frame and creates a matrix of scatter plots. We
can choose which columns want to display by using the formula notation along the following lines:
pairs(~ x + y + z, data = our.data)
We simply provide the required variables and separate them with + signs. If we are using a data frame, also
give the name of the data frame. In the current example we can select some of the columns like so:
> pairs(~ Length + Speed + NO3, data = mf)
This produces a graph like as shown in below Figure:
We can alter the plotting characters, their size, and color using the pch, cex, and col instructions. The following
command produces large red crosses but otherwise is essentially the same graph:
> pairs(~ Length + Speed + NO3, data = mf, col ='red', cex = 2, pch = 'X').
LINE CHARTS
The plot() command is used to produce scatter plots, either as a single pair of variables or a multiple-pairs
plot. There may be many occasions when data that is time-dependent, that means, data that is collected over
a period of time. We would want to display these data as a scatter plot where the y-axis reflects the
magnitude of the data recorded and the x-axis reflects the time.
Line Charts Using Numeric Data
If the time variable recorded is in the form of a numeric variable, then use a regular plot() command. We can
specify different ways to present the data using the type instruction. The following Table lists the main options
can set using the type instruction.
Therefore, if we want to highlight the pattern, specify type = “l” and draw a line, leaving the points out
entirely. Notice that we can use type = “n” to produce nothing at all.
Look at the Nile data that comes with R. This is stored as a special kind of object called a time series.
Essentially, this enables to specify the time in a more space-efficient manner than using a separate column of
data. In the Nile data have measurements of the flow of the Nile river from 1871 to 1970. If we plot these
data, the result shown in below figure:
> plot(Nile, type = 'l')
If data are not in numerical order, can end up with some odd-looking line charts. We can use the sort()
command to reorder the data using the x-axis data, which usually sorts out the problem. Look at the following
examples:
> with(mf, plot(Length, NO3, type = 'l'))

> with(mf[order(mf$Length),], plot(sort(Length), NO3, type = 'l'))
In the first case the data are not sorted, and the result is a bit of a mess. In the second case the data are
sorted, and the result is a lot better.
Line Charts Using Categorical Data:
If the data is a sequence but doesn’t have a numerical value, then it is a trickier situation. For example,
consider numeric data with labels that are categorical as shown in below:
The data in the form of a data frame, the following example shows the same data but this time the labels are in a second
column:
In either case, try to plotting the data using the plot() command like so:
>plot(rain, type = 'b')
>plot(rainfall$rain, type = 'b')
The result plot as shown in below:
The x-axis remains as a simple numeric index, To alter the x-axis as desired we need to remove the
existing x-axis, and create own using the character vector as the labels. Perform the following steps to do so:
1. Start by turning off the axes using the axes = FALSE instruction. We can still label the axes using the xlab and
ylab instructions as seen before. If we want to produce blank labels and add them later using the title()
command, set them using a pair of quotes; for example, xlab = “ ”:
> plot(rain, type = 'b', axes = FALSE, xlab = 'Month', ylab = 'Rainfall cm')
2. Now construct x-axis using the character labels already have. The axis() command creates an axis for a plot.
The basic layout of the command is like so:
axis(side, at = NULL, labels = TRUE)
The first part is where we set which side wants the axis to be created on; 1 is the bottom, 2 is the left, 3 is the
top, and 4 is the right side of the plot. The at = part is where determine how many tick marks are to be shown;
we show this as a range from 1: n where n = how many tick marks you require, (12 in this case).
3. Finally, get to point to the labels. In this example, use a separate character vector for the labels:
> month = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
> axis(side = 1, at = 1: length(rain), labels = month)
This creates an axis at the bottom of the plot (the x-axis) and sets the tick marks from 1 to 12.
4. To finish off plot, make the y-axis. We can make the y-axis using:
> axis(side = 2)
This creates an axis for and takes the scale from the existing plot.
5. Finally, enclose the whole lot in a neat bounding box. Use the box() command to make an enclosing
bounding box for the entire plot.
PIE CHARTS:
If data that represents how something is divided up between various categories, the pie chart is a common
graphic choice to illustrate that data. For example, we might have data that shows sales for various items for a
whole year. The pie chart enables to show how each item contributed to total sales. Each item is represented
by a slice of pie—the bigger the slice, the bigger the contribution to the total sales. In simple terms, the pie
chart takes a series of data, determines the proportion of each item toward the total, and then represents
these as different slices of the pie.
The pie chart is commonly used to display proportional data. We can create pie charts using the pie()
command and use a vector of numeric values to create plot like so:
> data11
[1] 3 5 7 5 3 2 6 8 5 6 9 8
When using the pie() command, these values are converted to proportions of the total and then the angle of
the pie slices is determined. If possible, the slices are labeled with the names of the data. In the current
example, a simple vector of values with no names, so must supply them separately. We can do this in a variety
of ways; in this instance having a vector of character labels:
> data8
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
To create a pie chart with labels use the pie() command in the following manner:
> pie(data11, labels = data8)
This produces a plot that looks like as shown in below Figure:
We can alter the direction and starting point of the slices using the clockwise = and init.angle = instructions. By
default the slices are drawn counter-clockwise, so clockwise = FALSE; we can set this to TRUE to produce
clockwise slices. The starting angle is set to 0º (this is 3 o’clock) by default when we have clockwise = FALSE.
The starting angle is set to 90º (12 o’clock) when we have clockwise = TRUE.
The default colors used are a range of six pastel colors; these are recycled as necessary. We can specify a
range of colors to use with the col = instruction. One way to do this is to make a list of color names. In the
following example, make a list of gray colors and then use these for charted colors:
> pc = c('gray40', 'gray50', 'gray60', 'gray70', 'gray80', 'gray90')
> pie(data11, labels = data8, col = pc, clockwise = TRUE, init.angle = 180)
It also set the slices to be drawn clockwise and set the starting point to 180º, which is 9 o’clock. The resulting
plot looks like as shown in below Figure:
When data are part of a data frame, must use the $ syntax to access the column that require or use the with()
or attach() commands. In the following example, the data frame contains row names can use to label pie
slices:
The labels = instruction points to the row names of the data frame. The final graph looks like Figure as:
When data are in matrix form, we have a few additional options: produce pie charts of the rows or the
columns. The following data example shows a matrix of bird observation data; the rows and the columns are
named:
We can use the [row, column] syntax with the pie() command; here we examine the first row:
> pie(bird[,1], col = pc)
This produces a graph like Figure as
If data in a data frame rather than a matrix, then get an error message. In this case first that data frame row
into matrix then plot the pie chat, as shown in below:
> pie(as.matrix(mf[1,]), labels = names(mf), col = pc).
Similarly, make pie charts from the columns, in which case specify the column require using the [row, column]
syntax. The following command examples both produce a pie chart of the Hedgerow column in the bird data
saw previously:
> pie(bird[,2])
> pie(bird[,'Hedgerow'])
CLEVELAND DOT CHARTS
An alternative to the pie chart is a Cleveland dot plot. All data that might be presented as a pie chart could
also be presented as a bar chart or a dot plot. We can create Cleveland dot plots using the dotchart()
command. If data are a simple vector of values then like the pie() command, simply give the vector name.
To create labels need to specify them. In the following example, a vector of numeric values and a vector of
character labels as:
> data11; data8
[1] 3 5 7 5 3 2 6 8 5 6 9 8
> dotchart(data11, labels = data8)
The resulting dot plot looks like Figure as
Consider the complex data example, data are best used if they are in the form of a matrix; the following data
are bird observations as:
With a pie chart must create a pie for the rows or the columns separately; with the dot plot do both at once.
we can create a basic dot plot grouped by columns simply by specifying the matrix name like so:
> dotchart(bird)
This produces a dot plot that looks like Figure as
Here we see the data shown column by column; in other words, we see the data for each column broken
down by rows. It might choose to view the data in a different order; by transposing the matrix could display
the rows as groups, broken down by column:
> dotchart(t(bird))
Use the t() command to transpose the matrix and produce dot plot, which looks like as shown in below:
We can alter a variety of parameters on plot. The following Table illustrates a few of the options:
The following command utilizes some of these instructions to produce the graph shown in Figure :
> dotchart(bird, color = 'gray30', gcolor = 'black', lcolor = 'gray30', cex = 0.8, xlab = 'Bird Counts', bg = 'gray90',
pch = 21)
We can also specify a mathematical function to apply to each of the groups using the gdata = instruction. It
makes the most sense to use an average of some kind—mean or median— to do so. In the following example
the mean is used as a grouping function:
> dotchart(bird, gdata = colMeans(bird), gpch = 16, gcolor = 'blue')
> mtext('Grouping = mean', side =3, adj = 1)
> title(main = 'Bird species and Habitat')
> title(xlab = 'Bird abundance')
The result is shown in below:
The first line of command draws the main plot; the mean is taken by using the colMeans() command and
applying it to the plot via the gdata = instruction. The plotting character of the grouping function is set using
the gpch = instruction; here, a filled circle is used to make it stand out from the main points. The gcolor =
instruction sets a color for the grouping points (and labels).
The second line adds some text to the margin of the plot; here we use the top axis (side = 1 is the bottom, 2 is
the left) and adjust the text to be at the extreme end (adj = 0 would be at the other end of the axis).
The final two lines add titles to the main plot and the value axis (the x-axis).
BAR CHARTS
The bar chart is suitable for showing data that fall into discrete categories. The histogram, which is a form of
bar chart. In that each bar of the graph showed the number of items in a certain range of data values. Bar
charts are widely used because they convey information in a readily understood fashion. They are also flexible
and can show items in various groupings.
We use the barplot() command to produce bar charts.
Single-Category Bar Charts:

The simplest plot can be made from a single vector of numeric values. In the following example, have such an
item:
> rain
[1] 3 5 7 5 3 2 6 8 5 6 9 8
To make a bar chart use the barplot() command and specify the vector name in the instruction like so:
>barplot(rain)
This makes a primitive plot that looks like Figure as:
The chart has no axis labels of any kind, but can add them quite simply. To start with, make names for the
bars, use the names = instruction to point to a vector of names.
The following example shows one way to do this:
> rain
[1] 3 5 7 5 3 2 6 8 5 6 9 8
> month
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" > barplot(rain, names = month)
If did not have names vector, then make one or simply specify the names using a c() command like so:
> barplot(rain, names = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'))
If vector has a names attribute, then barplot() command can read the names directly.
In the following example, set the names() of the rain vector and then use the barplot() command:
> rain ; month
[1] 3 5 7 5 3 2 6 8 5 6 9 8
> names(rain) = month
> rain
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
3 5 7 5 3 2 6 8 5 6 9 8
> barplot(rain)
Now the bars are neatly labeled with the names taken from the data itself, as shown in below:
To add axis labels we can use the xlab and ylab instructions and can use these as part of the command itself or
add the titles later using the title() command.
In the following example, create axis titles afterwards:
> barplot(rain)
> title(xlab = 'Month', ylab = 'Rainfall cm')
From the above results, y-axis is shorter, so alter the y-axis scale using the ylim instruction as shown in the
following example:
> barplot(rain, xlab = 'Month',

ylab = 'Rainfall cm', ylim =
c(0,10))
The result as follows:
We can alter the color of the

bars using the col = instruction. If we want to “ground” the plot, could add a line under the bars using the
abline() command:
> abline(h = 0)
In other words, add a horizontal line at 0 on the y-axis. If we want the whole plot enclosed in a box, use the
box() command. We can also use the abline() command to add gridlines:
> abline(h = seq(1, 9, 2), lty = 2, lwd = 0.5, col = 'gray70')
In this example, create horizontal lines using a sequence, the seq() command. With this command specify the
starting value, the ending value, and the interval. The lty = instruction sets the line to be dashed, and the lwd =
instruction makes the lines a bit thinner than usual.
Finally, set the gridline colors to be a light gray using the col = instruction. When put the commands together,
end up with something like this:
> barplot(rain, xlab = 'Month', ylab = 'Rainfall cm', ylim = c(0,10), col = 'lightblue')
> abline(h = seq(1,9,2), lty = 2, lwd = 0.5, col = 'gray40')
> box()
The final graph looks like Figure as:
We can create a bar chart of frequencies that is similar to a histogram by using the table() command:
> table(rain)
rain
2356789
1232121
Here the result of using the table() command on data; they are split into a simple frequency table. The first
row shows the categories (each relating to an actual numeric value), and the second row shows the
frequencies in each of these categories. If create a barplot() using these data, get something like Figure as
shown in below, which is produced using the following commands:
> barplot(table(rain), ylab = 'Frequency', xlab = 'Numeric category') > abline(h = 0)
Multiple Category Bar Charts

The examples of bar charts have seen so far have all involved a single “row” of data, that is, all the data relate
to categories in one group. It is also quite common to have several groups of categories. We can display these
groups in several ways, the most primitive being a separate graph for each group. We have two options:
stacked bars and grouped bars.
Stacked Bar Charts

If data contains several groups of categories, display the data in a bar chart in one of two ways. We can decide
to show the bars in blocks (or groups) or choose to have them stacked. The following example makes this
clearer and shows a matrix data object that has used in previous examples:
The plot that results is a stacked bar chart and each column has been split into its row components as shown
in below:
We can use any of the additional instructions that have seen so far to modify the plot.
Grouped Bar Charts

When data are in a matrix with several rows, the default bar chart is a stacked chart as saw in the previous
section. We can force the elements of each column to be unstacked by using the beside = TRUE instruction as
shown in the following code (the default is set to FALSE):
> barplot(bird, beside = TRUE, ylab = 'Total birds counted', xlab = 'Habitat')
The resulting graph now shows as a series of bars in each of the column categories as:
This is useful, but it is even better to see which bar relates to which row category; for this need a legend. We
can add one automatically using the legend = instruction, which creates a default legend that takes the colors
and text from the plot itself:
> barplot(bird, beside = TRUE, legend = TRUE)
> title(ylab = 'Total birds counted', xlab = 'Habitat')
The legend appears at the top right of the plot window, so if necessary it must alter the y-axis scale using the
ylim = instruction to get it to fit. In this case, the legend fits comfortably without any additional adjustments as
shown in below:
We can alter the colors of the bars by supplying a vector of names in some way; we might create a separate
vector or simply type the names into a col = instruction:
> barplot(bird, beside = TRUE, legend = TRUE, col = c('black', 'pink', 'lightblue', 'tan', 'red', 'brown'))
If we would rather have the row categories as the main bars, split by column, need to rotate or transpose the
matrix of data. We can use the t() command to do this like so:
> barplot(t(bird), beside = TRUE, legend = TRUE, cex.names = 0.8, col = c('black', 'pink', 'lightblue', 'tan', 'red',
'brown')) > title(ylab = 'Bird Count', xlab = 'Bird Species')
Horizontal Bars
We can make the bars horizontal rather than the default vertical using the horiz = TRUE instruction:
> barplot(bird, beside = TRUE, horiz = TRUE)
We can use all the regular instructions that met previously on horizontal bar charts as well, for example:
> bccol = c('black', 'pink', 'lightblue', 'tan', 'red', 'brown')
> barplot(bird, beside = TRUE, legend = TRUE, horiz = TRUE, xlim = c(0, 60), col = bccol)
> title(ylab = 'Habitat', xlab = 'Bird count')
The bars now point horizontally as shown in below:

COPY GR APHICS TO OTHER APPLICATIONS:
Being able to create a graphic is a useful start, but generally need to transfer the graphs have made to another
application. We may need to make a report and want to include a graph in a word processor or presentation.
We may also want to save a graph as a file on disk for later use.
Use Copy/Paste to Copy Graphs:

When we make a graph using R, it opens in a separate window. We can use copy to transfer the graphic to the
clipboard and then use paste to place it in another program. This method works for all operating systems, and
the image that get depends on the size of the graphics window and the resolution of screen.
Save a Graphic to Disk:

We can save a graphics window to a file in a variety of formats, including PNG, JPEG, TIFF, BMP, and PDF. The
way to save graphics depends on the operating system are using. In Windows and Mac the GUI has options to
save graphics. In Linux, save graphics only via direct commands, which can also use in the other operating
systems.
Windows
The Windows GUI allows saving graphics in various file formats. Once created the graphic, click the graphics
window and select Save As from the File menu. We have several options to choose from as shown in below:
The JPEG option gives the opportunity to select from one of several compressions. The TIFF option produces
the largest files because no compression is used. The PNG option is useful because the PNG file format is
widely used and file sizes are quite small.
We can also use commands typed from the keyboard to save graphics files to disk, and go about this in several
ways. The simplest is via the dev.copy() command.
This command copies the contents of the graphics window to a file; designate the type of file and the
filename.
To finish the process type the dev.off() command. In the following example the graphics window is saved using
the png option:
> dev.copy(png, file = 'R graphic test.eps')
png:R graphic test.eps 3
> dev.off() windows 2
Macintosh
The Macintosh GUI allows graphics to be saved as PDF files. PDF is handled easily by Mac and is seen as a good
option because PDF graphics can be easily rescaled. To save the graphics window, click the window to select it
and then choose Save or Save As from the File menu. If want to save your graphic in another format, need to
use the dev.copy() and dev.off() commands.
In the following example, the graphics window is saved as a PDF file. The filename must be specified
explicitly—the default location for saved files is the current working directory.
> dev.copy(pdf, file = 'Rplot eg.pdf') pdf 3

> dev.off() quartz 2
Linux
In Linux, R is run via the terminal and cannot save a graphics file using “point and click” options. To save a
graphics file need to use commands typed from the keyboard. The dev.copy() and dev.off() commands are
used in the same way as described for Windows or Mac operating systems.
Start by creating the graphic require and then use the dev.copy() command to write the file to disk in the
format you want. The process is completed by typing the dev.off() command.
In the following example, the graphics window is saved as a PNG file. The file is saved to the default working
directory—if want it to go somewhere else, need to specify the path in full as part of the filename.
> dev.copy(png, file = 'R graphic test.eps') png 3
> dev.off() X11cairo 2
The dev.copy() command requires to specify the type of graphic file require and the filename. we can specify
other options to alter the size of the final image. The most basic options are summarized in Table as

Edar M-4

Uploaded by

Copyright:

Available Formats

Edar M-4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Edar M-4

Uploaded by

Copyright:

Available Formats

EXPLORATORY DATA ANALYSIS WITH R

Using the Student’s t-test:

Using Directional Hypotheses:

> wilcox.test(data3, exact = FALSE)

Using Directional Hypotheses

Formula Syntax and Subsetting Samples in the U-test

wilcox.test(response ~ predictor, data = my.data)

wilcox.test(response ~ predictor, data = my.data, subset = predictor %in% c("sample1", "sample2"))

PAIRED T- AND U-TESTS

TESTS FOR ASSOCIATION:

Monte Carlo Simulation

Single Category: Goodness of Fit Tests

Setting Axis Limits:

Pairs Plots (Multiple Correlation Plots):

This produces a graph like Figure as shown

The command has taken the first column as the

> with(mf, plot(Length, NO3, type = 'l'))

Line Charts Using Categorical Data:

Single-Category Bar Charts:

> barplot(rain, xlab = 'Month',

The result as follows:

We can alter the color of the

Multiple Category Bar Charts

Stacked Bar Charts

Grouped Bar Charts

> barplot(bird, beside = TRUE, horiz = TRUE)

The bars now point horizontally as shown in below:

Use Copy/Paste to Copy Graphs:

Save a Graphic to Disk:

> dev.copy(pdf, file = 'Rplot eg.pdf') pdf 3

You might also like