Edar M-4
Edar M-4
Edar M-4
MODULE-IV
HYPOTHESIS TESTING AND GRAPHICAL ANALYSIS
HYPOTHESIS TESTING: Using the Student’s t-test, The Wilcox on U-Test (Mann-Whitney), Paired t- and
U-Tests, Correlation and Covariance, Tests for Association.
GRAPHICAL ANALYSIS: Box-whisker Plots, Scatter Plots, Pairs Plots (Multiple Correlation Plots) Line
Charts, Pie Charts, Cleveland Dot Charts, Bar Charts, Copy Graphics to Other Applications.
HYPOTHESIS TESTING
A hypothesis is made by the researchers about the data collected for any experiment or data set. A hypothesis is
an assumption made by the researchers that are not mandatory true. In simple words, a hypothesis is a decision
taken by the researchers based on the data of the population collected.
Hypothesis Testing in R Programming is a process of testing the hypothesis made by the researcher or to
validate the hypothesis. To perform hypothesis testing, a random sample of data from the population is taken
and testing is performed. Based on the results of testing, the hypothesis is either selected or rejected.
A statistical hypothesis is an assumption about a population which may or may not be true. Hypothesis testing is
a set of formal procedures used by statisticians to either accept or reject statistical hypotheses. Statistical
hypotheses are of two types:
Null hypothesis (H0) - represents a hypothesis of chance basis.
Alternative hypothesis (Ha) - represents a hypothesis of observations which are influenced by some non-
random cause.
Example
Suppose we wanted to check whether a coin was fair and balanced. A null hypothesis might say that half flips
will be of head and half will of tails whereas alternative hypothesis might say that flips of head and tail may be
very different.
H0: P=0.5
Ha: P≠0.5
For example if we flipped the coin 50 times, in which 40 Heads and 10 Tails results. Using result, we need to
reject the null hypothesis and would conclude, based on the evidence, that the coin was probably not fair and
balanced.
One-Sample t-Testing:
We can also carry out a one-sample t-test. In this version, supply the name of a single vector and the mean to
compare it to (by defaults to 0):
The t-test is designed to compare two samples (or one sample with a “standard”). However, data will be in a
more structured form with a column for the response variable and a column for the predictor variable. For
example, consider the data are in this manner:
R deals with this by having a “formula syntax.” We create a formula using the tilde (~) symbol. Essentially
response variable goes on the left of the ~ and the predictor goes on the right like so:
If predictor column contains more than two items, the t-test cannot be used. However, still carry out a test by
subsetting this predictor column and specifying which two samples want to compare. We must use the subset
= instruction as part of the t.test() command.
The following example illustrates how to do this using the same data as follows:
> t.test(rich ~ graze, data = grass, subset = graze %in% c('mow', 'unmow'))
First specify which column want to take subset from (graze in this case) and then type %in%; this tells the
command that the list that follows is contained in the graze column.
The Wilcoxon U-Test (Mann-Whitney)
When we have two samples to compare and data are non-parametric, and then use the U-test. This test by
various names and may be known as the Mann-Whitney U-test or Wilcoxon sign rank test. We use the
wilcox.test() command to carry out the analysis. The wilcox.test() command can conduct two-sample or one-
sample tests, and we can add a variety of instructions to carry out the test. The main options are shown in
below Table:
Two-Sample U-Test:
The basic way of using the wilcox.test() is to specify the two samples want to compare as separate vectors, as
the following example shows:
In this case there is a warning message because tied values in the data. If we set exact = FALSE, this message
would not be displayed because the p-value would be determined from a normal approximation method.
One-Sample U-Test
If we specify a single numerical vector, a one-sample U-test is carried out; the default is to set mu = 0, as in the
following example:
In this case the p-value is taken from a normal approximation because the exact = FALSE instruction is used.
The command has assumed mu = 0 because it is not specified explicitly.
We can also specify mu, the location shift. By default mu = 0. In the following example the hypothesis is set to
something other than 0.
In this example a one-sample test is carried out on the data3 sample vector. The test takes if the sample
median is less than 8. The instructions also specify to display the confidence interval and not to use an exact p-
value.
we can also use additional instructions. If predictor variable contains more than two samples, we cannot
conduct a U-test and must use a subset that contains exactly two samples.
The subset instruction works like so:
The U-test is one of the most widely used statistical methods, so it is important to be comfortable using the
wilcox.test() command.
If we have a situation in which there is paired data, we can use matched pair versions of the t-test and the U-
test with a simple extra instruction, simply add paired = TRUE as to command. It does not matter if the data
are in two separate sample columns or are represented as response and predictor. In fact, R will carry out a
paired test even if the data do not really match up as pairs. We can use all the regular syntax and instructions,
so use subsetting and directional hypotheses as like. In the following activity the paired tests takes place:
Look at the mpd data; it contains two samples, white and yellow. These data are matched pair data and each
row represents a bi-colored target. The values are for numbers of whitefly attracted to each half of the target.
CORRELATION AND COVARIANCE:
Correlation means association - more precisely it is a measure of the extent to which two variables are related.
There are three possible results of a correlational study: a positive correlation, a negative correlation, and no
correlation.
A positive correlation is a relationship between two variables in which both variables move in the same
direction. Therefore, when one variable increases as the other variable increases, or one variable decreases
while the other decreases. An example of positive correlation would be height and weight. Taller people tend to
be heavier.
A negative correlation is a relationship between two variables in which an increase in one variable is
associated with a decrease in the other. An example of negative correlation would be height above sea level and
temperature. As we climb the mountain (increase in height) it gets colder (decrease in temperature).
A zero correlation exists when there is no relationship between two variables. For example there is no
relationship between the amount of tea drunk and level of intelligence.
A correlation can be expressed visually. This is done by drawing a scatter gram (also known as a scatter plot,
scatter graph, scatter chart, or scatter diagram).
A scattergram is a graphical display that shows the relationships or associations between two numerical
variables (or co-variables), which are represented as points (or dots) for each pair of score.
A scattergraph indicates the strength and direction of the correlation between the co-variables.
When we have two continuous variables then look for a link between them, this link is called a correlation. We
can go about finding this several ways using R. The cor() command determines correlations between two
vectors, all the columns of a data frame (or matrix), or two data frames (or matrix objects).
The cov() command examines covariance. By default the Pearson product moment (that is regular parametric
correlation) is used but Spearman (rho) and Kendall (tau) methods (both non-parametric correlation) can be
specified instead. The cor.test() command carries out a test of significance of the correlation.
A variety of additional instructions to these commands, as listed in the following Table:
Simple Correlation:
Simple correlations are between two continuous variables and we can use the cor() command to obtain a
correlation coefficient like so:
> count = c(9, 25, 15, 2, 14, 25, 24, 47)
> speed = c(2, 3, 5, 9, 14, 24, 29, 34)
> cor(count, speed)
[1] 0.7237206
The default for R is to carry out the Pearson product moment, but we can specify other correlations using the
method = instruction, like so:
> cor(count, speed, method = 'spearman')
[1] 0.5269556
This example used the Spearman rho correlation but also apply Kendall’s tau by specifying method = “kendall”.
If vectors are contained within a data frame or some other object, then need to extract them in a different
fashion. Look at the women data frame. This comes as example data with distribution of R.
> data(women)
> str(women)
'data.frame': 15 obs. of 2 variables:
$ height: num 58 59 60 61 62 63 64 65 66 67 ...
$ weight: num 115 117 120 123 126 129 132 135 139 142 ...
Need to use attach() or with() commands to allow R to “read inside” the data frame and access the variables
within, and also use the $ syntax so that the command can access the variables as the following example
shows:
> cor(women$height, women$weight)
[1] 0.9954948
In this example the cor() command has calculated the Pearson correlation coefficient between the height and
weight variables contained in the women data frame. We can also use the cor() command directly on a data
frame (or matrix). If we use the data frame women then looked at the following example:
> cor(women)
height weight
height 1.0000000 0.9954948
weight 0.9954948 1.0000000
When we have more columns the matrix can be much more complex. The following example contains five
columns of data:
If we choose the Length variable and compare it to all the others in the mf data frame using the default
Pearson coefficient, then select a single variable and compare it to all the others like so:
> cor(mf$Length, mf)
Length Speed Algae NO3 BOD
[1, ] 1 -0.3432297 0.7650757 0.4547609 -0.8055507
Covariance:
In R programming, covariance can be measured using cov() function. Covariance is a statistical term used to
measures the direction of the linear relationship between the data vectors.
The cov() command uses syntax similar to the cor() command to examine covariance. The women data are
used with the cov() command in the following example:
> cov(women$height, women$weight)
[1] 69
> cov(women)
height weight
height 20 69.0000
weight 69 240.2095
The cov2cor() command is used to determine the correlation from a matrix of covariance in the following
example:
> women.cv = cov(women)
> cov2cor(women.cv)
height weight
height 1.0000000 0.9954948
weight 0.9954948 1.0000000
Significance Testing in Correlation Tests:
We can apply a significance test to correlations using the cor.test() command. If the test concludes that the
correlation coefficient is significantly different from zero, then say that the correlation coefficient is "significant."
Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship
between x and y because the correlation coefficient is significantly different from zero. That means, there is a
significant linear relationship between x and y.
If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero),
we say that correlation coefficient is "not significant".
Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship
between x and y because the correlation coefficient is not significantly different from zero." In this case we can
compare only two vectors at a time as the following example shows:
From the example, see that the Pearson correlation has been carried out between height and weight in the
women data and the result also shows the statistical significance of the correlation.
Formula Syntax:
If we data are contained in a data frame, using the attach() or with() commands is tedious, as is using the $
syntax. A formula syntax is available as an alternative, which provides a neater representation of data:
The formula is slightly different and need to specify both variables to the right of the ~. We also give the name
of the data as a separate instruction. All the additional instructions are available when using the formula
syntax as well as the subset instruction.
If data contain a separate grouping column, then specify the samples to use from it using an instruction along
the following lines:
subset = grouping %in% “sample”
The data here are already in a contingency table and each cell represents a unique combination of the two
categories; here we have several habitats and several species. To run the chisq.test() command, simply by
giving the name of the data to the command like so:
In this case, given the result a name and set it up as a new object, which we examine in more detail. We get an
error message in this example; this is because of some small values for observed data and the expected values
will probably include some that are smaller than 5. The result object can be examine in more detail, start by
trying a summary() command:
The result object we created contains several parts. A simpler way to see what are dealing with it, by using the
names() command:
> names(bird.cs)
[1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
[7] "expected" "residuals"
We can access the various parts of the result object by using the $ syntax and adding the part want to
examine. For example:
> bird.cs$stat
X-squared
78.27364
> bird.cs$p.val
[1] 7.693581e-09
We can see the calculated expected values as well as the Pearson residuals by using the appropriate
abbreviation. In the following example we look at the expected values:
From the above example that have some expected values < 5 and this is the reason for the warning message.
We might prefer to display the values as whole numbers and adjust the output by using the round() command
to choose how many decimal points to display the values like so:
At the first example, the data and when we run the chisq.test() command, see that Yates’ correction is applied
automatically. In the second example, force the command not to apply the correction by setting correct = FALSE.
In this example, did not have the probabilities as true probabilities but as frequencies; so use the rescale.p =
TRUE instruction to make sure that these are converted to probabilities.
The result contains all the usual items for a chi-squared result object, but if display the expected values, for
example, we do not automatically get to see the row names, even though they are present in the data:
> survey.cs$exp
[1] 20.25195 29.93766 116.22857 86.29091 39.62338 46.66753
We can get the row names from the original data using the row.names() command. We could set the names of
the expected values in the following way:
PART-II
GRAPHICAL ANALYSIS:
Graphs are a powerful way to present data and results in a concise manner. Whatever kind of data have, there
is a way to illustrate it graphically. A graph is more readily understandable than words and numbers, and
producing good graphs is a vital skill.
Some graphs are also useful in examining data so that gain some idea of patterns that may exist, this can
direct toward the correct statistical analysis.
R has powerful and flexible graphical capabilities. In general terms, R has two kinds of graphical commands:
some commands generate a basic plot of some sort, and other commands are used to adjust the output and
to produce a more customized finish.
BOX-WHISKER PLOTS:
The box-whisker plot (abbreviated to boxplot) is a useful way to visualize complex data where we have
multiple samples, and to display differences between samples. The basic form of the box-whisker plot shows
the median value, the quartiles, and the max/min values. That means, we get a lot of information in a compact
manner.
The box-whisker plot is also useful to visualize a single sample because we can show outliers. We can use the
boxplot() command to create box-whisker plots. The command can work in a variety of ways to visualize
simple or quite complex data.
Basic Boxplots
The following example shows a simple data frame composed of two columns:
We can use the boxplot() command to visualize one of the variables here:
> boxplot(fw$speed)
This produces a simple graph like as shown in below Figure:
This graph shows the typical layout of a box-whisker plot. The stripe shows the median, the box represents the
upper and lower quartiles, and the whiskers show the maximum and minimum values.
If we have several items to plot, simply give the vector names in the boxplot() command:
> boxplot(fw$count, fw$speed)
The resulting graph appears like as follows in below Figure:
In this case, specify vectors that correspond to the two columns in the data frame, but they could be
completely separate.
Customizing Boxplots:
A plot without labels is useless; the plot needs labels. We can use the xlab and ylab instructions to label the
axes. We can use the names instruction to set the labels (currently displayed as 1 and 2) for the two samples,
like so:
> boxplot(fw$count, fw$speed, names = c('count', 'speed'))
> title(xlab = 'Variable', ylab = 'Value')
The resulting plot looks like as shown in below Figure:
In this case we used the title() command to add the axis labels, but we could have specified xlab and ylab
within the boxplot() command. Now there are names for each of the samples as well as axis labels.
Notice that from the above figure, the whiskers of the count sample do not extend to the top, and that appear
to have a separate point displayed. We can determine how far out the whiskers extend, but by default this is
1.5 times the interquartile range.
We can alter this by using the range = instruction; if we specify range = 0 as shown in the following example,
the whiskers extend to the maximum and minimum values:
> boxplot(fw$count, fw$speed, names = c('count', 'speed'), range = 0, xlab = 'Variable', ylab = 'Value',
col = 'gray90')
The final graph appears like as shown in below Figure:
Consider the data in a different arrangement; commonly we have a data frame with one column representing
the response variable and another representing a predictor (or grouping) variable. In practice this means we
have one vector containing all the numerical data and another vector containing the grouping information as
text. Look at the following example:
> grass
With data in this format, it is best to use the same formula notation. When doing so, use the ~ symbol to
separate the response variable to the left and the predictor variable to the right and also instruct the
command where to find the data and set range = 0 to force the whiskers to the maximum and minimum as
before.
Consider the following example for details:
> boxplot(rich ~ graze, data = grass, range = 0) > title(xlab = 'cutting treatment', ylab = 'species richness')
This time the samples are automatically labeled; the command takes the names of the samples from the levels
of the factor, presented in alphabetical order. The resulting graph looks like as shown in below Figure:
Horizontal Boxplots
With a simple additional instruction we can display the bars horizontally rather than vertically :
> boxplot(rich ~ graze, data = grass, range = 0, horizontal = TRUE)
> title(ylab = 'cutting treatment', xlab = 'species richness')
When we use the horizontal = TRUE instruction, then graph is displayed with horizontal bars as shown in
below:
SCATTER PLOTS
The basic plot() command is a generic function that can be pressed into service for a variety of uses. Many
specialized statistical routines include a plotting routine to produce a specialized graph. We will use the plot()
command to produce xy scatter plots. The scatter plot is used especially to show the relationship between two
variables.
Basic Scatter Plots
The following data frame contains two columns of numeric values, and because they contain the same
number of observations, they could form the basis for a scatter plot:
The basic form of the plot() command requires to specify the x and y data, each being a numeric vector. We
use it like so:
>plot(x, y, ...)
If we have data contained in a data frame, must use the $ syntax to get at the variables and also use the with()
or attach() commands. For the example data here, the following commands all produce a similar result:
> plot(fw$speed, fw$count)
> with(fw, plot(speed, count))
> attach(fw) > plot(speed, count) > detach(fw)
The resulting graph looks like as shown in below Figure:
The names of the axis labels match up with what we typed into the command.
Adding Axis Labels
We can produce own axis labels easily using the xlab and ylab instructions. For example, to create labels for
these data might use something like the following:
> plot(fw$speed, fw$count, xlab = 'Speed m/s', ylab = 'Count of Mayfly')
We can still use the title() command to add axis titles later, but need to produce blank titles to start with. We
must set each title in the plot() command to blank using a pair of quotes as shown in the following:
> plot(fw$speed, fw$count, xlab = " ", ylab = " ")
Plotting Symbols:
We can use many other graphical parameters to modify basic scatter plot. We might want to alter the plotting
symbol, then use the pch = instruction, it refers to the plotting character, and can be specified in one of
several ways. We can type an integer value and this code will be reflected in the symbol/character produced.
For values from 0 to 25, get symbols that look like the ones depicted in the Figure as:
These were produced on a scatter plot using the following lines of command:
> plot(0:25, rep(1, 26), pch = 0:25, cex = 2)
> text(0:25, 0.95, as.character(0:25))
The first part produces a series of points, and sets the x values to range from 0 to 25 (to correspond to the pch
values). The y values are set at 1 so that you get a horizontal line of points; the rep() command is used to
repeat the value 1 for 26 times. In other words, get 26 1s to correspond to various x values. Now set the
plotting character to vary from 0 to 25 using pch = 0:25. Finally, make the points a bit bigger using a character
expansion factor (cex = 2). The text() command is used to add text to a current plot.
We can also specify a character from the keyboard directly by enclosing it in quotes; to produce + symbols, for
example, type the following:
> plot(fw$speed, fw$count, pch = "+")
The + symbol is also obtained via pch = 3.
By default, the pairs() command takes all the columns in a data frame and creates a matrix of scatter plots. We
can choose which columns want to display by using the formula notation along the following lines:
pairs(~ x + y + z, data = our.data)
We simply provide the required variables and separate them with + signs. If we are using a data frame, also
give the name of the data frame. In the current example we can select some of the columns like so:
> pairs(~ Length + Speed + NO3, data = mf)
This produces a graph like as shown in below Figure:
We can alter the plotting characters, their size, and color using the pch, cex, and col instructions. The following
command produces large red crosses but otherwise is essentially the same graph:
> pairs(~ Length + Speed + NO3, data = mf, col ='red', cex = 2, pch = 'X').
LINE CHARTS
The plot() command is used to produce scatter plots, either as a single pair of variables or a multiple-pairs
plot. There may be many occasions when data that is time-dependent, that means, data that is collected over
a period of time. We would want to display these data as a scatter plot where the y-axis reflects the
magnitude of the data recorded and the x-axis reflects the time.
Line Charts Using Numeric Data
If the time variable recorded is in the form of a numeric variable, then use a regular plot() command. We can
specify different ways to present the data using the type instruction. The following Table lists the main options
can set using the type instruction.
Therefore, if we want to highlight the pattern, specify type = “l” and draw a line, leaving the points out
entirely. Notice that we can use type = “n” to produce nothing at all.
Look at the Nile data that comes with R. This is stored as a special kind of object called a time series.
Essentially, this enables to specify the time in a more space-efficient manner than using a separate column of
data. In the Nile data have measurements of the flow of the Nile river from 1871 to 1970. If we plot these
data, the result shown in below figure:
> plot(Nile, type = 'l')
If data are not in numerical order, can end up with some odd-looking line charts. We can use the sort()
command to reorder the data using the x-axis data, which usually sorts out the problem. Look at the following
examples:
In the first case the data are not sorted, and the result is a bit of a mess. In the second case the data are
sorted, and the result is a lot better.
If the data is a sequence but doesn’t have a numerical value, then it is a trickier situation. For example,
consider numeric data with labels that are categorical as shown in below:
The data in the form of a data frame, the following example shows the same data but this time the labels are in a second
column:
In either case, try to plotting the data using the plot() command like so:
>plot(rain, type = 'b')
>plot(rainfall$rain, type = 'b')
The result plot as shown in below:
The x-axis remains as a simple numeric index, To alter the x-axis as desired we need to remove the
existing x-axis, and create own using the character vector as the labels. Perform the following steps to do so:
1. Start by turning off the axes using the axes = FALSE instruction. We can still label the axes using the xlab and
ylab instructions as seen before. If we want to produce blank labels and add them later using the title()
command, set them using a pair of quotes; for example, xlab = “ ”:
> plot(rain, type = 'b', axes = FALSE, xlab = 'Month', ylab = 'Rainfall cm')
2. Now construct x-axis using the character labels already have. The axis() command creates an axis for a plot.
The basic layout of the command is like so:
axis(side, at = NULL, labels = TRUE)
The first part is where we set which side wants the axis to be created on; 1 is the bottom, 2 is the left, 3 is the
top, and 4 is the right side of the plot. The at = part is where determine how many tick marks are to be shown;
we show this as a range from 1: n where n = how many tick marks you require, (12 in this case).
3. Finally, get to point to the labels. In this example, use a separate character vector for the labels:
> month = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
> axis(side = 1, at = 1: length(rain), labels = month)
This creates an axis at the bottom of the plot (the x-axis) and sets the tick marks from 1 to 12.
4. To finish off plot, make the y-axis. We can make the y-axis using:
> axis(side = 2)
This creates an axis for and takes the scale from the existing plot.
5. Finally, enclose the whole lot in a neat bounding box. Use the box() command to make an enclosing
bounding box for the entire plot.
PIE CHARTS:
If data that represents how something is divided up between various categories, the pie chart is a common
graphic choice to illustrate that data. For example, we might have data that shows sales for various items for a
whole year. The pie chart enables to show how each item contributed to total sales. Each item is represented
by a slice of pie—the bigger the slice, the bigger the contribution to the total sales. In simple terms, the pie
chart takes a series of data, determines the proportion of each item toward the total, and then represents
these as different slices of the pie.
The pie chart is commonly used to display proportional data. We can create pie charts using the pie()
command and use a vector of numeric values to create plot like so:
> data11
[1] 3 5 7 5 3 2 6 8 5 6 9 8
When using the pie() command, these values are converted to proportions of the total and then the angle of
the pie slices is determined. If possible, the slices are labeled with the names of the data. In the current
example, a simple vector of values with no names, so must supply them separately. We can do this in a variety
of ways; in this instance having a vector of character labels:
> data8
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
To create a pie chart with labels use the pie() command in the following manner:
> pie(data11, labels = data8)
This produces a plot that looks like as shown in below Figure:
We can alter the direction and starting point of the slices using the clockwise = and init.angle = instructions. By
default the slices are drawn counter-clockwise, so clockwise = FALSE; we can set this to TRUE to produce
clockwise slices. The starting angle is set to 0º (this is 3 o’clock) by default when we have clockwise = FALSE.
The starting angle is set to 90º (12 o’clock) when we have clockwise = TRUE.
The default colors used are a range of six pastel colors; these are recycled as necessary. We can specify a
range of colors to use with the col = instruction. One way to do this is to make a list of color names. In the
following example, make a list of gray colors and then use these for charted colors:
> pc = c('gray40', 'gray50', 'gray60', 'gray70', 'gray80', 'gray90')
> pie(data11, labels = data8, col = pc, clockwise = TRUE, init.angle = 180)
It also set the slices to be drawn clockwise and set the starting point to 180º, which is 9 o’clock. The resulting
plot looks like as shown in below Figure:
When data are part of a data frame, must use the $ syntax to access the column that require or use the with()
or attach() commands. In the following example, the data frame contains row names can use to label pie
slices:
The labels = instruction points to the row names of the data frame. The final graph looks like Figure as:
When data are in matrix form, we have a few additional options: produce pie charts of the rows or the
columns. The following data example shows a matrix of bird observation data; the rows and the columns are
named:
We can use the [row, column] syntax with the pie() command; here we examine the first row:
> pie(bird[,1], col = pc)
This produces a graph like Figure as
If data in a data frame rather than a matrix, then get an error message. In this case first that data frame row
into matrix then plot the pie chat, as shown in below:
> pie(as.matrix(mf[1,]), labels = names(mf), col = pc).
Similarly, make pie charts from the columns, in which case specify the column require using the [row, column]
syntax. The following command examples both produce a pie chart of the Hedgerow column in the bird data
saw previously:
> pie(bird[,2])
> pie(bird[,'Hedgerow'])
CLEVELAND DOT CHARTS
An alternative to the pie chart is a Cleveland dot plot. All data that might be presented as a pie chart could
also be presented as a bar chart or a dot plot. We can create Cleveland dot plots using the dotchart()
command. If data are a simple vector of values then like the pie() command, simply give the vector name.
To create labels need to specify them. In the following example, a vector of numeric values and a vector of
character labels as:
> data11; data8
[1] 3 5 7 5 3 2 6 8 5 6 9 8
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
> dotchart(data11, labels = data8)
The resulting dot plot looks like Figure as
Consider the complex data example, data are best used if they are in the form of a matrix; the following data
are bird observations as:
With a pie chart must create a pie for the rows or the columns separately; with the dot plot do both at once.
we can create a basic dot plot grouped by columns simply by specifying the matrix name like so:
> dotchart(bird)
This produces a dot plot that looks like Figure as
Here we see the data shown column by column; in other words, we see the data for each column broken
down by rows. It might choose to view the data in a different order; by transposing the matrix could display
the rows as groups, broken down by column:
> dotchart(t(bird))
Use the t() command to transpose the matrix and produce dot plot, which looks like as shown in below:
We can alter a variety of parameters on plot. The following Table illustrates a few of the options:
The following command utilizes some of these instructions to produce the graph shown in Figure :
> dotchart(bird, color = 'gray30', gcolor = 'black', lcolor = 'gray30', cex = 0.8, xlab = 'Bird Counts', bg = 'gray90',
pch = 21)
We can also specify a mathematical function to apply to each of the groups using the gdata = instruction. It
makes the most sense to use an average of some kind—mean or median— to do so. In the following example
the mean is used as a grouping function:
> dotchart(bird, gdata = colMeans(bird), gpch = 16, gcolor = 'blue')
> mtext('Grouping = mean', side =3, adj = 1)
> title(main = 'Bird species and Habitat')
> title(xlab = 'Bird abundance')
The result is shown in below:
The first line of command draws the main plot; the mean is taken by using the colMeans() command and
applying it to the plot via the gdata = instruction. The plotting character of the grouping function is set using
the gpch = instruction; here, a filled circle is used to make it stand out from the main points. The gcolor =
instruction sets a color for the grouping points (and labels).
The second line adds some text to the margin of the plot; here we use the top axis (side = 1 is the bottom, 2 is
the left) and adjust the text to be at the extreme end (adj = 0 would be at the other end of the axis).
The final two lines add titles to the main plot and the value axis (the x-axis).
BAR CHARTS
The bar chart is suitable for showing data that fall into discrete categories. The histogram, which is a form of
bar chart. In that each bar of the graph showed the number of items in a certain range of data values. Bar
charts are widely used because they convey information in a readily understood fashion. They are also flexible
and can show items in various groupings.
We use the barplot() command to produce bar charts.
The chart has no axis labels of any kind, but can add them quite simply. To start with, make names for the
bars, use the names = instruction to point to a vector of names.
The following example shows one way to do this:
> rain
[1] 3 5 7 5 3 2 6 8 5 6 9 8
> month
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" > barplot(rain, names = month)
If did not have names vector, then make one or simply specify the names using a c() command like so:
> barplot(rain, names = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'))
If vector has a names attribute, then barplot() command can read the names directly.
In the following example, set the names() of the rain vector and then use the barplot() command:
> rain ; month
[1] 3 5 7 5 3 2 6 8 5 6 9 8
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
> names(rain) = month
> rain
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
3 5 7 5 3 2 6 8 5 6 9 8
> barplot(rain)
Now the bars are neatly labeled with the names taken from the data itself, as shown in below:
To add axis labels we can use the xlab and ylab instructions and can use these as part of the command itself or
add the titles later using the title() command.
In the following example, create axis titles afterwards:
> barplot(rain)
> title(xlab = 'Month', ylab = 'Rainfall cm')
From the above results, y-axis is shorter, so alter the y-axis scale using the ylim instruction as shown in the
following example:
In this example, create horizontal lines using a sequence, the seq() command. With this command specify the
starting value, the ending value, and the interval. The lty = instruction sets the line to be dashed, and the lwd =
instruction makes the lines a bit thinner than usual.
Finally, set the gridline colors to be a light gray using the col = instruction. When put the commands together,
end up with something like this:
> barplot(rain, xlab = 'Month', ylab = 'Rainfall cm', ylim = c(0,10), col = 'lightblue')
> abline(h = seq(1,9,2), lty = 2, lwd = 0.5, col = 'gray40')
> box()
The final graph looks like Figure as:
We can create a bar chart of frequencies that is similar to a histogram by using the table() command:
> table(rain)
rain
2356789
1232121
Here the result of using the table() command on data; they are split into a simple frequency table. The first
row shows the categories (each relating to an actual numeric value), and the second row shows the
frequencies in each of these categories. If create a barplot() using these data, get something like Figure as
shown in below, which is produced using the following commands:
> barplot(table(rain), ylab = 'Frequency', xlab = 'Numeric category') > abline(h = 0)
The plot that results is a stacked bar chart and each column has been split into its row components as shown
in below:
We can use any of the additional instructions that have seen so far to modify the plot.
The resulting graph now shows as a series of bars in each of the column categories as:
This is useful, but it is even better to see which bar relates to which row category; for this need a legend. We
can add one automatically using the legend = instruction, which creates a default legend that takes the colors
and text from the plot itself:
> barplot(bird, beside = TRUE, legend = TRUE)
> title(ylab = 'Total birds counted', xlab = 'Habitat')
The legend appears at the top right of the plot window, so if necessary it must alter the y-axis scale using the
ylim = instruction to get it to fit. In this case, the legend fits comfortably without any additional adjustments as
shown in below:
We can alter the colors of the bars by supplying a vector of names in some way; we might create a separate
vector or simply type the names into a col = instruction:
> barplot(bird, beside = TRUE, legend = TRUE, col = c('black', 'pink', 'lightblue', 'tan', 'red', 'brown'))
If we would rather have the row categories as the main bars, split by column, need to rotate or transpose the
matrix of data. We can use the t() command to do this like so:
> barplot(t(bird), beside = TRUE, legend = TRUE, cex.names = 0.8, col = c('black', 'pink', 'lightblue', 'tan', 'red',
'brown')) > title(ylab = 'Bird Count', xlab = 'Bird Species')
Horizontal Bars
We can make the bars horizontal rather than the default vertical using the horiz = TRUE instruction:
We can use all the regular instructions that met previously on horizontal bar charts as well, for example:
> bccol = c('black', 'pink', 'lightblue', 'tan', 'red', 'brown')
> barplot(bird, beside = TRUE, legend = TRUE, horiz = TRUE, xlim = c(0, 60), col = bccol)
> title(ylab = 'Habitat', xlab = 'Bird count')
The JPEG option gives the opportunity to select from one of several compressions. The TIFF option produces
the largest files because no compression is used. The PNG option is useful because the PNG file format is
widely used and file sizes are quite small.
We can also use commands typed from the keyboard to save graphics files to disk, and go about this in several
ways. The simplest is via the dev.copy() command.
This command copies the contents of the graphics window to a file; designate the type of file and the
filename.
To finish the process type the dev.off() command. In the following example the graphics window is saved using
the png option:
> dev.copy(png, file = 'R graphic test.eps')
png:R graphic test.eps 3
> dev.off() windows 2
Macintosh
The Macintosh GUI allows graphics to be saved as PDF files. PDF is handled easily by Mac and is seen as a good
option because PDF graphics can be easily rescaled. To save the graphics window, click the window to select it
and then choose Save or Save As from the File menu. If want to save your graphic in another format, need to
use the dev.copy() and dev.off() commands.
In the following example, the graphics window is saved as a PDF file. The filename must be specified
explicitly—the default location for saved files is the current working directory.
In the following example, the graphics window is saved as a PNG file. The file is saved to the default working
directory—if want it to go somewhere else, need to specify the path in full as part of the filename.
> dev.copy(png, file = 'R graphic test.eps') png 3
> dev.off() X11cairo 2
The dev.copy() command requires to specify the type of graphic file require and the filename. we can specify
other options to alter the size of the final image. The most basic options are summarized in Table as