notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Introduction to coding

HTML Version: https://stirlingcodingclub.github.io/getting_started

Brad Duthie

08 NOV 2023

The purpose of this introduction


The purpose here is to get readers past the initial learning curve of coding as quickly as possible. If you
want to start coding for yourself, particularly in R for data analysis, but are not sure how, then read on. By
the end of these notes, you should be able to navigate through the basic graphical user interface of Rstudio
and write some basic lines of code. The goal is not to develop proficiency in coding or R yet, but to help you
get to the point at which it is possible to write and run code, make coding mistakes, and learn from other
researcher’s code.

Why use the R programming language?


The computer programming language R is a powerful and very widely-used tool among scientists for analysing
data. You can use it to analyse and plot data, run computer simulations, or even write slides, papers, or
books. The R programming language is completely free and open source, as is the popular Rstudio software
for using it. It specialises in statistical computing, which is part of the reason for its popularity among
scientists.
Another reason for its popularity is its versatility, and the ease with which new techniques can be shared.
Imagine that you develop a new method for analysing data. If you want other researchers to be able to use
your method in their research, then you could write your own software from scratch for them to install and
use. But doing this would be very time consuming, and a lot of that time would likely be spent writing the
graphical user interface and making sure that your program worked across platforms (e.g., on Windows and
Mac). Worse, once written, there would be no easy way to make your program work with other statistical
software should you need to integrate different analyses or visualisation tools (e.g., plotting data). To avoid
all of this, you could instead just present your new method for data analysis and let other researchers write
their own code for implementing it. But not all researchers will have the time or expertise to do this.
Instead, R allows researchers to write new tools for data analysis using simple coding scripts. These scripts
are organised into R packages, which can be uploaded by authors to the Comprehensive R Archive Network
(CRAN), then downloaded by users with a single command in R. This way, there is no need for completely
different software to be used for different analyses – all analyses can be written and run in R.
The downside to all of this is that learning R can be a bit daunting at first. Running analyses is not
done by pointing and clicking on icons as in Excel, SigmaPlot, or JMP. You need to use code. Here we will
start with the very basics and work our way up to some simple data analyses.

1
Getting started in R, and the basics
Installation. The first thing to do is download Rstudio if you have not already (but see below if you’re
eager to get started and want to skip this step). Note that R and Rstudio are not the same thing; R is a
language for scientific computing, and can be used outside of Rstudio. Rstudio is a very useful tool for coding
in the R language. As a very loose analogy, R is like a written language (e.g., English, Spanish) that can be
used to write inside Rstudio (e.g., a word processor such as Microsoft Word, LibreOffice). Look carefully at
the version of Rstudio that you download; different installers exist for Windows, Mac, and Linux. The most
recent version of Rstudio requires a 64-bit operating system. Unless your computer is quite old (say, over
seven years), you most likely have a 64-bit operating system rather than a 32-bit operating system, but if
you are uncertain, then it is best to check.
Bypassing installation with Rstudio Cloud. If you do not want to install R or Rstudio, or are having
trouble doing so but want to get started in R quickly, then an alternative is to use R through the Rstudio
cloud (https://rstudio.cloud). The Rstudio cloud allows you to run R right from your browser, and you can
sign up for free. You can watch this five minute video to see how to sign up and get started.
Running Rstudio. When you first run Rstudio, you will see several windows open. It will look something
like the below, except probably with a standard black on white theme (if you want, you can change this by
selecting from the toolbar ‘Tools > Global Options. . . ’, then selecting the ‘Appearance’ tab on the left).

This might look a bit intimidating at first. Unlike Microsoft Excel, SigmaPlot, or JMP, there is no spreadsheet
that opens up for you. You interact with R mostly by typing lines of commands rather than using a mouse
to point and click on different options. Eventually, this will feel liberating, but at first it will probably feel
overwhelming. First, let us look at all of the four panes in the Figure above. Your panes might be organised
a bit differently, but the important ones to start out with are the ‘Source’ and the ‘Console’. These are
shown in the right hand panes in the above Figure.
To make sure that the Source pane is available to you, open an new R script by selecting from the toolbar
‘File > New File > Rscript’ (shortcut: Shift+Ctrl+N). You should see a new Rscript open up that looks

2
something like the below (again, the colour scheme might differ).

Think of this Source file like a Word document that you have just opened up – completely blank and ready
for typing new lines of commands to read in data and run analyses. We will come back to this Source file,
but for now just know that the Source file stores commands that we want R to intepret and use. The Source
file does this by sending commands to the R console, which we will look at now.
The R console should be located somewhere in Rstudio (I like to keep it directly underneath my R Source
files). You can identify it by finding the standard R information printed off, which should look something
like the below.

R version 3.6.1 (2019-07-05) -- "Action of the Toes"


Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.


You are welcome to redistribute it under certain conditions.
Type ’license()’ or ’licence()’ for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.


Type ’contributors()’ for more information and
’citation()’ on how to cite R or R packages in publications.

Type ’demo()’ for some demos, ’help()’ for on-line help, or


’help.start()’ for an HTML browser interface to help.
Type ’q()’ to quit R.

>

The console is where all of the code is run. To get started, you can actually ignore everything else and just
focus on this pane. If you click to the right of the greater than sign > at the bottom, then you can start
using R right in the console. To get a feel for running code in the console, you can use the R console as a
standard calculator. Try typing something like the below to the right of the >, then hit ‘Return’ on your
keyboard (note, all of my semi-colons are optional – you do not actually need to put them at the end of each
line).

2 + 5;

## [1] 7

Now try some other common mathematical operations, one line at a time.

4 * 4;

## [1] 16

3
12 - 3;

## [1] 9

5ˆ2;

## [1] 25

Notice that R does the calculation of each of the above mathematical operations and returns the correct
value on the line below. If you are familiar with using Microsoft Excel, this is the equivalent to typing [=
2 + 5], or [= 4 * 4], etc., into a cell of an Excel spreadsheet. You might also be familiar with spreadsheet
functions as well, such as the square root function, which you could use in Excel by typing, e.g., [= sqrt(25)]
into a spreadsheet cell. This works in the R console too; the functions actually have the same syntax, so you
could type the below into the console and hit ‘Enter’.

sqrt(256);

## [1] 16

The console returns the correct answer 16. Similar functions exist for logarithms (log) and trigonmetric
functions (e.g., sin, cos), as they do in Microsoft Excel. But this is just the beginning in R. Functions can
be used to do any number of tasks. Some of these functions are built into the base R language; others are
written by researchers and distributed in R packages, but you can also learn to write your own R functions
to do any number of customised tasks. You will need to use functions in nearly every line of code you write
(technically, even the +, -, etc., are also functions), so it is good to know the basics of how to use them.
Most functions are called with open and closed parentheses, as in the sqrt(256) above. The sqrt is the
function, while the 256 is a function argument. An argument is a specific input to a function, and functions
can take any number of arguments. For the sqrt function, only one argument is needed, but many arguments
will have more than one argument. For example, if we want to take the logarithm of some number using the
log function, we might need to specify the base. In this case, we clarify the number for which we want to
compute the log (x) and the logarithm base (base). Let us say that we want to compute the logarithm of
256 in base 10.

log(x = 256, base = 10);

## [1] 2.40824

Note that some arguments are required, while some are optional. In the case of log, the first argument is
required, but the base is actually optional. If we do not specify a base, then the function simply defaults
to calculating the natural logarithm (i.e., base e). Hence, the below also works (note that we get a different
answer because the bases differ).

log(x = 256);

## [1] 5.545177

In fact, we do not even need to specify the x because only one argument is needed for the log function.
Hence, if only one argument is specified, the function just assumes that this argument is x. Try the below.

4
log(256);

## [1] 5.545177

Note that functions can be nested inside other functions, though this can get messy. For example, if you
wanted to get the logarithm of the logarithm of 256, then you could write the below.

log( log(256) );

## [1] 1.712929

Also note that functions do not need to be mathematical in R; they do not even need to operate on numbers.
One very useful function is the help function, which provides documentation for other R functions. If, for
example, we were not sure what log did, or what arguments it accepted, then we could run the code below.

help(topic = log);

Try running the above line of code in the R console. You should see a description of the log function, along
with some examples of how it is used and the two arguments (x and base) that it accepts. Anytime you
get stuck with a function, you should be able to use the help function for clarification. You can even use a
shortcut that returns the same as the help(topic = log); above.

?log;

We now have looked at three functions, sqrt, log, and help. If you have previous experience with Microsoft
Excel spreadsheets, you should now be able to make the conceptual connection between typing =sqrt(25)
into a spreadsheet cell and sqrt(25) into the R console. You should also recognise that R functions serve a
much broader set of purposes in R. Next, we will move onto assigning variables in the R console.

Assigning variables in the R console


In R, we can also assign values to variables using the characters <- to make an arrow. Say, for example, that
we wanted to make var_1 equal 10.

var_1 <- 10;

We can now use var_1 in the console.

var_1 * 5; # Multiplying the variable by 5

## [1] 50

Note that the correct value of 50 is returned because var_1 equals 10. Also note the comment left after the
# key. In R, anything that comes after # on a line is a comment that R ignores. Comments are ways of
explaining in plain words what the code is doing, or drawing attention to important notes about the code.
Note that we can assign multiple things to a single variable. Here is a vector of numbers created using the c
function, which combines multiple arguments into a vector or list. Below, we combine six numbers to form
a vector called vector_1.

5
vector_1 <- c(5, 1, 3, 5, 7, 11); # Six numbers

We can now print and perform operations on vector_1.

vector_1; # Prints out the vector

## [1] 5 1 3 5 7 11

vector_1 + 10; # Prints the vector elements plus ten

## [1] 15 11 13 15 17 21

vector_1 * 2; # Prints the vector elements times two

## [1] 10 2 6 10 14 22

vector_1 <- c(vector_1, 31, 100); # Appends the vector


vector_1;

## [1] 5 1 3 5 7 11 31 100

We can also assign lists, matrices, or other types of objects using the list function.

object_1 <- list(vector_1, 54, "string of words");


object_1;

## [[1]]
## [1] 5 1 3 5 7 11 31 100
##
## [[2]]
## [1] 54
##
## [[3]]
## [1] "string of words"

Play around a bit with R before moving on, and try to get confortable using the console. When you have
finished with the R console, continue reading to learn how to store lines of code using an R script.

Using R script to save and run code


Up until now, we have focused on running code directly into the console. This works, but if you want to
run multiple lines of code, or just save your code for later use, then you will need more than the console. R
scripts are plain text files with a ‘.R’ extension, which can be used to save R code. The R code itself is no
different than what we have already run into the console. For example, we could save an R file with all of
the code that we have read into the console up to this point; it would look like the below.

6
2 + 5;
4 * 4;
12 - 3;
5ˆ2;
sqrt(256);
log(x = 256, base = 10);
log(x = 256);
log(256);
log( log(256) );
help(topic = log);
?log;
var_1 <- 10;
var_1 * 5; # Multiplying the variable by 5
vector_1 <- c(5, 1, 3, 5, 7, 11); # Six numbers
vector_1; # Prints out the vector
vector_1 + 10; # Prints the vector elements plus ten
vector_1 * 2; # Prints the vector elements times two
vector_1 <- c(vector_1, 31, 100); # Appends the vector
vector_1;
object_1 <- list(vector_1, 54, "string of words");
object_1;

You can find and download the code above in the file sample_code.R on GitHub. If you wanted to redo all
of the calculations in this R script, you could open it and run each line one by one, or as a group. In the
figure below, I have read sample_code.R into R by first saving it to my computer, then from the toolbar
selecting ‘File > Open File. . . ’ and opening it from the saved location.

7
There are a few things to note in the Figure above. First, the script sample_code.R now sits above the
console; the position of the script might be different depending on your pane settings, but you should be able
to see it appear somewhere after opening the file. Second, note how different parts of the text in the R script
are coloured differently; this makes reading the text a bit easier. The variables, assignments, arguments,
and functions, all appear in white. Numbers are shown in pink, strings of words are in green, and comments
are in light blue (your colours might differ, but you should see some distinction among different types of
text). Third, note the ‘Run’ button in the upper right corner above the script. This allows you to run the
commands in the R script lines directly into the R console so that you do not have to retype them directly.
You can read the code from the Rscript into the console in multiple ways. The easiest is to simply click with
your mouse on whatever line you want to run, then click the ‘Run’ button. Try clicking anywhere on line 1,
for example, so that the cursor is blinking somewhere on the line. Then click ‘Run’; you should see > 2 + 5
appear in the R console, followed by the correct answer 7. After you have done this, R moves the cursor to
the next line in anticipation of you wanting to run line 2. If you want to run line 2, then you could just hit
‘Run’ again, and repeat for line 3, 4, etc. Give this a try.
If you do not want to go through all of the code line by line, you could instead highlight a block of code,
as I did above for lines 1-8. If you highlight these lines, then click ‘Run’, the R console will run every line

8
one after another, producing the output shown in the console of the Figure above. Try this as well to get a
feel for running multiple lines of code at once. You now know the basics of getting started with coding in
R. Next, we will move onto reading data into R for analysis, and doing a very simple correlation analysis on
the classic Bumpus data set (Bumpus 1898; Johnston, Niles, and Rohwer 1972). Very briefly, the Bumpus
data includes characteristics and morphological measurements from sparrows in North America following a
severe storm (the specifics are not important for our purposes).

Reading in data
Now we need to read in the Bumpus data. This is actually a challenging part because the data need to be in
a correct format and location to be read into R successfully. The format is best read in as a CSV file, though
other formats are also possible (TXT can work, or XLSX if you download and load the openxlsx R package
and use the read.xlsx function). For now, I will use a CSV file with the Bumpus data set. Reading CSV
files into R can be challenging at first, and I encourage you to first read in the example data set, then try
reading in your own data sets into R. You can read your own dat sets into R by saving them as CSV files
in Excel; as a general rule, it is good to avoid spaces in these files (replacing them, e.g., with an underscore
’_’). Also make sure that all rows and columns are filled in; any empty values can be replaced with an NA,
which R reads as unavailable data. See the Bumpus CSV file online to get an idea of what a data file looks
like.
Two common errors arise at this point, which can be sources of frustration for getting started. First, the
data might not be organised correctly for reading into R. Note that rows and columns should start in row
‘1’ and column ‘A’ of Excel (i.e., don’t leave empty rows and columns), and additional cells should not be
used outside of rows and columns (if, e.g., you have a value in cell M4, when the last column in your table
is K, then R will intepret this as column L being full of empty values). You should be fine if R includes
some number of completely filled in rows and columns, with nothing filled in outside. If you want to, you
can download the Bumpus CSV file from Dropbox and open it up in Excel for an example of what a good
CSV file looks like. Note that there are no empty cells inside the table, and no values outside the table. This
should therefore be read easily into R.
Second, you need to make sure that the file you are trying to read into R is located in the same place as
your current working directory. You can see what your current working directory (i.e., ‘folder’) is using the
command below.

getwd(); # No argument is needed here for the function

## [1] "/home/brad/Dropbox/projects/StirlingCodingClub/getting_started"

The above function returns the current working directory. If this is the same as the CSV file that you want
to read into R, then all is well. But if this is not the working directory where your CSV file is located, then
you need to find it. You could do this from the R console, but the easiest way is to go to the toolbar and go
to ‘Session > Set Working Directory > Choose Directory. . . ’ and find the location where your CSV file is
saved. The easiest way to do this is to save your data in the same place that you have saved your R script.
If you do this, then you can simply go to ‘Session > Set Working Directory > To Source File Location’, and
R will set the directory to the same file as your current R script. From there you can read in your CSV file
with the read.csv function in R. Note that the first row of the file ‘Bumpus_data.csv’ is a header, which
gives the column names, so we should specify the argument ‘header = TRUE’.

dat <- read.csv(file = "Bumpus_data.csv", header = TRUE);

If you get an error message, double-check that the file name and the working directory are correct (if there
is an error, this is the problem most of the time). Note that the everything in R is case sensitive. That

9
means that if a letter is capitalised in the file name, but you do not capitalise it in the file argument above,
then R will not recognise it. A lot of errors are caused by capitalisation issues in R.
Once you have succeeded in reading in a file without getting an error message, to make sure that everything
looks correct, you can type dat in the console to see all of the data print out. I will use the ‘head’ function
below to just print off the first six rows.

head(dat);

## sex surv totlen wingext wgt head humer femur tibio skull stern
## 1 male alive 154 241 24.5 31.2 0.687 0.668 1.022 0.587 0.830
## 2 male alive 160 252 26.9 30.8 0.736 0.709 1.180 0.602 0.841
## 3 male alive 155 243 26.9 30.6 0.733 0.704 1.151 0.602 0.846
## 4 male alive 154 245 24.3 31.7 0.741 0.688 1.146 0.584 0.839
## 5 male alive 156 247 24.1 31.5 0.715 0.706 1.129 0.575 0.821
## 6 male alive 161 253 26.5 31.8 0.780 0.743 1.144 0.607 0.893

If the data appear to be read into R correctly, then you can move on to working with the data and performing
analyses in R. Note that dat is a big table that is now read into R. While we do not necessarily see the entire
table at once, as we would in Excel, we can pull out any of the information in that we want. For example,
if we want to see how many rows and columns are in dat, we can use the following functions.

nrow(dat);

## [1] 136

ncol(dat);

## [1] 11

We could also just use the function dim to get the dimensions of dat (note that this would work for an array
of any number of dimensions).

dim(dat);

## [1] 136 11

So we know that our table dat, which contains the Bumpus data, includes 136 rows and 11 columns. Having
read this table into R successfully, we can now perform any number of statistical analyses on the contents.
The different ways to analyse theses data are beyond the scope of these notes, but there are a few useful
things to know. First, the row and columns in dat can be indexed using square brackets. If, for example,
we wanted to just look at the value of the fourth row and sixth column, we could type the following.

dat[4, 6]; # First row, second column

## [1] 31.7

The first position within the brackets is the row (4), and the second position is the column (6). Note that
R is not restricted to two dimensions; it is possible to have three or more dimensions of an array, in which
case we might refer to an array element as dat[x_dim, y_dim, z_dim] for a dat of three dimensions. Note
that we can also store any particular value in dat as a variable, if we want. We could, for example store the
above as dat_point_1 using the code below.

10
dat_point_1 <- dat[4, 6];

We could then use dat_point_1 in place of dat[4, 6]. We can also define entire rows or columns. For
example, if we wanted to return all of the values of row 4, then we could leaven the second index blank, as
below.

dat[4, ]; # Note the empty space where a column was previously

## sex surv totlen wingext wgt head humer femur tibio skull stern
## 4 male alive 154 245 24.3 31.7 0.741 0.688 1.146 0.584 0.839

In the Bumpus data set, this gives us all the information of measurements for sparrow number 4. We can
do the same for columns. Note that column 5 holds the mass of each sparrow (in grams). We could look at
all of the sparrow masses using the code below.

dat[, 5]; # Note the empty space is now where a row used to be.

## [1] 24.5 26.9 26.9 24.3 24.1 26.5 24.6 24.2 23.6 26.2 26.2 24.8 25.4 23.7 25.7
## [16] 25.7 26.5 26.7 23.9 24.7 28.0 27.9 25.9 25.7 26.6 23.2 25.7 26.3 24.3 26.7
## [31] 24.9 23.8 25.6 27.0 24.7 26.5 26.1 25.6 25.9 25.5 27.6 25.8 24.9 26.0 26.5
## [46] 26.0 27.1 25.1 26.0 25.6 25.0 24.6 25.0 26.0 28.3 24.6 27.5 31.0 28.3 24.6
## [61] 25.5 24.8 26.3 24.4 23.3 26.7 26.4 26.9 24.3 27.0 26.8 24.9 26.1 26.6 23.3
## [76] 24.2 26.8 23.5 26.9 28.6 24.7 27.3 25.7 29.0 25.0 27.5 26.0 25.3 22.6 25.1
## [91] 23.2 24.4 25.1 24.6 24.0 24.2 24.9 24.1 24.0 26.0 24.9 25.5 23.4 25.9 24.2
## [106] 24.2 27.4 24.0 26.3 25.8 26.0 23.2 26.5 24.2 26.9 27.7 23.9 26.1 24.6 23.6
## [121] 26.0 25.0 24.8 22.8 24.8 24.6 30.5 24.8 23.9 24.7 26.9 22.6 26.1 24.8 26.2
## [136] 26.1

Note that this now returns the masses of all 136 sparrows. Since our table has headers, and the header for
column 5 is wgt, we could also use the code below.

dat$wgt; # R sees the column header and returns column 5; same as above

## [1] 24.5 26.9 26.9 24.3 24.1 26.5 24.6 24.2 23.6 26.2 26.2 24.8 25.4 23.7 25.7
## [16] 25.7 26.5 26.7 23.9 24.7 28.0 27.9 25.9 25.7 26.6 23.2 25.7 26.3 24.3 26.7
## [31] 24.9 23.8 25.6 27.0 24.7 26.5 26.1 25.6 25.9 25.5 27.6 25.8 24.9 26.0 26.5
## [46] 26.0 27.1 25.1 26.0 25.6 25.0 24.6 25.0 26.0 28.3 24.6 27.5 31.0 28.3 24.6
## [61] 25.5 24.8 26.3 24.4 23.3 26.7 26.4 26.9 24.3 27.0 26.8 24.9 26.1 26.6 23.3
## [76] 24.2 26.8 23.5 26.9 28.6 24.7 27.3 25.7 29.0 25.0 27.5 26.0 25.3 22.6 25.1
## [91] 23.2 24.4 25.1 24.6 24.0 24.2 24.9 24.1 24.0 26.0 24.9 25.5 23.4 25.9 24.2
## [106] 24.2 27.4 24.0 26.3 25.8 26.0 23.2 26.5 24.2 26.9 27.7 23.9 26.1 24.6 23.6
## [121] 26.0 25.0 24.8 22.8 24.8 24.6 30.5 24.8 23.9 24.7 26.9 22.6 26.1 24.8 26.2
## [136] 26.1

Even better, we can plot a histogram of sparrow weights using the built-in function hist in R.

hist(x = dat$wgt);

11
Histogram of dat$wgt
40
30
Frequency

20
10
0

22 24 26 28 30

dat$wgt
This looks a bit rubbish; the main title is unnecessary, and the axis labels are not terribly informative. We
can tweak the axis labels and colours using the following arguments.

hist(x = dat$wgt, main = "", xlab = "Sparrow weight (g)", ylab = "Frequency",
cex.lab = 1.25, cex.axis = 1.25, col = "grey");
40
30
Frequency
20
10
0

22 24 26 28 30
Sparrow weight (g)

12
Some basic analyses
Since the R programming language is developed primarily for scientific analysis and statistical computing,
there are several built-in functions for doing simple analyses. More complex analyses that are not possible
with built-in functions can be performed by downloading R packages (this can be done from the Compre-
hensive R Archive Network (CRAN) using the funciton install.packages, or by looking at the packages
tab in Rstudio). If you have heard of an analysis, then it is probably available within an R package. For
now, let us just do some basic statistical analyses. For example, let us say that we want to summarise the
data on sparrow body mass. We can do this with the summary function in R.

summary(dat$wgt);

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 22.60 24.57 25.55 25.52 26.50 31.00

The summary function returns six numbers, including the minimum value, first quartile, median, mean, third
quartile, and maximum value of numbers in the data, in this case a column of sparrow body masses. Now
say that we want to see if sparrow body mass is correlated with sparrow body length (dat$totlen in our
data frame). We could first look at a scatter plot of body mass versus length.

plot(x = dat$wgt, y = dat$totlen, xlab = "Sparrow body mass (g)",


ylab = "Sparrow body length (mm)", cex.lab = 1.25, cex.axis = 1.25,
pch = 20); # Note: cex.lab, cex.axis, and pch are purely cosmetic
Sparrow body length (mm)
165
160
155

24 26 28 30
Sparrow body mass (g)
It clearly looks like there is a correlation between the two variables of interest. We can find out what this
correlation is below.

cor(dat$wgt, dat$totlen);

## [1] 0.5838648

13
Hence, the correlation between sparrow body mass and sparrow total body length is 0.5838648. We can even
test to see if this correlation is significant using the cor.test function.

cor.test(dat$wgt, dat$totlen);

##
## Pearson’s product-moment correlation
##
## data: dat$wgt and dat$totlen
## t = 8.3251, df = 134, p-value = 8.612e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4608233 0.6848847
## sample estimates:
## cor
## 0.5838648

As you can see above, this returns the correlation, along with a t-statistic, degrees of freedom, p-value, and
confidence interval for the Pearson product-moment correlation between the two variables of interest. If we
wanted to instead run a simple linear regression of body total length against body mass, we could use the
lm function as below.

lm(dat$totlen ~ dat$wgt);

##
## Call:
## lm(formula = dat$totlen ~ dat$wgt)
##
## Coefficients:
## (Intercept) dat$wgt
## 123.571 1.409

Note that the above function returns the intercept and slope, but not any results from statistical null
hypothesis tests. To do this, we need to wrap lm in the function summary, as below (the R function summary
can tell the difference between a vector of numbers and a model, and handles each differently).

summary( lm(dat$totlen ~ dat$wgt) );

##
## Call:
## lm(formula = dat$totlen ~ dat$wgt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4819 -2.1078 -0.2135 2.1958 6.3232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 123.5713 4.3282 28.550 < 2e-16 ***
## dat$wgt 1.4093 0.1693 8.325 8.61e-14 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

14
##
## Residual standard error: 2.902 on 134 degrees of freedom
## Multiple R-squared: 0.3409, Adjusted R-squared: 0.336
## F-statistic: 69.31 on 1 and 134 DF, p-value: 8.612e-14

Now we get a bit more information, including significance tests for our intercept (Intercept) and slope
dat$wgt. The linear model function (lm), and the generalised linear model function (glm) are very flexible,
and can be used for a variety of purposes that I will not elaborate on here, except for one more example.
Let us look at the categorical variable of sparrow sex, and test whether or not bird mass difference between
sexes. To do this, we could use a simple t-test with the t.test function below. For illustrative purposes, I
have set the argument var.equal equal to TRUE to assume equal variances (had I not done this, the default
would have been var.equal = FALSE).

t.test(dat$wgt ~ dat$sex, var.equal = TRUE);

##
## Two Sample t-test
##
## data: dat$wgt by dat$sex
## t = -3.0333, df = 134, p-value = 0.002906
## alternative hypothesis: true difference in means between group female and group male is not equal to
## 95 percent confidence interval:
## -1.2820247 -0.2700279
## sample estimates:
## mean in group female mean in group male
## 25.02857 25.80460

Since a t-test is equivalent to a linear model with two categorical response variables, we could also use the
lm function to do the same null hypothesis test.

summary( lm(dat$wgt ~ dat$sex) );

##
## Call:
## lm(formula = dat$wgt ~ dat$sex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6046 -1.0286 -0.1046 0.9144 5.4714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.0286 0.2046 122.316 < 2e-16 ***
## dat$sexmale 0.7760 0.2558 3.033 0.00291 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.432 on 134 degrees of freedom
## Multiple R-squared: 0.06425, Adjusted R-squared: 0.05727
## F-statistic: 9.201 on 1 and 134 DF, p-value: 0.002906

Or we could use aov to run an (again, equivalent) analysis of variance (ANOVA).

15
summary( aov(dat$wgt ~ dat$sex) );

## Df Sum Sq Mean Sq F value Pr(>F)


## dat$sex 1 18.88 18.877 9.201 0.00291 **
## Residuals 134 274.92 2.052
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Note that the p-value is identical in each of the above three cases because the underlying mathematics is the
same. The point here is not to demonstrate all of the different statistics that can be performed within R.
Because R is a programming language, the possibilities are limitless; in the unlikely chance that no one has
written code for a particular statistical analysis that you need to perform, you could write your own code to
do the analysis yourself.

Practicing with the Bumpus data set, or your own


If you want to continue practicing working in R, then you can download the Bumpus data set here (right
click on ‘Bumpus_data.csv’ and select ‘Save link as. . . ’) or here (select ‘Open with’ in the upper right to
download). Save the file somewhere on your computer, then make sure that you set R to the same working
directory as the saved file by going to the toolbar of Rstudio and selecting ‘Session > Set Working Directory
> To Source File Location’ (note, you could also do this from the R console using the setwd function). You
should now be able to read these data into R and start doing some analyses. I have recreated the above
analysis with the Bumpus data set into an Rscript. You can download it here, or recreate it yourself by
copying and pasting the code from the appendix below into a new Rscript.

Appendix: R script file

# First we need to read in the Bumpus data below


# Make sure that R is set to the same working directory as Bumpus_data.csv
# Also make sure that the filenames match exact (including capitalisation)
dat <- read.csv(file = "Bumpus_data.csv", header = TRUE);
# Now the whole data table is saved as the variable 'dat'
# You can save it as something different if you want, but avoid spaces

# Let's take a look at the first six rows of the data


head(x = dat); # Note that the 'x = ' is not necessary. 'head(dat)' works fine

# How many rows and columns are in the data?


dim(x = dat);

dat[4, 6]; # First row, second column?

# The below produces a histogram of sparrow mass


# Note that I break the line below mid-function after specifying 'ylab'. This
# isn't required, but it often makes code more readable to break to a new line
# when the line exceeds 80 characters.
hist(x = dat$wgt, main = "", xlab = "Sparrow weight (g)", ylab = "Frequency",
cex.lab = 1.25, cex.axis = 1.25, col = "grey");

16
# Let's get a summary of just the sparrow mass
summary(object = dat$wgt);

# Now let's plot sparrow totel length against sparrow mass


plot(x = dat$wgt, y = dat$totlen, xlab = "Sparrow body mass (g)",
ylab = "Sparrow body length (mm)", cex.lab = 1.25, cex.axis = 1.25,
pch = 20); # Note: cex.lab, cex.axis, and pch are purely cosmetic

# Test the correlatoin between total length and mass


cor.test(x = dat$wgt, y = dat$totlen);

# Make a linear model of total length regressed against body mass


our_model <- lm(formula = dat$totlen ~ dat$wgt);

# Now summarise 'our_model' that we saved above


summary(object = our_model);

# Now test if bird mass differs by sex using a t-test


t.test(formula = dat$wgt ~ dat$sex, var.equal = TRUE);

# We could also do a linear model get the same results


our_lm <- lm(formula = dat$wgt ~ dat$sex)
summary(object = our_lm);

# And we can do the same with an ANOVA


our_aov <- aov(dat$wgt ~ dat$sex)
summary(our_aov);

References
Bumpus, Hermon C. 1898. “Eleventh lecture. The elimination of the unfit as illustrated by the introduced
sparrow, Passer domesticus. (A fourth contribution to the study of variation.).” Biological Lectures:
Woods Hole Marine Biological Laboratory, 209–25.
Johnston, R F, D M Niles, and S A Rohwer. 1972. “Hermon Bumpus and natural selection in the House
Sparrow Passer domesticus .” Evolution 26: 20–31.

17

You might also like