YaRrr Book
YaRrr Book
YaRrr Book
YaRrr!
YA R R R ! T H E P I R AT E S
GUIDE TO R
Copyright 2016 Dr. Nathaniel D. Phillips
published by
http://www.thepiratesguidetor.com
This document may not be used for any commercial purposes. All rights are reserved by Nathaniel
Phillips.
First printing,
Contents
Introduction 11
2: R Basics 29
The basics of R programming 29
A brief style guide: Commenting and spacing 32
Creating new objects with <- 34
Test your R might! 38
4: Vector functions 49
Arithmetic operations on vectors 49
Summary statistic functions for numeric vectors 51
Counting functions for discrete and non-numeric data 54
Test your R Might! 58
Appendix 241
Index 245
9
with LaTeX.
Introduction
Who am I?
My name is Nathaniel. I am a psychologist with a background in
statistics and judgment and decision making. You can find my R (and
non-R) related musings at www.nathanieldphillips.com. As you will
learn in a minute, I didnt write this book, but I did translate it from
pirate-speak to English.
While this book was originally written for pirates, I think that anyone
who wants to learn R can benefit from this book. If you havent had
an introductory course in statistics, some of the later statistical con-
cepts may be difficult, but Ill try my best to add brief descriptions of
new topics when necessary. Likewise, if R is your first programming
language, youll likely find the first few chapters quite challenging
as you learn the basics of programming. However, if R is your first
programming language, thats totally fine as what you learn here
will help you in learning other languages as well (if you choose to).
Finally,while the techniques in this book apply to most data analysis
problems, because my background is in experimental psychology I
will cater the course to solving analysis problems commonly faced in
psychological research.
Why is R so great?
As youve already gotten this book, you probably already have some
idea why R is so great. However, in order to help prevent you from
giving up the first time you run into a programming wall, let me give
you a few more reasons:
completely free. This doesnt just help your wallet - it means that
a huge community of R programmers will constantly develop an
distribute new R functionality and packages at a speed that leaves
all those other packages in the dust! Unlike Fight Club, the first
rule of R is "Do talk about R!" The size of the R programming
community is staggering. If you ever have a question about how to
implement something in R, a quick Poogle1 search will lead you to 1
I am in the process of creating Poogle
- Google for Pirates. Kickstarter page
your answer virtually every single time.
coming soon...
2. R is the future. To illustrate this, look at the following three figures.
These are Google trend searches for three terms: R Programming,
Matlab, and SPSS. Try and guess which one is which.
able to show you the exact code they used. Of course, this doesnt
library("circlize")
mean that they used the appropriate analysis or interpreted it mat = matrix(sample(1:100, 18, replace = TRUE), 3, 6)
rownames(mat) = letters[1:3]
colnames(mat) = LETTERS[1:6]
correctly, but with all the original code, any problems should be chordDiagram(mat)
completely transparent!
C D
40
0
12 E
choice for pirates. 80
B
80
12
0
40
160
0
0
80
Additional Tips
40
F
A
40
80
0
0200
Because this is a beginners book, I try to avoid bombarding you with
40
160
too many details and tips when I first introduce a function. For this
80
120
0
c
12
reason, at the end of every chapter, I present several tips and tricks
80
a
0
16
40
that I think most R users should know once they get comfortable 0
20
0
160
0
24
0
20
0
120 40
80
tips as I expect youll find at least some of them very useful if not Figure 2: This is a chordDiagram
invaluable. plot that comes with the R package
circlize. How cool is this?! Have fun
trying to make this with SPSS.
1: Getting Started (and why R is like a relationship)
Yes, R is very much like a relationship. Like relationships, there are I love you R!
1. There is nothing more frustrating than when your code does Figure 3: Yep, R will become both your
not work best friend and your worst nightmare.
The bad times will make the good times
oh so much sweeter.
2. There is nothing more satisfying than when your code does
work!
First things first, lets download both Base R and Rstudio. Of course,
they are totally free and open source.
Download Base R
Windows: http://cran.r-project.org/bin/windows/base/
Mac: http://cran.r-project.org/bin/macosx/
Download RStudio
Figure 5: Here is how the standard R
application looks. Not too exciting - just
how we like it!
While you can do pretty much everything you want within base
R, youll find that most people these days do their R programming in
an application called RStudio. RStudio is a graphical user interface
(GUI)-like interface for R that makes programming in R a bit easier.
In fact, once youve installed RStudio, youll likely never need to
open the base R application again. To download and install RStudio
(around 40mb), go to one of the links above and follow the instruc-
tions.
Lets go ahead and boot up RStudio and see how she looks!
Code Chunks
In this book, R code is (almost) always presented in a separate gray
box like this one:
a <- 1 + 2 + 3 + 4 + 5
a
## [1] 15
1: getting started (and why r is like a relationship) 17
1+1
## [1] 2
Try calculating 1+1 by typing the code directly into the console -
then press Enter. You should see the result [1] 2. Dont worry about
the [1] for now, well get to that later. For now, were happy if we
just see the 2. Then, type the same code into the Source, and then
send the code to the Console by highlighting the code and clicking
the Run" button on the top right hand corner of the Source window.
Alternatively, you can use the hot-key "Command + Return" on Mac
or "Control + Enter" on Windows. Tip: Try to write most of your code in
So as you can see, you can execute code either by running it from a document in the Source. Only type
directly into the Console to de-bug or
the Source or by typing it directly into the Console. However, 99% do quick analyses.
1: getting started (and why r is like a relationship) 19
most of the time, you should be using the Source rather than the
Console. The reason for this is straightforward: If you type code into
the console, it wont be saved (though you can look back on your
command History). And if you make a mistake in typing code into
the console, youd have to re-type everything all over again. Instead,
its better to write all your code in the Source. When you are ready to
execute some code, you can then send "Run" it to the console.
Environment / History
The Environment tab of this panel shows you the names of all the
data objects (like vectors, matrices, and dataframes) that youve
defined in your current R session. You can also see information like
the number of observations and rows in data objects. The tab also
has a few clickable actions like Import Dataset" which will open a
graphical user interface (GUI) for important data into R. However, I
almost never look at this menu.
The History tab of this panel simply shows you a history of all
the code youve previously evaluated in the Console. To be honest,
I never look at this. In fact, I didnt realize it was even there until I
started writing this tutorial.
As you get more comfortable with R, you might find the Envi-
ronment / History panel useful. But for now you can just ignore
it. If you want to declutter your screen, you can even just minimize
the window by clicking the minimize button on the top right of the
panel.
1. Files - The files panel gives you access to the file directory on your
harddrive. One nice feature of the "Files" panel is that you can use
it to set your working directory - once you navigate to a folder
you want to read and save files to, click "More" and then "Set As
Working Directory." Well talk about working directories in more
detail soon.
2. Plots - The Plots panel (no big surprise), shows all your plots.
There are buttons for opening the plot in a separate window and
exporting the plot as a pdf or jpeg (though you can also do this
with code using the pdf() or jpeg() functions.) Most - if not all - of the time when
Lets see how plots are displayed in the Plots panel. Run the you perform actions using your mouse
by pointing and clicking in RStudio,
following code to display a histogram of 100 values randomly RStudio will perform the function by
drawn from a standard normal distribution. When you do, you sending the appropriate R Code to
the console. You can then copy and
paste this code into your documents to
automate the process later.
20 yarrr! the pirates guide to r
should see a plot similar to this one show up in the Plots panel.
Dont worry if your plot looks slightly different from this one: as
youll learn later, the rnorm() function generates different data
each time you evaluate it!
Plunder in 2015
15
Frequency
10
5
0
Amount
?hist
When you download and install R for the first time, you are installing
the Base R software. Base R will will contain most if not all the
functions you need. However, one of the great things about R is that
people are constantly writing and sharing new functions that you
can use. When people share a new function, they usually do so in the
form of an R package which contains anything from functions, to help
menus, to vignettes (examples), to data.
Most R packages are hosted at the Comprehensive R Archive Net-
work (CRAN) https://cran.r-project.org/. To install a new R package
from CRAN, you can simply run the code install.packages("package"),
where "package" is the name of the package. After youve installed
the package, you need to load it into R by running the code library("package").
This will load the package into your current R session and allow you
to use its contents. Once youve installed a package on
To see how this works in action, lets install and load the wordcloud your computer, you never need to
install it again. However, you do need
package. This package contains the wordcloud function which allows to load the package every time you start
you to easily create those really cool wordcloud plots youve seen on a new R session.
the Internets.
Well start by installing the package. When you run the following
code, R will download the package from CRAN. If everything works,
you should see some information about where the package is being
downloaded from, in addition to a progress bar.
install.packages("wordcloud")
Now that the package is installed on your computer, you can use it
anytime you want by loading the package:
library("wordcloud")
Monkey.Island
cutlass
Jolly.Roger
Blackbeard
Grogg
Treasure
For much of this book, you will need the yarrr package. This pack-
age contains every dataset, function, and plotting code from this
book. Unlike most packages that youll be using, the yarrr pack-
age is not hosted at CRAN. Instead, all the code is on github2 at 2
For those of you unfamiliar with
www.github.com/ndphillips/yarrr. To install the yarrr package on github, its basically Facebook for
programmers.
your machine, youll first need to install the devtools package which
will then allow you to directly install packages from github:
Now that youve installed and loaded the devtools package, you
can install and load the yarrr package from github with the follow-
ing code:
To see the dataset, you can execute the the View() function
View(pirates)
When you run this command, you should see the first several rows
and columns of the dataset (like this:)
Over the course of this book, you will be learning lots of new func-
tions. Wouldnt it be nice if someone created a Cheatsheet / Notecard
of many common R functions? Yes it would, and thankfully Tom
Short has done this in his creation of the R Reference Card. I highly
encourage you to print this out and start highlighting functions as
you learn them! .
You can access the pdf of this card in two ways. You can get it on
the web at http://nathanieldphillips.com/wp-content/uploads/2015/09/R-
Reference-Card.pdf. Alternatively, you can get it from the yarrr
package. To find where the file is located on your computer, run the
24 yarrr! the pirates guide to r
following code (the output after the code is the file location on my
computer):
system.file("RReferenceCard.pdf", package="yarrr")
## [1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/yarrr/RReferenceCard.pdf"
Finished!
Thats it for this lecture! All you did was install the most powerful
statistical package on the planet used by top universities and compa-
nies like Google. No big deal.
1.5: Jump off the plank and dive in
Whats the first exercise on the first day of pirate swimming lessons? Figure 8: Despite what you might find
While it would be cute if they all had little inflatible pirate ships to at family friendly waterparks this is
NOT how real pirate swimming lessons
swim around in unfortunately this is not the case. Instead, those look.
baby pirates take a walk off their baby planks so they can get a
taste of what theyre in for. Turns out, learning R is the same way.
Lets jump in. In this chapter, youll see how easy it is to calculate
basic statistics and create plots in R. Dont worry if the code youre
running doesnt make immediate sense just marvel at how easy it is
to do this in R!
Movies
If you havent downloaded the
In this section, well analyze a dataset of the top 5,000 grossing yarrr package yet, run the following
movies of all time. This dataset is called movies and is contained in commands:
library(yarrr)
Next, well look at the help menu for the movies dataset using the
question mark ?
?movies
First, lets take a look at the first few rows of the dataset using the
head() function. This will give you a visual idea of how the dataset is
structured.
head(movies)
Now lets calculate some basic statistics on the entire dataset. Well
calculate the mean revenue, highest running time, and number of
movies in each genre3 . 3
I got a mean revenue of 98.21, a
maximum running time of 240, and
a table which showed, among other
things, that there were 692 Action
movies, 486 Adventure movies...
26 yarrr! the pirates guide to r
Cool stuff, now lets make a plot! Well plot the relationship be- My first scatterplot of movie data!
tween movie budgets and revenue. I would hope that the higher a
2500
movies budget is, the more money it should make. Lets find out if
2000
this is the case. Well also include a blue regression line to measure
1500
the strength of the relationship.
Revenue
1000
plot(x = movies$budget, 500
y = movies$revenue.all,
main = 'My first scatterplot of movie data!',
0
xlab = 'Budget',
0 100 200 300 400
Budget
ylab = 'Revenue'
)
abline(lm(revenue.all ~ budget,
data = movies),
col = 'blue'
)
Now that the packages is installed and loaded, were good to go!
Lets do a Bayesian version of the t-test comparing the budgets of
movie sequels to non-sequels8 : 8
I got a Bayes Factor of 5.0869775 1038
which is pretty darn strong evidence
against the null hypothesis.
ttestBF(formula = budget ~ sequel,
data = movies[is.finite(movies$budget) &
is.finite(movies$sequel),])
Now, lets test the correlation between a movies budget and its
revenue with a Bayesian regression analysis9 : 9
I got a Bayes Factor of (Infinite!)
which is such strong evidence against
the null hypothesis that the time and
lmBF(formula = revenue.all ~ budget, space continuum is on the verge of
data = movies[is.finite(movies$revenue.all) & collapsing.
is.finite(movies$budget),])
really learned how R works yet, Id bet my beard that you could Bayesian statistics
dont worry if some or all of the previous code didnt make sense.
Soon...it will all be clear.
Now that youve gotten wet, lets learn how to swim.
2: R Basics
## [1] 23.83333
As you can see, the result of the mean() function, with the tattoos
vector object as an input, returned a single number (a scaler object)
with a value of 23.8333333.
Different functions require different types of objects as inputs. For
example, the mean() function has to have a numeric object (typically
a vector) as an input. If you try to use a non-numeric object as an
argument to mean(), youll get an error because the function wasnt
designed to work with that type of argument.
As you learn R, youll spend a lot of time learning new types
of objects (and their attributes) and new functions. To see a list
of the most commonly used built-in functions and procedures in
R, check out the great R Reference Card written by Tom Short at
http://nathanieldphillips.com/wp-content/uploads/2015/09/R-
Reference-Card.pdf. I highly encourage you to print the R Reference
Card out and keep it handy while youre learning new functions.
prompting you to feed it with some R code. The fastest way to have
R evaluate code is to type your R code directly into the command-
line interpreter. For example, if you type 1+1 into the interpreter and
hit enter youll see the following
1+1
## [1] 2
1. Copy the code from the Editor (or anywhere that has valid R code),
and paste it into the Console (using CommandV).
2. Highlight the code you want to run (with your mouse or by hold-
ing Shift), then use the CommandReturn shortcut (see Figure 12).
32 yarrr! the pirates guide to r
3. Place the cursor on a single line you want to run, then use the
CommandReturn shortcut to run just that line.
99% of the time, I use method 2, where I highlight the code I want,
then use the CommandReturn shortcut. However, method 3 is great
for trouble-shooting code line-by-line.
# Step 3: Calculations
Spacing
Howwouldyouliketoreadabookiftherewerenospacesbetweenwords?
Imguessingyouwouldnt. Soeverytimeyouwritecodewithoutprop-
erspacing,rememberthissentence.
Commenting isnt the only way to make your code legible. Its
important to make appropriate use of spaces and line breaks. For
example, I include spaces between arithmetic operators (like =, + and
-) and after commas (which well get to later). For example, look at
the following code:
a<-(100+3)-2
mean(c(a/100,642564624.34))
t.test(formula=revenue.all~sequel,data=movies)
plot(x=movies$budget,y=movies$dvd.usa,main="myplot")
That code looks like shit. Dont write code like that. It makes my
eyes hurt. Now, lets use some liberal amounts of commenting and
spacing to make it look less shitty.
34 yarrr! the pirates guide to r
a <- (100 + 3) - 2
mean(c(a / 100, 642564624.34))
plot(x = movies$budget,
y = movies$dvd.usa,
main = "myplot"
)
See how much better that second chunk of code looks? Not only
do the comments tell us the purpose behind the code, but there are
spaces and line-breaks separating distinct elements.
By now you know that you can use R to do simple calculations. But
to really take advantage of R, you need to know how to create and
manipulate objects. All of the data, analyses, and even plots, you
use and create are, or can be, saved as objects in R. For example
the movies dataset which weve used before is an object stored in
the yarrr package. This object was defined in the yarrr package
with the name movies. When you loaded the yarrr package with
the library(yarrr) command, you told R to give you access to
the movies object. Once the object was loaded, we could use it to
calculate descriptive statistics, hypothesis tests, and to create plots.
To create new objects in R, you need to do object assignment. Object
assignment is our way of storing information, such as a number or
a statistical test, into something we can easily refer to later. This is a
pretty big deal. Object assignment allows us to store data objects un-
der relevant names which we can then use to slice and dice specific
data objects anytime wed like to.
To do an assignment, we use the almighty <- operator15 called 15
You can use = instead of <- for object
assign. assignment but I recommend you stick
with <- because the direction of the
assignment is clear.
object <- [ ]
2: r basics 35
## [1] 100
Now, R will print the value of a (in this case 100) to the console. If
you try to evaluate an object that is not yet defined, R will return an
error. For example, lets try to print the object b which we havent yet
defined:
a <- 1
b <- 100
# What is a + b?
a + b
## [1] 101
# What is c?
c
## [1] 101
36 yarrr! the pirates guide to r
z <- 0
z + 1
## [1] 0
Damn! As you can see, the value of z is still 0! What went wrong?
Oh yeah...
z <- 0
z <- z + 1 # Now I'm REALLY changing z
z
## [1] 1
All the object names above are perfectly valid. Now, lets look at
some examples of invalid object names. These object names are all
invalid because they either contain spaces, start with numbers, or
have invalid characters:
If you try running the code above in R, you will receive a warning
message starting with Error: unexpected symbol. Anytime you see
this warning in R, it almost always means that you have a syntax
error of some kind.
R is case-sensitive!
Like English, R is case-sensitive it R treats capital letters differently
from lower-case letters. For example, the four following objects
Plunder, plunder and PLUNDER are totally different objects in R:
Avoid using too many capital letters in object names because they
Figure 14: Like a text message, you
require you to hold the shift key. This may sound silly, but youd be
should probably watch your use of
surprised how much easier it is to type mydata than MyData 100 times. capitalization in R.
search showed me that the revenue was $634,954,103. Ill create the
new object using assignment:
## [1] 558759611
It looks like the movie made 5.5875961 108 in Euros. Not bad.
Now, lets see how much more Pirates of the Caribbean 2: Dead
Mans Chest made compared to "Curse of the Black Pearl." Another
Google search uncovered that Dead Mans Chest made $1,066,215,812
(that wasnt a mistype, the freaking movie made over a billion dol-
lars).
deadman.usd / blackpearl.usd
## [1] 1.679201
It looks like "Dead Mans Chest" made 68% more than "Curse of
the Black Pearl" - not bad for two movies based off of a ride from
Disneyland.
1. Create a new R script. Using comments, write your name, the date,
and Testing my Chapter 2 R Might at the top of the script. Write
your answers to the rest of these exercises on this script, and be
sure to copy and paste the original questions using comments!
Your script should only contain valid R code and comments.
thisone <- 1
THISONE <- 2
this.one <- 3
2: r basics 39
This.1 <- 4
ThIS.....ON...E <- 5
This!One! <- 6
lkjasdfkjsdf <- 7
3. 2015 was a good year for pirate booty - your ship collected 100,800
gold coins. Create an object called gold.in.2015 and assign the
correct value to it.
5. Look at the code below. What will R return after the third line?
Make a prediction, then test the code yourself.
a <- 10
a + 10
a
3: Scaler and vector objects
People are not objects. But R is full of them. Here are some of the
# scalar v vector v matrix
basic ones. par(mar = rep(1, 4))
plot(1, xlim = c(0, 4), ylim = c(-.5, 5),
xlab = "", ylab = "",
xaxt = "n", yaxt = "n",
Scalers bty = "n", type = "n")
# scalar
rect(rep(0, 1), rep(0, 1), rep(1, 1), rep(1, 1))
text(.5, -.5, "scalar")
The simplest object type in R is a scaler. A scalar object is just a single
# Vector
value like a number or a name. In the previous chapter we defined rect(rep(2, 5), 0:4, rep(3, 5), 1:5)
text(2.5, -.5, "Vector")
several scaler objects. Here are examples of numeric scalers:
a <- 100
b <- 3 / 100
c <- (a + b) / c
d <- "ship"
e <- "cannon"
f <- "Do any modern armies still use cannons?"
As you can imagine, R treats numeric and character scalers dif- scalar Vector
ferently. For example, while you can do basic arithmetic operations
on numeric scalers they wont work on character scalers. If you try Figure 15: Visual depiction of a scalar
and vector. Deep shit. Wait until we get
to perform numeric operations (like addition) on character scalers, to matrices - youre going to lose it.
youll get an error like this one:
a <- "1"
b <- "2"
a + b
If you see an error like this one, it means that youre trying to
apply numeric operations to character objects. Thats just sick and
wrong.
42 yarrr! the pirates guide to r
Vectors
Ok now its time to learn our first function! Well start with c()
which allows you to build vectors.
c(a, b, c, ...)
a, b, c, ...
One or more objects to be combined into a vector
The simplest way to create a vector is with the c() function. The c
here stands for concatenate, which means "bring them together". The
c() function takes several scalers as arguments, and returns a vector
containing those objects. When using c(), place a comma in between
the objects (scalers or vectors) you want to combine:
Lets use the c() function to create a vector called a containing the
integers from 1 to 5.
a <- c(1, 2, 3, 4, 5)
## [1] 1 2 3 4 5
As you can see, R has stored all 5 numbers in the object a. Thanks
R!
3: scaler and vector objects 43
You can also create longer vectors by combining vectors you have
already defined. Lets create a vector of the numbers from 1 to 10
by first generating a vector a from 1 to 5, and a vector b from 6 to 10
then combine them into a single vector c:
a <- c(1, 2, 3, 4, 5)
b <- c(6, 7, 8, 9, 10)
c <- c(a, b)
c
## [1] 1 2 3 4 5 6 7 8 9 10
You can also create character vectors by using the c() function to
combine character scalers into character vectors:
a:b
The a:b function takes two scalers a and b as arguments, and returns
a vector of numbers from the starting point a to the ending point b in
steps of 1. You can go forwards or backwards depending on which
number is larger.
a:b
a
The start of the sequence
b
The end of the sequence
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
10:1
## [1] 10 9 8 7 6 5 4 3 2 1
2.5:8.5
seq()
The seq() function is a more flexible version of a:b. Like a:b, seq()
allows you to create a sequence from a starting number to an ending
number. However, seq(), has additional arguments that allow you
to specify either the size of the steps between numbers, or the total
length of the sequence:
3: scaler and vector objects 45
from
The start of the sequence
to
The end of the sequence
by
The step-size of the sequence
length.out
The desired length of the final sequence (only use if you dont
specify by)
seq(from = 1, to = 10, by = 1)
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 0 10 20 30 40 50 60 70 80 90 100
If you use the length.out argument, the sequence will have length
equal to the input of length.out.
## [1] 0 10 20 30 40 50 60 70 80 90 100
## [1] 0 25 50 75 100
rep()
The rep() function allows you to repeat a scaler (or vector) a speci-
fied number of times.
times
The number of times to repeat the sequence
each
The number of times to repeat each value within the sequence
length.out (optional)
The desired length of the final sequence
## [1] 3 3 3 3 3 3 3 3 3 3
## [1] 1 2 1 2 1 2 1 2 1 2
## [1] 1 2 3 1 2 3 1 2 3 1
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
As you can see in the fourth example above, you can can include
an a:b call within a rep()!
You can even combine the times and each arguments within a
single rep() function. For example, heres how to create the sequence
1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3 with one call to rep():
## [1] 1 1 2 2 3 3 1 1 2 2 3 3
3: scaler and vector objects 47
pirate.jobs.num
## [1] 1 2 3 1 2 3 1 2 3 1
## [1] 1 1 2 2 3 3 4 4
## [1] "a" "a" "a" "b" "b" "b" "a" "a" "a" "b" "b" "b"
2. Create the vector [2.1, 4.1, 6.1, 8.1] in two ways, once using c() and
once using seq()
48 yarrr! the pirates guide to r
3. Create the vector [0, 5, 10, 15] in 3 ways: using c(), seq() with a by
argument, and seq() with a length.out argument.
4. Create the vector [101, 102, 103, 200, 205, 210, 1000, 1100, 1200]
using a combination of the c() and seq() functions
5. A new batch of 100 pirates are boarding your ship and need
new swords. You have 10 scimitars, 40 broadswords, and 50 cut-
lasses that you need to distribute evenly to the 100 pirates as they
board. Create a vector of length 100 where there is 1 scimitar, 4
broadswords, and 5 cutlasses in each group of 10. That is, in the
first 10 elements there should be exactly 1 scimitar, 4 broadswords
and 5 cutlasses. The next 10 elements should also have the same
number of each sword (and so on).
In this chapter, well cover some basic vector operations and func-
tions.
length()
Once you have a vector, you may want to know how long it is. When
this happens, dont stare at your computer screen and count the
elements one by one! Instead, use length() function. The length()
function takes a vector as an argument, and returns a scaler repre-
senting the number of elements in the vector:
a <- 1:10
length(a)
## [1] 10
## [1] 20
## [1] 6
## [1] 1
and want to add 10 to each element in the vector, just add the vector
and scaler objects. Lets create a vector with the integers from 1 to 10,
and add then add 100 to each element:
a <- 1:10
a + 100
## [1] 101 102 103 104 105 106 107 108 109 110
a <- 1:10
a / 100
## [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
a ^ 2
## [1] 1 4 9 16 25 36 49 64 81 100
c(1, 2, 3, 4, 5) + c(5, 4, 3, 2, 1)
## [1] 6 6 6 6 6
Lets create two vectors a and b where each vector contains the
integers from 1 to 5. Well then create two new vectors ab.sum, the
sum of the two vectors and ab.diff, the difference of the two vectors,
and ab.prod, the product of the two vectors:
4: vector functions 51
a <- 1:5
b <- 1:5
ab.sum <- a + b
ab.diff <- a - b
ab.prod <- a * b
ab.sum
## [1] 2 4 6 8 10
ab.diff
## [1] 0 0 0 0 0
ab.prod
## [1] 1 4 9 16 25
Now, lets say you want to know how many total items each pirate
sold. You can do this by just adding the two vectors:
## [1] 73 46 42 210 64
Ok, now that we can create vectors, lets learn the basic descriptive
statistics functions. Well start with functions that apply to both
continuous and discrete data16 . Each of the following functions takes 16
Continuous data is data that, gener-
a numeric vector as an argument, and returns either a scalar (or in ally speaking, can take on an infinite
number of values. Height and weight
the case of summary(), a table) as a result. are good examples of continuous data.
Discrete data are those that can only
take on a finite number of values. The
number of pirates on a ship, or the
number of times a monkey farts in a
day are great examples of discrete data
52 yarrr! the pirates guide to r
min(x), max(x)
The minimum and maximum values of a vector x
mean(x)
The arithmetic mean of a numeric vector x
median(x)
The median of a numeric vector x. 50% of the data should be less
than median(x) and 50% should be greater than median(x).
sd(x), var(x)
The standard deviation and variance of a numeric vector x.
quantile(x, p)
The pth sample quantile of a numeric vector x. For example,
quantile(x, .2) will tell you the value at which 20% of cases are
less than x. The function quantile(x, .5) is identical to median(x)
summary(x)
Shows you several descriptive statistics of a vector x, including
min(x), max(x), median(x), mean(x)
min(tattoos)
## [1] 2
max(tattoos)
## [1] 100
mean(tattoos)
4: vector functions 53
## [1] 24.1
median(tattoos)
## [1] 9
sd(tattoos)
## [1] 31.32074
## [1] NA
mean(a, na.rm = T)
## [1] 4.5
Now, the function ignored the NA value and returned the mean
of the remaining data. While this may seem trivial now (why did we
include an NA value in the vector if we wanted to ignore it?!), it will
be become very important when we apply the function to real data
which, very often, contains missing values.
If you want to get many summary statistics from a vector, you can
use the summary() function which gives you several key statistics:
summary(tattoos)
ceiling(x), floor(x)
Round a number to the next largest integer with ceiling(x) or
down to the next lowest integer with floor(x).
x %% y
Modular arithmetic (i.e.; x mod y). You can interpret x %% y as
What is the remainder after dividing x by y?" For example, 10 %%
3 equals 1 because 3 times 3 is 9 (which leaves a remainder of 1).
unique(x)
Returns a vector of all unique values in the vector x.
table(x)
Returns a table showing all the unique values in the vector x as
well as a count of each occurrence. By default, the table() func-
tion does NOT count NA values. To include a count of NA values,
include the argument exclude = NULL
4: vector functions 55
The function unique(x) will tell you all the unique values in the
vector, but wont tell you anything about how often each value
occurs.
unique(vec)
## [1] 1 5 10
unique(gender)
The function table() does the same thing as unique(), but goes
a step further in telling you how often each of the unique values
occurs:
table(vec)
## vec
## 1 5 10
## 5 1 3
table(gender)
## gender
## F M
## 5 4
table(vec) / sum(table(vec))
## vec
## 1 5 10
## 0.5555556 0.1111111 0.3333333
table(gender) / sum(table(gender))
## gender
## F M
## 0.5555556 0.4444444
56 yarrr! the pirates guide to r
Standardization (z-score)
A common task in statistics is to standardize variables also known
as calculating z-scores. The purpose of standardizing a vector is to
put it on a common scale which allows you to compare it to other
(standardized) variables. To standardize a vector, you simply subtract
the vector by its mean, and then divide the result by the vectors
standard deviation.
If the concept of z-scores is new to you dont worry. In the next
worked example, youll see how it can help you compare two sets
of data. But for now, lets see how easy it is to standardize a vector
using basic arithmetic.
Lets say you have a vector a containing some data. Well assign
the vector to a new object called a then calculate the mean and stan-
dard deviation with the mean() and sd() functions:
a <- c(5, 3, 7, 5, 5, 3, 4)
mean(a)
## [1] 4.571429
sd(a)
## [1] 1.397276
Ok. Now well create a new vector called a.z which is a standard-
ized version of a. To do this, well simply subtract the mean of the
vector, then divide by the standard deviation.
mean(a.z)
## [1] 1.982309e-16
sd(a.z)
## [1] 1
Sweet17 . 17
Oh, dont worry that the mean of a.z
doesnt look like exactly zero. Using
non-scientific notation, the result is
A worked example: Evaluating a competition 0.000000000000000198. For all intents
and purposes, thats 0. The reason
Your gluten-intolerant first mate just perished in a tragic soy sauce the result is not exactly 0 is due to
computer science theoretical reasons
incident and its time to promote another member of your crew to the that I cannot explain (because I dont
understand them)
4: vector functions 57
Now youve got the data, but theres a problem: the scales of the
numbers are very different. While the grogg numbers range from 1 to
12, the climbing numbers have a much larger range from 100 to 700.
This makes it difficult to compare the two sets of numbers directly.
To solve this problem, well use standardization. Lets create new
standardized vectors called grogg.z and climbing.z
Now lets look at the final results. To make them easier to read, Ill
round them to 2 digits:
round(grogg.z, 2)
round(climbing.z, 2)
round(average.z, 1)
The highest average z-score belongs to the second pirate who had
an average z-score value of 0.5. The first and last pirates, who did
well in one event, seemed to have done poorly in the other event.
Moral of the story: promote the pirate who can drink and climb.
1. Create a vector that shows the square root of the integers from 1 to
10.
2. Theres an old parable that goes something like this. A man does
some work for a king and needs to be paid. Because the man
loves rice (who doesnt?!), the man offers the king two different
ways that he can be paid. You can either pay me 1000 bushels of
rice, or, you can pay me as follows: get a chessboard and put one
grain of rice in the top left square. Then put 2 grains of rice on
the next square, followed by 4 grains on the next, 8 grains on the
next...and so on, where the amount of rice doubles on each square,
until you get to the last square. When you are finished, give me
all the grains of rice that would (in theory), fit on the chessboard.
The king, sensing that the man was an idiot for making such a
stupid offer, immediately accepts the second option. He summons
a chessboard, and begins counting out grains of rice one by one...
Assuming that there are 64 squares on a chessboard, calculate how
many grains of rice the main will receive18 . 18
Hint: If you have trouble coming up
with the answer, imagine how many
3. Renata thinks that she finds more treasure when shes had a mug grains are on the first, second, third and
of grogg than when she doesnt. To test this, she recorded how fourth squares, then try to create the
vector that shows the number of grains
much treasure she found over 7 days without drinking any grogg, on each square. Once you come up with
and then did the same over 7 days while drinking grogg. Here are that vector, you can easily calculate the
her results: final answer with the sum() function.
## [1] 30
## [1] 2
## [1] 30 24 40
62 yarrr! the pirates guide to r
## [1] 30 24 40
Numerical Indexing
With numerical indexing, you enter the integers corresponding to
the values in the vector you want to access in the form a[index],
where a is the vector, and index is a vector of index values. For
example, to get the first value in a vector called a, youd write a[1]. It it makes your code clearer, you
To get the first, second, and third value, you can either type a[c(1, can define an indexing object before
doing your actual indexing. For ex-
2, 3)] or a[1:3]. As you can see, you can use any of the vector ample, lets define an object called
generation functions we learned in the previous chapter to index a get.these.values and use this object to
index our data vector:
vector. However, the values must be integers.
Lets do a few more examples. Well create an a vector again and my.index <- 6:10
use indexing to extract specific values: a[my.index]
## [1] 3 4 NA NA NA
a <- c(0, 50, 2, 39, 9, 20, 17, 8, 10, 100)
# 1st element
a[1]
## [1] 0
# 1-5 elements
a[1:5]
## [1] 0 50 2 39 9
## [1] 50 39 20 8 100
5: working with vectors 63
Logical Indexing
The second way to index vectors is with logical vectors. A logical
vector is a vector that only contains TRUE and FALSE values. In R,
true values are designated with either TRUE or T, and false values
with either FALSE or F.
You can create logical vectors from existing vectors using com-
parison operators like < (less than), == (equals to), and != (not equal
to). A complete list of the most common comparison operators is
in Figure 19. For example, lets start with a vector a containing five
numbers:
For character vectors, you can also use the == (equals) or != (not-
== equal
equal) comparisons (other comparisons such as < dont make sense
!= not equal
for character vectors).
< less than
b <- c("m", "f", "f", "f", "m") <= less than or equal
> greater than
b == "m" >= greater than or equal
| or
## [1] TRUE FALSE FALSE FALSE TRUE not
!
b != "m"
Figure 19: Comparison operators in R
a <- c(1, 2, 3, 4, 5)
b <- c(3, 3, 3, 3, 3)
a > b
a != b
Once you create a logical vector, you can use that logical vector to
index another vector. When you do this, R will return the values of
the vector for all TRUE values of the logical vector, and will ignore
all values for which the logical vector is FALSE. If that was confusing,
think about it this way: a logical vector, combined with the brackets [
], acts as a filter for the vector it is indexing. It only lets values of the
vector pass through for which the logical vector is TRUE.
For example, lets look at our sex and beard data again.
sex
beard
## [1] 30 24 0 40 0
5: working with vectors 65
Now, well use the logical vector (sex == "f") to index the beard
vector. This will give us the values of beard (representing beard
lengths) only for cases where the sex vector is equal to f:
beard[sex == "f"]
## [1] 0 0
This result tells us there were two females (that is, there were two
TRUE values in the logical vector sex == "f"), and both had a beard
length of 0.
x %in% y
%in%
The %in% operation helps you to easily create multiple OR argu-
ments.Imagine you have a vector of categorical data that can take
on many different values. For example, you could have a vector x
indicating peoples favorite letters.
Now, lets say you want to create a logical vector indicating which
values are either a or b or c or d. You could create this logical vector
with multiple | (OR) commands:
## [1] 9 9 9 9 9 6 7 8 9 10
Now lets change the last 5 values to 0s. Well index the values 6
through 10, and assign a value of 0.
a[6:10] <- 0
a
## [1] 9 9 9 9 9 0 0 0 0 0
As you can see, we have some invalid values (999 and -2) in this
vector. If we try to take the mean of this vector, well get a nonsense
response:
mean(happy)
## [1] 201.3
happy[log.vec] <- NA
happy
Of course, because we now have NA
## [1] 1 4 2 NA 2 3 NA 3 2 NA values in our vector, we need to include
the na.rm = T argument to the mean()
function! If we dont, well just get an
As you can see, the new vector happy.valid replaced the invalid NA response:
values with NAs. Now we can take a mean() of the vector and see the
mean(happy)
mean of the valid responses.
## [1] NA
mean(happy, na.rm = T)
mean(happy, na.rm = T)
## [1] 2.428571 ## [1] 2.428571
Taking the sum and mean of logical vectors to get counts and per-
centages
Many (if not all) R functions will interpret TRUE values as 1 and
FALSE values as 0. This allows us to easily answer questions like
"How many values in a data vector are greater than 0?" or "What
percentage of values are equal to 5?" by applying the sum() or mean()
function to a logical vector.
Well start with a vector x of length 10
Now, well use sum() and mean() to see how many of the values
in x are positive, and what percent are positive. We should find that
there are 6 TRUE values, and that 60% of the values (6 / 10) are
TRUE.
sum(x > 0)
## [1] 6
mean(x > 0)
## [1] 0.6
This is a really powerful tool. Pretty much any time you want to
answer a question like "How many of X are Y" or "What percent of X
are Y", you use sum() or mean() function with a logical vector as an
argument.
5: working with vectors 69
year <- c(2009, 2015, 2015, 1985, 2012, 2007, 2009, 1988, 2005, 2014)
boxoffice <- c(35, 15, 15, 62, 3, 10, 321, 79, 39, 1.5)
time <- c(92, 97, 120, 90, 81, 158, 97, -84, 119, 84)
6. One of the values in the time vector is invalid. Convert any invalid
values in this vector to NA. Then, calculate the mean movie time
7. What were the names of the Comedy movies? What were their
boxoffice totals? (Two separate questions)
8. What were the names of the movies that made less than $30 Mil-
lion dollars AND were Comedies?
10. What percent of the movies were rated R and were comedies?
70 yarrr! the pirates guide to r
Additional Tips
R has lots of special functions that take vectors as arguments, and
return logical vectors based on multiple criteria. Here are some that I
frequently use:
is.integer(x)
Tests if values in a vector are integers
is.na(x), is.null(x)
Tests if values in a vector are NA or NULL
is.finite(x)
Tests if a value is a finite numerical value. If a value is NA, NULL,
Inf, or -Inf, is.finite() will return FALSE.
duplicated(x))
Returns FALSE at the first location of each unique value in x,
and TRUE for all future locations of unique values. For example,
duplicated(c(1, 2, 1, 2, 3)) returns (FALSE, FALSE, TRUE,
TRUE, FALSE). If you want to remove duplicated values from a
vector, just run x <- x[!duplicated(x)]
which(log.vec)
Logical vectors arent just good for indexing, you can also use
them to figure out which values in a vector satisfy some criteria. To
do this, use the function which(). If you apply the function which() to
a logical vector, R will tell you which values of the index are TRUE.
For example, given our sex vector, we can use the which() function to
see which pirates are male and which are female:
which(sex == "m")
## [1] 1 2 4
which(sex == "f")
## [1] 3 5
6: Matrices and Data Frames
sion (its length), matrices and dataframes both have 2-dimensions # scalar
rect(rep(0, 1), rep(0, 1), rep(1, 1), rep(1, 1))
text(.5, -.5, "scalar")
representing their width and height. You can think of a matrix or
# Vector
dataframe as a combination of n vectors, where each vector has a rect(rep(2, 5), 0:4, rep(3, 5), 1:5)
text(2.5, -.5, "Vector")
length of m. See Figure 22 to see the shocking difference between # Matrix
rect(rep(4:8, each = 5),
these data objects. rep(0:4, times = 5),
rep(5:9, each = 5),
While matrices and dataframes look very similar, they arent rep(1:5, times = 5))
text(6.5, -.5, "Matrix / Data Frame"
exactly the same. While a matrix can contain either character or nu- )
that take matrices and dataframes as inputs. Figure 22: scalar, Vector, Matrix...
::drops mike::
19
WTF If dataframes are more flexible
than matrices, why do we use matrices
at all? The answer is that, because
they are simpler, matrices take up less
computational space than dataframes.
Additionally, some functions require
matrices as inputs to ensure that they
work correctly.
72 yarrr! the pirates guide to r
There are a number of ways to create your own matrix and dataframe
objects in R. Because matrices and dataframes are just combinations
of vectors, each function takes one or more vectors as inputs, and
returns a matrix or a dataframe.
matrix()
matrix()
The matrix() function creates a matrix form a single vector of data.
The function has 3 main inputs: data a vector of data, nrow the
We can also organize the vector by rows
number of rows you want in the matrix, and ncol the number of instead of columns using the argument
columns you want in the matrix, and byrow a logical value indicat- byrow = T:
ing whether you want to fill the matrix by rows. Check out the help
# Create a matrix of the integers 1:10,
menu for the matrix function (?matrix) to see some additional inputs. # with 2 rows and 5 columns entered by row
Lets use the matrix() function to re-create a matrix containing the
matrix(data = 1:10,
values from 1 to 10. nrow = 2,
ncol = 5,
# Create a matrix of the integers 1:10, byrow = T)
# with 5 rows and 2 columns ## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
matrix(data = 1:10, ## [2,] 6 7 8 9 10
nrow = 5,
ncol = 2)
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
data.frame()
To create a dataframe from vectors, use the data.frame() function.
The data.frame() function works very similarly to cbind() the only
difference is that in data.frame() you specify names to each of the
columns as you define them. Again, unlike matrices, dataframes
can contain both string vectors and numeric vectors within the same
data.frame()
object. Because they are more flexible than matrices, most large
datasets in R will be stored as dataframes.
Lets create a simple dataframe using the data.frame() function
with a mixture of text and numeric columns:
Dataframes pre-loaded in R
Now you know how to use functions like cbind() and data.frame()
to manually create your own matrices and dataframes in R. However,
for demonstration purposes, its frequently easier to use existing
dataframes rather than always having to create your own. Thankfully,
R has us covered: R has several datasets that come pre-installed in a
package called datasets you dont need to install this package, its
included in the base R software. While you probably wont make any
major scientific discoveries with these datasets, they allow all R users
to test and compare code on the same sets of data. Here are a few To see a complete list of all the datasets
datasets that we will be using in future examples: included in the datasets package, run
the code: library(help = "datasets")
View()
You can print an entire matrix or dataframe in the console by just
executing the name of the object, but if the matrix or dataframe is
very large, it can overwhelm the console. Instead, its best to use one
of the viewing functions like View() or head(). The View() function
will open the object in its own window: View()
# Print the entire ChickWeight dataframe
# in a new data window
View(ChickWeight)
head()
To print the first few rows of a dataframe or matrix to the console,
use the head() function:
76 yarrr! the pirates guide to r
head(ChickWeight)
summary()
summary()
To get summary statistics on all columns in a dataframe, use the
summary() function:
summary(ChickWeight)
## (Other):506
str()
To learn about the classes of columns in a dataframe, in addition to
some other summary information, use the str() (structure) function.
str()
This function returns information for more advanced R users, so
dont worry if the output looks confusing.
# Print the classes of the columns of
# ChickWeight to the console
str(ChickWeight)
## [1] 578
## [1] 4
One of the nice things about dataframes is that each column will have
a name. You can use these name to access specific columns by name
78 yarrr! the pirates guide to r
names()
names()
To access the names of a dataframe, use the function names(). This
will return a string vector with the names of the dataframe. Lets use
names() to get the names of the ToothGrowth dataframe:
## [1] 4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0 16.5 16.5 15.2 17.3
## [15] 22.5 17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5
## [29] 23.3 29.5 15.2 21.5 17.6 9.7 14.5 10.0 8.2 9.4 16.5 9.7 19.7 23.3
## [43] 23.6 26.4 20.0 25.2 25.8 21.2 14.5 27.3 25.5 26.4 22.4 24.5 24.8 30.9
## [57] 26.4 27.3 29.4 23.0
If you want to access several columns by name, you can forgo the $
operator, and put a character vector of column names in brackets:
## [1] 18.81333
##
## OJ VC
## 30 30
You can add new columns with any information that you want to
a dataframe - even basic numerical vectors. For example, lets add
a simple index column to the dataframe showing the original order
of the rows of the data. To do this, Ill assign the numeric vector
1:nrow(ToothGrowth) to a new column called index
## len dose
## 1 4.2 0.5
## 2 11.5 0.5
## 3 7.3 0.5
## 4 5.8 0.5
## 5 6.4 0.5
## 6 10.0 0.5
## 7 11.2 0.5
6: matrices and data frames 81
## 8 11.2 0.5
Similarly, to get the entire 2nd column, set the column index to 2
and leave the row index blank:
subset()
subset()
subset()
x
The data (usually a dataframe)
subset
Figure 26: The subset() function is like
A logical vector indicating which rows you want to select
a lightsaber. An elegant function from a
more civilized age...
select
An optional vector of the columns you want to select
subset(x = ToothGrowth,
subset = len < 20 &
supp == "OJ" &
6: matrices and data frames 83
dose >= 1
)
mean(oj$len)
## [1] 20.66333
mean(oj$len)
## [1] 20.66333
84 yarrr! the pirates guide to r
mean(ToothGrowth$len[ToothGrowth$supp == "OJ"])
## [1] 20.66333
Additional tips
with()
with(df, )
The function with() helps to save you some typing when you are us-
ing multiple columns from a dataframe. Specifically, it allows you to
specify a dataframe (or any other object in R) once at the beginning
of a line then, for every object you refer to in the code in that line, R
will assume youre referring to that object in an expression.
For example, using the ToothGrowth dataset, we can calculate a
vector defined as len divided by dose using the with() function
where we specify the name of the dataframe (ToothGrowth) just once
at the beginning of the line, and then refer to the column names
without re-writing the dataframe or using the $ operator:
# THIS IS IDENTICAL TO
ToothGrowth$len / ToothGrowth$dose
3. What was the mean age of female and male pirates separately?
9. What are the names of the female pirates whose favorite superhero
is Superman?
10. What was the median number of tattoos of pirates over the age of
30 whose favorite superhero is Spiderman?
7: Importing, saving, and managing data
getwd()
related project you work on. To see your current working directory,
use the getwd():
## [1] "/Users/CaptainJack/Desktop/yarrr"
As you can see, when I run this code, it tells me that my working
directory is in a folder on my Desktop called yarrr.
If you want to change your working directory, use the setwd()
function. For example, if I wanted to change my working directory to
an existing Dropbox folder called yarrr, Id run the following code:
will automatically be set to the directory that the .RProj file is located
in. Personally, I recommend creating a new R Project whenever you
are starting a new research project. However, its not necessary.
Once youve decided on a working directory for a specific R
project, its a good idea to put new folders in the directory which
will contain your R code, data files, notes, and other material relevant
to your project. In Figure 29 you can see how my working directory
looks for a project I am working on called ForensicFFT.
ls()
If you want to see all the objects defined in your current workspace,
use the ls() function:
ls()
ls()
.RData files
When you save objects from R to your computer using the save.image()
and save() functions, the objects are saved as .RData files. If you
open a .RData file outside of R (say by double-clicking it), this will
90 yarrr! the pirates guide to r
save.image()
To save all the objects in your workspace as a .RData file, use the
save.image()
save.image() function. This will save all the objects as a .RData file
close to your working directory. For example, to save my workspace
in the data folder located in my working directory, Id run the follow-
ing:
save.image(file = "data/myimage.RData")
save()
save()
If you dont want to save your entire workspace, but instead only
want to save a few selected objects, you can use the save() func-
tion. For example, lets say I want to store both data from an ex-
periment, and the results of a hypothesis test, in one file called
Mar7_experiment.RData. If the data is stored in a dataframe called
experiment.df, and the hypothesis test is stored in an object called
htest, I can run the following code:
save(experiment.df, htest,
file = "data/Mar7_experiment.RData")
load()
load()
To load a saved workspace, use the load() function. For example, to
load the workspace that I just saved to the data folder in my working
directory, Id run the following:
load(file = "data/myimage.RData")
7: importing, saving, and managing data 91
rm()
rm()
At some points in your analyses, you may find that your workspace
is filled up with one or more objects that you dont need either
because theyre slowing down your computer, or because theyre just
distracting. To remove objects from your workspace, use the rm()
function.
To remove specific objects, enter the objects as arguments to rm().
For example, to remove a huge dataframe called huge.df, Id run the
following;
While .RData files are great for saving R objects, sometimes youll
want to export data (specifically dataframes) as a simple .txt text file
write.table()
that other programs, like Excel and SPSS, can also read. To do this,
use the write.table() function.
92 yarrr! the pirates guide to r
write.table()
x
The object you want to write to a text file. Usually a dataframe
object.
file
The documents file path relative to the working directory un-
less specified otherwise. For example file = "mydata.txt" will
save the data as a text file directly in the working directory, while
file = "data/mydata.txt" will save the data in a folder called
data inside the working directory. If you want to write the file to
someplace outside of your workspace, just put the full file path.
sep
A string indicating how the columns are separated. For comma
separated files, use ,, for tabdelimited files, use \t
row.names
A logical value (T or F) indicating whether or not save the row-
names in the text file.
For example, the following code will save the ToothGrowth object
as a tabdelimited text file in my working directory:
write.table(x = ChickWeight,
file = "chickweight.txt",
sep = "\t"
)
If you have a text file that you want to read into R, use the read.table()
read.table()
function.
read.table()
file
The documents file path (make sure to enter as a string with
quotation marks!) OR an html link to a file.
header
A logical value indicating whether the data has a header row
that is, whether the first row of the data represents the column
names.
sep
A string indicating how the columns are separated. For comma
separated files, use ",", for tab-delimited files, use "\t"
stringsAsFactors
A logical value indicating whether or not to convert strings to
factors. I always set this to FALSE (because I dont like using
factors)
sep = '\t',
header = T
)
If you absolutely have to read a non-text file into R, check out the
package called foreign. This package has functions for importing
Stata, SAS and Shitty Piece of Shitty Shit files directly into R. To
read Excel files, try the package xlsx
7: importing, saving, and managing data 95
4. In the script, create new objects called a, b, and c. You can assign
anything to these objects from vectors to dataframes. If you cant
think of any, use these:
b <- mean(a$age)
c <- table(a$sex)
5. Send the code to the Console so the objects are stored in your
current workspace. Use the ls() function to see that the objects
are indeed stored in your workspace.
6. I have a tabdelimited text file called club at the following web ad-
dress: http://nathanieldphillips.com/wp-content/uploads/2015/12/club.txt.
Using read.table(), load the data as a new object called club.df
in your workspace.
11. Now, in your AnalyzingObjects.R script, load the objects back into
your workspace from the "myobjects.RData" file using the load()
function. Again, run the ls() function to make sure all the objects
are back in your workspace.
12. Add some R code to your AnalyzingObjects.R script. Calculate
some means and percentages. Now save your AnalyzingObjects.R
script, and then save all the objects in your workspace to "myob-
jects.RData".
13. Congratulations! You are now a well-organized R Pirate! Quit
RStudio and go outside for some relaxing pillaging.
8: Advanced dataframe manipulation
In this chapter well cover some more advanced functions and proce-
dures for manipulating dataframes
exam
## q1 q2 q3
## 1 1 8 3
## 2 5 10 7
## 3 2 9 4
## 4 3 8 6
## 5 2 7 4
colMeans(exam)
## q1 q2 q3
## 2.6 8.4 4.8
It looks like question 1 was the hardest (with a mean score of 2.6),
and question 2 was the easiest (with a mean score of 8.4).
98 yarrr! the pirates guide to r
To calculate the mean score across rows (students), we can use the
rowMeans function. This will return the mean score for each row:
It looks like the second student did the best (with a mean score of colSums(exam)
8.4). ## q1 q2 q3
## 13 42 24
## q1 q2 q3
## 1 1 8 3
## 3 2 9 4
## 5 2 7 4
## 4 3 8 6
## 2 5 10 7
You can also use the order() function to order character vectors in
alphabetical order. For example, if I add a character vector of student
names to the exam dataframe, I can then sort the data according to
student in alphabetical order:
## q1 q2 q3 student
## 1 1 8 3 alice
## 2 5 10 7 david
## 3 2 9 4 heath
## 4 3 8 6 jack
## 5 2 7 4 vincent
will apply the input function (FUN) to the dependent variable (Y)
separately for each level(s) of the independent variable(s) (x1, x2, ...)
and then print the result. Lets see how it works:
aggregate()
formula
A formula in the form y x1 + x2 + ... where y is the depen-
dent variable, and x1, x2... are the index (independent) variables.
For example, salary sex + age will aggregate a salary column
at every unique combination of age and age
FUN
A function that you want to apply to x at every level of the in-
dependent variables. For example, FUN = mean will calculate the
mean for each level of the independent variables.
data
The dataset containing the variables in formula
...
Optional arguments passed on to FUN (like na.rm = T to ignore
NA values in x)
aggregate()
each supplement?" For this question, well set the value of the depen-
dent variable Y to len, x1 to supp, and FUN to mean
## supp len
## 1 OJ 20.66333
## 2 VC 16.96333
with a column for the independent variable (supp), and a column for
the results of the function mean applied to each level of the indepen-
dent variable. The result of this function is the same thing wed get
by manually indexing each level of supp individually.
Our new result now shows us the median length for all combina-
tions of both supplement (supp) and dose.
You can even use aggregate() to evaluate functions that return
more than one value. For example, the summary() function returns
102 yarrr! the pirates guide to r
## supp dose len.Min. len.1st Qu. len.Median len.Mean len.3rd Qu. len.Max.
## 1 OJ 0.5 8.20 9.70 12.25 13.23 16.18 21.50
## 2 VC 0.5 4.20 5.95 7.15 7.98 10.90 11.50
## 3 OJ 1.0 14.50 20.30 23.45 22.70 25.65 27.30
## 4 VC 1.0 13.60 15.27 16.50 16.77 17.30 22.50
## 5 OJ 2.0 22.40 24.58 25.95 26.06 27.08 30.90
## 6 VC 2.0 18.50 23.38 25.95 26.14 28.80 33.90
When you use dplyr, you write code that sounds like: "The origi-
nal dataframe is XXX, now filter the dataframe to only include rows
that satisfy the conditions YYY, now group the data at each level of
the variable(s) ZZZ, now summarize the data and calculate summary
functions XXX..."
Lets start with an example: Lets create a dataframe of aggregated
data from the ToothGrowth dataset. Im going to group the data
according to the columns supp and dose. Ill then create several
columns is a different summary statistic of some data across each
grouping. To create this aggregated data frame, I will use the new
function group_by and the verb summarise. I will assign the result to
a new dataframe called ToothGrowth.agg:
ToothGrowth.agg
Additional Tips
Weve only scratched the surface of what you can do with dplyr.
For more tips, check out the dplyr vignette at https://cran.r-
project.org/web/packages/dplyr/vignettes/databases.html.
106 yarrr! the pirates guide to r
5. Calculate the mean age for each combined level of both gender
and drink
7. For men only, calculate the maximum score for each age
9. Only for females above the age of 20, create a table showing, for
each combined level of drink and cups, the mean, median, max-
imum, and standard deviation of scores. Also include a column
showing how many people were in each group.
# Create blank plot
plot(1, xlim = c(0, 100), ylim = c(0, 100),
xlab = "Pirate Quality", ylab = "", type = "n",
main = "Two different Pirate colleges", yaxt = "n"
)
# Set colors
require("RColorBrewer")
col.vec <- brewer.pal(10, name = "Set3")[4:6]
# Draw Samples
for(i in 1:length(samples.1)) {
Sampling data from probability distributions segments(40, 65, 60, 65, col = col.vec[1], lty = 1, lwd = 2)
By now you know how to generate sequences of numbers with the for(i in 1:length(samples.2)) {
functions :, seq(), and rep(). However, these functions dont gener- points(samples.2[i], 25, pch = 21, bg = col.vec[2], cex = 3)
text(samples.2[i], 25, round(samples.2[i], 0))
sampled data from specified probability distributions. A probability segments(10, 15, 90, 15, col = col.vec[2], lty = 1, lwd = 2)
curve(expr = fun,
1.0
mean = 0, sd = 1
mean = 2, sd = 0.5
0.8 mean = 1, sd = 2
0.4
rnorm(n = 5, mean = 0, sd = 1)
0.2
0.0
x
This code returns a vector of 5 values, where each value is a new
Figure 31: Three different normal
random sample drawn from a Normal distribution with mean = 0
distributions with different means and
and standard deviation = 1. standard deviations.
Because the sampling is done randomly, youll get different values
each time you run the rnorm() (or any other random sampling)
function. To see this, lets create two different sets of samples from a
normal distribution with mean 10 and standard deviation 5 and see
how they compare:
As you can see, even though I used the exact same code to gen-
erate the vectors a and b, the numbers in each sample are different.
Thats because the samples are each drawn randomly and inde-
pendently from the normal distribution. To visualize the sampling
process, run the code in the margin Figure 31 on your machine sev- # uniform distribution
curve(dunif,
eral times. You should see the sampling lines dance around the from = 0, to = 1,
xlim = c(-.5, 1.5),
xlab = "x",
distribution. lwd = 2,
main = "Uniform\nmin = 0, max = 1")
1.4
tion gives equal probability to all values between the minimum and
1.2
maximum values.
dunif(x)
To generate samples from a uniform distribution, we use the
1.0
function runif(), the function has 3 arguments:
0.8
0.6
runif() 0.5 0.0 0.5 1.0 1.5
min
The lower bound of the Uniform distribution from which samples
are drawn
max
The upper bound of the Uniform distribution from which samples
are drawn
sample()
x
A vector of outcomes you want to sample from. For example, to
simulate coin flips, youd enter x = c("Heads", "Tails")
size
The number of samples you want to draw.
replace
Should sampling be done with replacement? If T, then each in-
dividual sample will be replaced in the data vector. If F, then the
same outcome will never be drawn more than once. Think about
replacement like drawing different balls from a bag. Sampling
with replacement (replace = T) means that each time you draw a
ball, you return the ball back into the bag before drawing another
ball. Sampling without replacement (replace = F) means that after
you draw a ball, you remove that ball from the bag before drawing
again.
prob
A vector of probabilities of the same length as x indicating how
likely each outcome in "x" is. For example, to sample equally
from two outcomes, youd enter prob = c(.5, .5). The first value
corresponds to the first value of x and the second corresponds to
the second value (etc.). The vector of probabilities you give as an
argument should add up to one. However, if they dont, R will just
rescale them so that they will sum to 1.
sponding to our sample size of 10. Keep in mind that, just like using
rnorm() and runif(), the sample() function can give you different Chest of 20 Gold, 30 Silver,
and 50 Bronze Coins
outcomes every time you run it.
sample(x = chest,
size = 10,
prob = rep(1 / 100, times = 100),
replace = F
)
## [1] "No Match" "No Match" "No Match" "No Match" "No Match" "No Match"
## [7] "No Match" "No Match" "No Match" "No Match" "No Match" "No Match"
## [13] "No Match" "No Match" "No Match" "Match!!!" "No Match" "No Match"
## [19] "No Match" "No Match"
The output of this function is a simulated response from 10 pirates Figure 33: A typical Pinder profile.
that you liked.
According to the law of large numbers, the larger our sample size,
the closer our sample mean should be to the population mean. In
other words, the more data (samples) you have, the more accurate
your estimate should be. Lets test this by drawing either a small
(N = 10) or a large (N = 1,000,000) number of observations from a
Normal distribution with mean = 100 and sd = 20: Tip: You can easily write large powers
of 10 by using the notation 1eN, where
small <- rnorm(10, mean = 100, sd = 20) # 10 observations N is the power of 10. For example: 1e6
large <- rnorm(1e6, mean = 100, sd = 20) # One million observations is the same as 1,000,000
If our test worked, then the difference for the small sample should
be larger than the large sample. Lets test this by calculating the
mean of each sample and see how close they are to the true popula-
tion mean of 100:
## [1] 100.9499
## [1] 100.0289
## [1] 0.9499404
## [1] 0.02889741
3. Above all else, pirates love two things: treasure and gummy bears.
There are few things a pirate likes more than reaching into a bag
of fresh gummy bears and pulling out a handful of those gifts
from the pirate gods. Imagine that you have a bag filled with 100 Figure 34: A gummy captain.
gummy bears: 30 Green, 40 Red, 10 Blue, and 20 Puce. Draw a
random sample of 10 gummy bears from this bag. How many of
each kind did you get in the sample? What percent of the sample
are the terrible vomit-flavored puce color?
Additional Tips
Use set.seed() to control what your
Sometimes you may want sampling functions like rnorm() and random sampling functions return
runif to return consistent values for demonstration purposes.
For example, lets say you are showing someone how to use the
rnorm() function, and you want them get the same values you
get. You can accomplish this by using the set.seed() function.
To use the function, enter an integer as the main argument - it
doesnt matter what integer you pick. When you do, you will
fix Rs random number generator to a certain state associated
with the specific integer you use. After you run set.seed(), the
next random number generation function you use will always
give the same result. Lets try generating 5 values from a Normal
distribution with mean = 10 and sd = 1. But first, Ill set the seed
to 10.
114 yarrr! the pirates guide to r
set.seed(10)
rnorm(n = 5, mean = 10, sd = 1)
set.seed(10)
rnorm(n = 5, mean = 10, sd = 1)
Sammy Davis Jr. was one of the greatest American performers of all
time. If you dont know him already, Sammy was an American enter-
tainer who lived from 1925 to 1990. The range of his talents was just
incredible. He could sing, dance, act, and play multiple instruments
with ease. So how is R like Sammy Davis Jr? Like Sammy Davis Jr.,
R is incredibly good at doing many different things. R does data
analysis like Sammy dances, and creates plot like Sammy sings. If
Sammy and R did just one of these things, theyd be great. The fact
that they can do both is pretty amazing.
# A basic scatterplot
plot(x = 1:10,
y = 1:10,
116 yarrr! the pirates guide to r
Main Title
10
8
Y Axis label
6
4
2
2 4 6 8 10
X Axis label
Color basics
colors like snow, papayawhip and lawngreen. To see all color names magenta4 pink2 gray1 olivedrab gray75
The gray() function takes two arguments, level and alpha, and
returns a shade of gray. For example, gray(level = 1) will return
white. The alpha argument specifies how transparent to make gray(x)
the color on a scale from 0 (completely transparent), to 1 (not
transparent at all). The default value for alpha is 1 (not transparent 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
at all.)
library(yarrr)
Standard Transparent
# Generate Random data
g1.x <- rnorm(100)
g1.y <- g1.x + rnorm(100)
points(g2.x, g2.y,
col = "red",
pch = 16)
4
any color transparent. To use it, just enter the original color as the
2
main argument (orig.col), then enter how transparent you want to
g1.y
0
make it (from 0 to 1) as the second argument (trans.val). You can
either save the transparent color as a new color object, or use the
2
function directly in a plotting function like I do in the scatterplot in
4
the margin. Here are some examples: 2 1 0 1 2 3
g1.x
library(yarrr) # load the yarrr package
Transparent Colors
plot(1, col = transparent("blue", .5))
4
tions that generate shades of colors based on numeric data (like the
2
2 1 0 1 2 3
g1.x
10: plotting: part 1 119
Plotting arguments
Most plotting functions have tons of optional arguments (also called
parameters) that you can use to customize virtually everything in a
plot. To see all of them, look at the help menu for par by executing
?par. However, the good news is that you dont need to specify all
possible parameters you create a plot. Instead, there are only a few
critical arguments that you must specify - usually one or two vectors
of data. For any optional arguments that you do not specify, R will
use either a default value, or choose a value that makes sense based
on the data you specify.
In the following examples, I will to cover the main plotting param-
eters for each plotting type. However, the best way to learn what you
can, and cant, do with plots, is to try to create them yourself!
I think the best way to learn how to create plots is to see some
examples. Lets start with the main high-level plotting functions.
120 yarrr! the pirates guide to r
circle to Blue.
8
0 2 4 6 8 10
Aside from the x and y arguments, all of the arguments are op-
tional. If you dont specify a specific argument, then R will use a
10: plotting: part 1 121
default value, or try to come up with a value that makes sense. For
example, if you dont specify the xlim and ylim arguments, R will set
the limits so that all the points fit inside the plot.
color or background for symbols 1 through 20, use the col argument.
For symbols 21 through 25, you set the color of the border with col, pch =
and the color of the background using bg
Lets look at some different symbol types in action:
1 6 11 16 21
2 7 12 17 22
pch = 2, pch = 16, 3 8 13 18 23
col = 'blue' col = 'orchid2'
4 9 14 19 24
5 10 15 20 25
Histogram: hist()
hist(x = ChickWeight$weight,
Histograms are the most common way to plot unidimensional nu- main = "Fancy Chicken Weight Histogram",
xlab = "Weight",
meric data. To create a histogram well use the hist() function. The
ylab = "Frequency",
main argument to hist() is a x, a vector of numeric data. If you want breaks = 20, # 20 Bins
to specify how the histogram bins are created, you can use the breaks xlim = c(0, 500),
col = "papayawhip", # Filling Color
argument. To change the color of the border or background of the border = "hotpink" # Border Color
bins, use col and border: )
Lets create a histogram of the weights in the ChickWeight dataset:
Fancy Chicken Weight Histogram
hist(x = ChickWeight$weight,
main = "Chicken Weights",
100
xlab = "Weight",
80
Frequency
xlim = c(0, 500)
60
)
40
20
0
Chicken Weights 0 100 200
Weight
300 400 500
200
xlab = "Weight",
ylab = "Frequency",
breaks = 20,
xlim = c(0, 500),
50
breaks = 30,
add = T, # Add plot to previous one!
col = gray(1, .5)
0 100 200 300 400 500 )
figure 37)
10
If you want to plot two histograms on the same plot, for example,
5
to show the distributions of two different groups, you can use the add
0
= T argument to the second plot. See Figure 38 to see this in action. 0 100 200 300 400 500
Weight
groups. The primary argument to a barplot is height: a vector of # Step 2: Plot the result
barplot(height = avg.weights$weight,
names.arg = avg.weights$Time,
numeric values which will generate the height of each bar. To add xlab = "Week",
ylab = "Average Weight",
names below the bars, use the names.arg argument. For additional ar- main = "Average Chicken Weights by Time",
col = "yellow",
guments specific to barplot(), look at the help menu with ?barplot: border = "gray")
200
main = "Example Barplot",
150
xlab = "Group",
Average Weight
ylab = "Height"
100
)
50
Example Barplot
0
0 2 4 6 8 10 12 14 16 18 20 21
5
Week
subset = Diet == 2,
FUN = mean)
barplot(height = d2.weights$weight,
names.arg = d2.weights$Time,
xlab = "Week",
ylab = "Average Weight",
main = "Average Chicken Weights by Time",
3
FUN = mean)
barplot(height = d1.weights$weight,
add = T, # Add to existing plot!
col = transparent("blue", .5),
border = NA)
1
150
G1 G2 G3 G4 G5
Average Weight
100
Group
50
Of course, you should plot more interesting data than just a vector
of integers with a barplot. In the margin figure, I create a barplot
0
with the average weight of chickens for each time point with yel- 0 2 4 6 8 10
Week
12 14 16 18 20 21
low bars. If you want to plot different colored bars from different
Figure 39: Stacked barplots by adding
datasets, create one normal barplot, then create a second barplot with an additional barplot with the add = T
the add = T argument. In Figure 39, I plotted the average weights for argument.
Clustered barplot
If you want to create a clustered barplot, with different bars for
different groups of data, you can enter a matrix as the argument to
height. R will then plot each column of the matrix as a separate set
of bars. For example, lets say I conducted an experiment where I
compared how fast pirates can swim under four conditions: Wearing
clothes versus being naked, and while being chased by a shark versus
not being chased by a shark. Lets say I conducted this experiment
and calculated the following average swimming speed:
swim.data
## Naked Clothed
## No Shark 2.1 1.5
## Shark 3.0 3.0
barplot(height = swim.data,
beside = T, # Put the bars next to each other
legend.text = T, # Add a legend
col = c(transparent("green", .7),
transparent("red", .7)),
main = "Swimming Speed Experiment",
ylab = "Speed (in meters / second)",
xlab = "Clothing Condition",
ylim = c(0, 3.5))
10: plotting: part 1 125
2.0
1.5
1.0
0.5
0.0
Naked Clothed
Clothing Condition
126 yarrr! the pirates guide to r
bean.o = 0, # No bean
lubitel
pancake
cake
emo
point.o = .2,
ipod
ghostbusters
ohbrother
usualsuspects
nemo
trans = 0
500
450
400
350
300
ge
w
n
n
1
ee
ow
llo
ue
an
Weight
gr
ye
bl
br
or
250
200
150
100
50
0
0 2 4 6 8 10 12 14 16 18 20 21
Time
ToothGrowth Data
50
45
40
35
30
Tooth Length
25
20
15
10
0.5 1 2 0.5 1 2
supp = VC supp = OJ
Dose
400
300
300
Weight
Weight
200
200
100
100
0
0
1 2 3 4 1 2 3 4
Diet Diet
400
300
300
Weight
Weight
200
200
100
100
0
1 2 3 4 1 2 3 4
Diet Diet
40
20
0
0 20 40 60 80 100
X Label
10: plotting: part 1 131
Now, Ill add points for the female data with points:
20
Number of Tattoos
Number of Tattoos
15
15
10
10
5
5
0
10 15 20 25 30 35 40 45 10 15 20 25 30 35 40 45
Age Age
132 yarrr! the pirates guide to r
will add a line across the entire plot, while segments() will add a line # Add horizontal and vertical gridlines at 0, 10, ... 100
abline(h = seq(from = 0, to = 100, by = 10),
with defined starting and end points. v = seq(from = 0, to = 100, by = 10))
100
80
# Create a blank plot.
60
plot(1, xlim = c(0, 100), ylim = c(0, 100), type = "n")
40
20
# Add horizontal and vertical gridlines at 0, 10, ... 100
abline(h = seq(from = 0, to = 100, by = 10),
0
v = seq(from = 0, to = 100, by = 10)) 0 20 40 60 80 100
Index
To change the look of your lines, use the lty argument, which Figure 42: Adding gridlines with
changes the type of line (see Figure ), lwd, which changes its thick- abline()
ness, and col which changes its color. The following code will use
some of these arguments to create Figure 43: plot(1, xlim = c(0, 100), ylim = c(0, 100), type = "n")
)
0
0 20 40 60 80 100
The segments() function works very similarly to abline() how- Index
ever, with the segments() function, you specify the beginning and
Figure 43: Adding thin, gray, dashed
end points of the segments. See the help menu (?segments) for de- gridlines using the lty, col, and lwd
tails. arguments to abline()
Once youve created a plot with gridlines using abline(), you can
lty = ...
then add symbols with points()!
1 2 3 4 5 6
With text(), you can add text to a plot. You can use text() to high- labels: The text to plot
light specific points of interest in the plot, or to add information (like cex: The size of the text
a third variable) for every point in a plot. Ive highlighted some of adj: How should the text be justi-
fied? 0 = left justified, .5 = centered,
the key arguments to text() in Figure 45 1 = right justified
For example, the following code adds the three words "Put", "Text", pos: The position of the text relative
and "Here" at the coordinates (1, 9), (5, 5), and (9, 1) respectively. See to the x and y coordinates. Values
of 1, 2, 3, and 4 put the text below,
Figure 46 for the plot: to the left, above, and to the right of
the x-y coordinates respectively.
plot(1, xlim = c(0, 10), ylim = c(0, 10), type = "n") Figure 45: Main arguments to the
text() function
You can do some cool things with text(). Ill create a scatterplot
of data, and add data labels above each point:
10
height <- c(156, 175, 160, 172, 159) Put
weight <- c(65, 74, 69, 72, 66)
8
id <- c("p1", "p2", "p3", "p4", "p5")
6
text
4
plot(x = height, y = weight,
xlim = c(155, 180), ylim = c(65, 80))
2
here
0
text(x = height, y = weight,
labels = id, pos = 3) 0 2 4 6 8 10
Index
p2
75
weight
p4
p3
70
p5
p1
65
height
134 yarrr! the pirates guide to r
abline(h = 1, lty = 2)
# Text with \n breaks
text(x = 1, y = .92, labels = "Text with \\n", font = 2)
data. For example, you may want to include the text "Mean = 3.14"
1.4
Text without \n
in a plot to show that the mean of the data is 3.14. But how can we
1.2
Haikus are easy. But sometimes they don't make sense. Refrigerator
combine numerical data with text? In R, we can do this with the
paste() function:
1.0
Text with \n
The paste function will be helpful to you anytime you want to
0.8
Haikus are easy
combine either multiple strings, or text and strings together. For But sometimes they don't make sense
Refrigerator
0.6
example, lets say you want to write text in a plot that says The mean
0.6 0.8 1.0 1.2 1.4
of these data are XXX, where XXX is replaced by the group mean.
To do this, just include the main text and the object referring to the
Figure 47: Using the "\n" tag to plot
numerical mean as arguments to paste():
text on separate lines.
When you include descriptive statistics
data <- ChickWeight$weight in a plot, you will almost always want
mean(data) to use the round(x, digits) function
to reduce the number of digits in the
statistic.
## [1] 121.8183
Lets create a plot with labels using the paste() function. Well
plot the chicken weights over time, and add text to the plot specify-
ing the overall mean and standard deviations of weights.
# Add text
10: plotting: part 1 135
text(x = 0,
y = 300,
labels = paste("Mean weight = ",
round(mean(ChickWeight$weight), 2),
"\nStandard deviation = ",
round(sd(ChickWeight$weight), 2),
sep = ""),
adj = 0)
Chicken Weights
350
ChickWeight$weight
0 5 10 15 20
ChickWeight$Time
136 yarrr! the pirates guide to r
from, to
4
The starting (from) and ending (to) value of x to be plotted.
2
0
add
2
x^2
x^.5
A logical value indicating whether or not to add the curve to an
4
sin(x)
dnorm(x, 2, .2
existing plot. If add = FALSE, then curve() will act like a high-
level plotting function and create a new plot. If add = TRUE, then 4 2 0 2 4
For example, to add the function x2 to a plot from the x-values -10
to 10, you can run the code:
If you want to add a custom function to a plot, you can define the
function and then use that function name as the argument to expr.
For example, to plot the normal distribution with a mean of 10 and
standard deviation of 3, you can use this code:
2
right corner of the plot.
0
labels
2
A string vector specifying the text in the legend. For example, Females
Males
legend = c("Males, "Females") will create two groups with
4
names Males and Females. 2 1 0 1 2
x
pch, lty, lwd, col, pt.bg, ...
Figure 49: Creating a legend labeling
Additional arguments specifying symbol types (pch), line types the symbol types from different groups
(lty), line widths (lwd), background color of symbol types 21
through 25 ((pt.bg)) and several other optional arguments. See
?legend for a complete list
tional elements to your plots. Here are some I use. To see examples text(25, 95, labels = "rect()")
of how to use each one, check out their associated help menus. rect(xleft = 10, ybottom = 70,
xright = 40, ytop = 90, lwd = 2, col = "coral")
Add a polygon to a plot at coordinates specified by vectors x and y. text(75, 30, labels = "arrows()")
Additional arguments such as col, border change the color of the arrows(x0 = runif(3, 60, 90),
y0 = runif(3, 10, 25),
x1 = runif(3, 60, 90),
inside and border of the polygon y1 = runif(3, 10, 25),
length = .1, lwd = 2
)
segments(), arrows()
Add segments (lines with fixed endings), or arrows to a plot. Adding simple figures to a plot
mtext()
Add text to the margins of a plot. Look at the help menu for
mtext() to see parameters for this function.
10: plotting: part 1 139
file
The name and file destination of the final plot entered as a string.
For example, to put a plot on my desktop, Id write file =
"/Users/nphillips/Desktop/plot.pdf" when creating a pdf, and
file = "/Users/nphillips/Desktop/plot.jpg" when creating a
jpeg.
width, height
The width and height of the final plot in inches.
family()
An optional name of the font family to use for the plot. For exam-
ple, family = "Helvetica" will use the Helvetica font for all text
(assuming you have Helvetica on your system). For more help on
using different fonts, look at section "Using extra fonts in R" in
Chapter XX
dev.off()
This is not an argument to pdf() and jpeg(). You just need to ex-
ecute this code after creating the plot to finish creating the image
file (see examples below).
Youll notice that after you close the plot with dev.off(), youll see
a message in the prompt like "null device".
Using the command pdf() will save the file as a pdf. If you use
jpeg(), it will be saved as a jpeg.
10: plotting: part 1 141
For the following exercises, youll use datasets from the yarrr pack-
age. Make sure to install and load the package
15
10
5
0
Ship
50
45
40
35
30
Beard
25
20
15
10
Ship
3.
10
5
0
15 20 25 30 35 40 45
Age
10: Plotting: Part Deux
Advanced colors
survey data and plotted income on the x-axis and happiness on the
y-axis of a scatterplot, you could determine the darkness of each
point as a function of a third quantitative variable (such as number of
children or amount of travel time to work). I plotted an example of
80
this in Figure 50.
60
happiness
40
20
0
30 40 50 60 70
income
pirate palettes
trans = 0
basel
southpark
appletv
drugs
google
info2
info
ge
n2
is
n
op
e1
e2
nk
ee
uo
llo
d
an
ee
up
po
re
u
pi
gr
rq
ye
bl
bl
or
gr
tu
rat
nemo
cars
bugs
brave
lubitel
pancake
cake # Save the South Park palette
emo sp.cols <- piratepal(palette = "southpark",
provoking action = "return")
goldfish
espresso
# Create a blank plot
ipod
ghostbusters
plot(1, xlim = c(0, 6), ylim = c(0, 2),
ohbrother bty = "n", type = "n")
usualsuspects
toystory # Add points
malcovich points(x = 1:5, y = rep(1, 5),
evildead
pch = c(21, 22, 23, 24, 25),
eternal
memento
bg = sp.cols, # Use the South Park Colors
monalisa col = "white",
cex = 5)
Once you find a color palette you like, you can save the colors as a
vector by setting the action argument to "return", and assigning the
2.0
require("RColorBrewer")
display.brewer.all()
YlOrRd
YlOrBr
YlGnBu
YlGn
Reds
RdPu
Purples
PuRd
PuBuGn
PuBu
OrRd
Oranges
Greys
Greens
GnBu
BuPu
BuGn
Blues
Set3
Set2
Set1
Pastel2
Pastel1
Paired
Dark2
Accent
Spectral
RdYlGn
RdYlBu
RdGy
RdBu
PuOr
PRGn
PiYG
BrBG
I know the results look like gibberish, but trust me, R will inter-
pret them as the colors in the palette. Once you store the output of
the brewer.pal() function as a vector (something like my.colors),
148 yarrr! the pirates guide to r
you can then use this vector as an argument for the colors in your
library("RColorBrewer")
plot. library("circlize")
# Create Data
drinks <- sample(1:30, size = 100, replace = T)
smokes <- sample(1:30, size = 100, replace = T)
Numerically defined color gradients with colorRamp2 risk <- 1 / (1 + exp(-drinks / 20 + rnorm(100, mean = 0, sd = 1)))
package that creates that really cool chordDiagram from Chapter 1). # Set up plot layout
layout(mat = matrix(c(1, 2), nrow = 2, ncol = 1),
heights = c(2.5, 5), widths = 4)
The colorRamp2 function allows you to easily generate shades of
# Top Plot
colors based on numerical data. par(mar = c(4, 4, 2, 1))
plot(1, xlim = c(-.5, 31.5), ylim = c(0, .3),
The best way to explain how colorRamp2 works is by giving you type = "n", xlab = "Cigarette Packs",
yaxt = "n", ylab = "", bty = "n",
main = "colorRamp2 Example")
an example. Lets say that you want to want to plot data showing the
segments(x0 = c(0, 15, 30),
relationship between the number of drinks someone has on average y0 = rep(0, 3),
x1 = c(0, 15, 30),
y1 = rep(.1, 3),
per week and the resulting risk of some adverse health effect. Further, lty = 2)
lets say you want to color the points as a function of the number points(x = 0:30,
y = rep(.1, 31), pch = 16,
of packs of cigarettes per week that person smokes, where a value col = smoking.colors(0:30))
Moreover, you want the values in between these break points of 0, 10 # Bottom Plot
par(mar = c(4, 4, 5, 1))
and 30 to be a mix of the colors. For example, the value of 5 (half way plot(x = drinks, y = risk, col = smoking.colors(smokes),
pch = 16, cex = 1.2, main = "Plot of (Made-up) Data",
xlab = "Drinks", ylab = "Risk")
between 0 and 10) should be an equal mix of Blue and Orange.
mtext(text = "Point color indicates smoking rate", line = .5, side = 3)
colorRamp2 allows you to do exactly this. The function has three
arguments: colorRamp2 Example
another function that you can then use to generate colors. Once you
store the resulting function as an object (something like my.color.fun
0.7
Risk
You can then apply this new function on numerical data (in our
0.5
For example, lets create the color ramp function for our smoking 0 5 10 15 20 25 30
data points. Ill use colorRamp2 to create a function that Ill call Drinks
smoking.colors which takes a number as an argument, and returns
the corresponding color:
## [1] "#0000FFB2"
## [1] "#FF8200B2"
To see this function in action, check out the the margin Figure for
an example, and check out the help menu ?colorRamp2 for more
information and examples.
G in Google is R: 19, G: 72, B: 206. Using this method, I figured out text(3.5, .7, "Look familiar?", cex = 1.5)
the four colors of Google! Check out the margin Figure for the grand
result.
Look familiar?
par(new = T)
You can adjust the size of the margins by specifying a margin par(mar = rep(0, 4))
plot(1, xlim = c(0, 1), ylim = c(0, 1), type ="n",
parameter using the syntax par(mar = c(a, b, c, d)) before you main = "", bty = "n", xlab = "", ylab = "", xaxt = "n", yaxt = "n")
execute your first high-level plotting function, where a, b, c and d are rect(0, 0, 1, 1)
# Second Plot
2.0
par(mar = rep(6, 4)) # Set the margin on all sides to 6
1.5
plot(1:10)
1.0
mtext("Large Margins", side = 3, line = 1, cex = 1.2)
0.5
0.0
mar[3] mar[4]
Small Margins
1.0
10
2.0
Large Margins
2.0 1.0 0.0 0.5 1.0 1.5 2.0
8
mar[2]
10
8
6
1:10
6
4
4
2 4 6 8 10
2
Index
cant even see the axis labels, while in the second plot there is plenty
(probably too much) white space around the plotting region.
In addition to using the mar parameter, you can also specify mar-
gin sizes with the mai parameter. This acts just like mar except that
the values for mai set the margin size in inches.
10: plotting: part deux 151
Arranging multiple plots with par(mfrow) and layout par(mfrow = c(3, 3))
par(mar = rep(2.5, 4))
R makes it easy to arrange multiple plots in the same plotting space. for(i in 1:9) { # Loop across plots
The most common ways to do this is with the par(mfrow) parameter, # Generate data
x <- rnorm(100)
y <- x + rnorm(100)
and the layout() function. Lets go over each in turn:
# Plot data
plot(x, y, xlim = c(-2, 2), ylim = c(-2, 2),
col.main = "gray",
pch = 16, col = gray(.0, alpha = .1),
Simple plot layouts with par(mfrow) and par(mfcol) xaxt = "n", yaxt = "n"
)
The mfrow and mfcol parameters allow you to create a matrix of plots # Add a regression line for fun
abline(lm(y ~ x), col = "gray", lty = 2)
par(mfrow = c(3, 3)) # Create a 3 x 3 plotting matrix # Create box around border
box(which = "figure", lty = 2)
}
When you execute this code, you wont see anything happen.
However, when you execute your first high-level plotting command,
2
1
1
youll see that the plot will show up in the space reserved for the first
1 2 3
0
y
y
1
1
plot (the top left). When you execute a second high-level plotting
2
2 1 0 1 2 2 1 0 1 2 2 1 0 1 2
command, R will place that plot in the second place in the plotting x x x
2
matrix - either the top middle (if using par(mfrow) or the left middle
1
4 5 6
0
y
y
(if using par(mfcol)). As you continue to add high-level plots, R will
1
2
2
continue to fill the plotting matrix. 2 1 0
x
1 2 2 1 0
x
1 2 2 1 0
x
1 2
2
1
1
only difference is that while par(mfrow) puts sequential plots into the 0
7 8 9
0
y
y
1
1
plotting matrix by row, par(mfcol) will fill them by column.
2
2
2 1 0 1 2 2 1 0 1 2 2 1 0 1 2
When you are finished using a plotting matrix, be sure to reset the Figure 52: A matrix of plotting regions
plotting parameter back to its default state: created by par(mfrow = c(3, 3))
heights: A vector of values for the heights of the rows of the layout.show(3)
plotting space.
## [,1] [,2]
Figure 53: A plotting layout created
## [1,] 0 3 by setting a layout matrix and specific
## [2,] 2 1 heights and widths.
layout(mat = layout.matrix,
heights = c(1, 2), # Heights of the two rows
widths = c(2, 1) # Widths of the two columns
)
# Plot 1: Scatterplot
par(mar = c(5, 4, 0, 0))
plot(x.vals, y.vals)
10: plotting: part deux 153
# Plot 2: X boxplot
par(mar = c(0, 4, 0, 0))
boxplot(x.vals, xaxt = "n",
yaxt = "n", bty = "n", yaxt = "n",
col = "white", frame = F, horizontal = T)
# Plot 3: Y boxplot
par(mar = c(5, 0, 0, 0))
boxplot(y.vals, xaxt = "n",
yaxt = "n", bty = "n", yaxt = "n",
col = "white", frame = F)
120
100
y.vals
80
60
x.vals
154 yarrr! the pirates guide to r
Additional Tips
rect(1:n.data + rect.space / 2,
rep(0, n.data),
1:n.data + rect.width + rect.space / 2,
parrot.data$male,
col = male.col, border = NA
See Figure 54 for a nicer example. )
code par() to print the default parameter values for all plotting mtext("Source: Drunken survey on 22 May 2015", side = 1,
at = 0, adj = 0, line = 3, font = 3, col = "white")
parameters to the console. dev.off()
## pdf
## 2
#embed_fonts("/Users/nphillips/Dropbox/Git/YaRrr_Book/media/parrothelvetica.pdf") #
150
100
50
Do we get more treasure from chests buried in the sand or at the bot-
tom of the ocean? Is there a relationship between the number of scars
a pirate has and how much grogg he can drink? Are pirates with
body piercings more likely to wear bandannas than those without
body piercings? Glad you asked, in this chapter, well answer these
questions using 1 and 2 sample frequentist hypothesis tests.
As this is a Pirates Guide to R, and not a Pirates Guide to Statis-
tics, we wont cover all the theoretical background behind frequentist
null hypothesis tests (like t-tests) in much detail. However, its impor-
tant to cover three main concepts: Descriptive statistics, Test statistics,
and p-values. To do this, lets talk about body piercings.
As you may know, pirates are quite fond of body piercings. Both Body Piercing Survey
9
piercings than American pirates. But is this true? To answer this, I 8
7
conducted a survey where I asked 10 American and 10 European 6
pirates how many body piercings they had. The results are below, 5
4
and a Pirateplot of the data is in Figure 55: 3
2
1
american.bp <- c(3, 5, 2, 1, 4, 4, 6, 3, 5, 4) American European
In null hypothesis tests, you always start with a null hypothesis. The
specific null hypothesis you choose will depend on the type of ques-
tion you are asking, but in general, the null hypothesis states that
156 yarrr! the pirates guide to r
nothing is going on and everything is the same. For example, in our body
piercing study, our null hypothesis is that American and European Null Hypothesis: Everything is the
same
pirates have the same number of body piercings on average.
Alternative Hypothesis: Everything is
The alternative hypothesis is the opposite of the null hypothesis. In not the same
this case, our alternative hypothesis is that American and European
pirates do not have the same number of piercings on average. We
can have different types of alternative hypotheses depending on how
specific we want to be about our prediction. We can make a 1-sided
(also called 1-tailed) hypothesis, by predicting the direction of the
difference between American and European pirates. For example,
our alternative hypothesis could be that European pirates have more
piercings on average than American pirates.
Alternatively, we could make a 2-sided (also called 2-tailed) alterna-
tive hypothesis that American and European pirates simply differ in
their average number of piercings, without stating which group has
more piercings than the other.
Once weve stated our null and alternative hypotheses, we collect
data and then calculate descriptive statistics.
Descriptive Statistics
Descriptive statistics describe samples of data. For example, a mean,
median, or standard deviation of a dataset is a descriptive statistic of
that dataset. Lets calculate some descriptive statistics on our body
piercing survey American and European pirates using the summary()
function:
summary(american.bp)
summary(european.bp)
Well, it looks like our sample of 10 American pirates had 3.7 body
piercings on average, while our sample of 10 European pirates had
7.3 piercings on average. But is this difference large or small? Are we
justified in concluding that American and European pirates in general
differ in how many body piercings they have? To answer this, we
need to calculate a test statistic
11: inferential statistics: 1 and 2-sample null-hypothesis tests 157
Test Statistics
An test statistic compares descriptive statistics, and determines how
different they are. The formula you use to calculate a test statistics
depends the type of test you are conducting, which depends on
many factors, from the scale of the data (i.e.; is it nominal or inter-
val?), to how it was collected (i.e.; was the data collected from the
same person over time or were they all different people?), to how its
distributed (i.e.; is it bell-shaped or highly skewed?).
For now, I can tell you that the type of data we are analyzing calls
for a two-sample T-test. This test will take the descriptive statistics
from our study, and return a test-statistic we can then use to make a
decision about whether American and European pirates really differ.
To calculate a test statistic from a two-sample t-test, we can use the
t.test() function in R. Dont worry if its confusing for now, well go
through it in detail shortly.
I can get the test statistic from my new bp.test object by using the
$ operator as follows:
bp.test$statistic
## t
## -5.676354
p-value
The p-value is a probability that reflects how consistent the test
statistic is with the hypothesis that the groups are actually the same.
bp.test$p.value
## [1] 2.304418e-05
The p-value we got was .000022, this means that, assuming the
two populations of American and European pirates have the same
number of body piercings on average, the probability that we would
obtain a test statistic as large as 5.68 is around 0.0022%. This is very
small in other words, its close to impossible. Because our p-value
is so small, we conclude that the two populations must not be the
same.
Does the p-value tell us the probability that the null hypothesis is true?
No. The p-value does not tell you the probability that the null
hypothesis is true. In other words, if you calculate a p-value of .04,
this does not mean that the probability that the null hypothesis is
true is 4%. Rather, it means that if the null hypothesis was true, the
probability of obtaining the result you got is 4%. Now, this does
indeed set off our bullshit detector, but again, it does not mean that
the probability that the null hypothesis is true is 4%.
11: inferential statistics: 1 and 2-sample null-hypothesis tests 159
mean(No EP)
= 9.41
0 5 10 15 20 0 5 10 15 20
## 9.362
t.test(x = pirates$tattoos,
mu = 9.3) #Null: Mean is 9.3
##
## One Sample t-test
##
## data: pirates$tattoos
## t = 0.579, df = 999, p-value = 0.5627
## alternative hypothesis: true mean is not equal to 9.3
## 95 percent confidence interval:
## 9.15187 9.57213
## sample estimates:
## mean of x
## 9.362
Looks good. The test statistic decreased to just 0.58, and the p-
value increased to 0.56. In other words, our sample mean of 9.36 is
very consistent with the hypothesis that the true population mean is
9.30.
Two-sample t-test
In a two-sample t-test, you compare the means of two groups of data
and test whether or not they are the same. For example, if someone
said that pirates who wear eye patches have fewer tattoos on average
than those who dont wear eye patches, we could test this with a
two-sample t-test. To do a two-sample t-test, enter the dependent
variable (e.g.; tattoos) and independent variable (e.g.; eye patch) as a
formula in the form dv iv, and specify the dataset containing the
data in data:
##
## Welch Two Sample t-test
##
162 yarrr! the pirates guide to r
To see all the information contained in the test object, use the
names() function
names(tattoo.eyepatch.test)
Once you know the names of whats in the object, you can access
specific values by name using the $ operator (just like in dataframes).
In the code below, Ill access just the p-value and test statistic from
my new tattoo.eyepatch.test object:
## [1] 0.7684595
## t
## 0.2945193
11: inferential statistics: 1 and 2-sample null-hypothesis tests 163
y
columns in a dataframe.
data The dataframe containing the
variables x and y
alternative A string indicating the direction of
the test. t stands for two-sided, l
To run a correlation test, use the cor.test() function. Lets con- stands for less than, and g stands
duct a correlation test on the pirates dataset to see if there is a for greater than.
relationship between a pirates age and number of parrots theyve method A string indicating which corre-
lation coefficient to calculate and
had in their lifetime: test. pearson (the default) stands
for Pearson, while kendall and
# Is there a correlation between a pirate's age and spearman stand for Kendall and
Spearman correlations respectively.
# the number of parrots (s)he's owned?
conf.level The confidence level of the interval.
subset A vector specifying a subset of
age.parrots.test <- cor.test(formula = ~ age + parrots, observations to use.
data = pirates)
age.parrots.test
##
## Pearson's product-moment correlation
##
## data: age and parrots
## t = 7.4153, df = 998, p-value = 2.588e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1689151 0.2864504
## sample estimates:
## cor
## 0.2285152
11: inferential statistics: 1 and 2-sample null-hypothesis tests 165
names(age.parrots.test)
Just like the t.test() function, we can use the subset argument
in the cor.test() function to conduct a test on a subset of the entire
dataframe. For example, to run the same correlation test between
a pirates age and the number of parrots shes owned, but only for
female pirates, I can add the subset = sex == "female" argument:
##
## Pearson's product-moment correlation
##
## data: age and parrots
## t = 4.0423, df = 489, p-value = 6.147e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.09280611 0.26410932
## sample estimates:
## cor
## 0.1798206
166 yarrr! the pirates guide to r
Chi-square test
number of pirates from each college. Here is a table of the college table(pirates$college)
data:
##
## CCCC JSSFP
college frequency ## 667 333
CCCC 667
JSSFP 333
Just by looking at the table, it looks like pirates are much more
likely to come from Captain Chunks Cannon Crew (CCCC) than Jack
Sparrows School of Fashion and Piratery (JSSFP). For this reason,
we should expect a very large test statistic and a very small p-value.
Lets test it using the chisq.test() function. As the main argument
to the chisq.test() function, you enter a table of the data using the Dont forget that you should enter your
table() function rather than the original vector. data into the chisq.test() as a table
using the table() function!
chisq.test(x = table(pirates$college))
##
## Chi-squared test for given probabilities
##
## data: table(pirates$college)
## X-squared = 111.56, df = 1, p-value < 2.2e-16
table(pirates$eyepatch,
pirates$favorite.pirate)
chisq.test(x = table(pirates$college,
pirates$eyepatch))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(pirates$college, pirates$eyepatch)
## X-squared = 2.0729, df = 1, p-value = 0.1499
##
## Welch Two Sample t-test
##
## data: age by headband
## t = 1.3542, df = 116.92, p-value = 0.1783
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3834475 2.0416673
## sample estimates:
## mean in group no mean in group yes
## 28.14286 27.31375
As you can see, the apa function got the values we wanted and
reported them in proper APA style. The apa function will even
automatically adapt the output for Chi-Square and correlation tests
if you enter such a test object. Lets see how it works on a correlation
test where we correlate a pirates age with the number of parrots she
has owned:
The apa function has a few optional arguments that control things
like the number of significant digits in the output, and the number of
tails in the test. Run ?apa to see all the options.
11: inferential statistics: 1 and 2-sample null-hypothesis tests 171
The following questions are based on data from either the movies
or the pirates dataset in the yarrr package. Make sure to load the
package first to get access to the data!
150
Cleaning Time (minutes)
125
100
75
50
25
0
a b c a b c
type = parrot type = shark
Cleaner
174 yarrr! the pirates guide to r
Between-Subjects ANOVA
There are many types of ANOVAs that depend on the type of data
you are analyzing. In fact, there are so many types of ANOVAs that
there are entire books explaining differences between one type and
another. For this book, well cover just one type of ANOVAs called
full-factorial, between-subjects ANOVAs. These are the simplest types of
ANOVAs which are used to analyze a standard experimental design.
In a full-factorial, between-subjects ANOVA, participants (aka, source
of data) are randomly assigned to a unique combination of factors
where a combination of factors means a specific experimental
condition26 26
For example, consider a psychology
study comparing the effects of caffeine
For the rest of this chapter, I will refer to full-factorial between-
on cognitive performance. The study
subjects ANOVAs as standard ANOVAs could have two independent variables:
drink type (soda vs. coffee vs. energy
drink), and drink dose (.25l, .5l, 1l). In a
What does ANOVA stand for? full-factorial design, each participant in
the study would be randomly assigned
ANOVA stands for "Analysis of variance." At first glance, this sounds to one drink type and one drink dose
like a strange name to give to a test that you use to find differences condition. In this design, there would
be 3 x 3 = 9 conditions.
in means, not differences in variances. However, ANOVA actually uses
variances to determine whether or not there are real differences in
12: anova and factorial designs 175
1 2 3 3
2
1 3
2
1
Significant ANOVA
Between var LARGE compared to Within var
Between Within
100
90
80
70
data
60
Between Within
50 Variability Variability
40
30
20
10
0
1 2 3 3
2
1 3
2
1
176 yarrr! the pirates guide to r
You conduct a one-way ANOVA when you testing one factor inde-
pendent variable and are ignoring all other possible variables. Lets
use the poopdeck data and do a one-way ANOVA with the cleaner
type as the independent variable, and the cleaning time as the depen-
dent variable.
Step 1: Create an ANOVA object from the regression object with aov()
First, well create an ANOVA object with aov. Because cleaner is
the independent variable, and time is the dependent variable, well
include the argument formula = time cleaner
summary(cleaner.aov)
The main result from our table is that we have a significant effect
of cleaner on cleaning time (F(2, 597) = 5.29, p = 0.005). However,
the ANOVA table does not tell us which levels of the independent
variable differ. In other words, we dont know which cleaner is better
than which. To answer this,w e need to conduct a post-hoc test.
# Show summary
summary(cleaner.lm)
##
## Call:
## lm(formula = time ~ cleaner, data = poopdeck)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.02 -16.60 -1.05 16.92 71.92
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.020 1.691 39.037 < 2e-16 ***
## cleanerb -0.420 2.392 -0.176 0.86066
## cleanerc -6.940 2.392 -2.902 0.00385 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.92 on 597 degrees of freedom
## Multiple R-squared: 0.01743,Adjusted R-squared: 0.01413
## F-statistic: 5.294 on 2 and 597 DF, p-value: 0.005261
As you can see, the regression table does not give us tests for each
variable like the ANOVA table does. Instead, it tells us how different
each level of an independent variable is from a default value. You can
tell which value of an independent variable is the default variable
just by seeing which value is missing from the table. In this case, I
dont see a coefficient for cleaner a, so that must be the default value.
The intercept in the table tells us the mean of the default value. In
12: anova and factorial designs 179
this case, the mean time of cleaner a was 66.02. The coefficients for
the other levels tell us that cleaner b is, on average 0.42 minutes faster
than cleaner a, and cleaner c is on average 6.94 minutes faster than
cleaner a. Not surprisingly, these are the same differences we saw in
the Tukey HSD test!
Interactions between variables test whether or not the effect of one summary(cleaner.type.lm)
variable depends on another variable. For example, we could use an
interaction to answer the question: Does the effect of cleaners depend on ##
## Call:
## lm(formula = time ~ cleaner + type, data = poopdeck)
the type of poop they are used to clean? To include interaction terms in ##
## Residuals:
an ANOVA, just use an asterix (*) instead of the plus (+) between the ## Min 1Q Median 3Q Max
## -59.743 -13.792 -0.683 13.583 83.583
terms in your formula29 . ##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
Lets repeat our previous ANOVA with two independent variables, ## (Intercept) 54.357 1.705 31.881 < 2e-16 ***
## cleanerb -0.420 2.088 -0.201 0.840665
but now well include the interaction between cleaner and type. To ## cleanerc -6.940 2.088 -3.323 0.000944 ***
## typeshark 23.327 1.705 13.681 < 2e-16 ***
do this, well set the formula to time cleaner * type. ## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.88 on 596 degrees of freedom
# Step 1: Create ANOVA object ##
##
Multiple R-squared: 0.2523,Adjusted R-squared: 0.2485
F-statistic: 67.02 on 3 and 596 DF, p-value: < 2.2e-16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(cleaner.type.int.lm)
12: anova and factorial designs 181
##
## Call:
## lm(formula = time ~ cleaner * type, data = poopdeck)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.28 -12.83 -0.08 12.29 74.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.760 1.967 23.259 < 2e-16 ***
## cleanerb 8.060 2.782 2.897 0.003908 **
## cleanerc 10.370 2.782 3.727 0.000212 ***
## typeshark 40.520 2.782 14.563 < 2e-16 ***
## cleanerb:typeshark -16.960 3.935 -4.310 1.91e-05 ***
## cleanerc:typeshark -34.620 3.935 -8.798 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.67 on 594 degrees of freedom
## Multiple R-squared: 0.3385,Adjusted R-squared: 0.3329
## F-statistic: 60.79 on 5 and 594 DF, p-value: < 2.2e-16
Additional tips
names(cleaner.type.int.aov)
# Add the fits for the interaction model to the dataframe as int.fit
# Add the fits for the main effects model to the dataframe as me.fit
Now lets look at the first few rows in the table to see the fits for
the first few observations.
head(poopdeck)
You can use these fits to see how well (or poorly) the model(s)
12: anova and factorial designs 183
were able to fit the data. For example, we can calculate how far each
models fits were from the true data as follows:
# How far were the interaction model fits from the data on average? We can also see how far the mod-
els were on average from the data
separately for each condition using
mean(abs(poopdeck$int.fit - poopdeck$time))
dplyr.
# How far were the main effect model fits from the data on average? poopdeck %>% group_by(cleaner, type) %>%
summarise(
int.fit.err = mean(abs(int.fit - time)),
mean(abs(poopdeck$me.fit - poopdeck$time)) me.fit.err = mean(abs(me.fit - time))
)
## [1] 16.5351
## Source: local data frame [6 x 4]
## Groups: cleaner [?]
As you can see, the interaction model was off from the data by ##
## cleaner type int.fit.err me.fit.err
15.35 minutes on average, while the main effects model was off from
## (chr) (chr) (dbl) (dbl)
the data by 16.54 on average. This is not surprising as the interaction ## 1 a parrot 14.2752 16.14407
model is more complex than the main effects only model. However, ## 2 a shark 15.4944 17.36033
## 3 b parrot 13.9236 13.92127
just because the interaction model is better at fitting the data doesnt ## 4 b shark 18.4552 18.45053
necessarily mean that the interaction is either meaningful or reliable. ## 5 c parrot 15.4926 17.27167
## 6 c shark 14.4694 16.06273
Repeated measures (linear mixed-effects) ANOVA using the lm4 packge The results show that the interaction
model had better fits (i.e.; lower errors)
If you are conducting an analyses where youre repeating measure- for virtually every condition
ments over one or more third variables, like giving the same partici-
pant different tests, you should do a mixed-effects regression analysis.
To do this, you should use the lmer function in the lme4 package.
For example, in our poopdeck data, we have repeated measurements
for days. That is, on each day, we had 6 measurements. Now, its
possible that the overall cleaning times differed depending on the day.
We can account for this by including random intercepts for day by
adding the (1|day) term to the formula specification30 . 30
For more tips on mixed-
effects analyses, check out
this great tutorial by Bodo Winter
# install.packages(lme4) # If you don't have the package already http://www.bodowinter.com/tutorial/bw_LME_tutorial2.p
library(lme4)
For the following questions, use the pirates dataframe in the yarrr
package
3. Now, repeat your analysis from the previous two questions, but
include both indpendent variables fav.pixar and favorite.pirate
in the ANOVA. Do your conclusions differ when you include both
variables?
The linear model is easily the most famous and widely used model
in all of statistics. Why? Because it can apply to so many interesting
research questions where you are trying to predict a continuous
variable of interest (the response or dependent variable) on the basis of
one or more other variables (the predictor or independent variables). Figure 60: Insert funny caption here.
The linear model takes the following form, where the x values
represent the predictors, while the beta values represent weights.
y = 0 + 1 x1 + 2 x2 + ... n xn
lm()
formula
A formula in a form y x1 + x2 + ..., where y is the dependent
variable, and x1, x2, ... are the independent variables. If you want
to include all columns (excluding y) as independent variables, just
enter y .
data
The dataframe containing the columns specified in the formula.
Well start with a simple example using a dataset in the yarrr package
called diamonds. The dataset includes data on 150 diamonds sold at
an auction. Here are the first few rows of the dataset
library(yarrr)
head(diamonds)
Here, we can see from the summary table that the model esti-
mated Int (the intercept), to be 148.34, weight to be 2.19, clarity to
be 21.69, and , color to be 0.45. You can see the full linear model in
Figure 61 below:
You can access lots of different aspects of the regression object. To
see whats inside, use names()
190 yarrr! the pirates guide to r
For example, to get the estimated coefficients from the model, just
acecss the coefficients attribute:
abline(b = 1, a = 0)
195
True Values
13: regression 191
Ill use the predict() function to predict the value of each of these
diamonds using the regression model diamond.lm that I created
before. The two main arguments to predict() are object the regres-
sion object weve already defined), and newdata the dataframe of
new data: The dataframe that you use in the
newdata argument to predict() must
have column names equal to the names
# Create a dataframe of new diamond data
of the coefficients in the model. If the
diamonds.new <- data.frame(weight = c(12, 6, 5), names are different, the predict()
clarity = c(1.3, 1, 1.5), function wont know which column of
data applies to which coeffiicent
color = c(5, 2, 3))
## 1 2 3
## 200.5 182.3 190.5
This result tells us the the new diamonds are expected to have
values of 200.53, 182.25, and 190.46 respectively according to our
regression model.
# Print summary
summary(diamonds.int.lm)$coefficients
Hey that looks much better! Now we see that the main effects are
significant and the interaction is non-significant.
194 yarrr! the pirates guide to r
We can use standard regression with lm() when your dependent vari-
able is Normally distributed (more or less). When your dependent
variable does not follow a nice bell-shaped Normal distribution, you
need to use the Generalized Linear Model (GLM). the GLM is a more
general class of linear models that change the distribution of your
dependent variable. In other words, it allows you to use the linear
model even when your dependent variable isnt a normal bell-shape.
Here are 4 of the most common distributions you can can model with
glm():
Normal Poisson
Continuous, bellshaped Positive Integers
8.93 2
9.92 0
3
10.52 5
8.71 2
8.95 2
0
9.62 3
9.58 5
2
11.07
10.15
10.76
5 10 15 1 3 5 7 9
Binomial Gamma
Only 0s and 1s Continuous, positive
1 2.36
0 4.72
Frequency
0
1 0.62
0 1.95
1 2.52
0 2.36
0 4.72
1 0.62
0
1.95
2.52
0 1
0 2 4 6 8 10
13: regression 195
You can use the glm() function just like lm(). To specify the distri-
bution of the dependent variable, use the family argument.
glm()
family
One of the following strings, indicating the link function for the
general linear model
lifetime?" curve(logit.fun,
from = -3,
to = 3,
lwd = 2,
main = "Inverse Logit",
ylab = "p(y = 1)",
xlab = "Logit(x)"
)
variable. Well set family = binomial to tell glm() that the dependent
variable is binary. 3 2 1 0 1 2 3
Logit(x)
Just like with regular regression with lm(), we can get the fitted
values from the model and put them back into our dataset to see how
well the model fit the data: Looking at the first few observations, it
. looks like the probabilities match the
data pretty well. For example, the first
diamond with a value of 182.5 had a
# Add the fitted values from the GLM back to the data fitted probability of just 0.16 of being
diamonds$pred.g190 <- diamond.glm$fitted.values valued greater than 190. In contrast,
the second diamond, which had a
value of 191.2 had a much higher fitted
# Look at the first few rows (of the named columns) probability of 0.82.
head(diamonds[c("weight", "clarity", "color", "value", "pred.g190")])
Just like we did with regular regression, you can use the predict()
function along with the results of a glm() object to predict new data.
Lets use the diamond.glm object to predict the probability that the
new diamonds will have a value greater than 190:
predict(object = diamond.glm,
newdata = diamonds.new)
13: regression 197
## 1 2 3
## 4.8605 -3.5265 -0.3898
What the heck, these dont look like probabilities! True, theyre
not. They are logit-transformed probabilities. To turn them back into
probabilities, we need to invert them by applying the inverse logit
function (see Figure ??).
## 1 2 3
## 0.99231 0.02857 0.40376
So, the model predicts that the probability that the three new
diamonds will be valued over 190 is 99.23%, 2.86%, and 40.38%
respectively.
Once youve created a regression object with lm() or glm(), you can
summarize the results in an ANOVA table with aov():
Additional Tips
205
put the regression object you created with lm as the main argument
to abline(). For example, the following code will create the scat-
195
Value
terplot on the right (Figure 63) showing the relationship between a
185
diamonds weight and its value including a red regression line:
175
# Scatterplot of diamond weight and value
6 8 10 12 14
plot(x = diamonds$weight,
Weight
y = diamonds$value,
xlab = "Weight", Figure 63: Adding a regression line to a
scatterplot using abline()
ylab = "Value",
main = "Adding a regression line with abline()"
)
3000
For example, look at the distribution of movie revenues in the
Frequency
movies dataset in the margin Figure 64:
As you can see, these data dont look Normally distributed at all.
1000
There are a few movies (like Avatar) that just an obscene amount
0
of money, and many movies that made much less. If we want to 0 500 1500 2500
conduct a standard regression analysis on these data, we need to movies$revenue.all
create a new log-transformed version of the variable. In the following
Figure 64: Distribution of movie
code, Ill create a new variable called revenue.all.lt defined as the
revenues
logarithm of revenue.all
hist(movies$revenue.all.lt,
In Figure 65 you can see a histogram of the new log-transformed main = "Log-transformed Movie revenue")
variable. Its still skewed, but not nearly as badly as before, so I
would be ok using this variable in a standard regression analysis Logtransformed Movie revenue
with lm().
1000
Frequency
600
200
0
3 4 5 6 7 8
movies$revenue.all.lt
1. The column jbb is the "Jacks Blue Book" value of a ship. Create a
regression object called jbb.cannon.lm predicting the JBB value of
ships based on the number of cannons it has. Based on your result,
how much value does each additional cannon bring to a ship?
7. Using price.all.lm, predict the selling price of the 3 new ships cannons
12
rooms
34
age
43
condition
7
color
black
style
classic
8 26 54 3 black modern
displayed on the right in Figure 7 32 65 100 5 red modern
Figure 66: Data from 3 new ships about
8. Using price.all.blr, predict the probability that the three new to be auctioned.
ships will have a selling price greater than 3500.
14: Writing your own functions
Throughout this book, you have been using tons of functions either
built into base-R like mean(), hist(), t.test(), or written by other
people and saved in packages like pirateplot() and apa()in the
yarrr package. However, because R is a complete programming Figure 67: Functions. Theyre kind of a
language, you can easily write your own functions that perform big deal.
specific tasks you want. Boring standard histogram with default
For example, lets say you think the standard histograms made values of hist():
with hist() are pretty boring. Check out the top figure on the right hist(pirates$age,
for an example. Instead of using these boring plots, youd like to xlab = "Age",
easily create a version like the bottom figure just below it. This plot main = "Pirates' Ages")
not only has a more modern design, but it also includes statistical
information, like the data mean, standard deviation, and a 95% Pirates' Ages
way that youd like by adding customer parameter values like col,
Frequency
150
bg (etc.). But as you can see from the code above Figure , the plot
has exactly the same inputs as the boring histogram just above it! No
0 50
extra arguments necessary. How did I do that? Well as you may have
guessed, I wrote a custom function called my.hist() that contains all 10 15 20 25 30 35 40 45
the code necessary to build the plot. So now, anytime I want to make Age
a fancy histogram, I can just use the my.hist() function rather than
having to always write the raw code from scratch. Heres an advanced version made with
Of course, functions are limited to creating plots...oh no. You our new function my.hist()!
can write a function to do anything that you can program once in R. my.hist(pirates$age,
If theres anything you like to do repeatedly in R, you will almost xlab = "Age",
main = "Pirates' Ages")
certainly like to write your own custom function to perform that
action quickly and easily whenever youd like. In fact, thats all
functions really are, theyre just chunks of R code that are stored Pirates' Ages
Mean = 27.39 (95% CI [27.05, 27.74]), SD = 5.54
behind the scenes for you to use without having to see (or write) the
code yourself. Some of you reading this will quickly see how how
writing your own functions can save you tons of time. For those of
you who havent...trust me, this is a big deal.
10 15 20 25 30 35 40 45
Age
202 yarrr! the pirates guide to r
1. Name: What is the name of your function? You can give it any
valid object name. However, be careful not to use names of exist-
ing functions or R might get confused.
2. Inputs: What are the inputs to the function? Does it need a vector
of numeric data? Or some text? You can specify as many inputs as
you want.
4. Output: What do you want the code to return when its finished
with the actions? Should it return a scaler statistic? A vector of
data? A dataframe?
Heres how your function will look in R. Youll use two new
functions, called function() and return(). As you can see, you put Yes, you use functions to create func-
the function inputs as arguments to the function() function, and the tions! Very Inception-y
ACTIONS
return(OUTPUT)
A simple example
Lets create a very simple example of a function. Well create a func-
tion called my.mean that does the exact same thing as the mean()
function in R. This function will take a vector x as an argument, cre-
ates a new vector called output that is the mean of all the elements
14: writing your own functions 203
of x (by summing all the values in x and dividing by the length of x),
then return the output object to the user.
Try running the code above. When you do, nothing obvious hap- If you ever want to see the exact
pens. However, R has now stored the new function my.mean() in the code used to generate a function, you
can just call the name of the function
current working directory for later use. To use the function, we can without the parentheses. For example,
then just call our function like any other function in R. Lets call our to see the code underlying our new
function my.mean you can run the
new function on some data and make sure that it gives us the same following:
result as mean():
my.mean
mean(data)
## [1] 3.75
As you can see, our new function my.mean() gave the same result
as Rs built in mean() function! Obviously, this was a bit of a waste
of time as we simply recreated a built-in R function. But you get the
idea...
Now lets test our new function with a few different values for the output <- grogg * 1 + port * 3 + crabjuice * 10
inputs grogg, port, and crab juice. How much gold did Tamara, who return(output)
}
had had 10 mugs of grogg, 3 glasses of wine, and 0 shots of crab juice
spend? To use this function, the pirate will
enter the number of drinks she had as
oh.god.how.much.did.i.spend(grogg = 10, a single vector with length three rather
than as 3 separate scalers.
port = 3,
crabjuice = 0)
## [1] 19
Looks like Tamara spent 19 gold last night. Ok, now how about
Cosima, who didnt drink any grogg or port, but went a bit nuts on
the crab juice:
oh.god.how.much.did.i.spend(grogg = 0,
port = 0,
crabjuice = 7)
## [1] 70
Cosimas taste for crab juice set her back 70 gold pieces.
return(output)
}
Lets test the new version of our function with data from Hyejeong,
who had 5 glasses of port but no grogg or crab juice. Because 0 is the
default, we can just ignore these arguments:
If you have a default value for ev-
ery input, you can even call the
oh.god.how.much.did.i.spend(port = 5) function without specifying any
inputs R will set all of them to
## [1] 15 the default. For example, if we call
oh.god.how.much.did.i.spend with-
out specifying any inputs, R will set
Looks like Hyejeong only spent 15 by sticking with port. them all to 0 (which should make the
result 0).
Imagine a restaurant waiter who gives everyone crab juice no matter ## [1] 0
what they ordered. Or a guy who wears a tuxedo no matter if hes
going to a dance or a barbeque. Well, unlike these guys, a good
function needs to change what it does depending on what you give it.
That is, before serving someone crab juice, a good server() function
needs to first make sure that the order argument was actually crab
juice (order == "crabjuice").
Youve already used functions that change based on their inputs.
Consider the summary() function: If you evaluate the summary
function on a numeric vector, it will return summary statistics of that
vector. However, if you give the summary function a regression object,
it will return output specific to regression objects like estimates
of the beta values. In other words, before actually calculating and
displaying summary data, the summary() function first checks what
type of input it received, and then calculates summary information
specific to the type of input it is given.
Like all programming languages, R uses if-then statements to se-
lectively run code based on some criteria. The function for running
if-then statements is if() . The if() function has two main ele-
ments, a logical test and a chunk of conditional code (in curly braces).
The chunk of code in the curly braces is conditional because it is only
evaluated if the logical test is TRUE. If the logical test is FALSE, R will
completely ignore the conditional code.
For example, in the following code chunk, I will define an object a
206 yarrr! the pirates guide to r
and assign the value TRUE to it. I will then write an ifstatement that
checks if a is TRUE, and if so, prints the statement "a was true!":
a <- TRUE
a <- FALSE
if(a) {
if(x > 0) {output <- "WTF WAS THAT?! I am feed.me.negatives, not be.a.douchebag"}
return(output)
}
14: writing your own functions 207
feed.me.negatives(-2)
feed.me.negatives(10)
feed.me.negatives(0)
my.hist(x)
output <- "Ok! I hope you like the plot..."
}
if(what == "stats") {
output <- paste("Yarr! The mean of this data be ", round(mean(x), 2),
" and the standard deviation be ", round(sd(x), 2),
sep = ""
)
}
if(what == "tellmeajoke") {
return(output)
208 yarrr! the pirates guide to r
Lets test show.me() on the lengths of pirate beards. First, lets get
it to create a histogram by setting what = "plot"
Histogram of x
Mean = 9.82 (95% CI [9.19, 10.46]), SD = 10.24
0 10 20 30 40
Looks good! Now lets get the same function to tell us some
statistics about the data by setting what = "stats":
## [1] "Yarr! The mean of this data be 9.82 and the standard deviation be 10.24"
Phew that was exhausting, I need to hear a funny joke. Lets set
what = "tellmeajoke":
show.me(what = "tellmeajoke")
Storing and loading your functions to and from a function file with
source()
source(file = "Users/Nathaniel/Dropbox/Custom_Pirate_Functions.R")
Here are several tips and tricks that will help you to build good
functions:
not valid, you can return an error message to the user telling them
what is wrong.
Lets add a quality control check to the show.me function. Well
do this by creating a logical value called valid.input which is TRUE
when the input for what is valid, and FALSE when the input for what
is not valid. Then, well define a warning message in the case where
the input is not valid:
if(what == "plot") {
my.hist(x)
output <- "Ok here's your histogram!"
if(what == "stats") {
output <- (paste("Yarr! The mean of this data be ", round(mean(x), 2),
" and the standard deviation be ", round(sd(x), 2),
sep = ""
))
if(what == "tellmeajoke") {
output <- "Bad what input. Please enter plot, stats, or tellmeajoke."
return(output)
}
is.outlier <- x > (mean(x) + outlier.def * sd(x)) | x < (mean(x) - outlier.def * sd(x))
x.nooutliers <- x[is.outlier == F]
return(x.nooutliers)
Trust me, when you start building large complex functions, hard-
coding these test values will save you many headaches. Just dont
forget to comment them out when you are done testing or the func-
tion will always use those values!
212 yarrr! the pirates guide to r
if(add.ci == T) {
Now, lets test our function with the optional inputs main, xlab,
Frequency
15
and col. These arguments will be passed down to the hist() func-
10
can see, R has passed our optional plotting arguments down to the
main hist() function in the function code.
0
2 1 0 1 2 3
Number of Chests
A worked example: Custom plotting functions
Lets create our own advanced own custom plotting function called
plot.advanced that acts like the normal plotting function, but has
several additional arguments
add.mean: A logical value indicating whether or not to add verti-
cal and horizontal lines at the mean value of x and y.
14: writing your own functions 213
This plotting code is a bit long, but its all stuff youve learned
before.
plot.advanced <- function (x = rnorm(100),
y = rnorm(100),
add.mean = F,
add.regression = F,
sig.line.col = "red",
nonsig.line.col = "black",
p.threshold = .05,
add.modeltext = F,
... # Optional further arguments passed on to plot
) {
# Run regression
model <- lm(y ~ x)
50
Treasure Chests Found
40
can do, well turn on some of the additional elements, and include
30
optional titles for the plot. You can see the result in the margin figure
20
on the right!
10
0
15 20 25 30 35 40 45
Age
15: Loops
Pretty brutal right? Imagine if you have a huge dataset with 1,000
columns, now youre really doing a lot of typing. Thankfully, with
a loop you can take care of this in no time. Check out this following
code chunk which uses a loop to convert the data for all 100 columns
in our survey dataframe.
y <- survey.df[, i] # Get data for ith column and save in a new object y
} # Close loop!
Done. All 100 columns. Take a look at the code and see if you can
understand the general idea. But if not, no worries. By the end of this
chapter, youll know all the basics of how to construct loops like this
one.
A loop is, very simply, code that tells a program like R to repeat a
certain chunk of code several times with different values of an index
that changes for every run of the loop. In R, the format of a for-loop
is as follows:
for(INDEX.OBJECT in INDEX.VALUES) {
LOOP.CODE
As you can see, there are three key aspects of loops: The index
object, the index vector, and the loop code:
2. index vector: A vector specifying all values that the index will
take over the loop. In the previous example, the index values
To see if our loop gave us the correct
are 1:10. You can specify the index values any way youd like.
answer, we can do the same calculation
If youre running a loop over numbers, youll probably want to without a loop by using a:b and the
use a:b or seq(). However, if you want to run a loop over a few sum() function:
specific values, you can just use the c() function to type the values sum(1:100)
manually. For example, to run a loop over three different pirate
## [1] 5050
ships, you could set the index values as c("Jolly Roger", "Black
Pearl", "Queen Annes Revenge"). Theres actually a funny story about
how to quickly add integers (without
3. loop code: The code that will be executed for all index values. In a loop). According to the story, a lazy
the previous example, the loop code is print(i). You can write teacher who wanted to take a nap
decided that the best way to occupy his
any R code youd like in a loop - from plotting to analyses. students was to ask them to privately
count all the integers from 1 to 100 at
their desks. To his surprise, a young
A simple loop: Adding integers from 1 to 100 student approached him after a few
moments with the correct answer: 5050.
Lets use a loop to add all the integers from 1 to 100. To do this, The teacher suspected a cheat, but
well start by creating an object called current.sum that contains the the student didnt count the numbers.
Instead he realized that he could use
running sum of the numbers. This is where well store the running the formula n(n+1) / 2. Dont believe
sum. Well set the index object to i, the index vector to 1:100, and the the story? Check it out:
loop code to current.sum <- current.sum + i. Because we want the
100 * 101 / 2
starting sum to be 0, well set the initial value of current.sum to 0.
Here is the code: ## [1] 5050
for(i in 1:100) {
current.sum
## [1] 5050
You can use loops to do much more than basic math. Oh no. One of
Figure 69: Gauss. Guys a total pirate.
the best uses of a loop is to put multiple graphs quickly and easily And totally would give us shit for using
on the same chart. Lets use a loop to quickly create a matrix of four a loop to calculate the sum of 1 to 100...
plots. For this example, well plot a histogram of the ages of pirates
based on their favorite pixar movie from the pirates dataset.
218 yarrr! the pirates guide to r
0 1 2 3 4 5 6 7
20
15
15
Frequency
Frequency
Frequency
Frequency
10
10
5
5
0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
15
100
50
8
Frequency
Frequency
Frequency
Frequency
10
6
60
30
5
2
0 20
0 10
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
8
15
15
3
6
Frequency
Frequency
Frequency
Frequency
10
10
2
4
5
2
0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
0.8
5
Frequency
Frequency
Frequency
4
10
0.4
2
5
0.0
0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Now we can do our loop. Our index value will be i and our index
vector will be 1:10. For our loop code, well calculate the square of
the index value, and assign the result to the ith index of squares
for(i in 1:10) {
result.i <- i ^ 2
squares[i] <- result.i
Now lets look at the result. Hopefully well get the squares of 1 to
10...
squares
## [1] 1 4 9 16 25 36 49 64 81 100
Again, there are easier ways to calculate
the squares of integers than using a
Ok that loop was pretty simple but hopefully it shows you the loop. Here, we can do the same thing
basic idea of combining indexing with assignment in loops. using basic vector arithmetic:
# CREATING THE DATA CONTAINER parrot.ci.df 1. Step 1: Determine all the age values
to analyze and assign them to
# Step 1 - Determine the age levels to analyze the object ages.to.analyze. Then
ages.to.analyze <- 20:30 determine total number of unique
n.ages <- length(ages.to.analyze) age values to analyze and assign the
total to the object n.ages.
# Step 2 - Create the storage dataframe parrot.ci.df 2. Step 2: Create a storage dataframe
parrot.ci.df <- data.frame( called parrot.ci.df with N rows
age = rep(NA, n.ages), (where N is the number of ages
mean = rep(NA, n.ages), we will analyze), and 3 columns
lb = rep(NA, n.ages), (age, lower.bound, upper.bound). I
ub = rep(NA, n.ages), started by filling parrot.ci.df with
stringsAsFactors = F NA values. We will use the loop to
) replace these NA values with our
names(parrot.ci.df) <- c("age", "mean", "lb", "ub") statistics.
## age mean lb ub
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## 7 NA NA NA NA
## 8 NA NA NA NA
## 9 NA NA NA NA
## 10 NA NA NA NA
## 11 NA NA NA NA
parrot.ci.df
## age mean lb ub
## 1 20 1.667 5.493 35
## 2 21 2.265 6.265 33
## 3 22 2.150 6.954 39
## 4 23 2.300 7.893 59
## 5 24 1.797 7.154 63
## 6 25 2.324 10.41 67
## 7 26 2.415 6.463 52
## 8 27 2.702 8.656 83
## 9 28 2.400 8.053 59
## 10 29 2.532 7.218 61
## 11 30 2.947 7.409 74
So far weve covered simple loops with a single index value - but
how can you do loops over multiple indices? You could do this
by creating multiple nested loops. However, these are ugly and
cumbersome. Instead, I recommend that you use design matrices to
reduce loops with multiple index values into a single loop with just
one index. Heres how you do it:
Lets say you want to calculate the mean, median, and standard
deviation of some quantitative variable for all combinations of two
factors. For a concrete example, lets say we wanted to calculate
these summary statistics on the age of pirates for all combinations of
colleges and sex.
Design Matrices
To do this, well start by creating a design matrix. This matrix will
have all combinations of our two factors. To create this design matrix
matrix, well use the expand.grid() function. This function takes
several vectors as arguments, and returns a dataframe with all combi-
nations of values of those vectors. For our two factors college and sex,
well enter all the factor values we want. Additionally, well add NA
columns for the three summary statistics we want to calculate
design.matrix
As you can see, the design matrix contains all combinations of our
factors in addition to three NA columns for our future statistics. Now
that we have the matrix, we can use a single loop where the index is
the row of the design.matrix, and the index values are all the rows in
the design matrix. For each index value (that is, for each row), well
get the value of each factor (college and sex) by indexing the current
row of the design matrix. Well then subset the pirates dataframe
with those factor values, calculate our summary statistics, then assign
them
for(row.i in 1:nrow(design.matrix)) {
# Calculate statistics
median.i <- median(data.temp$age)
mean.i <- mean(data.temp$age)
sd.i <- sd(data.temp$age)
}
224 yarrr! the pirates guide to r
design.matrix
Lets say you are conducting a loop where the outcome of each index
is a vector. However, the length of each vector could change - one
might have a length of 1 and one might have a length of 100. How
can you store each of these results in one object? Unfortunately, a
vector, matrix or dataframe might not be appropriate because their
size is fixed. The solution to this problem is to use a list(). A list is
a special object in R that can store virtually anything. You can have a
list that contains several vectors, matrices, or dataframes of any size.
If you want to get really Inception-y, you can even make lists of lists
(of lists of lists....).
To create a list in R, use the list() function. Lets create a list
that contains 3 vectors where each vector is a random sample from a
normal distribution. Well have the first element have 10 samples, the
second will have 5, and the third will have 15.
number.list
## $first
## [1] -1.2430 0.1740 -1.1913 -0.5787 -1.2293 0.9176 1.6177 -2.0238
## [9] -0.2475 0.9151
##
## $second
## [1] -1.62736 -1.10305 -0.52422 0.07205 -2.35107
##
## $third
15: loops 225
To index a list, use double brackets [[]] or $ if the list has names.
For example, to get the first element of a list named number.list,
wed use number.ls[[1]]:
number.list[[1]]
number.list$second
Ok, now lets use the list object within a loop. Well create a loop
that generates 5 different samples from a Normal distribution with
mean 0 and standard deviation 1 and saves the results in a list called
samples.ls. The first element will have 1 sample, the second element
will have 2 samples, etc.
First, we need to set up an empty list container object. To do this,
use the vector function:
samples.ls
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
226 yarrr! the pirates guide to r
Now, lets run the loop. For each run of the loop, well generate
random samples and assign them to an object called samples. Well
then assign the samples object to the ith entry in samples.ls
for(i in 1:5) {
samples <- rnorm(n = i, mean = 0, sd = 1)
samples.ls[[i]] <- samples
}
samples.ls
## [[1]]
## [1] 0.06156
##
## [[2]]
## [1] 0.9576 2.0565
##
## [[3]]
## [1] 0.48995 0.94618 -0.02269
##
## [[4]]
## [1] 1.4533 0.2751 0.6653 -0.9734
##
## [[5]]
## [1] 0.61423 -0.09755 0.77833 -0.20921 0.28763
Looks like it worked. The first element has one sample, the second
element has two samples
If you want to convert a list to a vector format, you can use the
command unlist().
unlist(samples.ls)
Now all the results are compressed into one vector. Of course, the
resulting vector has lost some information because you dont know
which values came from which loop index.
Loops are great because they save you a lot of code. However, a
drawback of loops is that they can be slow relative to other functions.
15: loops 227
## [1] 1 2 3 4 5 6 7 8 9 10
While this for-loop works just fine, you may have noticed that its
a bit silly. Why? Because R has built-in functions for quickly and
easily calculating sequences of numbers. In fact, we used one of those
functions in creating this loop! (See if you can spot it...its 1:10). The
lesson is: before creating a loop, make sure theres not already a
function in R that can do what you want.
If youre running a long loop and find that its taking a long time,
you can try running in in parallel using snowfall(). Snowfall is a
package that allows you to use multiple processors (called "slaves")
on your computer to run a loop. Theoretically, this can increase the
speed of your loop by 4, 16, 24 (etc.) times, but obviously it depends
on how many cores your computer (or the server youre running the
loop on) has.
To install the snowfall package, run the code
install.packages("snowfall")
2. Set up the slaves with sfInit(). Here, you determine how many
slaves (processors) you want to run.
4. Write a function that each slave will evaluate (e.g.; slave.fun). The
function must have a single input.
library("snowfall")
sfExport("pirates")
sfLibrary(dplyr)
Fourth, well define the function that you want each slave to run.
In this case, well create a function called slave.fun() that takes the
name of a college as an input, and return the average age of pirates
from that college as an output. Each slave will evaluate this function
on a specific college.
return(output)
Fifth, now were ready to run the cluster! We do this using the
sfSapply()33 function. There are two arguments to sfSapply(): x, a 33
In addition to sfSapply(), you can
also use sfLapply(), which returns
vector of index values sent to the function, and fun, a function that
stores the results as a list
will be evaluated with each element of x. For this example, well set x
to a vector of the two colleges, and the function to be slave.fun. The
function will then send each value of the vector x to a slave, the slave
will evaluate slave.fun() for a value of x, and return the result to the
main R session:
Sixth, now that the cluster is finished, we need to close the clusters
with the command sfStop().
sfStop()
##
## Stopping cluster
cluster.result
Looks like our result is what we want. Now, of course this was
a very simple example, and it would have been much easier to use
the aggregate() function to accomplish this task. But the general
structure should allow you to perform much more complicated loops
in parallel.
In this chapter, well cover many tips and tricks for preparing a
dataset for analysis - from recoding values in a dataframe, to merging
two dataframes together, to ... lots of other fun things. Some of the
material here has been covered in other chapters - however, because
data preparation is so important and because so many of these
procedures go together, I thought it would be wise to have them all
in one place.
The Basics
If you dont know the exact index of a column whose name you
want to change, but you know the original name, you can use logical
indexing. For example, to change the name of a column titled sex to
gender, you can do the following:
ChickWeight[1:2,]
232 yarrr! the pirates guide to r
Now lets look at the new version of gender. The former values of
0 should now be female, 1 should now be male, and 2 should now be
other:
234 yarrr! the pirates guide to r
gender
As you can see, age is coded as a character (not numeric), and # We'll get an error because age is not
# numeric (yet)
gender is coded as a factor. Wed like to convert age to numeric, and mean(data$age)
gender to character.
To make age numeric, just use the as.numeric() function: ## Warning in
mean.default(data$age): argument
is not numeric or logical: returning
data$age <- as.numeric(data$age) NA
## [1] NA
To convert gender from a factor to a character, use the as.character()
function:
str(data)
mean(data$age)
## [1] 24
cut()
x
A vector of numeric data
breaks
Either a numeric vector of two or more unique cut points or a
single number (greater than or equal to 2) giving the number of
intervals into which x is to be cut. For example, breaks = 1:10
will put break points at all integers from 1 to 10, while breaks = 5
will split the data into 5 equal sections.
labels
An optional string vector of labels for each grouping. By default,
labels are constructed using "(a,b]" interval notation. If labels =
FALSE, simple integer codes are returned instead of a factor.
right
A logical value indicating if the intervals should be closed on the
right (and open on the left) or vice versa.
As you can see, our result is a vector of factors, where the first ten
elements are (0, 10], the next ten elements are (10, 20], and so on. In
other words, the new vector treats all numbers from 1 to 10 as being
the same, and all numbers from 11 to 20 as being the same.
Lets test the cut() function on the age data from pirates. Well
Once youve used cut() to convert a
add a new column to the dataset called age.decade, which separates numeric variable into bins, you can then
the age data into bins of size 10. This means that every pirate be- use aggregate() or dplyr to calculate
aggregate statistics for each bin. For
tween the ages of 10 and 20 will be in the first bin, those between example, to calculate the mean number
the ages of 21 and 30 will be in the second bin, and so on. To do this, of tattoos of pirates in their 20s, 30s, 40s,
well enter pirates$age as the x argument, and seq(10, 60, 10) as ... we could do the following:
To show you how this worked, lets look at the first few rows of # Calculate the mean number of tattoos
the columns age and age.cut # in each decade
aggregate(tattoos ~ age.decade,
head(pirates[c("age", "age.decade")]) FUN = mean,
data = pirates)
## age age.decade
## age.decade tattoos
## 1 30 (20,30] ## 1 (10,20] 9.722
## 2 25 (20,30] ## 2 (20,30] 9.320
## 3 25 (20,30] ## 3 (30,40] 9.335
## 4 (40,50] 8.143
## 4 29 (20,30]
## 5 31 (30,40]
## 6 30 (20,30]
As you can see, age.cut has correctly converted the original age
variable to a factor.
From these data, we can now easily calculate how many pirates
are in each age group using table()
table(pirates$age.decade)
##
## (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]
## 0 115 600 278 7 0 0 0
## (80,90] (90,100]
## 0 0
16: data cleaning and preparation 237
merge()
x, y
2 dataframes to be merged
all.x, all.y
A logical value indicating whether or not to include non-matching
rows of the dataframes in the final output. The default value is
all.y = FALSE, such that any non-matching rows in y are not
included in the final merged dataframe.
For example, lets say that we have some survey data in a dataframe
called survey. survey <- data.frame(
Heres how the survey data looks: "pirate" = 1:10,
"country" = c("Germany",
"Portugal",
survey "Spain",
"Austria",
## pirate country "Australia",
## 1 1 Germany "Austria",
"Germany",
## 2 2 Portugal "Portugal",
## 3 3 Spain "Portugal",
"Germany"
## 4 4 Austria ),
## 5 5 Australia stringsAsFactors = F
## 6 6 Austria )
## 7 7 Germany
## 8 8 Portugal
## 9 9 Portugal
## 10 10 Germany
As you can see, the merge() function added all the country.info
data to the survey data.
Named Colors
white darkgoldenrod4
blue1 chocolate1 dodgerblue3gray4 gray30 gray56 gray82 grey0 grey26 grey52 grey78honeydew3
darkseagreen3 lightgreen
lavenderblush3 lightyellow
mediumpurple2
olivedrab2 slategray thistlewhitesmoke
purplesandybrown
paleturquoise2
darkgray
aliceblue blue2 chocolate2 dodgerblue4gray5 gray31 gray57 gray83 grey1 grey27 grey53 grey79honeydew4
darkseagreen4 lightgrey
lavenderblush4 lightyellow1
mediumpurple3 purple1seagreen
paleturquoise3
olivedrab3 slategray1thistle1 yellow
darkgreen
antiquewhiteblue3 chocolate3 firebrick gray6 gray32 gray58 gray84 grey2 grey28 grey54 grey80 hotpinklawngreenlightpink
darkslateblue lightyellow2
mediumpurple4 purple2seagreen1
paleturquoise4
olivedrab4 slategray2thistle2 yellow1
antiquewhite1 darkgrey
blue4 chocolate4 firebrick1 gray7 gray33 gray59 gray85 grey3 grey29 grey55 grey81 hotpink1
darkslategray lightpink1
lemonchiffon lightyellow3 orange
mediumseagreen purple3seagreen2
palevioletred slategray3thistle3 yellow2
antiquewhite2
blueviolet coral darkkhaki firebrick2 gray8 gray34 gray60 gray86 grey4 grey30 grey56 grey82 hotpink2
darkslategray1 lightpink2
lemonchiffon1 lightyellow4 orange1
mediumslateblue purple4seagreen3
palevioletred1 slategray4thistle4 yellow3
antiquewhite3 darkmagenta
brown coral1 firebrick3 gray9 gray35 gray61 gray87 grey5 grey31 grey57 grey83 hotpink3
darkslategray2 lightpink3
lemonchiffon2 limegreen orange2
mediumspringgreenpalevioletred2red seagreen4
slategrey tomato yellow4
antiquewhite4 darkolivegreen
brown1 coral2 firebrick4 gray10 gray36 gray62 gray88 grey6 grey32 grey58 grey84 hotpink4
darkslategray3 lightpink4 linen
lemonchiffon3 orange3
mediumturquoise yellowgreen
palevioletred3red1 seashell snow tomato1
aquamarine darkolivegreen1
brown2 coral3 floralwhitegray11 gray37 gray63 gray89 grey7 grey33 grey59 grey85 indianred
darkslategray4 magenta
lightsalmon
lemonchiffon4 orange4
mediumvioletred palevioletred4red2 seashell1 snow1 tomato2
aquamarine1 darkolivegreen2
brown3 coral4 forestgreengray12 gray38 gray64 gray90 grey8 grey34 grey60 grey86indianred1lightblue
darkslategrey magenta1
lightsalmon1 midnightblue
orangered
papayawhip red3 seashell2 snow2 tomato3
aquamarine2
brown4 darkolivegreen3
cornflowerblue gainsborogray13 gray39 gray65 gray91 grey9 grey35 grey61 grey87indianred2
darkturquoise lightblue1 magenta2
lightsalmon2 orangered1
mintcream peachpuff red4 seashell3 snow3 tomato4
darkolivegreen4
burlywoodcornsilk
aquamarine3 ghostwhitegray14 gray40 gray66 gray92 grey10 grey36 grey62 grey88indianred3
darkviolet lightblue2 magenta3
lightsalmon3 orangered2
mistyrose rosybrown
peachpuff1 seashell4 snow4 turquoise
burlywood1
aquamarine4 darkorange
cornsilk1 deeppink gold gray15 gray41 gray67 gray93 grey11 grey37 grey63 grey89indianred4
lightblue3 magenta4
lightsalmon4 orangered3
mistyrose1 rosybrown1siennaspringgreen
peachpuff2 turquoise1
azureburlywood2
cornsilk2 deeppink1 gold1 gray16 gray42 gray68 gray94 grey12 grey38 grey64 grey90
darkorange1 ivory lightblue4
lightseagreen orangered4
maroonmistyrose2 rosybrown2
peachpuff3 springgreen1
sienna1 turquoise2
azure1burlywood3
cornsilk3 deeppink2 gold2 gray17 gray43 gray69 gray95 grey13 grey39 grey65 grey91 ivory1 lightcoral
darkorange2 lightskyblue mistyrose3orchidpeachpuff4
maroon1 rosybrown3 springgreen2
sienna2 turquoise3
azure2burlywood4
cornsilk4 deeppink3 gold3 gray18 gray44 gray70 gray96 grey14 grey40 grey66 grey92 ivory2 lightcyan
darkorange3 lightskyblue1 mistyrose4orchid1
maroon2 peru rosybrown4 springgreen3
sienna3 turquoise4
azure3cadetblue cyandarkorange4
deeppink4 gold4 gray19 gray45 gray71 gray97 grey15 grey41 grey67 grey93 ivory3 lightcyan1
lightskyblue2
maroon3moccasinorchid2 springgreen4violet
pink royalbluesienna4
azure4cadetblue1cyan1 darkorchid goldenrodgray20 gray46 gray72 gray98 grey16 grey42 grey68 grey94 ivory4 lightcyan2
deepskyblue lightskyblue3 navajowhite
maroon4 orchid3 pink1 royalblue1skybluesteelbluevioletred
beige cadetblue2cyan2darkorchid1 goldenrod1gray21 gray47 gray73 gray99 grey17 grey43 grey69 grey95
deepskyblue1 khaki lightcyan3
lightskyblue4 navajowhite1
mediumaquamarine orchid4 pink2 royalblue2
skyblue1
steelblue1
violetred1
bisquecadetblue3cyan3darkorchid2 goldenrod2gray22 gray48 gray74 gray100 grey18 grey44 grey70 grey96 khaki1lightcyan4
deepskyblue2 lightslateblue palegoldenrod
navajowhite2
mediumblue pink3 royalblue3
skyblue2
steelblue2
violetred2
bisque1cadetblue4cyan4darkorchid3 goldenrod3gray23 gray49 gray75 green grey19 grey45 grey71 grey97 khaki2
deepskyblue3 lightgoldenrod
lightslategray
mediumorchid palegreen pink4 royalblue4
navajowhite3 skyblue3
steelblue3
violetred3
bisque2chartreuse
darkblue
darkorchid4 goldenrod4gray24 gray50 gray76 green1 grey20 grey46 grey72 grey98 khaki3
deepskyblue4 lightgoldenrod1
lightslategrey palegreen1 plumsaddlebrown
navajowhite4
mediumorchid1 skyblue4
steelblue4
violetred4
bisque4 darkgoldenrod
chartreuse2 dimgrey gray0 gray26 gray52 gray78 green3 grey22 grey48 grey74 grey100lavender
darksalmon lightgoldenrod3
lightsteelblue1 navyblue
mediumorchid3 palegreen3plum2 salmon1slateblue1 tan1 wheat1
darkgoldenrod1
blackchartreuse3 dodgerbluegray1 gray27 gray53 gray79 green4 grey23 grey49 grey75honeydew
darkseagreen lightgoldenrod4
lavenderblush lightsteelblue2 oldlacepalegreen4plum3 salmon2slateblue2 tan2
mediumorchid4 wheat2
blanchedalmond darkgoldenrod2
chartreuse4 dodgerblue1gray2 gray28 gray54 gray80greenyellowgrey24 grey50 grey76honeydew1
darkseagreen1 lightgoldenrodyellow
lavenderblush1 lightsteelblue3
mediumpurple paleturquoise
olivedrab plum4 salmon3slateblue3 tan3 wheat3
darkgoldenrod3
blue chocolate dodgerblue2gray3 gray29 gray55 gray81
darkseagreen2 grey grey25 grey51 grey77honeydew2 lightgray
lavenderblush2 lightsteelblue4
mediumpurple1 paleturquoise1
olivedrab1 powderblue
salmon4slateblue4 tan4 wheat4