YaRrr Book

Download as pdf or txt
Download as pdf or txt
You are on page 1of 247

yarrr!

the pirates guide to r 1


2 dr. nathaniel d. phillips

YaRrr!

The Pirates Guide to R


D R . N AT H A N I E L D . P H I L L I P S

YA R R R ! T H E P I R AT E S
GUIDE TO R
Copyright 2016 Dr. Nathaniel D. Phillips

published by

http://www.thepiratesguidetor.com

This document may not be used for any commercial purposes. All rights are reserved by Nathaniel
Phillips.

First printing,
Contents

Introduction 11

1: Getting Started (and why R is like a relationship) 15


R is like a relationship... 15
Installing R and RStudio 16
Installing and loading packages 21
The yarrr package 22
The R Reference Card 23

1.5: Jump off the plank and dive in 25


Whats the best way to learn how to swim? 25
Wasnt that easy?! 27

2: R Basics 29
The basics of R programming 29
A brief style guide: Commenting and spacing 32
Creating new objects with <- 34
Test your R might! 38

3: Scaler and vector objects 41


Scalers 41
Vectors 42
4

Functions to generate numeric vectors 43


Test your R might! 47

4: Vector functions 49
Arithmetic operations on vectors 49
Summary statistic functions for numeric vectors 51
Counting functions for discrete and non-numeric data 54
Test your R Might! 58

5: Working with vectors 61


Indexing vectors with brackets 62
Additional ways to create and use logical vectors 66
Changing values in a vector with indexing 66
Taking the sum and mean of logical vectors to get counts and percentages 68
Test your R Might!: Movie data 69

6: Matrices and Data Frames 71


What are matrices and dataframes? 71
Creating matrices and dataframe objects 72
Matrix and dataframe functions 75
Dataframe column names 77
Slicing and dicing dataframes 80
Additional tips 84
Test your R might! Pirates and superheroes 85

7: Importing, saving, and managing data 87


Workspace / Working Environment 88
Exporting dataframes as .txt files 91
Reading dataframes from .txt files 93
Additional Tips 94
Test your R Might! 95
5

8: Advanced dataframe manipulation 97


Row and column statistics: rowMeans, colMeans, rowSums, colSums 97
Sorting dataframes with order() 98
Grouped aggregation with aggregate() 99
Aggregation with the dplyr package 102
Additional Tips 105
Test your R might!: Mmmmm...caffeine 106

9: Sampling and Probability Distributions 107


Sampling data from probability distributions 107
A worked example: A quick test of the law of large numbers 112
Test your R might! 113
Additional Tips 113

10: Plotting: Part 1 115


How does R manage plots? 115
Color basics 117
Scatterplot: plot() 120
Histogram: hist() 122
Barplot: barplot() 123
The Pirate Plot: pirateplot() 126
Low-level plotting functions 130
Adding new points to a plot with points() 131
Adding straight lines with abline() 132
Adding text to a plot with text() 133
Combining text and numbers with paste() 134
Additional low-level plotting functions 138
Saving plots to a file 139
Test your R Might! Purdy pictures 141
6

10: Plotting: Part Deux 145


Advanced colors 145
Plot margins 150
Arranging multiple plots with par(mfrow) and layout 151
Additional Tips 154

11: Inferential Statistics: 1 and 2-sample Null-Hypothesis tests 155


Null vs. Alternative Hypotheses, Descriptive Statistics, Test Statistics, and p-values: A
very short introduction 155
Null v Alternative Hypothesis 155
T-test with t.test() 160
Correlation test with cor.test() 164
Chi-square test 167
Getting APA-style conclusions with the apa function 169
Test your R might! 171

12: ANOVA and Factorial Designs 173


Between-Subjects ANOVA 174
4 Steps to conduct a standard ANOVA in R 176
One-way (1 IV) ANOVA 176
Interactions between variables with * (y x1 * x2) 180
Additional tips 182
Test your R Might! 185

13: Regression 187


The Linear Model 187
Linear regression with lm() 187
Estimating the value of diamonds with lm() 188
Including interactions in models: dv x1 * x2 192
Regression on non-Normal data with glm() 194
Getting an ANOVA from a regression model with aov() 197
Additional Tips 198
Test your Might! A ship auction 200
7

14: Writing your own functions 201


Why would you want to write your own function? 201
The basic structure of a function 202
Storing and loading your functions to and from a function file with source() 209
Tips and tricks for complex functions 209
A worked example: Custom plotting functions 212

15: Loops 215


What are loops? 216
Creating multiple plots with a loop 217
Storing sequential loop results in a container object 219
Loops over multiple indices 222
The list object 224
When and when not to use loops 226
Parallel computing with snowfall() 227

16: Data Cleaning and preparation 231


The Basics 231
Splitting numerical data into groups using cut() 235
Merging two dataframes 237
Random Data Preparation Tips 239

Appendix 241

Index 245
9

This book is dedicated to Dr. Thomas Moore

and Dr. Wei Lin who taught me everything

I know about statistics, Dr. Dirk Wulff

who taught me everything I know about R,

and Dr. Hansjrg Neth who did the same

with LaTeX.
Introduction

Who am I?
My name is Nathaniel. I am a psychologist with a background in
statistics and judgment and decision making. You can find my R (and
non-R) related musings at www.nathanieldphillips.com. As you will
learn in a minute, I didnt write this book, but I did translate it from
pirate-speak to English.

Buy me a mug of beer!


This book is totally free. You are welcome to share it with your
friends, family, hairdresser, plumber, and other loved ones. If you like
it, and want to support me in improving the book, consider buying
me a mug of beer with a donation. You can find a donation link at
http://www.thepiratesguidetor.com. Like a pirate, I work best with a mug of
beer within arms reach.

Where did this book come from?


This whole story started in the Summer of 2015. I was taking a late
night swim on the Bodensee in Konstanz and saw a rusty object stick-
ing out of the water. Upon digging it out, I realized it was an ancient
usb-stick with the word YaRrr inscribed on the side. Intrigued, I
brought it home and plugged it into my laptop. Inside the stick, I
found a single pdf file written entirely in pirate-speak. After watch-
ing several pirate movies, I learned enough pirate-speak to begin
translating the text to English. Sure enough, the book turned out to
be an introduction to R called The Pirates Guide to R.
This book clearly has both massive historical and pedagogical
significance. Most importantly, it turns out that pirates were pro-
gramming in R well before the earliest known advent of computers.
Of slightly less significance is that the book has turned out to be a
surprisingly up-to-date and approachable introductory text to R. For
both of these reasons, I felt it was my duty to share the book with the
world.
12

This book is in progress..


This book is very much a work in progress and was last updated
on . As I am still improving the pirate-English translations, there
are likely many typos, errors and omissions in the document. Im
constantly experimenting with the material and the layout. If you
have any recommendations for changes or spot any errors, please
write me at [email protected] or tweet me @YaRrrBook. Email me with comments, rec-
ommendations or typos at:
[email protected] or tweet me
Who is this book for? at @YaRrrBook

While this book was originally written for pirates, I think that anyone
who wants to learn R can benefit from this book. If you havent had
an introductory course in statistics, some of the later statistical con-
cepts may be difficult, but Ill try my best to add brief descriptions of
new topics when necessary. Likewise, if R is your first programming
language, youll likely find the first few chapters quite challenging
as you learn the basics of programming. However, if R is your first
programming language, thats totally fine as what you learn here
will help you in learning other languages as well (if you choose to).
Finally,while the techniques in this book apply to most data analysis
problems, because my background is in experimental psychology I
will cater the course to solving analysis problems commonly faced in
psychological research.

What this book is


This book is meant to introduce you to the basic analytical tools in
R, from basic coding and analyses, to data wrangling, plotting, and
statistical inference.

What this book is not


This book does not cover any one topic in extensive detail. If you are
interested in conducting analyses or creating plots not covered in the
book, Im sure youll find the answer with a quick Google search!

Why is R so great?
As youve already gotten this book, you probably already have some
idea why R is so great. However, in order to help prevent you from
giving up the first time you run into a programming wall, let me give
you a few more reasons:

1. R is 100% free and as a result, has a huge support community.


Unlike SPSS, Matlab, Excel and JMP, R is, and always will be
13

completely free. This doesnt just help your wallet - it means that
a huge community of R programmers will constantly develop an
distribute new R functionality and packages at a speed that leaves
all those other packages in the dust! Unlike Fight Club, the first
rule of R is "Do talk about R!" The size of the R programming
community is staggering. If you ever have a question about how to
implement something in R, a quick Poogle1 search will lead you to 1
I am in the process of creating Poogle
- Google for Pirates. Kickstarter page
your answer virtually every single time.
coming soon...
2. R is the future. To illustrate this, look at the following three figures.
These are Google trend searches for three terms: R Programming,
Matlab, and SPSS. Try and guess which one is which.

Figure 1: I wonder which trend is for


R...
3. R is incredibly versatile. You can use R to do everything from
calculating simple summary statistics, to performing complex
simulations to creating gorgeous plots like the chord diagram in
Figure 2. If you can imagine an analytical task, you can almost
certainly implement it in R.
4. Using RStudio, You can easily and seamlessly combine R code,
analyses, plots, and written text into elegant documents all in
one place using Sweave (R and Latex) or RMarkdown. In fact, I
translated this entire book (the text, formatting, plots, code...yes,
everything) in RStudio using Sweave. With RStudio and Sweave,
instead of trying to manage two or three programs, say Excel,
Word and (sigh) SPSS, where you find yourself spending half your
time copying, pasting and formatting data, images and test, you
can do everything in one place so nothing gets misread, mistyped,
or forgotten.
5. Analyses conducted in R are transparent, easily shareable, and
reproducible. If you ask an SPSS user how they conducted a spe-
cific analyses, they will either A) Not remember, B) Try (nervously)
to construct an analysis procedure on the spot that makes sense
- which may or may not correspond to what they actually did
months or years ago, or C) Ask you what you are doing in their
kitchen. I used to primarily use SPSS, so I speak from experience
on this. If you ask an R user (who uses good programming tech-
niques!) how they conducted an analysis, they should always be
14

able to show you the exact code they used. Of course, this doesnt
library("circlize")
mean that they used the appropriate analysis or interpreted it mat = matrix(sample(1:100, 18, replace = TRUE), 3, 6)
rownames(mat) = letters[1:3]
colnames(mat) = LETTERS[1:6]
correctly, but with all the original code, any problems should be chordDiagram(mat)

completely transparent!
C D

6. And most importantly of all, R is the programming language of 0


40
80 0
40
0

40
0
12 E
choice for pirates. 80

B
80

12
0
40

160
0
0
80
Additional Tips

40
F
A
40

80
0
0200
Because this is a beginners book, I try to avoid bombarding you with

40
160
too many details and tips when I first introduce a function. For this

80
120

0
c

12
reason, at the end of every chapter, I present several tips and tricks

80

a
0
16
40
that I think most R users should know once they get comfortable 0
20
0

160
0
24
0
20
0

120 40
80

with the basics. I highly encourage you to read these additional b

tips as I expect youll find at least some of them very useful if not Figure 2: This is a chordDiagram
invaluable. plot that comes with the R package
circlize. How cool is this?! Have fun
trying to make this with SPSS.
1: Getting Started (and why R is like a relationship)

Good Times Bad Times


R is like a relationship... ok JUST DO THE
DAMN T-TEST

Yes, R is very much like a relationship. Like relationships, there are I love you R!

two major truths to R programming: no

1. There is nothing more frustrating than when your code does Figure 3: Yep, R will become both your
not work best friend and your worst nightmare.
The bad times will make the good times
oh so much sweeter.
2. There is nothing more satisfying than when your code does
work!

So by now youve installed R and youre ready to get started. But


first, let me give you a brief word of warning: Especially if this is
your first experience programming, you are going to experience a lot
of headaches when you get started. You will run into error after error $RG:LKSD
%#
and pound your fists against the table screaming: "WHY ISNT MY Day 1
CODE WORKING?!?!? There must be something wrong with this
stupid software!!!" You will spend hours trying to find a bug in your
code, only to find that - frustratingly enough, you had had an extra
space or missed a comma somewhere. Youll then wonder why you
Askand you
ever decided to learn R when (::sigh::) SPSS was so "nice and easy." Day 100 shall receive.

This is perfectly normal! Dont get discouraged and DONT GO


BACK TO SPSS! Trust me, as you gain more programming experi-
ence, youll experience fewer and fewer bugs (though theyll never go
away completely). Once you get over the initial barriers, youll find Figure 4: When you first meet R, it will
look so fugly that youll wonder if this
yourself conducting analyses much, much faster than you ever did is all some kind of sick joke. But trust
before. me, once you learn how to talk to it,
and clean it up a bit, all your friends
will be crazy jealous.
Fun fact: SPSS stands for "Shitty Piece
of Shitty Shit". True story.
16 yarrr! the pirates guide to r

Installing R and RStudio

First things first, lets download both Base R and Rstudio. Of course,
they are totally free and open source.

Download Base R

Windows: http://cran.r-project.org/bin/windows/base/

Mac: http://cran.r-project.org/bin/macosx/

Once youve installed base R on your computer, try opening it.


When you do you should see a screen like the one in Figure 5 (this
is the Mac version). As you can see, base R is very much bare-bones
software. Its kind of the equivalent of a simple text editor that comes
with your computer.

Download RStudio
Figure 5: Here is how the standard R
application looks. Not too exciting - just
how we like it!

Windows and Mac: http://www.rstudio.com/products/rstudio/download/

While you can do pretty much everything you want within base
R, youll find that most people these days do their R programming in
an application called RStudio. RStudio is a graphical user interface
(GUI)-like interface for R that makes programming in R a bit easier.
In fact, once youve installed RStudio, youll likely never need to
open the base R application again. To download and install RStudio
(around 40mb), go to one of the links above and follow the instruc-
tions.
Lets go ahead and boot up RStudio and see how she looks!

Code Chunks
In this book, R code is (almost) always presented in a separate gray
box like this one:

a <- 1 + 2 + 3 + 4 + 5
a

## [1] 15
1: getting started (and why r is like a relationship) 17

This is called a code chunk. You should always be able to directly


copy and paste code chunks directly into R. If you copy a chunk and
it does not work for you, it is most likely because the code refers to
a package, function, or object that I defined in a previous chunk. If
so, read back and look for a previous chunk that contains the missing
definition. As youll soon learn, lines that begin with # are either
comments or output from prior code that R will ignore.
As youll notice, Ill include code chunks before all plots in the
book. In early chapters, the code might not make sense just yet.
However, I elected to always include plotting code so you have the
option of re-creating (and tweaking) any plot in the book.

The four RStudio windows


When you open RStudio, youll see the following four windows (also
called panes) shown in in Figure 6. However, your windows might
be in a different order that those in Figure 6. If youd like, you can
change the order of the windows under RStudio preferences. You can
also change their shape by either clicking the minimize or maximize
buttons on the top right of each panel, or by clicking and dragging
the middle of the borders of the windows.

Figure 6: The four panes of RStudio.


18 yarrr! the pirates guide to r

Now, lets see what each window does in detail.

Source - Your notepad for code


The source pane is where you create and edit R Scripts" - your
collections of code. Dont worry, R scripts are just text files with
the ".R" extension. When you open RStudio, it will automatically
start a new Untitled script. Before you start typing in an untitled R
script, you should always save the file under a new file name (like,
"2015PirateSurvey.R"). That way, if something on your computer
crashes while youre working, R will have your code waiting for you
when you re-open RStudio.
Youll notice that when youre typing code in a script in the Source
panel, R wont actually evaluate the code as you type. To have R
actually evaluate your code, you need to first send the code to the
Console (well talk about this in the next section).
There are many ways to send your code from the Source to the
console. The slowest way is to copy and paste. A faster way is to
highlight the code you wish to evaluate and clicking on the "Run"
button on the top right of the Source. Alternatively, you can use the
hot-key "Command + Return" on Mac, or "Control + Enter" on PC to
send all highlighted code to the console.

Console: The calculator


The console is where R actually evaluates code. At the beginning
of the console youll see the character >. This is a prompt that tells
you that R is ready for new code. You can type code directly into
the console after the > prompt and get an immediate response. For
example, if you type 1+1 into the console and press enter, youll see
that R immediately gives an output of 2.

1+1

## [1] 2

Try calculating 1+1 by typing the code directly into the console -
then press Enter. You should see the result [1] 2. Dont worry about
the [1] for now, well get to that later. For now, were happy if we
just see the 2. Then, type the same code into the Source, and then
send the code to the Console by highlighting the code and clicking
the Run" button on the top right hand corner of the Source window.
Alternatively, you can use the hot-key "Command + Return" on Mac
or "Control + Enter" on Windows. Tip: Try to write most of your code in
So as you can see, you can execute code either by running it from a document in the Source. Only type
directly into the Console to de-bug or
the Source or by typing it directly into the Console. However, 99% do quick analyses.
1: getting started (and why r is like a relationship) 19

most of the time, you should be using the Source rather than the
Console. The reason for this is straightforward: If you type code into
the console, it wont be saved (though you can look back on your
command History). And if you make a mistake in typing code into
the console, youd have to re-type everything all over again. Instead,
its better to write all your code in the Source. When you are ready to
execute some code, you can then send "Run" it to the console.

Environment / History
The Environment tab of this panel shows you the names of all the
data objects (like vectors, matrices, and dataframes) that youve
defined in your current R session. You can also see information like
the number of observations and rows in data objects. The tab also
has a few clickable actions like Import Dataset" which will open a
graphical user interface (GUI) for important data into R. However, I
almost never look at this menu.
The History tab of this panel simply shows you a history of all
the code youve previously evaluated in the Console. To be honest,
I never look at this. In fact, I didnt realize it was even there until I
started writing this tutorial.
As you get more comfortable with R, you might find the Envi-
ronment / History panel useful. But for now you can just ignore
it. If you want to declutter your screen, you can even just minimize
the window by clicking the minimize button on the top right of the
panel.

Files / Plots / Packages / Help


The Files / Plots / Packages / Help panel shows you lots of helpful
information. Lets go through each tab in detail:

1. Files - The files panel gives you access to the file directory on your
harddrive. One nice feature of the "Files" panel is that you can use
it to set your working directory - once you navigate to a folder
you want to read and save files to, click "More" and then "Set As
Working Directory." Well talk about working directories in more
detail soon.

2. Plots - The Plots panel (no big surprise), shows all your plots.
There are buttons for opening the plot in a separate window and
exporting the plot as a pdf or jpeg (though you can also do this
with code using the pdf() or jpeg() functions.) Most - if not all - of the time when
Lets see how plots are displayed in the Plots panel. Run the you perform actions using your mouse
by pointing and clicking in RStudio,
following code to display a histogram of 100 values randomly RStudio will perform the function by
drawn from a standard normal distribution. When you do, you sending the appropriate R Code to
the console. You can then copy and
paste this code into your documents to
automate the process later.
20 yarrr! the pirates guide to r

should see a plot similar to this one show up in the Plots panel.
Dont worry if your plot looks slightly different from this one: as
youll learn later, the rnorm() function generates different data
each time you evaluate it!

hist(x = rnorm(n = 100, mean = 1000, sd = 100),


main = "Plunder in 2015",
xlab = "Amount"
)

Plunder in 2015
15
Frequency

10
5
0

800 900 1000 1100 1200

Amount

3. Packages - Shows a list of all the R packages installed on your


harddrive and indicates whether or not they are currently loaded.
Packages that are loaded in the current session are checked while
those that are installed but not yet loaded are unchecked. Well
discuss packages in more detail in the next section.
4. Help - Help menu for R functions. You can either type the name of
a function in the search window, or use the code ?function.name
to search for a function with the name function.name
To get help and see documentation for a function, type ?fun,
where fun is the name of the function. For example, to get addi-
tional information on the histogram function, run the following
code: Tip: If you ever need to learn
more about an R function: type
?functionname, where functionname
is the name of the function.
1: getting started (and why r is like a relationship) 21

?hist

Installing and loading packages

When you download and install R for the first time, you are installing
the Base R software. Base R will will contain most if not all the
functions you need. However, one of the great things about R is that
people are constantly writing and sharing new functions that you
can use. When people share a new function, they usually do so in the
form of an R package which contains anything from functions, to help
menus, to vignettes (examples), to data.
Most R packages are hosted at the Comprehensive R Archive Net-
work (CRAN) https://cran.r-project.org/. To install a new R package
from CRAN, you can simply run the code install.packages("package"),
where "package" is the name of the package. After youve installed
the package, you need to load it into R by running the code library("package").
This will load the package into your current R session and allow you
to use its contents. Once youve installed a package on
To see how this works in action, lets install and load the wordcloud your computer, you never need to
install it again. However, you do need
package. This package contains the wordcloud function which allows to load the package every time you start
you to easily create those really cool wordcloud plots youve seen on a new R session.
the Internets.
Well start by installing the package. When you run the following
code, R will download the package from CRAN. If everything works,
you should see some information about where the package is being
downloaded from, in addition to a progress bar.

install.packages("wordcloud")

Now that the package is installed on your computer, you can use it
anytime you want by loading the package:

library("wordcloud")

## Loading required package: RColorBrewer

Now, lets create a wordcloud of pirate words! Dont worry about


the specifics of the code below, youll learn more about how all
this works later. For now, just run the code and marvel at your
wordcloud!

wordcloud(words = c("Blackbeard", "Jolly.Roger", "Grogg", "Monkey.Island", "Treasure", "cutlass"),


freq = c(100, 20, 50, 40, 200, 100))
22 yarrr! the pirates guide to r

Monkey.Island
cutlass
Jolly.Roger

Blackbeard
Grogg

Treasure

The yarrr package

For much of this book, you will need the yarrr package. This pack-
age contains every dataset, function, and plotting code from this
book. Unlike most packages that youll be using, the yarrr pack-
age is not hosted at CRAN. Instead, all the code is on github2 at 2
For those of you unfamiliar with
www.github.com/ndphillips/yarrr. To install the yarrr package on github, its basically Facebook for
programmers.
your machine, youll first need to install the devtools package which
will then allow you to directly install packages from github:

install.packages("devtools") # Install the devtools package

Now that youve installed and loaded the devtools package, you
can install and load the yarrr package from github with the follow-
ing code:

devtools::install_github("ndphillips/yarrr") # Install the yarrr package

If everything went correctly, you should have access to all the


datasets mentioned in this book in addition to many functions like
apa() and pirateplot(). Dont worry if you are unable to install
the yarrr package. Ill also provide
direct html links that you can use the
download the datasets when necessary.
1: getting started (and why r is like a relationship) 23

To see the dataset, you can execute the the View() function

View(pirates)

When you run this command, you should see the first several rows
and columns of the dataset (like this:)

Figure 7: The pirates dataset included


in the yarrr package.

The R Reference Card

Over the course of this book, you will be learning lots of new func-
tions. Wouldnt it be nice if someone created a Cheatsheet / Notecard
of many common R functions? Yes it would, and thankfully Tom
Short has done this in his creation of the R Reference Card. I highly
encourage you to print this out and start highlighting functions as
you learn them! .
You can access the pdf of this card in two ways. You can get it on
the web at http://nathanieldphillips.com/wp-content/uploads/2015/09/R-
Reference-Card.pdf. Alternatively, you can get it from the yarrr
package. To find where the file is located on your computer, run the
24 yarrr! the pirates guide to r

following code (the output after the code is the file location on my
computer):

system.file("RReferenceCard.pdf", package="yarrr")

## [1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/yarrr/RReferenceCard.pdf"

Finished!
Thats it for this lecture! All you did was install the most powerful
statistical package on the planet used by top universities and compa-
nies like Google. No big deal.
1.5: Jump off the plank and dive in

Whats the best way to learn how to swim?

Whats the first exercise on the first day of pirate swimming lessons? Figure 8: Despite what you might find
While it would be cute if they all had little inflatible pirate ships to at family friendly waterparks this is
NOT how real pirate swimming lessons
swim around in unfortunately this is not the case. Instead, those look.
baby pirates take a walk off their baby planks so they can get a
taste of what theyre in for. Turns out, learning R is the same way.
Lets jump in. In this chapter, youll see how easy it is to calculate
basic statistics and create plots in R. Dont worry if the code youre
running doesnt make immediate sense just marvel at how easy it is
to do this in R!

Movies
If you havent downloaded the
In this section, well analyze a dataset of the top 5,000 grossing yarrr package yet, run the following
movies of all time. This dataset is called movies and is contained in commands:

the yarrr package. install.packages('devtools')


First, well load the yarrr package. This will give us access to the library('devtools")
movies dataset. install_github('ndphillips/yarrr')

library(yarrr)

Next, well look at the help menu for the movies dataset using the
question mark ?

?movies

First, lets take a look at the first few rows of the dataset using the
head() function. This will give you a visual idea of how the dataset is
structured.

head(movies)

Now lets calculate some basic statistics on the entire dataset. Well
calculate the mean revenue, highest running time, and number of
movies in each genre3 . 3
I got a mean revenue of 98.21, a
maximum running time of 240, and
a table which showed, among other
things, that there were 692 Action
movies, 486 Adventure movies...
26 yarrr! the pirates guide to r

# What is the mean boxoffice total of all movies?


mean(movies$revenue.all)

# What was the longest movie time?


max(movies$time, na.rm = T)

# How many movies are there of each genre?


table(movies$genre)

Now, lets calculate statistics for different gropus of data. For


example, do you think that the budget of sequels is different from
that of non-sequels? Lets calculate the median budget separately for
movies that are sequels and those that are not4 . 4
For non-sequels (where sequel = 0), I
got a median budget of 28 Million. For
sequels, I got a median budget of 48
aggregate(formula = budget ~ sequel,
Million.
data = movies,
FUN = median
)
Heres how my plot looks

Cool stuff, now lets make a plot! Well plot the relationship be- My first scatterplot of movie data!

tween movie budgets and revenue. I would hope that the higher a

2500
movies budget is, the more money it should make. Lets find out if

2000
this is the case. Well also include a blue regression line to measure

1500
the strength of the relationship.
Revenue

1000
plot(x = movies$budget, 500

y = movies$revenue.all,
main = 'My first scatterplot of movie data!',
0

xlab = 'Budget',
0 100 200 300 400

Budget

ylab = 'Revenue'
)

abline(lm(revenue.all ~ budget,
data = movies),
col = 'blue'
)

Now, lets do some basic hypothesis tests. First, lets conduct a


two-sample t-test to see if there is a significant difference between the
budget of sequels and non sequels5 : 5
I got a test statistic of 9.44 and a
p-value of 2.1487864 1019 (which is
pretty much 0).
t.test(formula = budget ~ sequel,
data = movies,
alternative = 'two.sided'
)
1.5: jump off the plank and dive in 27

Next, lets test if there a significant correlation between a movies


budget and its length. In other words, do movies with higher bud-
gets tend to be longer?6 6
For this correlation test, I got a correla-
tion of 0.27, a test statistic of 11.65, and
another p-value of 3.0811405 1030 .
cor.test(formula = ~ budget + time,
data = movies
)

Finally, lets run a regression analysis to see if a movies budget,


length, and whether or not it is a sequel each predict how much a
movie will make:

revenue.model <- lm(formula = revenue.all ~ budget + time + sequel,


data = movies
)
summary(revenue.model)

Now, lets repeat some of our previous analyses with Bayesian


versions. First well install7 and load the BayesFactor package which 7
You need to be connected to the inter-
contains the Bayesian statistics functions well use: net in order to install the BayesFactor
package. If you arent, youll have to
skip the rest of these exercises.
install.packages('BayesFactor')
library(BayesFactor)

Now that the packages is installed and loaded, were good to go!
Lets do a Bayesian version of the t-test comparing the budgets of
movie sequels to non-sequels8 : 8
I got a Bayes Factor of 5.0869775 1038
which is pretty darn strong evidence
against the null hypothesis.
ttestBF(formula = budget ~ sequel,
data = movies[is.finite(movies$budget) &
is.finite(movies$sequel),])

Now, lets test the correlation between a movies budget and its
revenue with a Bayesian regression analysis9 : 9
I got a Bayes Factor of (Infinite!)
which is such strong evidence against
the null hypothesis that the time and
lmBF(formula = revenue.all ~ budget, space continuum is on the verge of
data = movies[is.finite(movies$revenue.all) & collapsing.
is.finite(movies$budget),])

Wasnt that easy?!

Ok seriously. Wasnt that easy? Imagine how long it would have


taken to explain how to do all that in SPSS10 . And while you havent Oh wait I forgot, SPSS cant even do
10

really learned how R works yet, Id bet my beard that you could Bayesian statistics

easily alter the previous code to do lots of other analyses. Of course,


28 yarrr! the pirates guide to r

dont worry if some or all of the previous code didnt make sense.
Soon...it will all be clear.
Now that youve gotten wet, lets learn how to swim.
2: R Basics

If youre like most people, you think of R as a statistics program.


However, while R is definitely the coolest, most badass, pirate-y
way to conduct statistics its not really a program. Rather, its a
programming language that was written by and for statisticians. To
learn more about the history of R...just...you know...Google it.
In this chapter, well go over the basics of the R language and the Figure 9: Ross Ihaka and Robert
Gentlemen. You have these two pirates
RStudio programming environment. to thank for creating R! You might not
think much of them now, but by the
end of this book theres a good chance
The basics of R programming youll be dressing up as one of them on
Halloween.
Objects and functions. Functions and objects
To understand how R works, you need to know that R revolves
around two things: objects and functions. Almost everything in R is
either an object or a function.
What is an object? An object is a thing like a number, a dataset,
a summary statistic like a mean or standard deviation, or a statistical
test. Objects come in many different shapes and sizes in R. There
are simple objects like scalers which represent single numbers, vec-
tors which represent a list of numbers, more complex objects like
dataframes which represent tables of data, and even more complex
objects like hypothesis tests or regression which contain all sorts of
statistical information.
Different types of objects have different attributes. For example,
a vector of data has a length attribute (i.e.; how many numbers are
in the vector), while a hypothesis test has many attributes such as
a test-statistic and a p-value. Dont worry if this is a bit confusing
now it will all become clearer when you meet these new objects
in person in later chapters. For now, just know that objects in R are
things, and different objects have different attributes.
What is a function? A function is a procedure that typically takes
one or more objects as arguments (aka, inputs), does something with
those objects, then returns a new object. For example, the mean()
function takes a vector of numeric data as an argument, calculates
30 yarrr! the pirates guide to r

the arithmetic mean of those data, then returns a single number (a


scaler) as a result. A great thing about R is that you can easily create
your own functions that do whatever you want but well get to that
much later in the book. Thankfully, R has hundreds (thousands?) of
built-in functions that perform most of the basic analysis tasks you
can think of.
Lets see a built-in function in practice. In the code below, Ill
define a vector object called tattoos. Ill then use the mean() function
to calculate the average number of tattoos in the tattoos vector by
setting the argument as the tattoos vector.

# 1: Create a vector object called tattoos


tattoos <- c(4, 67, 23, 4, 10, 35)

# 2: Apply the mean() function to the tattoos object


mean(tattoos)

## [1] 23.83333

As you can see, the result of the mean() function, with the tattoos
vector object as an input, returned a single number (a scaler object)
with a value of 23.8333333.
Different functions require different types of objects as inputs. For
example, the mean() function has to have a numeric object (typically
a vector) as an input. If you try to use a non-numeric object as an
argument to mean(), youll get an error because the function wasnt
designed to work with that type of argument.
As you learn R, youll spend a lot of time learning new types
of objects (and their attributes) and new functions. To see a list
of the most commonly used built-in functions and procedures in
R, check out the great R Reference Card written by Tom Short at
http://nathanieldphillips.com/wp-content/uploads/2015/09/R-
Reference-Card.pdf. I highly encourage you to print the R Reference
Card out and keep it handy while youre learning new functions.

The command-line interpreter


R code, on its own, is just text. You can write R code in a new script
within R or RStudio, or in any text editor. Hell, you can write R code
on Twitter if you want. However, just writing the code wont do the
whole job in order for your code to be executed (aka, interpreted)
you need to send it to Rs command-line interpreter. In RStudio, the
command-line interpreter is called the Console.
In R, the command-line interpreter starts with the > symbol.
This is called the prompt. Why is it called the prompt? Well, its
2: r basics 31

prompting you to feed it with some R code. The fastest way to have
R evaluate code is to type your R code directly into the command-
line interpreter. For example, if you type 1+1 into the interpreter and
hit enter youll see the following

1+1

## [1] 2

As you can see, R returned the (thankfully correct) value of 211 .


Figure 10: Yep. R is really just a fancy
As you can see, R can, thankfully, do basic calculations. In fact, at its
calculator. This R programming device
heart, R is technically just a fancy calculator. But thats like saying was found on a shipwreck on the
Michael Jordan is just a fancy ball bouncer or Donald Trump is Bodensee in Germany. I stole it from a
museum and made a pretty sweet plot
just an idiot. It (and they), are much more than that. with it. But I dont want to show it to
you.
11
Youll notice that the console also
Writing R scripts in an editor returns the text [1]. This is just telling
you you the index of the value next to it.
There are certainly many cases where it makes sense to type code Dont worry about this for now, it will
directly into the console. For example, to open a help menu for a new make more sense later.
function with the ? command, to take a quick look at a dataset with
the head() function, or to do simple calculations like 1+1, you should
type directly into the console. However, the problem with writing all
your code in the console is that nothing that you write will be saved.
So if you make an error, or want to make a change to some earlier
code, you have to type it all over again. Not very efficient. For this
(and many more reasons), youll should write any important code
that you want to save as an R script12 in an editor. 12
An R script is just a bunch of R code
In RStudio, youll write your R code in the...wait for it...Editor win- in a single file. You can write an R
script in any text editor, but you should
dow. To start writing a new R script in RStudio, click File New File save it with the .R suffix to make it
R Script13 . When you open a new script, youll see a blank page clear that it contains R code.we
13
Shortcut! To create a new script in R,
waiting for you to write as much R code as youd like. In Figure 11, I you can also use the commandshiftN
have a new script called myfirstscript.R with a few random calcula- shortcut on Mac. I dont know what it
tions. is on PC...and I dont want to know.

Send code from an editor to the console


When you type code into an R script, youll notice that, unlike typing
code into the Console, nothing happens. In order for R to interpret
the code, you need to send it from the Editor to the Console. There
are a few ways to do this, here are the three most common ways:

1. Copy the code from the Editor (or anywhere that has valid R code),
and paste it into the Console (using CommandV).

2. Highlight the code you want to run (with your mouse or by hold-
ing Shift), then use the CommandReturn shortcut (see Figure 12).
32 yarrr! the pirates guide to r

Figure 11: Heres how a new script


looks in the editor window on RStudio.
The code you type wont be executed
until you send it to the console.

3. Place the cursor on a single line you want to run, then use the
CommandReturn shortcut to run just that line.

99% of the time, I use method 2, where I highlight the code I want,
then use the CommandReturn shortcut. However, method 3 is great
for trouble-shooting code line-by-line.

A brief style guide: Commenting and spacing


Figure 12: Ah...the CommandReturn
Like all programming languages, R isnt just meant to be read by shortcut (ControlEnter on PC) to send
a computer, its also meant to be read by other humans or very highlighted code from the Editor to
the Console. Get used to this shortcut
well-trained dolphins. For this reason, its important that your code people. Youre going to be using this a
looks nice and is understandable to other people and your future self. lot
To keep things brief, I wont provide a complete style guide instead
Ill focus on the two most critical aspects of good style: commenting
and spacing14 . 14
For a list of recommendations on how
to make your code easier to follow,
check out Googles own company
Commenting code with the # (pound) sign R Style guide at https://google-
styleguide.googlecode.com/svn/trunk/Rguide.xml
Comments are completely ignored by R and are just there for
whomever is reading the code. You can use comments to explain
what a certain line of code is doing, or just to visually separate
meaningful chunks of code from each other. Comments in R are
designated by a # (pound) sign. Whenever R encounters a # sign, it
will ignore all the code after the # sign on that line. Additionally, in
most coding editors (like RStudio) the editor will display comments
in a separate color than standard R code to remind you that its a
comment: Figure 13: As Stan discovered in season
six of South Park, your future self is a
lazy, possibly intoxicated moron. So do
your future self a favor and make your
code look nice. Also maybe go for a run
once in a while.
2: r basics 33

Here is an example of a short script that is nicely commented. Try


to make your scripts look like this!

# Author: Pirate Jack


# Title: My nicely commented R Script
# Date: None today :(

# Step 1: Load the yarrr package


library(yarrr)

# Step 2: See the column names in the movies dataset


names(movies)

# Step 3: Calculations

# What percent of movies are sequels?


mean(movies$sequel, na.rm = T)

# How much did Pirate's of the Caribbean: On Strager Tides make?


movies$revenue.all[movies$name == 'Pirates of the Caribbean: On Stranger Tides']

I cannot stress enough how important it is to comment your code!


Trust me, even if you dont plan on sharing your code with anyone
else, keep in mind that your future self will be reading it in the
future.

Spacing
Howwouldyouliketoreadabookiftherewerenospacesbetweenwords?
Imguessingyouwouldnt. Soeverytimeyouwritecodewithoutprop-
erspacing,rememberthissentence.
Commenting isnt the only way to make your code legible. Its
important to make appropriate use of spaces and line breaks. For
example, I include spaces between arithmetic operators (like =, + and
-) and after commas (which well get to later). For example, look at
the following code:

a<-(100+3)-2
mean(c(a/100,642564624.34))
t.test(formula=revenue.all~sequel,data=movies)
plot(x=movies$budget,y=movies$dvd.usa,main="myplot")

That code looks like shit. Dont write code like that. It makes my
eyes hurt. Now, lets use some liberal amounts of commenting and
spacing to make it look less shitty.
34 yarrr! the pirates guide to r

# Some meaningless calculations. Not important

a <- (100 + 3) - 2
mean(c(a / 100, 642564624.34))

# t.test comparing revenue of sequels v non-sequels

t.test(formula = revenue.all ~ sequel,


data = movies)

# A scatterplot of budget and dvd revenue.


# Hard to see a relationship

plot(x = movies$budget,
y = movies$dvd.usa,
main = "myplot"
)

See how much better that second chunk of code looks? Not only
do the comments tell us the purpose behind the code, but there are
spaces and line-breaks separating distinct elements.

Creating new objects with <-

By now you know that you can use R to do simple calculations. But
to really take advantage of R, you need to know how to create and
manipulate objects. All of the data, analyses, and even plots, you
use and create are, or can be, saved as objects in R. For example
the movies dataset which weve used before is an object stored in
the yarrr package. This object was defined in the yarrr package
with the name movies. When you loaded the yarrr package with
the library(yarrr) command, you told R to give you access to
the movies object. Once the object was loaded, we could use it to
calculate descriptive statistics, hypothesis tests, and to create plots.
To create new objects in R, you need to do object assignment. Object
assignment is our way of storing information, such as a number or
a statistical test, into something we can easily refer to later. This is a
pretty big deal. Object assignment allows us to store data objects un-
der relevant names which we can then use to slice and dice specific
data objects anytime wed like to.
To do an assignment, we use the almighty <- operator15 called 15
You can use = instead of <- for object
assign. assignment but I recommend you stick
with <- because the direction of the
assignment is clear.
object <- [ ]
2: r basics 35

To assign something to a new object (or to update an existing


object), use the notation object <- [ ], where object is the new (or
updated) object, and [ ] is whatever you want to store in object.
Lets start by creating a very simple object called a and assigning the
value of 100 to it: Good object names strike a balance
between being easy to type (i.e.; short
a <- 100 names) and interpret. If you have
several datasets, its probably not a
Once you run this code, youll notice that R doesnt tell you any- good idea to name them a, b, c be-
cause youll forget which is which.
thing. However, as long as you didnt type something wrong, R However, using long names like
should now have a new object called a which contains the number March2015Group1OnlyFemales will
give you carpel tunnel syndrome.
100. If you want to see the value, you need to call the object by just
executing its name. This will print the value of the object to the
console:
a

## [1] 100

Now, R will print the value of a (in this case 100) to the console. If
you try to evaluate an object that is not yet defined, R will return an
error. For example, lets try to print the object b which we havent yet
defined:

## Error in eval(expr, envir, enclos): object b not found

As you can see, R yelled at us because the object b hasnt been


defined yet.
Once youve defined an object, you can combine it with other
objects using basic arithmetic. Lets create objects a and b and play
around with them.

a <- 1
b <- 100

# What is a + b?
a + b

## [1] 101

# Assign a + b to a new object (c)


c <- a + b

# What is c?
c

## [1] 101
36 yarrr! the pirates guide to r

Top change an object, you must assign it again!


Normally I try to avoid excessive emphasis, but because this next
sentence is so important, I have to just go for it. Here it goes...

To change an object, you must assign it again!

No matter what you do with an object, if you dont assign it again,


it wont change. For example, lets say you have an object z with a
value of 0. Youd like to add 1 to z in order to make it 1. To do this,
you might want to just enter z + 1 but that wont do the job. Heres
what happens if you dont assign it again:

z <- 0
z + 1

Ok! Now lets see the value of z

## [1] 0

Damn! As you can see, the value of z is still 0! What went wrong?
Oh yeah...

To change an object, you must


assign it again!
The problem is that when we wrote z + 1 on the second line, R
thought we just wanted it to calculate and print the value of z + 1,
without storing the result as a new z object. If we want to actually
update the value of z, we need to reassign the result back to z as
follows:

z <- 0
z <- z + 1 # Now I'm REALLY changing z
z

## [1] 1

Phew, z is now 1. Because we used assignment, z has been up-


dated. About freaking time.
2: r basics 37

How to name objects


You can create object names using any combination of letters and a
few special characters (like .). Here are some valid object names

# Valid object names


group.mean <- 10.21
my.age <- 32
FavoritePirate <- "Jack Sparrow"
sum.1.to.5 <- 1 + 2 + 3 + 4 + 5

All the object names above are perfectly valid. Now, lets look at
some examples of invalid object names. These object names are all
invalid because they either contain spaces, start with numbers, or
have invalid characters:

# Invalid object names!


famale ages <- 50 # spaces
5experiment <- 50 # starts with a number
a! <- 50 # has an invalid character

If you try running the code above in R, you will receive a warning
message starting with Error: unexpected symbol. Anytime you see
this warning in R, it almost always means that you have a syntax
error of some kind.

R is case-sensitive!
Like English, R is case-sensitive it R treats capital letters differently
from lower-case letters. For example, the four following objects
Plunder, plunder and PLUNDER are totally different objects in R:

# These are all different objects


Plunder <- 1
plunder <- 100
PLUNDER <- 5

Avoid using too many capital letters in object names because they
Figure 14: Like a text message, you
require you to hold the shift key. This may sound silly, but youd be
should probably watch your use of
surprised how much easier it is to type mydata than MyData 100 times. capitalization in R.

Example: Pirates of The Caribbean


Lets do a more practical example well define an object called
blackpearl.usd which has the global revenue of Pirates of the
Caribbean: Curse of the Black Pearl in U.S. dollars. A quick Google
38 yarrr! the pirates guide to r

search showed me that the revenue was $634,954,103. Ill create the
new object using assignment:

blackpearl.usd <- 634954103

Now, my fellow European pirates might want to know how much


this is in Euros. Lets create a new object called blackpearl.eur
which converts our original value to Euros by multiplying the origi-
nal amount by 0.88 (assuming 1 USD = 0.88 EUR)

blackpearl.eur <- blackpearl.usd * 0.88


blackpearl.eur

## [1] 558759611

It looks like the movie made 5.5875961 108 in Euros. Not bad.
Now, lets see how much more Pirates of the Caribbean 2: Dead
Mans Chest made compared to "Curse of the Black Pearl." Another
Google search uncovered that Dead Mans Chest made $1,066,215,812
(that wasnt a mistype, the freaking movie made over a billion dol-
lars).

deadman.usd <- 1066215812

Now, Ill divide deadman.usd by blackpearl.usd:

deadman.usd / blackpearl.usd

## [1] 1.679201

It looks like "Dead Mans Chest" made 68% more than "Curse of
the Black Pearl" - not bad for two movies based off of a ride from
Disneyland.

Test your R might!

1. Create a new R script. Using comments, write your name, the date,
and Testing my Chapter 2 R Might at the top of the script. Write
your answers to the rest of these exercises on this script, and be
sure to copy and paste the original questions using comments!
Your script should only contain valid R code and comments.

2. Which (if any) of the following objects names is/are invalid?

thisone <- 1
THISONE <- 2
this.one <- 3
2: r basics 39

This.1 <- 4
ThIS.....ON...E <- 5
This!One! <- 6
lkjasdfkjsdf <- 7

3. 2015 was a good year for pirate booty - your ship collected 100,800
gold coins. Create an object called gold.in.2015 and assign the
correct value to it.

4. Oops, during the last inspection we discovered that one of your


pirates Skippy McGee" hid 800 gold coins in his underwear. Go
ahead and add those gold coins to the object gold.in.2015. Next,
create an object called plank.list with the name of the pirate
thief.

5. Look at the code below. What will R return after the third line?
Make a prediction, then test the code yourself.

a <- 10
a + 10
a
3: Scaler and vector objects

People are not objects. But R is full of them. Here are some of the
# scalar v vector v matrix
basic ones. par(mar = rep(1, 4))
plot(1, xlim = c(0, 4), ylim = c(-.5, 5),
xlab = "", ylab = "",
xaxt = "n", yaxt = "n",
Scalers bty = "n", type = "n")

# scalar
rect(rep(0, 1), rep(0, 1), rep(1, 1), rep(1, 1))
text(.5, -.5, "scalar")
The simplest object type in R is a scaler. A scalar object is just a single
# Vector
value like a number or a name. In the previous chapter we defined rect(rep(2, 5), 0:4, rep(3, 5), 1:5)
text(2.5, -.5, "Vector")
several scaler objects. Here are examples of numeric scalers:

a <- 100
b <- 3 / 100
c <- (a + b) / c

Scalers dont have to be numeric, they can also be characters (also


known as strings). In R, you denote characters using quotation marks.
Here are examples of character scalers:

d <- "ship"
e <- "cannon"
f <- "Do any modern armies still use cannons?"

As you can imagine, R treats numeric and character scalers dif- scalar Vector
ferently. For example, while you can do basic arithmetic operations
on numeric scalers they wont work on character scalers. If you try Figure 15: Visual depiction of a scalar
and vector. Deep shit. Wait until we get
to perform numeric operations (like addition) on character scalers, to matrices - youre going to lose it.
youll get an error like this one:

a <- "1"
b <- "2"
a + b

## Error in a + b: non-numeric argument to binary operator

If you see an error like this one, it means that youre trying to
apply numeric operations to character objects. Thats just sick and
wrong.
42 yarrr! the pirates guide to r

Vectors

Now lets move onto vectors. A vector object is just a combination of


several scalars stored as a single object. For example, the numbers
from one to ten could be a vector of length 10, and the characters
in the English alphabet could be a vector of length 26. Like scalars,
vectors can be either numeric or character (but not both!).
There are many ways to create vectors in R. Here are the methods
we will cover in this chapter:

Function Example Result


c(a, b) c(1, 5, 9) [1, 5, 9]
a:b 5:10 [5, 6, 7, 8, 9, 10]
seq(from, to, by, length.out) seq(from = 0, to = 6, by = 2) [0, 2, 4, 6]
rep(x, times, each, length.out) rep(c(1, 5), times = 2, each = 2) [1, 1, 5, 5, 1, 1, 5, 5]

Ok now its time to learn our first function! Well start with c()
which allows you to build vectors.

c(a, b, c, ...)

a, b, c, ...
One or more objects to be combined into a vector

The simplest way to create a vector is with the c() function. The c
here stands for concatenate, which means "bring them together". The
c() function takes several scalers as arguments, and returns a vector
containing those objects. When using c(), place a comma in between
the objects (scalers or vectors) you want to combine:
Lets use the c() function to create a vector called a containing the
integers from 1 to 5.

a <- c(1, 2, 3, 4, 5)

Lets look at the object by evaluating it in the console:

## [1] 1 2 3 4 5

As you can see, R has stored all 5 numbers in the object a. Thanks
R!
3: scaler and vector objects 43

You can also create longer vectors by combining vectors you have
already defined. Lets create a vector of the numbers from 1 to 10
by first generating a vector a from 1 to 5, and a vector b from 6 to 10
then combine them into a single vector c:

a <- c(1, 2, 3, 4, 5)
b <- c(6, 7, 8, 9, 10)
c <- c(a, b)
c

## [1] 1 2 3 4 5 6 7 8 9 10

You can also create character vectors by using the c() function to
combine character scalers into character vectors:

char.vec <- c("this", "is", "not", "a", "pipe")


char.vec

## [1] "this" "is" "not" "a" "pipe"

Vectors contain either numbers or characters, not both!


A vector can only contain one type of scalar: either numeric or char-
acter. If you try to create a vector with numeric and character scalars,
then R will convert all of the numeric scalars to characters. In the
next code chunk, Ill create a new vector called my.vec that contains a
mixture of numeric and character scalers.

my.vec <- c("a", 1, "b", 2, "c", 3)


my.vec

## [1] "a" "1" "b" "2" "c" "3"

As you can see from the output, my.vec is stored as a character


vector where all the numbers are converted to characters.

Functions to generate numeric vectors

While the c() function is the most straightforward way to create a


vector, its also one of the most tedious. For example, lets say you
wanted to create a vector of all integers from 1 to 100. You definitely
dont want to have to type all the numbers into a c() operator. Thank-
fully, R has many simple built-in functions for generating numeric
vectors. Lets start with three of them: a:b, seq(), and rep():
44 yarrr! the pirates guide to r

a:b
The a:b function takes two scalers a and b as arguments, and returns
a vector of numbers from the starting point a to the ending point b in
steps of 1. You can go forwards or backwards depending on which
number is larger.

a:b
a
The start of the sequence

b
The end of the sequence

Here are some examples of the a:b function in action. As youll


see, you can go backwards or forwards, or make sequences between
non-integers:

1:10

## [1] 1 2 3 4 5 6 7 8 9 10

10:1

## [1] 10 9 8 7 6 5 4 3 2 1

2.5:8.5

## [1] 2.5 3.5 4.5 5.5 6.5 7.5 8.5

seq()
The seq() function is a more flexible version of a:b. Like a:b, seq()
allows you to create a sequence from a starting number to an ending
number. However, seq(), has additional arguments that allow you
to specify either the size of the steps between numbers, or the total
length of the sequence:
3: scaler and vector objects 45

seq(from, to, by)

from
The start of the sequence

to
The end of the sequence

by
The step-size of the sequence

length.out
The desired length of the final sequence (only use if you dont
specify by)

The seq() function has two new arguments by and length.out. If


you use the by argument, the sequence will be in steps of the input to
the by argument:

seq(from = 1, to = 10, by = 1)

## [1] 1 2 3 4 5 6 7 8 9 10

seq(from = 0, to = 100, by = 10)

## [1] 0 10 20 30 40 50 60 70 80 90 100

If you use the length.out argument, the sequence will have length
equal to the input of length.out.

seq(from = 0, to = 100, length.out = 11)

## [1] 0 10 20 30 40 50 60 70 80 90 100

seq(from = 0, to = 100, length.out = 5)

## [1] 0 25 50 75 100

rep()
The rep() function allows you to repeat a scaler (or vector) a speci-
fied number of times.

Figure 16: Not a good depiction of a rep


in R.
46 yarrr! the pirates guide to r

rep(x, times, each)


x
A scaler or vector of values to repeat

times
The number of times to repeat the sequence

each
The number of times to repeat each value within the sequence

length.out (optional)
The desired length of the final sequence

Lets do some reps.

rep(x = 3, times = 10)

## [1] 3 3 3 3 3 3 3 3 3 3

rep(x = c(1, 2), times = 5)

## [1] 1 2 1 2 1 2 1 2 1 2

rep(x = c("a", "b"), each = 2)

## [1] "a" "a" "b" "b"

rep(x = 1:3, length.out = 10)

## [1] 1 2 3 1 2 3 1 2 3 1

rep(x = 1:5, times = 3)

## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

As you can see in the fourth example above, you can can include
an a:b call within a rep()!
You can even combine the times and each arguments within a
single rep() function. For example, heres how to create the sequence
1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3 with one call to rep():

rep(x = 1:3, each = 2, times = 2)

## [1] 1 1 2 2 3 3 1 1 2 2 3 3
3: scaler and vector objects 47

Assigning jobs to a new crew


Lets say you are are getting a batch of 10 new pirates on your crew,
and you need to assign each of them to one of three jobs. To help you,
you could use a vector with the numbers 1, 2, 3, 1, 2, 3, etc. As they
board the ship, youll just read off the next job to each pirate. Lets
create this vector using rep()

pirate.jobs.num <- rep(x = 1:3,


length.out = 10)

pirate.jobs.num

## [1] 1 2 3 1 2 3 1 2 3 1

Ok thats helpful, but it would be nice if the jobs were in text


instead of numbers. Lets repeat the previous example, but instead
of using the numbers 1, 2, 3, well use the names of the actual jobs
(which are Deck Swabber, Parrot Groomer, and App Developer)

pirate.jobs.char <- rep(x = c("Deck Swabber", "Parrot Groomer", "App Developer"),


length.out = 10)
pirate.jobs.char

## [1] "Deck Swabber" "Parrot Groomer" "App Developer" "Deck Swabber"


## [5] "Parrot Groomer" "App Developer" "Deck Swabber" "Parrot Groomer"
## [9] "App Developer" "Deck Swabber"

The each argument allows you to repeat each element in the


original vector within each repetition. Wow, that was a confusing
sentence. Let me just show you:

rep(x = 1:4, each = 2)

## [1] 1 1 2 2 3 3 4 4

rep(x = c("a", "b"), each = 3, times = 2)

## [1] "a" "a" "a" "b" "b" "b" "a" "a" "a" "b" "b" "b"

Test your R might!

1. Create the vector [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] in three ways: once


using c(), once using a:b, and once using seq().

2. Create the vector [2.1, 4.1, 6.1, 8.1] in two ways, once using c() and
once using seq()
48 yarrr! the pirates guide to r

3. Create the vector [0, 5, 10, 15] in 3 ways: using c(), seq() with a by
argument, and seq() with a length.out argument.

4. Create the vector [101, 102, 103, 200, 205, 210, 1000, 1100, 1200]
using a combination of the c() and seq() functions

5. A new batch of 100 pirates are boarding your ship and need
new swords. You have 10 scimitars, 40 broadswords, and 50 cut-
lasses that you need to distribute evenly to the 100 pirates as they
board. Create a vector of length 100 where there is 1 scimitar, 4
broadswords, and 5 cutlasses in each group of 10. That is, in the
first 10 elements there should be exactly 1 scimitar, 4 broadswords
and 5 cutlasses. The next 10 elements should also have the same
number of each sword (and so on).

6. Create an object called overboard containing the text NNNNNNOOOOOOoooooo!!!!!!


as a vector of length 24 using the rep() function.
4: Vector functions

In this chapter, well cover some basic vector operations and func-
tions.

length()
Once you have a vector, you may want to know how long it is. When
this happens, dont stare at your computer screen and count the
elements one by one! Instead, use length() function. The length()
function takes a vector as an argument, and returns a scaler repre-
senting the number of elements in the vector:

a <- 1:10
length(a)

## [1] 10

b <- seq(from = 1, to = 100, length.out = 20)


length(b)

## [1] 20

length(c("This", "character", "vector", "has", "six", "elements."))

## [1] 6

length("This character scaler has just one element.")

## [1] 1

Arithmetic operations on vectors

So far, you know how to do basic arithmetic operations like + (addi-


tion), - (subtraction), and * (multiplication) on scalars. Thankfully, R
makes it just as easy to do arithmetic operations on numeric vectors.
If you do an operation on a vector with a scalar, R will apply the
scalar to each element in the vector. For example, if you have a vector
50 yarrr! the pirates guide to r

and want to add 10 to each element in the vector, just add the vector
and scaler objects. Lets create a vector with the integers from 1 to 10,
and add then add 100 to each element:

a <- 1:10
a + 100

## [1] 101 102 103 104 105 106 107 108 109 110

As you can see, the result is [1 + 100, 2 + 100, ... 10 + 100]. Of


course, we could have made this vector with 101:110, but you get the
idea.
Of course, this doesnt only work with addition...oh no. Lets try
division, multiplication, and exponents. Lets create a vector a with
the integers from 1 to 10 and then change it up:

a <- 1:10
a / 100

## [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

a ^ 2

## [1] 1 4 9 16 25 36 49 64 81 100

Again, if you perform an algebraic operation on a vector with a


scaler, R will just apply the operation to every element in the vector.

Basic math with multiple vectors


What if you want to do some operation on two vectors of the same
length? Easy. Just apply the operation to both vectors. R will then
combine them elementbyelement. For example, if you add the
vector [1, 2, 3, 4, 5] to the vector [5, 4, 3, 2, 1], the resulting vector will
have the values [1 + 5, 2 + 4, 3 + 3, 4 + 2, 5 + 1] = [6, 6, 6, 6, 6]:

c(1, 2, 3, 4, 5) + c(5, 4, 3, 2, 1)

## [1] 6 6 6 6 6

Lets create two vectors a and b where each vector contains the
integers from 1 to 5. Well then create two new vectors ab.sum, the
sum of the two vectors and ab.diff, the difference of the two vectors,
and ab.prod, the product of the two vectors:
4: vector functions 51

a <- 1:5
b <- 1:5

ab.sum <- a + b
ab.diff <- a - b
ab.prod <- a * b

ab.sum

## [1] 2 4 6 8 10

ab.diff

## [1] 0 0 0 0 0

ab.prod

## [1] 1 4 9 16 25

Pirate Bake Sale


Lets say you had a bake sale on your ship where 5 pirates sold both
pies and cookies. You could record the total number of pies and
cookies sold in two vectors:

pies <- c(3, 6, 2, 10, 4)


cookies <- c(70, 40, 40, 200, 60)

Now, lets say you want to know how many total items each pirate
sold. You can do this by just adding the two vectors:

total.sold <- pies + cookies


total.sold

## [1] 73 46 42 210 64

Summary statistic functions for numeric vectors

Ok, now that we can create vectors, lets learn the basic descriptive
statistics functions. Well start with functions that apply to both
continuous and discrete data16 . Each of the following functions takes 16
Continuous data is data that, gener-
a numeric vector as an argument, and returns either a scalar (or in ally speaking, can take on an infinite
number of values. Height and weight
the case of summary(), a table) as a result. are good examples of continuous data.
Discrete data are those that can only
take on a finite number of values. The
number of pirates on a ship, or the
number of times a monkey farts in a
day are great examples of discrete data
52 yarrr! the pirates guide to r

Statistical functions for numeric vectors


sum(x), product(x)
The sum, or product, of a numeric vector x.

min(x), max(x)
The minimum and maximum values of a vector x

mean(x)
The arithmetic mean of a numeric vector x

median(x)
The median of a numeric vector x. 50% of the data should be less
than median(x) and 50% should be greater than median(x).

sd(x), var(x)
The standard deviation and variance of a numeric vector x.

quantile(x, p)
The pth sample quantile of a numeric vector x. For example,
quantile(x, .2) will tell you the value at which 20% of cases are
less than x. The function quantile(x, .5) is identical to median(x)

summary(x)
Shows you several descriptive statistics of a vector x, including
min(x), max(x), median(x), mean(x)

Lets calculate some descriptive statistics from some pirate related


data. Ill create a vector called data that contains the number of
tattoos from 10 random pirates.

tattoos <- c(4, 50, 2, 39, 4, 20, 4, 8, 10, 100)

Now, we can calculate several descriptive statistics on this vector


by using the summary statistics functions:

min(tattoos)

## [1] 2

max(tattoos)

## [1] 100

mean(tattoos)
4: vector functions 53

## [1] 24.1

median(tattoos)

## [1] 9

sd(tattoos)

## [1] 31.32074

Watch out for missing (NA) values!


In R, missing data are coded as NA. In real datasets, NA values turn
up all the time. Unfortunately, most descriptive statistics functions
will freak out if there is a missing (NA) value in the data. For exam-
ple, the following code will return NA as a result because there is an
NA value in the data vector: Important!!! Include the argument
na.rm = T to ignore missing (NA)
values when calculating a descriptive
a <- c(1, 5, NA, 2, 10) statistic.
mean(a)

## [1] NA

Thankfully, theres a way we can work around this. To tell a de-


scriptive statistic function to ignore missing (NA) values, include the
argument na.rm = T in the function. This argument explicitly tells
the function to ignore NA values. Lets try calculating the mean of
the vector a again, this time with the additional na.rm = T argument:

mean(a, na.rm = T)

## [1] 4.5

Now, the function ignored the NA value and returned the mean
of the remaining data. While this may seem trivial now (why did we
include an NA value in the vector if we wanted to ignore it?!), it will
be become very important when we apply the function to real data
which, very often, contains missing values.
If you want to get many summary statistics from a vector, you can
use the summary() function which gives you several key statistics:

summary(tattoos)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 2.00 4.00 9.00 24.10 34.25 100.00
54 yarrr! the pirates guide to r

Additional numeric vector functions


Here are some other functions that you will find useful when manag-
ing numeric vectors:

Additional numeric vector functions


round(x, digits)
Round values in a vector (or scaler) x to a certain number of digits.

ceiling(x), floor(x)
Round a number to the next largest integer with ceiling(x) or
down to the next lowest integer with floor(x).

x %% y
Modular arithmetic (i.e.; x mod y). You can interpret x %% y as
What is the remainder after dividing x by y?" For example, 10 %%
3 equals 1 because 3 times 3 is 9 (which leaves a remainder of 1).

Counting functions for discrete and non-numeric data

Next, well move on to common counting functions for vectors with


discrete or non-numeric data. Again, discrete data are those like
gender, occupation, and monkey farts, that only allow for a finite (or
at least, plausibly finite) set of responses. Each of these vectors takes
a vector as an argument however, unlike the previous functions we
looked at, the vectors can be either numeric or character.

Counting functions for discrete data

unique(x)
Returns a vector of all unique values in the vector x.

table(x)
Returns a table showing all the unique values in the vector x as
well as a count of each occurrence. By default, the table() func-
tion does NOT count NA values. To include a count of NA values,
include the argument exclude = NULL
4: vector functions 55

Lets test these functions by starting with two vectors of discrete


data:

vec <- c(1, 1, 1, 5, 1, 1, 10, 10, 10)


gender <- c("M", "M", "F", "F", "F", "M", "F", "M", "F")

The function unique(x) will tell you all the unique values in the
vector, but wont tell you anything about how often each value
occurs.

unique(vec)

## [1] 1 5 10

unique(gender)

## [1] "M" "F"

The function table() does the same thing as unique(), but goes
a step further in telling you how often each of the unique values
occurs:

table(vec)

## vec
## 1 5 10
## 5 1 3

table(gender)

## gender
## F M
## 5 4

If you want to get percentages instead of counts, you can just


divide the result of the table() function by the sum of the result:

table(vec) / sum(table(vec))

## vec
## 1 5 10
## 0.5555556 0.1111111 0.3333333

table(gender) / sum(table(gender))

## gender
## F M
## 0.5555556 0.4444444
56 yarrr! the pirates guide to r

Standardization (z-score)
A common task in statistics is to standardize variables also known
as calculating z-scores. The purpose of standardizing a vector is to
put it on a common scale which allows you to compare it to other
(standardized) variables. To standardize a vector, you simply subtract
the vector by its mean, and then divide the result by the vectors
standard deviation.
If the concept of z-scores is new to you dont worry. In the next
worked example, youll see how it can help you compare two sets
of data. But for now, lets see how easy it is to standardize a vector
using basic arithmetic.
Lets say you have a vector a containing some data. Well assign
the vector to a new object called a then calculate the mean and stan-
dard deviation with the mean() and sd() functions:

a <- c(5, 3, 7, 5, 5, 3, 4)
mean(a)

## [1] 4.571429

sd(a)

## [1] 1.397276

Ok. Now well create a new vector called a.z which is a standard-
ized version of a. To do this, well simply subtract the mean of the
vector, then divide by the standard deviation.

a.z <- (a - mean(a)) / sd(a)

The mean of a.z should now be 0, and the standard deviation of


a.z should now be 1. Lets make sure:

mean(a.z)

## [1] 1.982309e-16

sd(a.z)

## [1] 1

Sweet17 . 17
Oh, dont worry that the mean of a.z
doesnt look like exactly zero. Using
non-scientific notation, the result is
A worked example: Evaluating a competition 0.000000000000000198. For all intents
and purposes, thats 0. The reason
Your gluten-intolerant first mate just perished in a tragic soy sauce the result is not exactly 0 is due to
computer science theoretical reasons
incident and its time to promote another member of your crew to the that I cannot explain (because I dont
understand them)
4: vector functions 57

newly vacated position. Of course, only two qualities really matter


for a pirate: rope-climbing, and grogg drinking. Therefore, to see
which of your crew deserves the promotion, you decide to hold a
climbing and drinking competition. In the climbing competition, you
measure how many feet of rope a pirate can climb in an hour. In the
drinking competition, you measure how many mugs of grogg they
can drink in a minute. Five pirates volunteer for the competition
here are their results:

grogg <- c(12, 8, 1, 6, 2)


climbing <- c(100, 520, 430, 200, 700)

Now youve got the data, but theres a problem: the scales of the
numbers are very different. While the grogg numbers range from 1 to
12, the climbing numbers have a much larger range from 100 to 700.
This makes it difficult to compare the two sets of numbers directly.
To solve this problem, well use standardization. Lets create new
standardized vectors called grogg.z and climbing.z

grogg.z <- (grogg - mean(grogg)) / sd(grogg)


climbing.z <- (climbing - mean(climbing)) / sd(climbing)

Now lets look at the final results. To make them easier to read, Ill
round them to 2 digits:

round(grogg.z, 2)

## [1] 1.38 0.49 -1.07 0.04 -0.85

round(climbing.z, 2)

## [1] -1.20 0.54 0.17 -0.78 1.28

It looks like there were two outstanding performances in particular.


In the grogg drinking competition, the first pirate had a z-score of
1.4. We can interpret this by saying that this pirate drank 1.4 more
standard deviations of mugs of grogg than the average pirate. In the
climbing competition, the fifth pirate had a z-score of 1.3. Here, we
would conclude that this pirate climbed 1.3 standard deviations more
than the average pirate.
But which pirate was the best on average across both events? To
answer this, lets create a combined z-score for each pirate which
calculates the average z-scores for each pirate across the two events.
Well do this by adding two performances and dividing by two. This
will tell us, how good, on average, each pirate did relative to her
fellow pirates.
58 yarrr! the pirates guide to r

average.z <- (grogg.z + (climbing.z)) / 2

Lets look at the result

round(average.z, 1)

## [1] 0.1 0.5 -0.5 -0.4 0.2

The highest average z-score belongs to the second pirate who had
an average z-score value of 0.5. The first and last pirates, who did
well in one event, seemed to have done poorly in the other event.
Moral of the story: promote the pirate who can drink and climb.

Test your R Might!

1. Create a vector that shows the square root of the integers from 1 to
10.

2. Theres an old parable that goes something like this. A man does
some work for a king and needs to be paid. Because the man
loves rice (who doesnt?!), the man offers the king two different
ways that he can be paid. You can either pay me 1000 bushels of
rice, or, you can pay me as follows: get a chessboard and put one
grain of rice in the top left square. Then put 2 grains of rice on
the next square, followed by 4 grains on the next, 8 grains on the
next...and so on, where the amount of rice doubles on each square,
until you get to the last square. When you are finished, give me
all the grains of rice that would (in theory), fit on the chessboard.
The king, sensing that the man was an idiot for making such a
stupid offer, immediately accepts the second option. He summons
a chessboard, and begins counting out grains of rice one by one...
Assuming that there are 64 squares on a chessboard, calculate how
many grains of rice the main will receive18 . 18
Hint: If you have trouble coming up
with the answer, imagine how many
3. Renata thinks that she finds more treasure when shes had a mug grains are on the first, second, third and
of grogg than when she doesnt. To test this, she recorded how fourth squares, then try to create the
vector that shows the number of grains
much treasure she found over 7 days without drinking any grogg, on each square. Once you come up with
and then did the same over 7 days while drinking grogg. Here are that vector, you can easily calculate the
her results: final answer with the sum() function.

grogg <- c(2, 0, 3, 1, 0, 3, 5)


nogrogg <- c(0, 0, 1, 0, 1, 2, 2)

How much treasure did Renata find on average when drinking


grogg? What about when she did not drink grogg?
4: vector functions 59

4. Using Renatas data again, create a new vector called difference


that shows how much more treasure Renata found while drinking
grogg than when she didnt drink grogg. What was the mean,
median, and standard deviation of the difference?
5: Working with vectors

By now you should feel comfortable applying functions like mean()


and table() to vectors. However, in many analyses, you will want to
access specific values of a vector based on their position or on some
other criteria. For example, you may want to analyze values in a
specific location in the vector (i.e.; the first 10 elements) or based on
some criteria (i.e.; all values greater than 0). To access specific values
of a vector in R, we use indexing.
To show you where were going with this, consider the following
two vectors sex and beard these vectors represent responses to
two questions in a survey of five pirates. sex is a character vector
representing the pirates sex (m or f), while beard is a numeric vector
indicating the length of the pirates beards.
Figure 17: Traditionally, a pirate will
sex <- c("m", "m", "f", "m", "f") steal a Beard Bauble from every pirate
beard <- c(30, 24, 0, 40, 0) he beats in a game of Mario Kart. This
pirate is is the Mario Kart champion of
Heres how we can use indexing to answer questions about subsets Missionsstrasse.
of the data:

# What is the length of the first pirate's beard?


beard[1]

## [1] 30

# What is the sex of the 3rd, 4th, and 5th pirates?


sex[3:5]

## [1] "f" "m" "f"

# How many females were there?


sum(sex == "f")

## [1] 2

# What are the values of beard that are greater than 0?


beard[beard > 0]

## [1] 30 24 40
62 yarrr! the pirates guide to r

# What are the beard lengths of males?


beard[sex == "m"]

## [1] 30 24 40

As you can see, we use brackets [ ] in index (aka, select) specific


values of a vector. In this chapter well cover these and other ways of
indexing vectors.

Indexing vectors with brackets

To index a vector, we use brackets [ ] after the vector object. In gen-


a[]
eral, whatever you put inside the brackets, tells R which values of the
vector object you want. There are two main ways that you can use
indexing to access subsets of data in a vector: numerical and logical
indexing.

Numerical Indexing
With numerical indexing, you enter the integers corresponding to
the values in the vector you want to access in the form a[index],
where a is the vector, and index is a vector of index values. For
example, to get the first value in a vector called a, youd write a[1]. It it makes your code clearer, you
To get the first, second, and third value, you can either type a[c(1, can define an indexing object before
doing your actual indexing. For ex-
2, 3)] or a[1:3]. As you can see, you can use any of the vector ample, lets define an object called
generation functions we learned in the previous chapter to index a get.these.values and use this object to
index our data vector:
vector. However, the values must be integers.
Lets do a few more examples. Well create an a vector again and my.index <- 6:10
use indexing to extract specific values: a[my.index]

## [1] 3 4 NA NA NA
a <- c(0, 50, 2, 39, 9, 20, 17, 8, 10, 100)

# 1st element
a[1]

## [1] 0

# 1-5 elements
a[1:5]

## [1] 0 50 2 39 9

# Every second element


a[seq(from = 2, to = 10, by = 2)]

## [1] 50 39 20 8 100
5: working with vectors 63

Logical Indexing
The second way to index vectors is with logical vectors. A logical
vector is a vector that only contains TRUE and FALSE values. In R,
true values are designated with either TRUE or T, and false values
with either FALSE or F.
You can create logical vectors from existing vectors using com-
parison operators like < (less than), == (equals to), and != (not equal
to). A complete list of the most common comparison operators is
in Figure 19. For example, lets start with a vector a containing five
numbers:

a <- c(5, 3, -2, -5, 11)

We can create logical vectors by comparing the original vector a to


Figure 18: Logical indexing. Good for R
a scaler as follows: aliens and R pirates.

# Which values are > 0?


par(mar = rep(.1, 4))
a > 0 plot(1, xlim = c(0, 1.1), ylim = c(0, 9),
xlab = "", ylab = "", xaxt = "n", yaxt = "n",
type = "n")

## [1] TRUE TRUE FALSE FALSE TRUE text(rep(0, 8), 8:1,


labels = c("==", "!=", "<", "<=",
">", ">=", "|", "!"),
# Which values are equal to 5? adj = 0, cex = 3)

a == 5 text(rep(.2, 8), 8:1,


labels = c("equal", "not equal", "less than",
"less than or equal","greater than",
"greater than or equal", "or", "not"),
## [1] TRUE FALSE FALSE FALSE FALSE adj = 0, cex = 3)

For character vectors, you can also use the == (equals) or != (not-
== equal
equal) comparisons (other comparisons such as < dont make sense
!= not equal
for character vectors).
< less than
b <- c("m", "f", "f", "f", "m") <= less than or equal
> greater than
b == "m" >= greater than or equal
| or
## [1] TRUE FALSE FALSE FALSE TRUE not
!
b != "m"
Figure 19: Comparison operators in R

## [1] FALSE TRUE TRUE TRUE FALSE

You can also create logical vectors by comparing a vector to an-


other vector of the same length. When you do this, R will compare
values in the same position (e.g.; the first values will be compared,
then the second values, etc.)
64 yarrr! the pirates guide to r

a <- c(1, 2, 3, 4, 5)
b <- c(3, 3, 3, 3, 3)

a > b

## [1] FALSE FALSE FALSE TRUE TRUE

a != b

## [1] TRUE TRUE FALSE TRUE TRUE

Once you create a logical vector, you can use that logical vector to
index another vector. When you do this, R will return the values of
the vector for all TRUE values of the logical vector, and will ignore
all values for which the logical vector is FALSE. If that was confusing,
think about it this way: a logical vector, combined with the brackets [
], acts as a filter for the vector it is indexing. It only lets values of the
vector pass through for which the logical vector is TRUE.

Figure 20: FALSE values in a logical


vector are like lots of mini-Gandolfs. In
this example, I am indexing a vector x
with a logical vector y (y for example
could be x > 0, so all positive values of
x are TRUE and all negative values are
FALSE). The result is a vector of length
2, which are the values of x for which
the logical vector y was true. Gandolf
stopped all the values of x for which y
was FALSE.

For example, lets look at our sex and beard data again.

sex

## [1] "m" "m" "f" "m" "f"

beard

## [1] 30 24 0 40 0
5: working with vectors 65

Now, well use the logical vector (sex == "f") to index the beard
vector. This will give us the values of beard (representing beard
lengths) only for cases where the sex vector is equal to f:

beard[sex == "f"]

## [1] 0 0

This result tells us there were two females (that is, there were two
TRUE values in the logical vector sex == "f"), and both had a beard
length of 0.

Using & (AND), | (OR)


In addition to using single comparison operators, you can combine
multiple logical vectors using the OR (which looks like |) and AND &
commands. The OR | operation will return TRUE if any of the logical
vectors is TRUE, while the AND operation will only return TRUE
if all of the values in the logical vectors is TRUE. This is especially
powerful when you want to create a logical vector based on criteria
from multiple vectors.
For example, in addition to our sex and beard vectors, lets create
an age vector indicating the age of the 5 pirates:

age <- c(19, 24, 22, 40, 88)

Now, we can use the (AND), and | (OR) operations to create a


logical vector based on both age and sex criteria. In this example, Ill
create a logical vector for females (sex == "f") who are older than 20
(age > 20):

sex == "f" & age > 20

## [1] FALSE FALSE TRUE FALSE TRUE

Because the result is just another logical vector, we can use it to


index our main data vector (beard).

# Beard lengths of female AND older than 20?


beard[sex == "f" & age > 20]
You can combine as many logical vec-
## [1] 0 0
tors as you want to create increasingly
complex selection criteria. For example,
# Beard lengths of male OR less than 40? the following logical vector returns
beard[sex == "m" | age < 40] TRUE for cases for females with an age
less than 20 OR greater than 35, and
whose beard length is not equal to 20
## [1] 30 24 0 40
sex == "f" & (age < 20 | age > 35) & beard != 20

## [1] FALSE FALSE FALSE FALSE TRUE

When using multiple criteria, make


sure to use parentheses when appropri-
ate. If I didnt use parentheses above, I
would get a different answer.
66 yarrr! the pirates guide to r

Additional ways to create and use logical vectors

x %in% y
%in%
The %in% operation helps you to easily create multiple OR argu-
ments.Imagine you have a vector of categorical data that can take
on many different values. For example, you could have a vector x
indicating peoples favorite letters.

x <- c("a", "t", "a", "b", "z")

Now, lets say you want to create a logical vector indicating which
values are either a or b or c or d. You could create this logical vector
with multiple | (OR) commands:

x == "a" | x == "b" | x == "c" | x == "d"

## [1] TRUE FALSE TRUE TRUE FALSE

However, this takes a long time to write. Thankfully, the %in%


operation allows you to combine multiple OR comparisons much
faster. To use the %in% function, just put it in between the original
vector, and a new vector of possible values. The %in% function goes through every
value in the vector x, and returns TRUE
if it finds it in the vector of possible
x %in% c("a", "b", "c", "d")
values otherwise it returns FALSE.

## [1] TRUE FALSE TRUE TRUE FALSE

As you can see, the result is identical to our previous result.


Technically, when you assign new
Changing values in a vector with indexing values to a vector, you should always
assign a vector of the same length as the
number of values that you are updating.
Now that you know how to index a vector, you can easily change For example, given a vector a with the
specific values in a vector using the assignment (<-) operation. To do integers from 1 to 10:
this, just assign a vector of new values to the indexed values of the a <- rep(1:10)
original vector:
Lets create a vector a which contains 10 1s: To update the first 5 values with 5 9s,
we should assign a new vector of 5 9s
a <- rep(1, 10) a[1:5] <- c(9, 9, 9, 9, 9)
a
Now, lets change the first 5 values in the vector to 9s by indexing
## [1] 9 9 9 9 9 6 7 8 9 10
the first five values, and assigning the value of 9:
However, if we repeat this code but
a[1:5] <- 9 just assign a single 9, R will repeat the
a value as many times as necessary to fill
the indexed value of the vector. Thats
why the following code still works:
## [1] 9 9 9 9 9 6 7 8 9 10
a[1:5] <- 9
a

## [1] 9 9 9 9 9 6 7 8 9 10

In other languages this code wouldnt


work because were trying to replace
5 values with just 1. However, this is a
case where R bends the rules a bit.
5: working with vectors 67

Now lets change the last 5 values to 0s. Well index the values 6
through 10, and assign a value of 0.

a[6:10] <- 0
a

## [1] 9 9 9 9 9 0 0 0 0 0

Using indexing to change specific values of a vector


You can also change values of a vector with logical indexing. This is
a particularly helpful tool when, for example, you want to remove
invalid values in a vector before performing an analysis. For example,
lets say you asked 10 people how happy they were on a scale of 1 to
5 and received the following responses:

happy <- c(1, 4, 2, 999, 2, 3, -2, 3, 2, 999)

As you can see, we have some invalid values (999 and -2) in this
vector. If we try to take the mean of this vector, well get a nonsense
response:

mean(happy)

## [1] 201.3

We need to somehow take care of these invalid values. To do this,


well use logical indexing to change the invalid values (999 and -2) to
NA. Well start by creating a vector of valid responses:

valid.response <- 1:5

Now, well create a logical vector indicating which values of happy


are invalid using the %in% operation. Because we want to see which
values are invalid, well add the == FALSE condition (If we dont, the
index will tell us which values are valid).

log.vec <- (happy %in% valid.response) == FALSE


log.vec
We can also recode all the invalid
## [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE values of happy in one line as follows:

happy[(happy %in% 1:5) == FALSE] <- NA


Finally, well index happy with log.vec, and assign the invalid
values as NA:
68 yarrr! the pirates guide to r

happy[log.vec] <- NA
happy
Of course, because we now have NA
## [1] 1 4 2 NA 2 3 NA 3 2 NA values in our vector, we need to include
the na.rm = T argument to the mean()
function! If we dont, well just get an
As you can see, the new vector happy.valid replaced the invalid NA response:
values with NAs. Now we can take a mean() of the vector and see the
mean(happy)
mean of the valid responses.
## [1] NA
mean(happy, na.rm = T)
mean(happy, na.rm = T)
## [1] 2.428571 ## [1] 2.428571

Taking the sum and mean of logical vectors to get counts and per-
centages

Many (if not all) R functions will interpret TRUE values as 1 and
FALSE values as 0. This allows us to easily answer questions like
"How many values in a data vector are greater than 0?" or "What
percentage of values are equal to 5?" by applying the sum() or mean()
function to a logical vector.
Well start with a vector x of length 10

x <- c(5, 2, -2, -7, 2, -6, -2, 1, 8, 10)

Now, well use sum() and mean() to see how many of the values
in x are positive, and what percent are positive. We should find that
there are 6 TRUE values, and that 60% of the values (6 / 10) are
TRUE.

sum(x > 0)

## [1] 6

mean(x > 0)

## [1] 0.6

This is a really powerful tool. Pretty much any time you want to
answer a question like "How many of X are Y" or "What percent of X
are Y", you use sum() or mean() function with a logical vector as an
argument.
5: working with vectors 69

Test your R Might!: Movie data

The following vectors contain data about 10 of my favorite movies.

m.names <- c("Whatever Works", "It Follows", "Love and Mercy",


"The Goonies", "Jiro Dreams of Sushi",
"There Will be Blood", "Moon",
"Spice World", "Serenity", "Finding Vivian Maier")

year <- c(2009, 2015, 2015, 1985, 2012, 2007, 2009, 1988, 2005, 2014)

boxoffice <- c(35, 15, 15, 62, 3, 10, 321, 79, 39, 1.5)

genre <- c("Comedy", "Horror", "Drama", "Adventure", "Documentary",


"Drama", "Science Fiction", "Comedy", "Science Fiction",
"Documentary")

time <- c(92, 97, 120, 90, 81, 158, 97, -84, 119, 84)

rating <- c("PG-13", "R", "R", "PG", "G", "R", "R",


"PG-13", "PG-13", "Unrated")

1. What is the name of the 10th movie in the list?

2. What are the genres of the first 4 movies?

3. Some joker put Spice World in the movie names it should be


The Naked Gun Please correct the name.

4. What were the names of the movies made before 1990?

5. How many movies were Dramas? What percent of the 10 movies


were Dramas?

6. One of the values in the time vector is invalid. Convert any invalid
values in this vector to NA. Then, calculate the mean movie time

7. What were the names of the Comedy movies? What were their
boxoffice totals? (Two separate questions)

8. What were the names of the movies that made less than $30 Mil-
lion dollars AND were Comedies?

9. What was the median boxoffice revenue of movies rated either G


or PG?

10. What percent of the movies were rated R and were comedies?
70 yarrr! the pirates guide to r

Additional Tips
R has lots of special functions that take vectors as arguments, and
return logical vectors based on multiple criteria. Here are some that I
frequently use:

Logical testing functions

is.integer(x)
Tests if values in a vector are integers

is.na(x), is.null(x)
Tests if values in a vector are NA or NULL

is.finite(x)
Tests if a value is a finite numerical value. If a value is NA, NULL,
Inf, or -Inf, is.finite() will return FALSE.

duplicated(x))
Returns FALSE at the first location of each unique value in x,
and TRUE for all future locations of unique values. For example,
duplicated(c(1, 2, 1, 2, 3)) returns (FALSE, FALSE, TRUE,
TRUE, FALSE). If you want to remove duplicated values from a
vector, just run x <- x[!duplicated(x)]

which(log.vec)
Logical vectors arent just good for indexing, you can also use
them to figure out which values in a vector satisfy some criteria. To
do this, use the function which(). If you apply the function which() to
a logical vector, R will tell you which values of the index are TRUE.
For example, given our sex vector, we can use the which() function to
see which pirates are male and which are female:

which(sex == "m")

## [1] 1 2 4

which(sex == "f")

## [1] 3 5
6: Matrices and Data Frames

What are matrices and dataframes?

By now, you should be comfortable with scalar and vector objects.


However, you may have noticed that neither object types are appro- Figure 21: Did you actually think I
priate for storing lots of data such as the results of a survey or could talk about matrices without a
Matrix reference?!
experiment. Thankfully, R has two object types that represent large
data structures much better: matrices and dataframes.
# scalar v vector v matrix
Matrices and dataframes are very similar to spreadsheets in Excel par(mar = rep(1, 4))
plot(1, xlim = c(0, 10), ylim = c(-.5, 5),
or data files in SPSS. Every matrix or dataframe contains rows (call xlab = "", ylab = "",
xaxt = "n", yaxt = "n",
that number m) and columns (n). Thus, wile a vector has 1 dimen- bty = "n", type = "n")

sion (its length), matrices and dataframes both have 2-dimensions # scalar
rect(rep(0, 1), rep(0, 1), rep(1, 1), rep(1, 1))
text(.5, -.5, "scalar")
representing their width and height. You can think of a matrix or
# Vector
dataframe as a combination of n vectors, where each vector has a rect(rep(2, 5), 0:4, rep(3, 5), 1:5)
text(2.5, -.5, "Vector")
length of m. See Figure 22 to see the shocking difference between # Matrix
rect(rep(4:8, each = 5),
these data objects. rep(0:4, times = 5),
rep(5:9, each = 5),
While matrices and dataframes look very similar, they arent rep(1:5, times = 5))
text(6.5, -.5, "Matrix / Data Frame"
exactly the same. While a matrix can contain either character or nu- )

meric columns, a dataframe can contain both numeric and character


columns. Because dataframes are more flexible, most real-world
datasets, such as surveys containing both numeric (e.g.; age, response
times) and character (e.g.; sex, favorite movie) data, will be stored as
dataframes in R19 .
In the next section, well cover the most common functions for cre-
ating matrix and dataframe objects. Well then move on to functions scalar Vector Matrix / Data Frame

that take matrices and dataframes as inputs. Figure 22: scalar, Vector, Matrix...
::drops mike::
19
WTF If dataframes are more flexible
than matrices, why do we use matrices
at all? The answer is that, because
they are simpler, matrices take up less
computational space than dataframes.
Additionally, some functions require
matrices as inputs to ensure that they
work correctly.
72 yarrr! the pirates guide to r

Creating matrices and dataframe objects

There are a number of ways to create your own matrix and dataframe
objects in R. Because matrices and dataframes are just combinations
of vectors, each function takes one or more vectors as inputs, and
returns a matrix or a dataframe.

Creating matrices and


dataframes
Function Description
cbind() Combine vectors as columns in a matrix/dataframe
rbind() Combine vectors as rows in a matrix/dataframe
matrix() Create a matrix with a desired number of rows and
columns from a single vector
data.frame() Combine vectors as columns in a dataframe

cbind() and rbind()


cbind() and rbind() both create matrices by combining several
vectors of the same length. cbind() combines vectors as columns,
while rbind() combines them as rows.
cbind()
rbind()
Lets use these functions to create a matrix with the numbers 1
through 30. First, well create three vectors of length 10, then well
combine them into one matrix.
Keep in mind that matrices can either
contain numbers or characters. If
x <- 1:5 you try to create a matrix with both
y <- 6:10 numbers and characters, it will turn all
the numbers into characters:
z <- 11:15
cbind(1:5,
c("a", "b", "c", "d", "e"))
# Create a matrix where x, y and z are columns
cbind(x, y, z) ## [,1] [,2]
## [1,] "1" "a"
## x y z ## [2,] "2" "b"
## [3,] "3" "c"
## [1,] 1 6 11 ## [4,] "4" "d"
## [2,] 2 7 12 ## [5,] "5" "e"
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15

# Create a matrix where x, y and z are rows


rbind(x, y, z)
6: matrices and data frames 73

## [,1] [,2] [,3] [,4] [,5]


## x 1 2 3 4 5
## y 6 7 8 9 10
## z 11 12 13 14 15

As you can see, the cbind() function combined the vectors as


columns in the final matrix, while the rbind() function combined
them as rows.

matrix()
matrix()
The matrix() function creates a matrix form a single vector of data.
The function has 3 main inputs: data a vector of data, nrow the
We can also organize the vector by rows
number of rows you want in the matrix, and ncol the number of instead of columns using the argument
columns you want in the matrix, and byrow a logical value indicat- byrow = T:
ing whether you want to fill the matrix by rows. Check out the help
# Create a matrix of the integers 1:10,
menu for the matrix function (?matrix) to see some additional inputs. # with 2 rows and 5 columns entered by row
Lets use the matrix() function to re-create a matrix containing the
matrix(data = 1:10,
values from 1 to 10. nrow = 2,
ncol = 5,
# Create a matrix of the integers 1:10, byrow = T)
# with 5 rows and 2 columns ## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
matrix(data = 1:10, ## [2,] 6 7 8 9 10

nrow = 5,
ncol = 2)

## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10

# Now with 2 rows and 5 columns


matrix(data = 1:10,
nrow = 2,
ncol = 5)

## [,1] [,2] [,3] [,4] [,5]


## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
74 yarrr! the pirates guide to r

data.frame()
To create a dataframe from vectors, use the data.frame() function.
The data.frame() function works very similarly to cbind() the only
difference is that in data.frame() you specify names to each of the
columns as you define them. Again, unlike matrices, dataframes
can contain both string vectors and numeric vectors within the same

data.frame()
object. Because they are more flexible than matrices, most large
datasets in R will be stored as dataframes.
Lets create a simple dataframe using the data.frame() function
with a mixture of text and numeric columns:

# Create a dataframe with columns col.1,


# col.2, and col.3

data.frame("index" = c(1, 2, 3, 4, 5),


"sex" = c("m", "m", "m", "f", "f"),
"age" = c(99, 46, 23, 54, 23)
)

## index sex age


## 1 1 m 99
## 2 2 m 46
## 3 3 m 23
## 4 4 f 54
## 5 5 f 23

Dataframes pre-loaded in R
Now you know how to use functions like cbind() and data.frame()
to manually create your own matrices and dataframes in R. However,
for demonstration purposes, its frequently easier to use existing
dataframes rather than always having to create your own. Thankfully,
R has us covered: R has several datasets that come pre-installed in a
package called datasets you dont need to install this package, its
included in the base R software. While you probably wont make any
major scientific discoveries with these datasets, they allow all R users
to test and compare code on the same sets of data. Here are a few To see a complete list of all the datasets
datasets that we will be using in future examples: included in the datasets package, run
the code: library(help = "datasets")

ChickWeight: Weight versus age of chicks on four different diets

InsectSprays: Effectiveness of six different types of insect sprays

ToothGrowth: The effects of different levels of vitamin C on the


tooth growth of guinea pigs.
6: matrices and data frames 75

Since these datasets are preloaded in R, you can always access


them without having to create them manually. For example, if
you run head(ToothGrowth), youll see the first few rows of the
ToothGrowth dataframe.

Matrix and dataframe functions

R has lots of functions for viewing matrices and dataframes and


returning information about them. Here are the most common:

Matrix and Dataframe


Functions
Function Description
head() Print the first few rows to the console
tail() Print the last few rows to the console
View() Open the entire object in a new window
dim() Count the number of rows and columns
nrow(), ncol() Count the number of rows (or columns)
rownames(), colnames() Show the row (or column) names
names() Show the column names (only for dataframes)
str() Show the structure of a dataframe
summary() Show summary statistics

View()
You can print an entire matrix or dataframe in the console by just
executing the name of the object, but if the matrix or dataframe is
very large, it can overwhelm the console. Instead, its best to use one
of the viewing functions like View() or head(). The View() function
will open the object in its own window: View()
# Print the entire ChickWeight dataframe
# in a new data window

View(ChickWeight)

head()
To print the first few rows of a dataframe or matrix to the console,
use the head() function:
76 yarrr! the pirates guide to r

Figure 23: Screenshot of the window


from View(ChickWeight). You can use
this window to visually sort and filter
the data to get an idea of how it looks,
but you cant add or remove data and
nothing you do will actually change the
dataframe.

# Print the first few rows of ChickWeight


head()
# to the console

head(ChickWeight)

## weight Time Chick Diet


## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1

summary()

summary()
To get summary statistics on all columns in a dataframe, use the
summary() function:

# Print summary statistics of the columns of


# ChickWeight to the console

summary(ChickWeight)

## weight Time Chick Diet


## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
6: matrices and data frames 77

## (Other):506

str()
To learn about the classes of columns in a dataframe, in addition to
some other summary information, use the str() (structure) function.

str()
This function returns information for more advanced R users, so
dont worry if the output looks confusing.
# Print the classes of the columns of
# ChickWeight to the console

str(ChickWeight)

## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 578 obs. of 4 variables:


## $ weight: num 42 51 59 64 76 93 106 125 149 171 ...
## $ Time : num 0 2 4 6 8 10 12 14 16 18 ...
## $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
## $ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "formula")=Class 'formula' length 3 weight ~ Time | Chick
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "outer")=Class 'formula' length 2 ~Diet
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Time"
## ..$ y: chr "Body weight"
## - attr(*, "units")=List of 2
## ..$ x: chr "(days)"
## ..$ y: chr "(gm)"

Here are some of the other functions in action:

# What are the names of the ChickWeight columns?


names(ChickWeight)

## [1] "weight" "Time" "Chick" "Diet"

# How many rows are in ChickWeight?


nrow(ChickWeight)

## [1] 578

# How many columns are in ChickWeight?


ncol(ChickWeight)

## [1] 4

Dataframe column names

One of the nice things about dataframes is that each column will have
a name. You can use these name to access specific columns by name
78 yarrr! the pirates guide to r

without having to know which column number it is.

names()
names()
To access the names of a dataframe, use the function names(). This
will return a string vector with the names of the dataframe. Lets use
names() to get the names of the ToothGrowth dataframe:

# What are the names of the ToothGrowth dataframe?


names(ToothGrowth)

## [1] "len" "supp" "dose"

Accessing dataframe columns by name with $


To access a specific column in a dataframe by name, you use the $ op-
erator in the form df$colname where df is the name of the dataframe,
and colname is the name of the column you are interested in. This
operation will then return the column you want as a vector.
Lets use the $ operator to get a vector of just the length column
df$column
(called len) from the ToothGrowth dataset:
# Return the len column of ToothGrowth
ToothGrowth$len

## [1] 4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0 16.5 16.5 15.2 17.3
## [15] 22.5 17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5
## [29] 23.3 29.5 15.2 21.5 17.6 9.7 14.5 10.0 8.2 9.4 16.5 9.7 19.7 23.3
## [43] 23.6 26.4 20.0 25.2 25.8 21.2 14.5 27.3 25.5 26.4 22.4 24.5 24.8 30.9
## [57] 26.4 27.3 29.4 23.0

If you want to access several columns by name, you can forgo the $
operator, and put a character vector of column names in brackets:

# Give me the len AND supp columns of ToothGrowth


ToothGrowth[c("len", "supp")]

Because the $ operator returns a vector, you can easily calculate


descriptive statistics on columns of a dataframe using $. Lets calcu-
late the mean tooth length with mean(), and the frequency of each
supplement with table():

# What is the mean of the len column of ToothGrowth?


mean(ToothGrowth$len)

## [1] 18.81333

# Give me a table of the supp column of ToothGrowth.


table(ToothGrowth$supp)
6: matrices and data frames 79

##
## OJ VC
## 30 30

Adding new columns to a dataframe


You can add new columns to a dataframe using the $ and assignment
<- operators. To do this, just use the dataframe$colname notation and
assign a new vector to it. Lets test this by adding a new column to
ToothGrowth called len.in which converts the original len column
from centimeters to inches. Because 1 cm = 2.54 inches, well just
multiply the original len column by 2.54. df$new <- [ ]
# Add a new column called len.in to
# ToothGrowth calculated as len * 2.54

ToothGrowth$len.in <- ToothGrowth$len * 2.54

You can add new columns with any information that you want to
a dataframe - even basic numerical vectors. For example, lets add
a simple index column to the dataframe showing the original order
of the rows of the data. To do this, Ill assign the numeric vector
1:nrow(ToothGrowth) to a new column called index

# Add a new columnn called index to


# ToothGrowth with integers 1:nrow(ToothGrowth)

ToothGrowth$index <- 1:nrow(ToothGrowth)

Changing dataframe column names


To change the name of a column in a dataframe, just use a combina-
tion of the names() function, indexing, and reassignment. For exam-
ple, to change the name of the first column of ToothGrowth to length
just index the first value of names(ToothGrowth) and reassign: Tip! If you have a dataframe with
several columns, and you want to
# Change the name of the first column of change one of the column names, you
can avoid having to visually look up the
# ToothGrowth to "length" index by utilizing the which() function.
For example, in a dataframe called df,
you can change a column titled old to
names(ToothGrowth)[1] <- "length"
new as follows:

# Change the name of the "old" column


# in df to "new"
names(df)[which(names(df) == "old")] <- "new"

If you have a dataset with, say, over


100 variables, this trick can save you a
lot of time and eye-strain.
80 yarrr! the pirates guide to r

Slicing and dicing dataframes

Once you have a dataset stored as a matrix or dataframe in R, youll


want to start accessing specific parts of the data based on some crite-
ria. For example, if your dataset contains the result of an experiment
comparing different experimental groups, youll want to calculate
statistics for each experimental group separately. The process of se-
lecting specific rows and columns of data based on some criteria is
commonly known as slicing and dicing.

Indexing dataframes with brackets [rows, columns]


Just like vectors, you can access specific data in dataframes using
brackets. But now, instead of just using one indexing vector, we use
two indexing vectors: one for the rows and one for the columns.
Figure 24: Slicing and dicing data. The
To do this, use the notation data[rows, columns], where rows and
turnip represents your data, and the
columns are vectors of integers. knife represents indexing with brackets,
Lets try indexing the ToothGrowth dataframe. Again, the Tooth- or subsetting functions like subset().
The black-eyed clown holding the knife
Growth dataframe represents the results of a study testing the effec- is just off camera.
tiveness of different types of supplements on the length of guinea
pigs teeth. First, lets look at the entries in rows 1 through 5, and
df[rows, columns]
column 1:

# Give me the rows 1-5 and column 1 of ToothGrowth


ToothGrowth[1:5, 1]

## [1] 4.2 11.5 7.3 5.8 6.4

Because the first column is len, the primary dependent measure,


this means that the tooth lengths in the first 5 observations are 4.2,
11.5, 7.3, 5.8, 6.4.
Of course, you can index matrices and dataframes with longer vec-
tors to get more data. Now, lets look at the first 8 rows of columns 1
and 3:

# Give me rows 1-8 and columns 1 and 3 of ToothGrowth


ToothGrowth[1:8, c(1,3)]

## len dose
## 1 4.2 0.5
## 2 11.5 0.5
## 3 7.3 0.5
## 4 5.8 0.5
## 5 6.4 0.5
## 6 10.0 0.5
## 7 11.2 0.5
6: matrices and data frames 81

## 8 11.2 0.5

Get an entire row (or column)


If you want to look at an entire row or an entire column of a matrix
or dataframe, you can leave the corresponding index blank. For
Figure 25: Ah the ToothGrowth
example, to see the entire 1st row of the ToothGrowth dataframe, we dataframe. Yes, one of the dataframes
can set the row index to 1, and leave the column index blank: stored in R contains data from an ex-
periment testing the effectiveness of
different doses of Vitamin C supple-
# Give me the 1st row of ToothGrowth
ments on the growth of guinea pig
ToothGrowth[1, ] teeth. The images I found by Googling
guinea pig teeth were all pretty
## len supp dose len.in index horrifying, so lets just go with this one.

## 1 4.2 VC 0.5 10.668 1

Similarly, to get the entire 2nd column, set the column index to 2
and leave the row index blank:

# Give me the 2nd column of ToothGrowth


ToothGrowth[, 2]

Many, if not all, of the analyses you will be doing will be on


subsets of data, rather than entire datasets. For example, if you have
data from an experiment, you may wish to calculate the mean of
participants in one group separately from another. To do this, well
use subsetting selecting subsets of data based on some criteria.
To do this, we can use one of two methods: indexing with logical
vectors, or the subset() function. Well start with logical indexing
first.

Indexing dataframes with brackets [ , ]


Indexing dataframes with logical vectors is almost identical to index-
ing single vectors. First, we create a logical vector containing only
TRUE and FALSE values. Next, we index a dataframe (typically the
rows) using the logical vector to return only values for which the
logical vector is TRUE.
For example, to create a new dataframe called ToothGrowth.VC
containing only data from the guinea pigs who were given the VC
supplement, wed run the following code:

# Create a new df with only the rows of ToothGrowth


# where supp equals VC

ToothGrowth.VC <- ToothGrowth[ToothGrowth$supp == "VC", ]


82 yarrr! the pirates guide to r

Of course, just like we did with vectors, we can make logical


vectors based on multiple criteria and then index a dataframe
based on those criteria. For example, lets create a dataframe called
ToothGrowth.OJ.a that contains data from the guinea pigs who were
given an OJ supplement with a dose less than 1.0:

# Create a new df with only the rows of ToothGrowth


# where supp equals OJ and dose < 1

ToothGrowth.VC.a <- ToothGrowth[ToothGrowth$supp == "OJ" &


ToothGrowth$dose < 1, ]

Indexing with brackets is the standard way to slice and dice


dataframes. However, the code can get a bit messy. A more elegant
method is to use the subset() function.

subset()
subset()
subset()
x
The data (usually a dataframe)

subset
Figure 26: The subset() function is like
A logical vector indicating which rows you want to select
a lightsaber. An elegant function from a
more civilized age...
select
An optional vector of the columns you want to select

Lets use the subset() function to create a new, subsetted dataset


from the ToothGrowth dataframe containing data from guinea pigs
who had a tooth length less than 20cm (len < 20), given the OJ
supplement (supp == "OJ"), and with a dose greater than or equal to
1 (dose >= 1):

# Create a subsetted dataframe from ToothGrowth


# of rows where len < 20, supp == "OJ" and dose >= 1

subset(x = ToothGrowth,
subset = len < 20 &
supp == "OJ" &
6: matrices and data frames 83

dose >= 1
)

## len supp dose len.in index


## 41 19.7 OJ 1 50.038 41
## 49 14.5 OJ 1 36.830 49
# EXPLANATION:
#
As you can see, there were only two datapoints that satisfied all 3 # From the ToothGrowth dataframe,
of our subsetting criteria. # give me all rows where len < 20
In the example above, I didnt specify an input to the select # AND supp == "OJ" AND dose >= 1.
# Then, return just the len and
argument because I wanted all columns. However, if you just want # supp columns.
certain columns, you can just name the columns you want in the
subset(x = ToothGrowth,
select argument (see example on the right): subset = len < 20 &
supp == "OJ" &
dose >= 1,
Combining slicing and dicing with functions select = c(len, supp)
)
Now that you know how to slice and dice dataframes using indexing
and subset(), you can easily combine slicing and dicing with statis- ## len supp
## 41 19.7 OJ
tical functions to calculate summary statistics on groups of data. For
## 49 14.5 OJ
example, the following code will calculate the mean tooth length of
guinea pigs with the OJ supplement using the subset() function:

# Step 1: Create a subsettted dataframe called oj

oj <- subset(x = ToothGrowth,


subset = supp == "OJ"
)

# Step 2: Calculate the mean of the len column from


# the new subsetted dataset

mean(oj$len)

## [1] 20.66333

Or, we can do the same thing with logical indexing:

# Step 1: Create a subsettted dataframe called oj


oj <- ToothGrowth[ToothGrowth$supp == "OJ",]

# Step 2: Calculate the mean of the len column from


# the new subsetted dataset

mean(oj$len)

## [1] 20.66333
84 yarrr! the pirates guide to r

Or heck, we can do it all in one line by only referring to column


vectors:

# Step 1: Take the mean of len column, indexed by


# the supp column

mean(ToothGrowth$len[ToothGrowth$supp == "OJ"])

## [1] 20.66333

As you can see, R allows for many methods to accomplish the


same task. The choice is up to you.

Additional tips

with()
with(df, )
The function with() helps to save you some typing when you are us-
ing multiple columns from a dataframe. Specifically, it allows you to
specify a dataframe (or any other object in R) once at the beginning
of a line then, for every object you refer to in the code in that line, R
will assume youre referring to that object in an expression.
For example, using the ToothGrowth dataset, we can calculate a
vector defined as len divided by dose using the with() function
where we specify the name of the dataframe (ToothGrowth) just once
at the beginning of the line, and then refer to the column names
without re-writing the dataframe or using the $ operator:

# Create a vector of len / dose from ToothGrowth


with(ToothGrowth, len / dose)

# THIS IS IDENTICAL TO

ToothGrowth$len / ToothGrowth$dose

If you find yourself making several calculations on one dataframe,


the with() function can save you a lot of typing.
6: matrices and data frames 85

Test your R might! Pirates and superheroes

The following table shows the results of a survey of 10 pirates. In


addition to some basic demographic information, the survey asked
each pirate What is your favorite superhero? and How many
tattoos do you have?

Name Sex Age Superhero Tattoos


Astrid f 30 Batman 11
Lea m 25 Superman 15
Sarina m 25 Batman 12
Remon m 29 Spiderman 12
Letizia f 72 Batman 17 Figure 27: This is a lesser-known
Babice m 22 Antman 12 superhero named Maggott who
could transform his body to get
Jonas f 35 Batman 9
superhuman strength and endurance,
Wendy f 7 Superman 13 but to do so he needed to release two
Niveditha m 32 Maggott 875 huge parasitic worms from his stom-
ach cavity and have them eat things
Gioia f 21 Superman 0 (http://heavy.com/comedy/2010/04/the-
20-worst-superheroes/). Yeah...Im
1. Combine the data into a single dataframe. Complete all the follow- shocked this guy wasnt a hit.
ing exercises from the dataframe!

2. What is the median age of the 10 pirates?

3. What was the mean age of female and male pirates separately?

4. What was the most number of tattoos owned by a male pirate?

5. What percent of pirates under the age of 32 were female?

6. What percent of female pirates are under the age of 32?

7. Add a new column to the dataframe called tattoos.per.year


which shows how many tattoos each pirate has for each year in
their life.

8. Which pirate had the most number of tattoos per year?

9. What are the names of the female pirates whose favorite superhero
is Superman?

10. What was the median number of tattoos of pirates over the age of
30 whose favorite superhero is Spiderman?
7: Importing, saving, and managing data

Remember way back in Chapter 220 when I said everything in R is 20


You know...back when we first
an object? Well, thats still true. In this chapter, well cover the basics met...we were so young and full of
excitement then...sorry, now Im getting
of R object management. Well cover how to load new objects like emotional...lets move on.
external datasets into R, how to manage the objects that you already
have, and how to export objects from R into external files that you
can share with other people or store for your own future use.

Helpful workspace functions


Here are some functions helpful for managing your workspace that
well go over in this chapter:

Figure 28: Your workspace all the


objects, functions, and delicious glue
Code Result youve defined in your current session.
getwd() Returns the current working directory
setwd(file = ) Changes the working directory to a specified file location
list.files() Returns a vector of all files and folders in the working
directory
ls() Display all objects in the current workspace
rm(a, b, ..) Removes the objects a, b... from your workspace
rm(list = ls()) Deletes all objects in your workspace
save(a, b, .., file = "myimage.RData) Saves objects a, b, ... to myimage.RData
save.image(file = "myimage.RData") Saves your entire workspace to myimage.RData in the
working directory
load(file = "myimage.RData") Loads a stored workspace called myimage.RData from
the working directory
write.table(x, file = "mydata.txt") Saves the object x as a text file called "mydata.txt" to the
working directory
read.table(file = "mydata.txt") Reads a text file called "mydata.txt" in the working
directory into R.
88 yarrr! the pirates guide to r

Workspace / Working Environment

The workspace (aka your working environment) represents all of the


objects and functions you have either defined in the current session,
or have loaded from a previous session. When you started RStudio
for the first time, the working environment was empty because you
hadnt created any new objects or functions. However, as you defined
new objects and functions using the assignment operator <-, these
new objects were stored in your working environment. When you
closed RStudio after defining new objects, you likely got a message
asking you Save workspace image...? This is RStudios way of
asking you if you want to save all the objects currently defined in
your workspace as an image file on your computer.

The working directory


The working directory is just a file path on your computer that sets
the default location of any files you read into R, or save out of R. If
you ask R to import a dataset from a text file, or save a dataframe
as a text file, it will assume that the file is in, or near, your working
directory unless you say otherwise.
It is a good idea to have a new working directory for every R

getwd()
related project you work on. To see your current working directory,
use the getwd():

# Print my current working directory


getwd()

## [1] "/Users/CaptainJack/Desktop/yarrr"

As you can see, when I run this code, it tells me that my working
directory is in a folder on my Desktop called yarrr.
If you want to change your working directory, use the setwd()
function. For example, if I wanted to change my working directory to
an existing Dropbox folder called yarrr, Id run the following code:

# Chnage my working directory to the following path


setwd(dir = "/Users/CaptainJack/Dropbox/yarrr")
setwd()
Project in RSTudio
If youre using RStudio, you have the option of creating a new R
project. A project in RStudio is simply a working directory designated
with a .RProj file. When you open a project, the working directory
7: importing, saving, and managing data 89

will automatically be set to the directory that the .RProj file is located
in. Personally, I recommend creating a new R Project whenever you
are starting a new research project. However, its not necessary.
Once youve decided on a working directory for a specific R
project, its a good idea to put new folders in the directory which
will contain your R code, data files, notes, and other material relevant
to your project. In Figure 29 you can see how my working directory
looks for a project I am working on called ForensicFFT.

Figure 29: Here is the folder structure I


use for the working directory in my R
project called ForensicFFT. As you can
see, it contains an .Rproj file generated
by RStudio which sets this folder as
the working directory. I also created
separate folders for analyses (e.g.; R
code), data (e.g.; text files, and .RData
image files) and notes among others

ls()
If you want to see all the objects defined in your current workspace,
use the ls() function:

# Print all the objects in my workspace

ls()
ls()

## [1] "experiment.df" "htest" "a" "b"

The result above says that I have 4 objects in my workspace called


experiment.df, htest, a and b.

.RData files
When you save objects from R to your computer using the save.image()
and save() functions, the objects are saved as .RData files. If you
open a .RData file outside of R (say by double-clicking it), this will
90 yarrr! the pirates guide to r

open either R or RStudio (you can choose which by right-clicking)


and load the objects it contains into a new workspace.
The .RData file format is specific to R and cannot be opened in any
other software. If you want to save data like a dataframe in a format
that can be read by other software, like Excel or SPSS, youll need to
save the object as a text file using a function like write.table(). Well
get to this a bit later.

save.image()
To save all the objects in your workspace as a .RData file, use the
save.image()
save.image() function. This will save all the objects as a .RData file
close to your working directory. For example, to save my workspace
in the data folder located in my working directory, Id run the follow-
ing:

# Save my workspace to myimage.RData in th,e


# data folder of my working directory

save.image(file = "data/myimage.RData")

save()
save()
If you dont want to save your entire workspace, but instead only
want to save a few selected objects, you can use the save() func-
tion. For example, lets say I want to store both data from an ex-
periment, and the results of a hypothesis test, in one file called
Mar7_experiment.RData. If the data is stored in a dataframe called
experiment.df, and the hypothesis test is stored in an object called
htest, I can run the following code:

save(experiment.df, htest,
file = "data/Mar7_experiment.RData")

load()
load()
To load a saved workspace, use the load() function. For example, to
load the workspace that I just saved to the data folder in my working
directory, Id run the following:

load(file = "data/myimage.RData")
7: importing, saving, and managing data 91

rm()
rm()
At some points in your analyses, you may find that your workspace
is filled up with one or more objects that you dont need either
because theyre slowing down your computer, or because theyre just
distracting. To remove objects from your workspace, use the rm()
function.
To remove specific objects, enter the objects as arguments to rm().
For example, to remove a huge dataframe called huge.df, Id run the
following;

# Remove huge.df from workspace


rm(huge.df)

If you want to remove all of the objects in your working directory,


enter the argument list = ls()

# Remove ALL objects from workspace


rm(list = ls())

Exporting dataframes as .txt files

While .RData files are great for saving R objects, sometimes youll
want to export data (specifically dataframes) as a simple .txt text file

write.table()
that other programs, like Excel and SPSS, can also read. To do this,
use the write.table() function.
92 yarrr! the pirates guide to r

write.table()
x
The object you want to write to a text file. Usually a dataframe
object.

file
The documents file path relative to the working directory un-
less specified otherwise. For example file = "mydata.txt" will
save the data as a text file directly in the working directory, while
file = "data/mydata.txt" will save the data in a folder called
data inside the working directory. If you want to write the file to
someplace outside of your workspace, just put the full file path.

sep
A string indicating how the columns are separated. For comma
separated files, use ,, for tabdelimited files, use \t

row.names
A logical value (T or F) indicating whether or not save the row-
names in the text file.

For example, the following code will save the ToothGrowth object
as a tabdelimited text file in my working directory:

# Write the ChickWeight dataframe object to a tab-delimited


# text file called chickweight.txt in my working directory

write.table(x = ChickWeight,
file = "chickweight.txt",
sep = "\t"
)

If you want to save a file to a folder inside of your working di-


rectory, just include the folder name before the file name. For ex-
ample, if I have a folder called data in my working directory, I can
save the file to this folder with the file = "data/chickweight.txt"
argument. Finally, if you want to save a file to a location outside
of your working directory, just use the entire directory name. For
example, to save a text file to my Desktop, I would set file =
"Users/nphillips/Desktop". When you enter a long path name
into the file argument of read.table(), R will look for that directory
outside of your working directory.
7: importing, saving, and managing data 93

Reading dataframes from .txt files

If you have a text file that you want to read into R, use the read.table()
read.table()
function.

read.table()

file
The documents file path (make sure to enter as a string with
quotation marks!) OR an html link to a file.

header
A logical value indicating whether the data has a header row
that is, whether the first row of the data represents the column
names.

sep
A string indicating how the columns are separated. For comma
separated files, use ",", for tab-delimited files, use "\t"

stringsAsFactors
A logical value indicating whether or not to convert strings to
factors. I always set this to FALSE (because I dont like using
factors)

The three critical arguments to read.table() are file, sep, and


header. The file argument is a character value telling R where to
find the file. This can be either a specific directory on your harddrive,
or a folder in your working directory. The sep argument tells R how
the columns are separated in the file (again, for a commaseparated
file, use sep = ",", for a tabdelimited file, use sep = "\t". Finally,
the header is a logical value (T or F) telling R whether or not the first
row in the data is the name of the data columns.
Lets test this function out by reading in a text file titled mydata.txt.
Since the text file is located a folder called data in my working direc-
tory, Ill use the file path file = "data/mydata.txt" and since the file
is tabdelimited, Ill use the argument sep = "\t":

# Read a tab-delimited text file called mydata.txt into


# R and store as a new object called mydata

mydata <- read.table(file = 'data/mydata.txt',


94 yarrr! the pirates guide to r

sep = '\t',
header = T
)

Again, if you want to read a test file outside of your working


directory, just use the entire path as the argument to file = . For
example, to read in a datafile on my desktop, Id use the argument
file = "/Users/nphillips/Desktop/mydata.txt".

Reading datafiles directly from a web URL


A really neat feature of the read.table() function is that you can use
it to load datafiles directly from the web. To do this, just set the file
path to the documents web URL (beginning with http://www.). For
example, I have a text file stored at http://goo.gl/jTNf6P. You can
import and save this tabdelimited text file as a new object called
fromweb as follows:

fromweb <- read.table(file = 'http://goo.gl/jTNf6P',


sep = '\t',
header = T
)

## Warning in file(file, "rt"): unable to resolve goo.gl


## Error in file(file, "rt"): cannot open the connection
The fromweb dataframe should look
I think this is pretty darn cool. This means you can save your main like this:
data files on Dropbox or a web-server, and always have access to it
fromweb
from any computer by accessing it from its web URL.
## Error in eval(expr, envir,
enclos): object fromweb not found
Additional Tips

There are many functions other than read.table() for importing


data. For example, the functions read.csv and read.delim are
specific for importing comma-separated and tab-separated text
files. In practice, these functions do the same thing as read.table,
but they dont require you to specify a sep argument. Personally, I
always use read.table() because it always works and I dont like
trying to remember unnecessary functions.

If you absolutely have to read a non-text file into R, check out the
package called foreign. This package has functions for importing
Stata, SAS and Shitty Piece of Shitty Shit files directly into R. To
read Excel files, try the package xlsx
7: importing, saving, and managing data 95

Test your R Might!

1. In RStudio, open a new R Project in a new directory by clicking


File New Project. Call the directory MyRProject, and then
select a directory on your computer for the project. This will be the
projects working directory.

2. Outside of RStudio, navigate to the directory you selected in


Question 1 and create three new folders Call them data, r, and
notes.

3. Go back to RStudio and open a new R script. Save the script as


CreatingObjects.R in the r folder you created in Question 2.

4. In the script, create new objects called a, b, and c. You can assign
anything to these objects from vectors to dataframes. If you cant
think of any, use these:

a <- data.frame("sex" = c("m", "f", "m"),


"age" = c(19, 43, 25),
"favorite.movie" = c("Moon", "The Goonies", "Spice World")
)

b <- mean(a$age)

c <- table(a$sex)

5. Send the code to the Console so the objects are stored in your
current workspace. Use the ls() function to see that the objects
are indeed stored in your workspace.

6. I have a tabdelimited text file called club at the following web ad-
dress: http://nathanieldphillips.com/wp-content/uploads/2015/12/club.txt.
Using read.table(), load the data as a new object called club.df
in your workspace.

7. Using write.table(), save the dataframe as a tabdelimited text


file called club.txt to the data folder you created in Question 221 21
You wont use the text file again
for this exercise, but now you have it
8. Save the three objects a, b, c, and club.df to an .RData file called handy in case you need to share it with
"myobjects.RData" in your data folder using save(). someone who doesnt use R.

9. Clear your workspace using the rm(list = ls()) function. Then,


run the ls() function to make sure the objects are gone.
10. Open a new R script called AnalyzingObjects.R and save the script
to the r folder you created in Question 2.
96 yarrr! the pirates guide to r

11. Now, in your AnalyzingObjects.R script, load the objects back into
your workspace from the "myobjects.RData" file using the load()
function. Again, run the ls() function to make sure all the objects
are back in your workspace.
12. Add some R code to your AnalyzingObjects.R script. Calculate
some means and percentages. Now save your AnalyzingObjects.R
script, and then save all the objects in your workspace to "myob-
jects.RData".
13. Congratulations! You are now a well-organized R Pirate! Quit
RStudio and go outside for some relaxing pillaging.
8: Advanced dataframe manipulation

In this chapter well cover some more advanced functions and proce-
dures for manipulating dataframes

Row and column statistics: rowMeans, colMeans, rowSums, colSums

To calculate sums or means of numerical data across rows or columns


of a matrix or dataframe, you can use one of the functions: rowMeans,
colMeans, rowSums and colSums.
For example, lets say you give an exam containing 3 questions
to 5 different students. The data could be represented in a matrix,
where each row is a student and each column is a question as fol-
lows:

exam <- data.frame(


"q1" = c(1, 5, 2, 3, 2),
"q2" = c(8, 10, 9, 8, 7),
"q3" = c(3, 7, 4, 6, 4)
)

exam

## q1 q2 q3
## 1 1 8 3
## 2 5 10 7
## 3 2 9 4
## 4 3 8 6
## 5 2 7 4

To calculate the mean score across columns (questions), we can


use the colMeans function. This will return the mean score for each
column.

colMeans(exam)

## q1 q2 q3
## 2.6 8.4 4.8

It looks like question 1 was the hardest (with a mean score of 2.6),
and question 2 was the easiest (with a mean score of 8.4).
98 yarrr! the pirates guide to r

To calculate the mean score across rows (students), we can use the
rowMeans function. This will return the mean score for each row:

rowMeans(exam) As you can imagine, the rowSums and


colSums functions work exactly the
same way except they calculate sums
## [1] 4.000000 7.333333 5.000000 5.666667 4.333333 instead of means:

It looks like the second student did the best (with a mean score of colSums(exam)
8.4). ## q1 q2 q3
## 13 42 24

Sorting dataframes with order() rowSums(exam)

To sort the rows of a dataframe according to column values, use a ## [1] 12 22 15 17 13

combination of indexing, reassignment, and the order() function.


The order() function takes one or more vectors as arguments, and
returns an integer vector indicating the order of the vectors. You can
use the output of order() to index a dataframe, and thus change its
order.
Lets re-order our exam data by the values of the question 1 col-
umn. To do this, well put order(exam$q1) as the row index: Important! Dont forget than in order
to change an object in R, you must
reassign it! If you try to reorder a
# Sort the exam dataframe by Q1 dataframe without reassigning it with
exam <- exam[order(exam$q1),] the <- operator, it wont change.
exam

## q1 q2 q3
## 1 1 8 3
## 3 2 9 4
## 5 2 7 4
## 4 3 8 6
## 2 5 10 7

To order a dataframe by several columns, just add additional argu-


ments to order(). For example, to order the dataframe by Question 1
and then by Question 2, wed do the following:

# Sort the exam dataframe by Q1 and then Q2


exam <- exam[order(exam$q1, exam$q2),]

By default, the order() function will sort values in ascending


(increasing) order. If you want to order the values in descending
(decreasing) order, just add the argument decreasing = T to the
order() function:
8: advanced dataframe manipulation 99

# Sort the exam dataframe by Q3 in decreasing order


exam <- exam[order(exam$q3, decreasing = T),]

You can also use the order() function to order character vectors in
alphabetical order. For example, if I add a character vector of student
names to the exam dataframe, I can then sort the data according to
student in alphabetical order:

# Add student names to the exam dataframe


exam$student <- c("david", "jack", "vincent", "heath", "alice")

# Sort the exam data by the student names


exam <- exam[order(exam$student),]
exam

## q1 q2 q3 student
## 1 1 8 3 alice
## 2 5 10 7 david
## 3 2 9 4 heath
## 4 3 8 6 jack
## 5 2 7 4 vincent

Grouped aggregation with aggregate()

In Chapter 6, you learned how to calculate statistics on subsets of


data using subsetting, indexing, and basic functions. However, you
may have noticed that the code is not terribly efficient. For example,
lets say you want to know if a persons favorite TV show predicts
whether or not they smoke. To answer this, you might conduct a
study where you ask 1,000 people to name their favorite show and
indicate whether or not they smoke. After you collect the data, you
want to calculate, what percent of people smoke for every favorite
TV show. If you try to do this with Chapter 6 techniques, youd
have to write a separate line of code for every possible TV show in
the data. Thankfully, R has some great built-in functions that allow
you to easily apply functions (like mean()) to a dependent variable
(like whether or not people smoke) for every level of one or more
# The SLOW way to calculate summary statistics
independent variables (like their favorite TV show) with just a few
lines of code. mean.seinfeld <- mean(smoke[tv == "seinfeld"])
mean.simpsons <- mean(smoke[tv == "simpsons"])
The first aggregation function well cover is aggregate(). The func-
mean.tatort <- mean(smoke[tv == "tatort"])
tion aggregate() takes three arguments, a formula in the form y #.....
x1 + x2 + ... defining the dependent (Y) and one or more indepen-
dent variables (x1, x2, ...), a function (FUN), and a dataframe (data).
When you execute aggregate(y x1 + x2 + ..., data, FUN), R
100 yarrr! the pirates guide to r

will apply the input function (FUN) to the dependent variable (Y)
separately for each level(s) of the independent variable(s) (x1, x2, ...)
and then print the result. Lets see how it works:

aggregate()

formula
A formula in the form y x1 + x2 + ... where y is the depen-
dent variable, and x1, x2... are the index (independent) variables.
For example, salary sex + age will aggregate a salary column
at every unique combination of age and age

FUN
A function that you want to apply to x at every level of the in-
dependent variables. For example, FUN = mean will calculate the
mean for each level of the independent variables.

data
The dataset containing the variables in formula

...
Optional arguments passed on to FUN (like na.rm = T to ignore
NA values in x)

Lets give aggregate() a whirl. No...not a whirl...well give it a


spin. Definitely a spin. Well use aggregate() on the ToothGrowth
dataset to answer the question What is the mean tooth length for

aggregate()
each supplement?" For this question, well set the value of the depen-
dent variable Y to len, x1 to supp, and FUN to mean

# Calculate the mean tooth length (DV) for


# EVERY value of supp (IV)

aggregate(formula = len ~ supp, # DV is len, IV is supp


FUN = mean, # Calculate the mean of each group
data = ToothGrowth # dataframe is ToothGrowth
)

## supp len
## 1 OJ 20.66333
## 2 VC 16.96333

As you can see, the aggregate() function has returned a dataframe


8: advanced dataframe manipulation 101

with a column for the independent variable (supp), and a column for
the results of the function mean applied to each level of the indepen-
dent variable. The result of this function is the same thing wed get
by manually indexing each level of supp individually.

Combine aggregate() with subset()


If you want to aggregate statistics for a subset of a dataframe, just
subset the dataframe in the data argument with the subset() func-
tion:

# Calculate the median tooth length


# for EVERY value of SUPP
# ONLY for dose > .5

aggregate(formula = len ~ supp,


FUN = median,
data = subset(x = ToothGrowth,
subset = dose > .5)
)

You can also include multiple independent variables in the for-


mula argument to aggregate(). For example, lets use aggregate()
to now get the median value of len for all combinations of both supp
and dose (the amount of the supplement):

# Calculate the median tooth length


# for every combination of supp AND dose

aggregate(formula = len ~ supp + dose,


FUN = median,
data = ToothGrowth
)

## supp dose len


## 1 OJ 0.5 12.25
## 2 VC 0.5 7.15
## 3 OJ 1.0 23.45
## 4 VC 1.0 16.50
## 5 OJ 2.0 25.95
## 6 VC 2.0 25.95

Our new result now shows us the median length for all combina-
tions of both supplement (supp) and dose.
You can even use aggregate() to evaluate functions that return
more than one value. For example, the summary() function returns
102 yarrr! the pirates guide to r

several summary statistics from a vector. Lets use the summary()


function in aggregate() to calculate several summary statistics for
each combination of supp and dose:

# Calculate summary statistics of tooth length


# for every combination of supp AND dose

aggregate(formula = len ~ supp + dose,


FUN = summary,
data = ToothGrowth
)

## supp dose len.Min. len.1st Qu. len.Median len.Mean len.3rd Qu. len.Max.
## 1 OJ 0.5 8.20 9.70 12.25 13.23 16.18 21.50
## 2 VC 0.5 4.20 5.95 7.15 7.98 10.90 11.50
## 3 OJ 1.0 14.50 20.30 23.45 22.70 25.65 27.30
## 4 VC 1.0 13.60 15.27 16.50 16.77 17.30 22.50
## 5 OJ 2.0 22.40 24.58 25.95 26.06 27.08 30.90
## 6 VC 2.0 18.50 23.38 25.95 26.14 28.80 33.90

While aggregate() is good for calculating summary statistics


for a single dependent variable, it cant handle multiple dependent
variables. For example, if you wanted to calculate summary statistics
for multiple dependent variables in a dataset, youd need to execute
an aggregate() command for each dependent variable, and then
combine the results into a single dataframe. Thankfully, a recently
released R package called dplyr makes this process very simple!

Aggregation with the dplyr package

The dplyr package is a relatively new R package that allows you to


do all kinds of analyses quickly and easily. It is especially useful for
creating tables of summary statistics across specific groups of data.
In this section, well go over a very brief overview of how you can
use dplyr to easily do grouped aggregation. Just to be clear - you can
use dplyr to do everything the aggregate() function does and much
more! However, this will be a very brief overview and I strongly
recommend you look at the help menu for dplyr for additional
descriptions and examples.
To use the dplyr package. You first need to load it22 22
If the dplyr package isnt in-
stalled on your computer, youll
need to install it by running
library(dplyr) install.packages("dplyr").

Programming with dplyr looks a lot different than programming


in standard R. dplyr works by combining objects (dataframes and
8: advanced dataframe manipulation 103

columns in dataframes), functions (mean, median, etc.), and verbs


(special commands in dplyr). In between these commands is a new
operator called the pipe which looks like this: %>%. The pipe simply
tells R that you want to continue executing some functions or verbs
on the object you are working on. You can think about this pipe as The pipe operator %>% in dplyr
is used to link multiple arguments
meaning and then... sequentially. You can think of %>% as
To aggregate data with dplyr, your code will look something like meaning and then...
the following code. In this example, assume that the dataframe you
want to summarize is called my.df, the variable you want to group
the data by called grouping.column, and the columns you want to
aggregate are called col.a, col.b and col.c

my.df %>% # Specify original dataframe


group_by(grouping.column) %>% # Grouping variable
summarise(
a = mean(col.a), # calculate mean of column col.a in my.df
b = sd(col.b), # calculate sd of column col.b in my.df
c = max(col.c) # calculate max on column col.c in my.df, ...
)

When you use dplyr, you write code that sounds like: "The origi-
nal dataframe is XXX, now filter the dataframe to only include rows
that satisfy the conditions YYY, now group the data at each level of
the variable(s) ZZZ, now summarize the data and calculate summary
functions XXX..."
Lets start with an example: Lets create a dataframe of aggregated
data from the ToothGrowth dataset. Im going to group the data
according to the columns supp and dose. Ill then create several
columns is a different summary statistic of some data across each
grouping. To create this aggregated data frame, I will use the new
function group_by and the verb summarise. I will assign the result to
a new dataframe called ToothGrowth.agg:

ToothGrowth.agg <- ToothGrowth %>% # Define dataframe, THEN...


group_by(supp, dose) %>% # Define the grouping variable, THEN...
summarise( # Tell R you are going to calculate summaries
len.min = min(len), # Define first summary...
len.mean = mean(len), # Define second summary...
len.sd = sd(len),
n = n() # How many are in each group?
) # End

ToothGrowth.agg

## Source: local data frame [6 x 6]


## Groups: supp [?]
##
## supp dose len.min len.mean len.sd n
## (fctr) (dbl) (dbl) (dbl) (dbl) (int)
104 yarrr! the pirates guide to r

## 1 OJ 0.5 8.2 13.23 4.459709 10


## 2 OJ 1.0 14.5 22.70 3.910953 10
## 3 OJ 2.0 22.4 26.06 2.655058 10
## 4 VC 0.5 4.2 7.98 2.746634 10
## 5 VC 1.0 13.6 16.77 2.515309 10
## 6 VC 2.0 18.5 26.14 4.797731 10

As you can see, our final object ToothGrowth.agg is the aggregated


dataframe we want which aggregates all the columns we wanted
for each combination of supp and dose. One key new function here
is n(). This function is specific to dplyr and returns a frequency of
values in a summary command.
Lets do a more complex example where we combine multiple
verbs into one chunk of code. Well create a new dataframe called
movies.agg that aggregates data from the movies dataframe. In
addition to aggregating the data, well filter out data to only include
values that specify some criteria.
library(yarrr) # load the yarrr package

movies.agg <- movies %>% # From the movies dataframe...


filter(genre != "Horror" & time > 50) %>% # Select only these rows
group_by(rating, sequel) %>% # Group by genre and sequel
summarise( #
frequency = n(), # How many movies in each group?
budget.mean = mean(budget, na.rm = T), # Mean budget?
revenue.mean = mean(revenue.all), # Mean revenue?
billion.p = mean(revenue.all > 1000) # Percent of movies with revenue > 1000?
)

movies.agg # Print the result!

## Source: local data frame [14 x 6]


## Groups: rating [?]
##
## rating sequel frequency budget.mean revenue.mean billion.p
## (chr) (int) (int) (dbl) (dbl) (dbl)
## 1 G 0 59 56.564837 233.54391 0.000000000
## 2 G 1 12 111.500000 357.22634 0.083333333
## 3 NC-17 0 2 3.750000 18.48954 0.000000000
## 4 Not Rated 0 84 13.281818 55.52732 0.000000000
## 5 Not Rated 1 12 8.000000 66.06133 0.000000000
## 6 PG 0 312 64.885651 190.58432 0.009615385
## 7 PG 1 62 87.036364 371.67376 0.016129032
## 8 PG-13 0 645 59.893058 167.70012 0.006201550
## 9 PG-13 1 120 127.347863 523.79108 0.116666667
## 10 R 0 623 37.673610 109.14962 0.000000000
## 11 R 1 42 66.121622 226.16161 0.000000000
## 12 NA 0 86 8.332353 33.67170 0.000000000
## 13 NA 1 15 41.300000 48.07824 0.000000000
## 14 NA NA 11 NaN 34.05020 0.000000000

As you can see, our result is a dataframe with 14 rows and 6


columns. The data are summarized from the movie dataframe, only
include values where the genre is not Horror and the movie length is
longer than 50 minutes, is grouped by rating and sequel, and shows
8: advanced dataframe manipulation 105

several summary statistics.

Additional Tips

There is an entire class of functions in R that apply functions to


groups of data. One common one is tapply(), sapply() and
lapply() which work very similarly to aggregate(). For example,
you can calculate the average length of movies by genre with
tapply() as follows:

with(movies, tapply(X = time,


INDEX = genre,
FUN = mean,
na.rm = T))

Weve only scratched the surface of what you can do with dplyr.
For more tips, check out the dplyr vignette at https://cran.r-
project.org/web/packages/dplyr/vignettes/databases.html.
106 yarrr! the pirates guide to r

Test your R might!: Mmmmm...caffeine

Youre in charge of analyzing the results of an experiment testing the


effects of different forms of caffeine on a measure of performance. In
the experiment, 100 participants were given either Green tea or coffee,
in doses of either 1 or 5 servings. They then performed a cognitive
test where higher scores indicate better performance.
The data are stored in a tabdelimited dataframe at the following
link: https://dl.dropboxusercontent.com/u/7618380/datasets/caffeinestudy.txt

1. Load the dplyr library

2. Load the dataset from https://dl.dropboxusercontent.com/u/7618380/datasets/caffeinestudy.txt


as a new dataframe called caffeine.

3. Calculate the mean age for each gender

4. Calculate the mean age for each drink

5. Calculate the mean age for each combined level of both gender
and drink

6. Calculate the median score for each age

7. For men only, calculate the maximum score for each age

8. Create a dataframe showing, for each level of drink, the mean,


median, maximum, and standard deviation of scores.

9. Only for females above the age of 20, create a table showing, for
each combined level of drink and cups, the mean, median, max-
imum, and standard deviation of scores. Also include a column
showing how many people were in each group.
# Create blank plot
plot(1, xlim = c(0, 100), ylim = c(0, 100),
xlab = "Pirate Quality", ylab = "", type = "n",
main = "Two different Pirate colleges", yaxt = "n"
)

# Set colors
require("RColorBrewer")
col.vec <- brewer.pal(10, name = "Set3")[4:6]

# Draw Samples

9: Sampling and Probability Distributions samples.1 <- runif(n = 5, 40, 60)


samples.2 <- runif(n = 5, 10, 90)

text(50, 90, "Pirate Training Unlimited", font = 3)

for(i in 1:length(samples.1)) {

points(samples.1[i], 75, pch = 21, bg = col.vec[1], cex = 3)


text(samples.1[i], 75, round(samples.1[i], 0))

Sampling data from probability distributions segments(40, 65, 60, 65, col = col.vec[1], lty = 1, lwd = 2)

text(50, 40, "Unlimited Pirate Training", font = 3)

By now you know how to generate sequences of numbers with the for(i in 1:length(samples.2)) {

functions :, seq(), and rep(). However, these functions dont gener- points(samples.2[i], 25, pch = 21, bg = col.vec[2], cex = 3)
text(samples.2[i], 25, round(samples.2[i], 0))

ate very interesting data. Instead, we can use R to generate randomly }

sampled data from specified probability distributions. A probability segments(10, 15, 90, 15, col = col.vec[2], lty = 1, lwd = 2)

distribution is simply an equation also called a likelihood function


that indicates how likely certain numerical values are to be drawn. Two different Pirate colleges

For example, imagine you need to hire a new group of pirates


Pirate Training Unlimited
for your crew. You have the option of hiring people form one of two
4855
46
44
different pirate training colleges that produce pirates of varying
quality. One college Pirate Training Unlimited" might tend to pirates Unlimited Pirate Training
that are generally ok - never great but never terrible. While another 27 45 57 70
72

college Unlimited Pirate Training" might produce pirates with a


wide variety of quality, from very low to very high. In Figure 30 I 0 20 40 60 80 100
plotted 5 example pirates from each college, where each pirate is
Pirate Quality
shown as a ball with a number written on it. As you can see, pirates
from PTU all tend to be clustered between 40 and 60 (not terrible but Figure 30: Sampling 5 potential pirates
from two different pirate colleges.
not great), while pirates from UPT are all over the map, from 0 to Pirate Training Unlimited (PTU)
100. We can use probability distributions (in this case, the uniform consistently produces average pirates
(with scores between 40 and 60), while
distribution) to mathematically define how likely any possible value Unlimited Pirate Training (UPT),
is to be drawn at random from a distribution. produces a wide range of pirates from 0
In the next section well go over some of the most commonly used to 100.

sampling distributions: the Normal and Uniform distributions.


108 yarrr! the pirates guide to r

The Normal (Gaussian) distribution require("RColorBrewer")

# Create blank plot


Lets start with the most famous distribution in statistics: the Normal plot(1, xlim = c(-5, 5), ylim = c(0, 1),
xlab = "x", ylab = "dnorm(x)", type = "n",
(or if you want to sound pretentious, the Gaussian) distribution. The main = "Three Normal Distributions"
)
Normal distribution is bell-shaped, and has two parameters: a mean # Set up design matrix for loop
design.matrix <- data.frame("mean" = c(0, -2, 1),
and a standard deviation. See the margin figure 31 for plots of three "sd" = c(1, .5, 2)
)
different Normal distributions with different means and standard # Set colors
col.vec <- brewer.pal(10, name = "Set3")[4:6]
deviations. # Start loop over distributions
for (i in 1:nrow(design.matrix)) {
To generate samples from a normal distribution in R, we use the
mean.i <- design.matrix$mean[i]
function rnorm() this function has three arguments: sd.i <- design.matrix$sd[i]

fun <- function(x) {


dnorm(x, mean = mean.i, sd = sd.i)}

curve(expr = fun,

rnorm() from = -5,to = 5,


xlab = "x", lwd = 3,
add = T, col = col.vec[i])

samples <- rnorm(n = 10, mean = mean.i, sd = sd.i)


n segments(x0 = samples, y0 = rep(0, 10),
x1 = samples, y1 = fun(samples),
col = col.vec[i], lwd = 1, lty = 2
The number of observations )
}

mean legend.fun <- function(i) {


paste("mean = ", design.matrix$mean[i],
", sd = ", design.matrix$sd[i], sep = "")}
The mean of the Normal distribution from which samples are legend("topright",
legend = c(legend.fun(1),
drawn (not the sample mean!!) legend.fun(2),
legend.fun(3)
),
sd lwd = rep(3, 3),
col = col.vec[1:3]
)
The standard deviation of the Normal distribution from which
samples are drawn
Three Normal Distributions

1.0
mean = 0, sd = 1
mean = 2, sd = 0.5
0.8 mean = 1, sd = 2

For example, lets draw 5 samples (n = 5) from a normal distribu-


0.6

tion with mean 0 (mean = 0) and standard deviation 1 (sd = 1)


dnorm(x)

0.4

rnorm(n = 5, mean = 0, sd = 1)
0.2
0.0

## [1] 0.1756358 0.6535578 -0.0654903 0.6455222 0.3642940


4 2 0 2 4

x
This code returns a vector of 5 values, where each value is a new
Figure 31: Three different normal
random sample drawn from a Normal distribution with mean = 0
distributions with different means and
and standard deviation = 1. standard deviations.
Because the sampling is done randomly, youll get different values
each time you run the rnorm() (or any other random sampling)
function. To see this, lets create two different sets of samples from a
normal distribution with mean 10 and standard deviation 5 and see
how they compare:

rnorm(5, mean = 10, sd = 5)

## [1] 14.217208 6.617606 8.042835 12.024703 4.465707


9: sampling and probability distributions 109

rnorm(5, mean = 10, sd = 5)

## [1] 8.083498 11.470101 18.965303 14.070161 3.399878

As you can see, even though I used the exact same code to gen-
erate the vectors a and b, the numbers in each sample are different.
Thats because the samples are each drawn randomly and inde-
pendently from the normal distribution. To visualize the sampling
process, run the code in the margin Figure 31 on your machine sev- # uniform distribution
curve(dunif,
eral times. You should see the sampling lines dance around the from = 0, to = 1,
xlim = c(-.5, 1.5),
xlab = "x",
distribution. lwd = 2,
main = "Uniform\nmin = 0, max = 1")

The Uniform distribution Uniform


min = 0, max = 1

Next, lets move on to the uniform distribution. The uniform distribu-

1.4
tion gives equal probability to all values between the minimum and

1.2
maximum values.

dunif(x)
To generate samples from a uniform distribution, we use the

1.0
function runif(), the function has 3 arguments:

0.8
0.6
runif() 0.5 0.0 0.5 1.0 1.5

n Figure 32: The Uniform distribution


- known colloquially as the Anthony
The number of observations (i.e.; samples) Davis distribution.

min
The lower bound of the Uniform distribution from which samples
are drawn

max
The upper bound of the Uniform distribution from which samples
are drawn

Lets draw 5 samples from two uniform distributions, one with


bounds at 0 and 1, and one with bounds at -100 and 100:

runif(5, min = 0, max = 1) # 5 samples from U[0, 1]

## [1] 0.03176634 0.57970549 0.15420484 0.12527050 0.14798581

runif(5, min = -100, max = 100) # 5 samples from U[-100, 100]

## [1] 82.668526 -48.513919 8.493059 29.804493 -70.161126


110 yarrr! the pirates guide to r

Sampling from a set of values: sample()


The next function well use is sample(). The sample function works
a bit differently from runif() and rnorm() because it allows to you
to define which values you want to sample and the probability
associated with each value. For example, if you want to simulate the
flip of a fair coin, you can tell the sample function to draw the value
"Heads" with probability .50, and the value "Tails" with probability
.50.

sample()
x
A vector of outcomes you want to sample from. For example, to
simulate coin flips, youd enter x = c("Heads", "Tails")

size
The number of samples you want to draw.

replace
Should sampling be done with replacement? If T, then each in-
dividual sample will be replaced in the data vector. If F, then the
same outcome will never be drawn more than once. Think about
replacement like drawing different balls from a bag. Sampling
with replacement (replace = T) means that each time you draw a
ball, you return the ball back into the bag before drawing another
ball. Sampling without replacement (replace = F) means that after
you draw a ball, you remove that ball from the bag before drawing
again.

prob
A vector of probabilities of the same length as x indicating how
likely each outcome in "x" is. For example, to sample equally
from two outcomes, youd enter prob = c(.5, .5). The first value
corresponds to the first value of x and the second corresponds to
the second value (etc.). The vector of probabilities you give as an
argument should add up to one. However, if they dont, R will just
rescale them so that they will sum to 1.

As a simple example, lets simulate 10 flips of a fair coin, were the


probably of getting either a Head or Tail is .50:
9: sampling and probability distributions 111

sample(x = c("Heads", "Tails"), # The values you want to sample from


size = 10, # The number of samples
prob = c(.5, .5), # The probability of each value
replace = T # Sampling with replacement par(mar = c(3, 3, 3, 3))
plot(1, xlim = c(0, 1), ylim = c(0, 1),
) xlab = "", ylab = "", xaxt = "n",
yaxt = "n", type = "n",
main = "Chest of 20 Gold, 30 Silver,\nand 50 Bronze Coins")
## [1] "Heads" "Tails" "Heads" "Heads" "Heads" "Tails" "Tails" "Tails"
points(runif(100, .1, .9),
## [9] "Heads" "Tails" runif(100, .1, .9),
pch = 21, cex = 3,
bg = c(rep("gold", 20),
rep("gray94", 30),
As you can see, our function returned a vector of 10 values corre- )
rep("chocolate", 50))

sponding to our sample size of 10. Keep in mind that, just like using
rnorm() and runif(), the sample() function can give you different Chest of 20 Gold, 30 Silver,
and 50 Bronze Coins
outcomes every time you run it.

Drawing coins from a treasure chest


Now, lets sample drawing coins from a treasure chest Lets say the
chest has 100 coins: 20 gold, 30 silver, and 50 bronze. Lets draw 10
random coins from this chest. Because we remove coins when we
draw them, well set replace = F.

# Create chest with the 100 coins

chest <- c(rep("gold", 20),


rep("silver", 30),
rep("bronze", 50)
)

# Draw 10 coins from the chest without replacement

sample(x = chest,
size = 10,
prob = rep(1 / 100, times = 100),
replace = F
)

## [1] "bronze" "silver" "bronze" "bronze" "bronze" "bronze" "silver"


## [8] "silver" "silver" "bronze"

The output of the sample() function above is a vector of 10 strings


indicating the type of coin we drew on each sample. The order of
these strings matter: the first one is the first coin we drew, and the
last one is the 10th coin we drew. And like any random sampling
function, this code will likely give you different results every time
you run it! See how long it takes you to get 10 gold coins...
112 yarrr! the pirates guide to r

Simulating Pinder Outcomes


Lets simulate some Pinder outcomes. For those who dont know,
Pinder is an app that allows Pirates to view profiles of potential dates.
For each potential date, you can see their picture and either "like"
them by swiping right, or "dislike" them by swiping left. If a pirate
that you "liked" also "likes" you, then youve had a successful match
and will be able to start chatting. lets say you "swipe right" on 20
Pinder profiles and the probability you get a match is 20%. We can
simulate this using the sample function

sample(x = c("Match!!!", "No Match"),


size = 20,
replace = T, # Replace each sample back to the set
prob = c(.2, .8) # Probability of Match! is .2, and No Match :( is .8)
)

## [1] "No Match" "No Match" "No Match" "No Match" "No Match" "No Match"
## [7] "No Match" "No Match" "No Match" "No Match" "No Match" "No Match"
## [13] "No Match" "No Match" "No Match" "Match!!!" "No Match" "No Match"
## [19] "No Match" "No Match"

The output of this function is a simulated response from 10 pirates Figure 33: A typical Pinder profile.
that you liked.

A worked example: A quick test of the law of large numbers

According to the law of large numbers, the larger our sample size,
the closer our sample mean should be to the population mean. In
other words, the more data (samples) you have, the more accurate
your estimate should be. Lets test this by drawing either a small
(N = 10) or a large (N = 1,000,000) number of observations from a
Normal distribution with mean = 100 and sd = 20: Tip: You can easily write large powers
of 10 by using the notation 1eN, where
small <- rnorm(10, mean = 100, sd = 20) # 10 observations N is the power of 10. For example: 1e6
large <- rnorm(1e6, mean = 100, sd = 20) # One million observations is the same as 1,000,000

If our test worked, then the difference for the small sample should
be larger than the large sample. Lets test this by calculating the
mean of each sample and see how close they are to the true popula-
tion mean of 100:

mean(small) # What is the mean of the small sample?

## [1] 100.9499

mean(large) # What is the mean of the large sample?

## [1] 100.0289

mean(small) - 100 # How far is the mean of Small from 100?


9: sampling and probability distributions 113

## [1] 0.9499404

mean(large) - 100 # How far is the mean of Large from 100?

## [1] 0.02889741

Looks like the law of large numbers holds up!

Test your R might!

1. Create a vector samp1 with 100 values drawn from a Normal


distribution with a mean of 100 and a standard deviation of 20.
What is the mean and median of the samples? What percent of the
samples are greater than 100?

2. Create a vector called samp2 with 50 values drawn from a Uniform


distribution with a minimum of -100 and a maximum of +100.
What is the What is the mean and median of the samples? What
percent of the samples are greater than 50?

3. Above all else, pirates love two things: treasure and gummy bears.
There are few things a pirate likes more than reaching into a bag
of fresh gummy bears and pulling out a handful of those gifts
from the pirate gods. Imagine that you have a bag filled with 100 Figure 34: A gummy captain.
gummy bears: 30 Green, 40 Red, 10 Blue, and 20 Puce. Draw a
random sample of 10 gummy bears from this bag. How many of
each kind did you get in the sample? What percent of the sample
are the terrible vomit-flavored puce color?

Additional Tips
Use set.seed() to control what your
Sometimes you may want sampling functions like rnorm() and random sampling functions return
runif to return consistent values for demonstration purposes.
For example, lets say you are showing someone how to use the
rnorm() function, and you want them get the same values you
get. You can accomplish this by using the set.seed() function.
To use the function, enter an integer as the main argument - it
doesnt matter what integer you pick. When you do, you will
fix Rs random number generator to a certain state associated
with the specific integer you use. After you run set.seed(), the
next random number generation function you use will always
give the same result. Lets try generating 5 values from a Normal
distribution with mean = 10 and sd = 1. But first, Ill set the seed
to 10.
114 yarrr! the pirates guide to r

set.seed(10)
rnorm(n = 5, mean = 10, sd = 1)

## [1] 10.018746 9.815747 8.628669 9.400832 10.294545

The result doesnt look so interesting, but watch me get the


same result again by running the same code:

set.seed(10)
rnorm(n = 5, mean = 10, sd = 1)

## [1] 10.018746 9.815747 8.628669 9.400832 10.294545

Thankfully the change isnt permanent. If you run a sampling


function twice after setting the seed once, the next run will be a
new random sample.
10: Plotting: Part 1

Sammy Davis Jr. was one of the greatest American performers of all
time. If you dont know him already, Sammy was an American enter-
tainer who lived from 1925 to 1990. The range of his talents was just
incredible. He could sing, dance, act, and play multiple instruments
with ease. So how is R like Sammy Davis Jr? Like Sammy Davis Jr.,
R is incredibly good at doing many different things. R does data
analysis like Sammy dances, and creates plot like Sammy sings. If
Sammy and R did just one of these things, theyd be great. The fact
that they can do both is pretty amazing.

How does R manage plots?

When you evaluate plotting functions in R, R can build the plot in


different locations. The default location for plots is in a temporary
plotting window within your R programming environment. In Figure 35: The great Sammy Davis
Jr. Do yourself a favor and spend
RStudio, plots will show up in the Plot window (typically on the
an evening watching videos of him
bottom right hand window pane). In Base R, plots will show up in a performing on YouTube. Image used
Quartz window. entirely without permission.

You can think of these plotting locations as canvases. You only


have one canvas active at any given time, and any plotting command
you run will put more plotting elements on your active canvas.
Certain highlevel plotting functions like plot() and hist() create
brand new canvases, while other low--level plotting functions like
points() and segments() place elements on top of existing canvases.
Dont worry if thats confusing for now well go over all the
details soon.
Lets start by looking at a basic scatterplot in R using the plot()
function. When you execute the following code, you should see a
plot open in a new window:

# A basic scatterplot

plot(x = 1:10,
y = 1:10,
116 yarrr! the pirates guide to r

xlab = "X Axis label",


ylab = "Y Axis label",
main = "Main Title"
)

Main Title
10
8
Y Axis label

6
4
2

2 4 6 8 10

X Axis label

Lets take a look at the result. We see an xaxis, a yaxis, 10 data


points, an xaxis label, a yaxis label, and a main plot title. Some
of these items, like the labels and data points, were entered as ar-
guments to the function. For example, the main arguments x and
y are vectors indicating the x and y coordinates of the (in this case,
10) data points. The arguments xlab, ylab, and main set the labels to
the plot. However, there were many elements that I did not specify
from the x and y axis limits, to the color of the plotting points. As
youll discover later, you can change all of these elements (and many,
many more) by specifying additional arguments to the plot() func-
tion. However, because I did not specify them, R used default values
values that R uses unless you tell it to use something else.
For the rest of this chapter, well go over the main plotting func-
tions, along with the most common arguments you can use to cus-
tomize the look of your plot.
10: plotting: part 1 117

Color basics

Most plotting functions have a color argument (usually col) that


allows you to specify the color of whatever your plotting. There are
many ways to specify colors in R, lets start with the easiest ways.

Specifying simple colors


Here are some random colors that are
col = "red", col = "blue", col = "lawngreen" (etc.) stored in R, try running colors() to see
them all!
The easiest way to specify a color is to enter its name as a string.
For example col = "red" is Rs default version of the color red. Of
grey81 grey7
course, all the basic colors are there, but R also has tons of quirky coral3 darkslategray salmon3

colors like snow, papayawhip and lawngreen. To see all color names magenta4 pink2 gray1 olivedrab gray75

in R, run the code


rosybrown3 lemonchiffon4 orange2 gray76 grey0

colors() # Show me all the color names!


grey19 brown2 gray19 greenyellow plum1

col = gray(level, alpha) lightseagreen indianred1 darkgoldenrod lavenderblush2 gray81

The gray() function takes two arguments, level and alpha, and
returns a shade of gray. For example, gray(level = 1) will return
white. The alpha argument specifies how transparent to make gray(x)
the color on a scale from 0 (completely transparent), to 1 (not
transparent at all). The default value for alpha is 1 (not transparent 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
at all.)

Transparent Colors with transparent()


I dont know about you, but I almost always find transparent colors
to be more appealing than solid colors. Not only do they help you
see when multiple points are overlapping, but theyre just much nicer
to look at. Just look at the overlapping circles in the plot below.
par(mar = c(0, 0, 0, 0))
plot(1, xlim = c(0, 6), ylim = c(0, 1),
type = "n", xaxt = "n", yaxt = "n",
xlab = "", ylab = "", bty = "n")

text(1.5, .9, "Standard", cex = 1.5)

points(x = c(1, 2, 1.5),


y = c(.5, .5, .3),
col = c("red", "blue", "yellow"),
pch = 16, cex = 20)

text(4.5, .9, "Transparent", cex = 1.5)

library(yarrr)

points(x = c(4, 5, 4.5),


y = c(.5, .5, .3),
col = c(transparent("red", .5),
transparent("blue", .5),
transparent("yellow", .5)),
pch = 16, cex = 20)
118 yarrr! the pirates guide to r

Standard Transparent
# Generate Random data
g1.x <- rnorm(100)
g1.y <- g1.x + rnorm(100)

g2.x <- rnorm(100)


g2.y <- g2.x + rnorm(100)

# Plot with Standard Colors


plot(g1.x, g1.y,
col = "blue",
pch = 16,
main = "Standard Colors",
cex = 1)

points(g2.x, g2.y,
col = "red",
pch = 16)

Unfortunately, as far as I know, base-R does not make it easy


to make transparent colors. Thankfully, there is a function in the Standard Colors

yarrr package called transparent that makes it very easy to make

4
any color transparent. To use it, just enter the original color as the

2
main argument (orig.col), then enter how transparent you want to

g1.y

0
make it (from 0 to 1) as the second argument (trans.val). You can
either save the transparent color as a new color object, or use the

2
function directly in a plotting function like I do in the scatterplot in

4
the margin. Here are some examples: 2 1 0 1 2 3

g1.x
library(yarrr) # load the yarrr package

# Plot with Transparent olors


# Make my.blue, a transparent version of blue
my.blue <- transparent(orig.col = "blue", library(yarrr) # Load the yarrr package
trans.val = .5)
plot(g1.x, g1.y,
col = transparent("blue", .7),
# Use my.blue as a color pch = 16, main = "Transparent Colors",
cex = 1)
plot(1, col = my.blue)
points(g2.x, g2.y,
# Or put the transparent function in the plotting col = transparent("red", .7),
pch = 16)
# function directly:

Transparent Colors
plot(1, col = transparent("blue", .5))
4

Later on in the book, well cover more advanced ways to come


2

up with colors using color palettes (using the RColorBrewer pack-


g1.y

age or the piratepal() function in the yarrr package) and func-


0

tions that generate shades of colors based on numeric data (like the
2

colorRamp2() function in the circlize package).


4

2 1 0 1 2 3

g1.x
10: plotting: part 1 119

High vs. low-level plotting commands


There are two general types of plotting commands in R: high and
low-level. High level plotting commands, like plot(), hist() and
pirateplot() create entirely new plots. Within these high level
plotting commands, you can define the general layout of the plot -
like the axis limits and plot margins.
Low level plotting commands, like points(), segments(), and
text() add elements to existing plots. These low level commands
dont change the overall layout of a plot - they just add to what
youve already created. Once you are done creating a plot, you can
export the plot to a pdf or jpeg using the pdf() or jpeg() functions.
Or, if youre creating documents in Markdown or Latex, you can add
your plot directly to your document.

Plotting arguments
Most plotting functions have tons of optional arguments (also called
parameters) that you can use to customize virtually everything in a
plot. To see all of them, look at the help menu for par by executing
?par. However, the good news is that you dont need to specify all
possible parameters you create a plot. Instead, there are only a few
critical arguments that you must specify - usually one or two vectors
of data. For any optional arguments that you do not specify, R will
use either a default value, or choose a value that makes sense based
on the data you specify.
In the following examples, I will to cover the main plotting param-
eters for each plotting type. However, the best way to learn what you
can, and cant, do with plots, is to try to create them yourself!
I think the best way to learn how to create plots is to see some
examples. Lets start with the main high-level plotting functions.
120 yarrr! the pirates guide to r

Here are some of the main arguments


Scatterplot: plot() to plot()
x, y Vectors specifying the x and y
The most common high-level plotting function is plot(x, y). The values of the points
plot() function makes a scatterplot from two vectors x and y, where type Type of plot. "l" means lines, "p"
the x vector indicates the x (horizontal) values of the points, and the y means points, "b" means lines and
vector indicates the y (vertical) values. points, "n" means no plotting
main Label for the plot title
plot(x = 1:10, # x-coordinates xlab Label for the x-axis
y = 1:10, # y-coordinates ylab Labels for the y-axis
type = "p", # Draw points (not lines) xlim Limits to the x-axis. For example,
xlim = c(0, 100) will set the
main = "My First Plot", minimum and maximum of the
xlab = "This is the x-axis label", x-axis to 0 and 100.
ylab = "This is the y-axis label", ylim Limits to the y-axis. For example,
xlim = c(0, 11), # Min and max values for x-axis ylim = c(50, 200) will set the
minimum and maximum of the
ylim = c(0, 11), # Min and max values for y-axis y-axis to 50 and 200.
col = "blue", # Color of the points pch An integer indicating the type of
pch = 16, # Type of symbol (16 means Filled circle) plotting symbols (see ?points and
section below), or a string specifying
cex = 1 # Size of the symbols symbols as text. For example,
) pch = 21 will create a two-color
circle, while pch = "P" will plot the
character P. To see all the different
symbol types, run ?points.
My First Plot col Main color of the plotting symbols.
For example col = "red" will create
red symbols.
bg Color of the background of two-
color symbols 21 through 25. For
10

example pch = 21, bg = "blue"


will the background of the two-color
This is the yaxis label

circle to Blue.
8

cex A numeric vector specifying the size


of the symbols (from 0 to Inf). The
default size is 1. cex = 2 will make
6

the points very large, while cex =


.5 will make them very small.
4
2
0

0 2 4 6 8 10

This is the xaxis label

Aside from the x and y arguments, all of the arguments are op-
tional. If you dont specify a specific argument, then R will use a
10: plotting: part 1 121

default value, or try to come up with a value that makes sense. For
example, if you dont specify the xlim and ylim arguments, R will set
the limits so that all the points fit inside the plot.

Symbol types: pch


When you create a plot with plot() (or points with points()), you
can specify the type of symbol with the pch argument. You can
specify the symbol type in one of two ways: with an integer, or with
a string. If you use a string (like "p"), R will use that text as the
par(mar = c(1, 1, 3, 1))
plotting symbol. If you use an integer value, youll get the symbol
plot(x = rep(1:5 + .1, each = 5),
that correspond to that number. See Figure 36 for all the symbol y = rep(5:1, times = 5),
pch = 1:25,
xlab = "", ylab = "", xaxt = "n", yaxt = "n",
types you can specify with an integer. xlim = c(.5, 5.5),
ylim = c(0, 6),
Symbols differ in their shape and how they are colored. Symbols 1 bty = "n", bg = "gray", cex = 1.4,
main = "pch = "
through 14 only have borders and are always empty, while symbols )

text(x = rep(1:5, each = 5) - .35,


15 through 20 dont have a border and are always filled. Symbols 21 y = rep(5:1, times = 5),
labels = 1:25, cex = 1.2
through 25 have both a border and a filling. To specify the border )

color or background for symbols 1 through 20, use the col argument.
For symbols 21 through 25, you set the color of the border with col, pch =
and the color of the background using bg
Lets look at some different symbol types in action:
1 6 11 16 21
2 7 12 17 22
pch = 2, pch = 16, 3 8 13 18 23
col = 'blue' col = 'orchid2'
4 9 14 19 24
5 10 15 20 25

Figure 36: The symbol types associated


with the pch plotting parameter.

pch = 21, pch = 25,


col = 'black', col = 'pink3',
bg = 'orangered2 bg = 'plum3
122 yarrr! the pirates guide to r

Histogram: hist()
hist(x = ChickWeight$weight,
Histograms are the most common way to plot unidimensional nu- main = "Fancy Chicken Weight Histogram",
xlab = "Weight",
meric data. To create a histogram well use the hist() function. The
ylab = "Frequency",
main argument to hist() is a x, a vector of numeric data. If you want breaks = 20, # 20 Bins
to specify how the histogram bins are created, you can use the breaks xlim = c(0, 500),
col = "papayawhip", # Filling Color
argument. To change the color of the border or background of the border = "hotpink" # Border Color
bins, use col and border: )
Lets create a histogram of the weights in the ChickWeight dataset:
Fancy Chicken Weight Histogram

hist(x = ChickWeight$weight,
main = "Chicken Weights",

100
xlab = "Weight",

80
Frequency
xlim = c(0, 500)

60
)

40
20
0
Chicken Weights 0 100 200

Weight
300 400 500
200

Figure 37: Fancy histogram

You can plot two histograms in one plot


by adding the add = T argument to the
150

second hist() function:

hist(x = ChickWeight$weight[ChickWeight$Diet == 1],


Frequency

main = "Two Histograms in one",


100

xlab = "Weight",
ylab = "Frequency",
breaks = 20,
xlim = c(0, 500),
50

col = gray(0, .5)


)

hist(x = ChickWeight$weight[ChickWeight$Diet == 2],


0

breaks = 30,
add = T, # Add plot to previous one!
col = gray(1, .5)
0 100 200 300 400 500 )

Weight Two Histograms in one


35
30

We can get more fancy by adding additional arguments like


25

breaks = 20 to force there to be 20 bins, and col = "papayawhip"


20
Frequency

and bg = "hotpink" to make it a bit more colorful (see the margin


15

figure 37)
10

If you want to plot two histograms on the same plot, for example,
5

to show the distributions of two different groups, you can use the add
0

= T argument to the second plot. See Figure 38 to see this in action. 0 100 200 300 400 500

Weight

Figure 38: Plotting two histograms


in the same plot using the add = T
argument to the second plot.
10: plotting: part 1 123

Barplot: barplot() # Step 1: calculate the average weight by time


avg.weights <- aggregate(weight ~ Time,
data = ChickWeight,
A barplot is good for showing summary statistics for different FUN = mean)

groups. The primary argument to a barplot is height: a vector of # Step 2: Plot the result
barplot(height = avg.weights$weight,
names.arg = avg.weights$Time,
numeric values which will generate the height of each bar. To add xlab = "Week",
ylab = "Average Weight",
names below the bars, use the names.arg argument. For additional ar- main = "Average Chicken Weights by Time",
col = "yellow",

guments specific to barplot(), look at the help menu with ?barplot: border = "gray")

Average Chicken Weights by Time


barplot(height = 1:5, # A vector of heights
names.arg = c("G1", "G2", "G3", "G4", "G5"), # A vector of names

200
main = "Example Barplot",

150
xlab = "Group",

Average Weight
ylab = "Height"

100
)

50
Example Barplot

0
0 2 4 6 8 10 12 14 16 18 20 21
5

Week

# Calculate and plot average weights for Diet 2


d2.weights <- aggregate(weight ~ Time,
data = ChickWeight,
4

subset = Diet == 2,
FUN = mean)

barplot(height = d2.weights$weight,
names.arg = d2.weights$Time,
xlab = "Week",
ylab = "Average Weight",
main = "Average Chicken Weights by Time",
3

col = transparent("green", .5),


border = NA)
Height

# Do the same for Diet 1 and add to existing plot


d1.weights <- aggregate(weight ~ Time,
data = ChickWeight,
subset = Diet == 1,
2

FUN = mean)

barplot(height = d1.weights$weight,
add = T, # Add to existing plot!
col = transparent("blue", .5),
border = NA)
1

Average Chicken Weights by Time


200
0

150

G1 G2 G3 G4 G5
Average Weight

100

Group
50

Of course, you should plot more interesting data than just a vector
of integers with a barplot. In the margin figure, I create a barplot
0

with the average weight of chickens for each time point with yel- 0 2 4 6 8 10

Week
12 14 16 18 20 21

low bars. If you want to plot different colored bars from different
Figure 39: Stacked barplots by adding
datasets, create one normal barplot, then create a second barplot with an additional barplot with the add = T
the add = T argument. In Figure 39, I plotted the average weights for argument.

chickens on Diets 1 and 2 separately on the same plot.


124 yarrr! the pirates guide to r

Clustered barplot
If you want to create a clustered barplot, with different bars for
different groups of data, you can enter a matrix as the argument to
height. R will then plot each column of the matrix as a separate set
of bars. For example, lets say I conducted an experiment where I
compared how fast pirates can swim under four conditions: Wearing
clothes versus being naked, and while being chased by a shark versus
not being chased by a shark. Lets say I conducted this experiment
and calculated the following average swimming speed:

Table 1: Mean Swimming times (in


Naked Clothed meters / second)

No Shark 2.1 1.5


Shark 3.0 3.0

I can represent these data in a matrix as follows. In order for the


final barplot to include the condition names, Ill add row and column
names to the matrix with colnames() and rownames()

swim.data <- cbind(c(2.1, 3), # Naked Times


c(1.5, 3)) # Clothed Times

colnames(swim.data) <- c("Naked", "Clothed")


rownames(swim.data) <- c("No Shark", "Shark")

Heres how the final matrix looks:

swim.data

## Naked Clothed
## No Shark 2.1 1.5
## Shark 3.0 3.0

Now, when I enter this matrix as the height argument to barplot(),


Ill get multiple bars.

barplot(height = swim.data,
beside = T, # Put the bars next to each other
legend.text = T, # Add a legend
col = c(transparent("green", .7),
transparent("red", .7)),
main = "Swimming Speed Experiment",
ylab = "Speed (in meters / second)",
xlab = "Clothing Condition",
ylim = c(0, 3.5))
10: plotting: part 1 125

Swimming Speed Experiment


3.5
No Shark
Shark
3.0
2.5
Speed (in meters / second)

2.0
1.5
1.0
0.5
0.0

Naked Clothed

Clothing Condition
126 yarrr! the pirates guide to r

The Pirate Plot: pirateplot()

A Pirate Plot is the plotting choice of R pirates who want to plot a


numeric dependent variable (like beard length) for each level of a
categorical independent variable (like gender or favorite Spice Girl).
The two main arguments to pirateplot() are formula and data. In
formula, you specify plotting variables in the form dv iv, where
dv is the name of the dependent variable, and iv is the name of the
independent variable. In data, you specify the name of the dataframe
object where the variables are stored. The pirateplot() function is part of the
Ill create a pirateplot of the ChickWeight data. Ill set the depen- yarrr package. So to use it, youll first
need to install the yarrr package
dent variable to weight, and the independent variable to Diet.
Here are some of the main arguments
to pirateplot()
library("yarrr")
formula A formula in the form dv iv1
pirateplot(formula = weight ~ Diet, # dv is weight, iv is Diet + iv2 where where dv is the name
data = ChickWeight, of the dependent variable, and
main = "My first pirate plot!", iv1 and iv2 are the name(s) of the
independent variable(s). Up to two
xlab = "Diet", independent variables can be used.
ylab = "Weight") data A dataframe containing the depen-
dent and independent variables.
line.fun A function that determines the
height of the lines and bars. De-
My first pirate plot! faults to mean
pal A string indicating the color palette
of the plot. Can be a single color,
500 or the name of a palette in the
450 piratepal() function (e.g.; "basel",
"google", "drugs")
400 point.cex A number indicating the size of the
points
350
bar.o A number between 0 and 1 specify-
300 ing the opacity of the bars
Weight

point.o ... opacity of the points


250
hdi.o ... opacity of the 95% Highest
200 Density Interval (HDI). Note that
calculating HDIs can be time-
150 consuming!
100 line.o ... opacity of the average line (i.e. the
top of each bar).
50 bean.o ... opacity of the beans.
0 jitter.val A number indicating how much to
jitter the points (default is .05)
1 2 3 4 ... Additional arguments passed to
the plot (e.g.; main, xlab, ylab, xlim,
ylim)
Diet

There are many additional optional arguments to pirateplot()


for example, you can use different color palettes with the pal argu-
10: plotting: part 1 127

To see all the different palettes you can


ment or change how transparent different elements are with bar.o,
use in the piratepal() function, run the
bean.o, line.o, point.o, hdi.o. To see them all, look at the help following code:
menu with ?pirateplot.
piratepal(palette = "all",
For example, heres a second pirateplot where I put Time on action = "show")
the x-axis, change the height of the bars to median, added a gray
background pirate palettes
trans = 0
basel
southpark

pirateplot(formula = weight ~ Time, # dv is weight, iv is Diet


appletv
drugs
google
info2

data = ChickWeight, info


up
rat

line.fun = median, # median heights


nemo
cars
bugs
brave

bean.o = 0, # No bean
lubitel
pancake
cake
emo

point.pch = 16, # Solid points provoking


goldfish
espresso

point.o = .2,
ipod
ghostbusters
ohbrother
usualsuspects

point.cex = 1.2, # Larger points toystory


malcovich
evildead

pal = "basel", # Basel palette


eternal
memento
monalisa

back.col = gray(.97), # gray background


To see a specific palette in detail, run
bar.o = .2, # Lighter bar the following (replacing the name of the
main = "My second pirate plot!", palette with the one you want)

xlab = "Time", piratepal(palette = "nemo", # nemo palette


ylab = "Weight") action = "show")

nemo
trans = 0

My second pirate plot!

500
450
400
350
300
ge
w

n
n

1
ee
ow
llo

ue
an
Weight

gr
ye

bl
br
or

250
200
150
100
50
0
0 2 4 6 8 10 12 14 16 18 20 21

Time

You can include up to two independent variables in the formula


128 yarrr! the pirates guide to r

argument to the piratepal() function to create a souped up clustered


barplot. For example, the ToothGrowth dataframe measures the length
of Guinea Pigs teeth based on different supplements and different
doses. We can plot both as follows:

# Plotting data from two IVs in pirateplot

pirateplot(formula = len ~ dose + supp, # Two IVs


data = ToothGrowth,
xlab = "Dose",
ylab = "Tooth Length",
main = "ToothGrowth Data")

ToothGrowth Data

50

45

40

35

30
Tooth Length

25

20

15

10

0.5 1 2 0.5 1 2
supp = VC supp = OJ
Dose

Its really easy to customize your pirateplots like crazy. In Fig-


ure 40 are four different pirateplots you can make of the ChickWeight
data just by changing a few parameters in the pirateplot() function.
To see how they were created, look at the help menu for pirateplot
with ?pirateplot:
10: plotting: part 1 129

B&W with 95% HDI Super boring bars


400

400
300

300
Weight

Weight
200

200
100

100
0

0
1 2 3 4 1 2 3 4

Diet Diet

Large, jittered points HDI and points


Google colors Dark background
400

400
300

300
Weight

Weight
200

200
100

100
0

1 2 3 4 1 2 3 4

Diet Diet

Figure 40: 4 different pirate plots of the


ChickWeight dataset. Run ?pirateplot
to see the code for these examples.
130 yarrr! the pirates guide to r

Low-level plotting functions

Once youve created a plot with a high-level plotting function, you


can add additional elements with low-level functions. You can add
data points with points(), reference lines with abline(), text with
text(), and legends with legend().

Starting with a blank plot


Before you start adding elements with low-level plotting functions,
its useful to start with a blank plotting space. To do this, execute the
plot() function, but use the type = "n" argument to tell R that you
dont want to plot anything yet. Once youve created a blank plot,
you can additional elements with low-level plotting commands.

plot(x = 1, Figure 41: Ahhhh....a fresh blank canvas


xlab = "X Label", ylab = "Y Label", waiting for low-level plotting functions.
xlim = c(0, 100), ylim = c(0, 100),
main = "Blank Plotting Canvas",
type = "n")

Blank Plotting Canvas


100
80
60
Y Label

40
20
0

0 20 40 60 80 100

X Label
10: plotting: part 1 131

Adding new points to a plot with points()

To add new points to an existing plot, use the points() function23 . 23


The points function has many similar
Lets use points() to create a plot with different symbol types for arguments to the plot() function, like
x (for the x-coordinates), y (for the
different data. Ill use the pirates dataset and plot the relationship y-coordinates), and parameters like col
between a pirates age and the number of tattoos he/she has. Ill (border color), cex (point size), and pch
(symbol type). To see all of them, look
create separate points for male and female pirates: at the help menu with ?points()

library(yarrr) # Load the yarrr package (for pirates dataset)

# Create the plot with male data


plot(x = pirates$age[pirates$sex == "male"],
y = pirates$tattoos[pirates$sex == "male"],
xlim = c(10, 45), ylim = c(0, 22),
pch = 16,
col = transparent("coral2", trans.val = .2),
xlab = "Age", ylab = "Number of Tattoos",
main = "Pirate Tattoos by Age\nMale vs. Female"
)

Now, Ill add points for the female data with points:

# Add points for female data


points(x = pirates$age[pirates$sex == "female"],
y = pirates$tattoos[pirates$sex == "female"],
pch = 16,
col = transparent("steelblue3", trans.val = .3)
)
Because you can continue adding as
My first plotting command with plot() will create the left-hand many low-level plotting commands to a
plot as youd like, you can keep adding
figure below. The second plotting command with points() will add different types or colors of points by
to the plot and create the right-hand figure below. adding additional points() functions.
However, keep in mind that because
R plots each element on top of the
Pirate Tattoos by Age Pirate Tattoos by Age previous one, early calls to points()
Male vs. Female Male vs. Female might be covered by later calls. So
add the points that you want in the
foreground at the end!
20

20
Number of Tattoos

Number of Tattoos
15

15
10

10
5

5
0

10 15 20 25 30 35 40 45 10 15 20 25 30 35 40 45

Age Age
132 yarrr! the pirates guide to r

Adding straight lines with abline() # Create a blank plot.


plot(1,
xlim = c(0, 100),
To add straight lines to a plot, use abline() or segments(). abline() ylim = c(0, 100),
type = "n")

will add a line across the entire plot, while segments() will add a line # Add horizontal and vertical gridlines at 0, 10, ... 100
abline(h = seq(from = 0, to = 100, by = 10),
with defined starting and end points. v = seq(from = 0, to = 100, by = 10))

You can use abline() to add gridlines to a plot. To do this, enter


the locations of horizontal lines with the h argument, and vertical
lines with the v argument. The following code will create Figure 42:

100
80
# Create a blank plot.

60
plot(1, xlim = c(0, 100), ylim = c(0, 100), type = "n")

40
20
# Add horizontal and vertical gridlines at 0, 10, ... 100
abline(h = seq(from = 0, to = 100, by = 10),

0
v = seq(from = 0, to = 100, by = 10)) 0 20 40 60 80 100

Index

To change the look of your lines, use the lty argument, which Figure 42: Adding gridlines with
changes the type of line (see Figure ), lwd, which changes its thick- abline()
ness, and col which changes its color. The following code will use
some of these arguments to create Figure 43: plot(1, xlim = c(0, 100), ylim = c(0, 100), type = "n")

abline(h = seq(from = 0, to = 100, by = 10),


v = seq(from = 0, to = 100, by = 10),
# Create a blank plot. lty
col
= 2,
= gray(.5),
lwd = .5)
plot(1, xlim = c(0, 100), ylim = c(0, 100), type = "n")

# Add horizontal and vertical gridlines at 0, 10, ... 100


abline(h = seq(from = 0, to = 100, by = 10),
100
v = seq(from = 0, to = 100, by = 10), 80

lty = 2, # Dashed line


60
1

col = gray(.5), # Light gray color


40

lwd = .5 # Thin line


20

)
0

0 20 40 60 80 100
The segments() function works very similarly to abline() how- Index
ever, with the segments() function, you specify the beginning and
Figure 43: Adding thin, gray, dashed
end points of the segments. See the help menu (?segments) for de- gridlines using the lty, col, and lwd
tails. arguments to abline()
Once youve created a plot with gridlines using abline(), you can
lty = ...
then add symbols with points()!
1 2 3 4 5 6

Figure 44: Line types generated from


arguments to lty.
10: plotting: part 1 133

Main arguments to text():


Adding text to a plot with text()
x, y: The location(s) of the text

With text(), you can add text to a plot. You can use text() to high- labels: The text to plot

light specific points of interest in the plot, or to add information (like cex: The size of the text

a third variable) for every point in a plot. Ive highlighted some of adj: How should the text be justi-
fied? 0 = left justified, .5 = centered,
the key arguments to text() in Figure 45 1 = right justified
For example, the following code adds the three words "Put", "Text", pos: The position of the text relative
and "Here" at the coordinates (1, 9), (5, 5), and (9, 1) respectively. See to the x and y coordinates. Values
of 1, 2, 3, and 4 put the text below,
Figure 46 for the plot: to the left, above, and to the right of
the x-y coordinates respectively.
plot(1, xlim = c(0, 10), ylim = c(0, 10), type = "n") Figure 45: Main arguments to the
text() function

text(x = c(1, 5, 9),


y = c(9, 5, 1), plot(1, xlim = c(0, 10), ylim = c(0, 10), type = "n")

text(x = c(1, 5, 9),


labels = c("Put", "text", "here") y = c(9, 5, 1),
labels = c("Put", "text", "here")
) )

You can do some cool things with text(). Ill create a scatterplot
of data, and add data labels above each point:

10
height <- c(156, 175, 160, 172, 159) Put
weight <- c(65, 74, 69, 72, 66)

8
id <- c("p1", "p2", "p3", "p4", "p5")

6
text

4
plot(x = height, y = weight,
xlim = c(155, 180), ylim = c(65, 80))

2
here

0
text(x = height, y = weight,
labels = id, pos = 3) 0 2 4 6 8 10

Index

Figure 46: Adding text to a plot. The


characters in the argument to labels
will be plotted at the coordinates
indicated by x and y.
80

p2
75
weight

p4
p3
70

p5
p1
65

155 160 165 170 175 180

height
134 yarrr! the pirates guide to r

To plot text on separate lines in a plot,


When entering text in the labels argument, keep in mind that R put the tag "\n" between lines.
will, by default, plot the entire text in one line. However, if you are plot(1, type = "n", main = "The \\n tag",
xlab = "", ylab = "")
adding a long text string (like a sentence), you may want to separate
# Text without \n breaks
the text into separate lines. To do this, add the text "\n" where you text(x = 1, y = 1.3, labels = "Text without \\n", font = 2)
text(x = 1, y = 1.2,
want new lines to start. Look at Figure 47 for an example. labels = "Haikus are easy. But
)
sometimes they don't make sense. Refrigerator"

abline(h = 1, lty = 2)
# Text with \n breaks
text(x = 1, y = .92, labels = "Text with \\n", font = 2)

Combining text and numbers with paste() text(x = 1, y = .7,


labels = "Haikus are easy\nBut sometimes they don't make sense\nRefrigerator"
)

A common way to use text in a plot, either in the main title of a


plot or using the text() function, is to combine text with numerical The \n tag

data. For example, you may want to include the text "Mean = 3.14"

1.4
Text without \n
in a plot to show that the mean of the data is 3.14. But how can we

1.2
Haikus are easy. But sometimes they don't make sense. Refrigerator
combine numerical data with text? In R, we can do this with the
paste() function:

1.0
Text with \n
The paste function will be helpful to you anytime you want to

0.8
Haikus are easy
combine either multiple strings, or text and strings together. For But sometimes they don't make sense
Refrigerator

0.6
example, lets say you want to write text in a plot that says The mean
0.6 0.8 1.0 1.2 1.4
of these data are XXX, where XXX is replaced by the group mean.
To do this, just include the main text and the object referring to the
Figure 47: Using the "\n" tag to plot
numerical mean as arguments to paste():
text on separate lines.
When you include descriptive statistics
data <- ChickWeight$weight in a plot, you will almost always want
mean(data) to use the round(x, digits) function
to reduce the number of digits in the
statistic.
## [1] 121.8183

paste("The mean of data is", mean(data)) # No rounding

## [1] "The mean of data is 121.818339100346"

paste("The mean of data is", round(mean(data), 2)) # No rounding

## [1] "The mean of data is 121.82"

Lets create a plot with labels using the paste() function. Well
plot the chicken weights over time, and add text to the plot specify-
ing the overall mean and standard deviations of weights.

# Create the plot


plot(x = ChickWeight$Time,
y = ChickWeight$weight,
col = gray(.3, .5),
pch = 16,
main = "Chicken Weights")

# Add text
10: plotting: part 1 135

text(x = 0,
y = 300,
labels = paste("Mean weight = ",
round(mean(ChickWeight$weight), 2),
"\nStandard deviation = ",
round(sd(ChickWeight$weight), 2),
sep = ""),
adj = 0)

Chicken Weights
350
ChickWeight$weight

Mean weight = 121.82


Standard deviation = 71.07
250
150
50

0 5 10 15 20

ChickWeight$Time
136 yarrr! the pirates guide to r

curve() plot(1, xlim = c(-5, 5), ylim = c(-5, 5),


type = "n", main = "Plotting function lines with curve()",
The curve() function allows you to add a line showing a specific ylab = "", xlab = "")
abline(h = 0)
abline(v = 0)
function or equation to a plot
require("RColorBrewer")
col.vec <- brewer.pal(12, name = "Set3")[4:7]

curve(expr = x^2, from = -5, to = 5,


add = T, lwd = 2, col = col.vec[1])

curve() curve(expr = x^.5, from = 0, to = 5,


add = T, lwd = 2, col = col.vec[2])
curve(expr = sin, from = -5, to = 5,
add = T, lwd = 2, col = col.vec[3])

expr my.fun <- function(x) {return(dnorm(x, mean = 2, sd = .2))}


curve(expr = my.fun, from = -5, to = 5,
add = T, lwd = 2, col = col.vec[4])
The name of a function written as a function of x that returns a legend("bottomright",
legend = c("x^2", "x^.5", "sin(x)", "dnorm(x, 2, .2"),
single vector. You can either use base functions in R like expr = col = col.vec[1:4], lwd = 2,
lty = 1, cex = .8, bty = "n"
x2 , expr = x + 4 - 2, or use your own custom functions such as )

expr = my.fun, where my.fun is previously defined (e.g.; my.fun <-


function(x) dnorm(x, mean = 10, sd = 3)) Plotting function lines with curve()

from, to

4
The starting (from) and ending (to) value of x to be plotted.

2
0
add

2
x^2
x^.5
A logical value indicating whether or not to add the curve to an

4
sin(x)
dnorm(x, 2, .2
existing plot. If add = FALSE, then curve() will act like a high-
level plotting function and create a new plot. If add = TRUE, then 4 2 0 2 4

curve() will act like a low-level plotting function.


Figure 48: Using curve() to easily
lty, lwd, col create lines of functions
Additional arguments such as lty, col, lwd, ...

For example, to add the function x2 to a plot from the x-values -10
to 10, you can run the code:

curve(expr = x^2, from = -10, to = 10)

If you want to add a custom function to a plot, you can define the
function and then use that function name as the argument to expr.
For example, to plot the normal distribution with a mean of 10 and
standard deviation of 3, you can use this code:

my.fun <- function(x) {dnorm(x, mean = 10, sd = 3)}


curve(expr = my.fun, from = -10, to 10)

In Figure 48, I use the curve() function to create curves of several


mathematical formulas.
10: plotting: part 1 137

legend() # Generate some random data


female.x <- rnorm(100)
female.y <- female.x + rnorm(100)
The last low-level plotting function that well go over in detail is male.x <- rnorm(100)
male.y <- male.x + rnorm(100)
legend() which adds a legend to a plot. This function has the follow-
# Create plot with data from females
ing arguments plot(female.x, female.y, pch = 16, col = 'blue',
xlab = "x", ylab = "y", main = "Adding a legend with legend()"
)

# Add data from males


points(male.x, male.y, pch = 16, col = 'orange')

legend() # Add legend


legend("bottomright",
legend = c("Females", "Males"),
col = c('blue', 'orange'),
pch = c(16, 16),
x, y bg = "white"
)

Coordinates of the legend - for example, x = 0, y = 0 will put the


text at the coordinates (0, 0). Alternatively, you can enter a string Adding a legend with legend()
indicating where to put the legend (i.e.; "topright", "topleft"). For
example, "bottomright" will always put the legend at the bottom

2
right corner of the plot.

0
labels

2
A string vector specifying the text in the legend. For example, Females
Males
legend = c("Males, "Females") will create two groups with

4
names Males and Females. 2 1 0 1 2

x
pch, lty, lwd, col, pt.bg, ...
Figure 49: Creating a legend labeling
Additional arguments specifying symbol types (pch), line types the symbol types from different groups
(lty), line widths (lwd), background color of symbol types 21
through 25 ((pt.bg)) and several other optional arguments. See
?legend for a complete list

For example, to add a legend to to bottom-right of an existing


graph where data from females are plotted in blue circles and data
from males are plotted in pink circles, youd use the following code:

legend("bottomright", # Put legend in bottom right of graph


legend = c("Females", "Males"), # Names of groups
col = c("blue", "orange"), # Colors of symbols
pch = c(16, 16) # Point types
)

In margin Figure I use this code to add a legend to plot contain-


ing data from males and females.
138 yarrr! the pirates guide to r

Additional low-level plotting functions par(mar = c(0, 0, 3, 0))

plot(1, xlim = c(1, 100), ylim = c(1, 100),


There are many more low-level plotting functions that can add addi- type = "n", xaxt = "n", yaxt = "n",
ylab = "", xlab = "", main = "Adding simple figures to a plot")

tional elements to your plots. Here are some I use. To see examples text(25, 95, labels = "rect()")

of how to use each one, check out their associated help menus. rect(xleft = 10, ybottom = 70,
xright = 40, ytop = 90, lwd = 2, col = "coral")

text(25, 60, labels = "polygon()")

polygon(x = runif(6, 15, 35),

Additional low-level plotting functions y = runif(6, 40, 55),


col = "skyblue"
)

# polygon(x = c(15, 35, 25, 15),


rect() # y = c(40, 40, 55, 40),
# col = "skyblue"
# )
Add rectangles to a plot at coordinates specified by xleft,
text(25, 30, labels = "segments()")
ybottom, xright, ybottom. For example, to add a rectangle with segments(x0 = runif(5, 10, 40),

corners at (0, 0) and c(10, 10), specify xleft = 0, ybottom = 0, y0


x1
=
=
runif(5,
runif(5,
5, 25),
10, 40),
y1 = 5, 25), lwd = 2
xright = 10, ytop = 10. Additional arguments like col, border
runif(5,
)

change the color of the rectangle. text(75, 95, labels = "symbols(circles)")

symbols(x = runif(3, 60, 90),


y = runif(3, 60, 70),
polygon() circles = c(1, .1, .3),
add = T, bg = gray(.5, .1))

Add a polygon to a plot at coordinates specified by vectors x and y. text(75, 30, labels = "arrows()")

Additional arguments such as col, border change the color of the arrows(x0 = runif(3, 60, 90),
y0 = runif(3, 10, 25),
x1 = runif(3, 60, 90),
inside and border of the polygon y1 = runif(3, 10, 25),
length = .1, lwd = 2
)
segments(), arrows()
Add segments (lines with fixed endings), or arrows to a plot. Adding simple figures to a plot

symbols(add = T) rect() symbols(circles)

Add symbols (circles, squares, rectangles, stars, thermometers) to


a plot. The dimensions of each symbol are specified with specific
polygon()
input types. See ?symbols for details. Specify add = T to add to an
existing plot or add = F to create a new plot.
segments() arrows()
axis()
Add an additional axis to a plot (or add fully customizable x and y
axes). Usually you only use this if you set xaxt = "n", yaxt = "n"
in the original high-level plotting function.

mtext()
Add text to the margins of a plot. Look at the help menu for
mtext() to see parameters for this function.
10: plotting: part 1 139

Saving plots to a file

Once youve created a plot in R, you may wish to save it to a file so


you can use it in another document. To do this, youll use either the
pdf() or jpeg() functions. These functions will save your plot to
either a .pdf of jpeg file.

pdf() and jpeg()

file
The name and file destination of the final plot entered as a string.
For example, to put a plot on my desktop, Id write file =
"/Users/nphillips/Desktop/plot.pdf" when creating a pdf, and
file = "/Users/nphillips/Desktop/plot.jpg" when creating a
jpeg.

width, height
The width and height of the final plot in inches.

family()
An optional name of the font family to use for the plot. For exam-
ple, family = "Helvetica" will use the Helvetica font for all text
(assuming you have Helvetica on your system). For more help on
using different fonts, look at section "Using extra fonts in R" in
Chapter XX

dev.off()
This is not an argument to pdf() and jpeg(). You just need to ex-
ecute this code after creating the plot to finish creating the image
file (see examples below).

To use these functions to save files, you need to follow 3 steps

1. Execute the pdf() or jpeg() functions with file, width and


height arguments.

2. Execute all your plotting code.

3. Complete the file by executing the command dev.off(). This tells


R that youre done creating the file.

Heres an example of the three steps.


140 yarrr! the pirates guide to r

# Step 1: Call the pdf command


pdf(file = "/figures/My Plot.pdf", # The directory you want to save the file in
width = 4, # The width of the plot in inches
height = 4 # The height of the plot in inches
)

# Step 2: Create the plot


plot(1:10, 1:10)
abline(v = 0) # Additional low-level plotting commands
text(x = 0, y = 1, labels = "Random text")

# Step 3: Run dev.off() to create the file!


dev.off()

Youll notice that after you close the plot with dev.off(), youll see
a message in the prompt like "null device".
Using the command pdf() will save the file as a pdf. If you use
jpeg(), it will be saved as a jpeg.
10: plotting: part 1 141

Test your R Might! Purdy pictures

For the following exercises, youll use datasets from the yarrr pack-
age. Make sure to install and load the package

1. The BeardLengths dataframe contains data on the lengths of


beards from 3 different pirate ships. Calculate the average beard
length for each ship using aggregate(), then create the following
barplot:

Barplot of mean beard length by ship


25
20
Beard Length

15
10
5
0

Angry Badger Fearless Snake Nervous Goat

Ship

2. Now using the entire BeardLengths dataframe, create the following


pirateplot:
142 yarrr! the pirates guide to r

Pirateplot of beard lengths by ship

50

45

40

35

30
Beard

25

20

15

10

Fearless Snake Nervous Goat Angry Badger

Ship

3.

4. Using the pirates dataset, create the following scatterplot showing


the relationship between a pirates age and how many parrots
(s)he has owned (hint: to make the points solid and transparent,
use pch = 16, and col = gray(level = .5, alpha = .1)).
10: plotting: part 1 143

Pirate age and number of parrots owned


20
15
Parrots

10
5
0

15 20 25 30 35 40 45

Age
10: Plotting: Part Deux

Advanced colors

Shades of gray with gray()


If youre a lonely, sexually repressed, 50+ year old housewife, then
you might want to stick with shades of gray. If so, the function
gray(x) is your answer. gray() is a function that takes a number (or
vector of numbers) between 0 and 1 as an argument, and returns
inc <- rnorm(n = 200, mean = 50, sd = 10)
a shade of gray (or many shades of gray with a vector input). A hap <- inc + rnorm(n = 200, mean = 0, sd = 15)
drive <- inc + rnorm(n = 200, mean = 0, sd = 5)
value of 1 is equivalent to "white" while 0 is equivalent to "black".
plot(x = inc, y = hap, pch = 16,
This function is very helpful if you want to create shades of gray col = gray((drive - min(drive)) / max(drive - min(drive)), alpha = .4),
cex = 1.5,
xlab = "income", ylab = "happiness"
depending on the value of a numeric vector. For example, if you had )

survey data and plotted income on the x-axis and happiness on the
y-axis of a scatterplot, you could determine the darkness of each
point as a function of a third quantitative variable (such as number of
children or amount of travel time to work). I plotted an example of
80
this in Figure 50.
60
happiness

40
20
0

30 40 50 60 70

income

Figure 50: Using the gray() function to


easily create shades of gray in plotting
symbols based on numerical data.
146 yarrr! the pirates guide to r

If you see a palette you like, you can


Pirate Palettes see the colors (and their inspiration), in
more detail as follows:
The yarrr package comes with several color palettes ready for you to
use. The palettes are contained in the piratepal() function. To see all library("yarrr")
the palettes, run the following code: piratepal(palette = "basel",
action = "show"
)
library("yarrr")
piratepal(palette = "all",
basel
action = "show") trans = 0

pirate palettes
trans = 0
basel
southpark
appletv
drugs
google
info2
info

ge

n2
is
n

op
e1

e2

nk
ee

uo

llo
d

an

ee
up

po
re
u

pi
gr

rq

ye
bl

bl

or

gr
tu
rat
nemo
cars
bugs
brave
lubitel
pancake
cake # Save the South Park palette
emo sp.cols <- piratepal(palette = "southpark",
provoking action = "return")
goldfish
espresso
# Create a blank plot
ipod
ghostbusters
plot(1, xlim = c(0, 6), ylim = c(0, 2),
ohbrother bty = "n", type = "n")
usualsuspects
toystory # Add points
malcovich points(x = 1:5, y = rep(1, 5),
evildead
pch = c(21, 22, 23, 24, 25),
eternal
memento
bg = sp.cols, # Use the South Park Colors
monalisa col = "white",
cex = 5)

Once you find a color palette you like, you can save the colors as a
vector by setting the action argument to "return", and assigning the
2.0

result to an object. For example, if I want to use the southpark palette


1.5

and use them in a plot, I would do the following:

# Save the South Park palette


1.0
1

sp.cols <- piratepal(palette = "southpark", action = "return")


0.5

plot(x = 1:5, y = rep(1, 5),


0.0

pch = c(21, 22, 23, 24, 25), 0 1 2 3 4 5 6

main = "South Park Colors", Index

bg = sp.cols, # Use the South Park Colors


col = "white", cex = 3)
10: plotting: part deux 147

Color Palettes with the RColorBrewer package


If you use many colors in the same plot, its probably a good idea
to choose colors that compliment each other. An easy way to select
colors that go well together is to use a color palette - a collection of
colors known to go well together.
One package that is great for getting (and even creating) palettes
is RColorBrewer. Here are some of the palettes in the package. The
name of each palette is in the first column, and the colors in each
palette are in each row:

require("RColorBrewer")
display.brewer.all()

YlOrRd
YlOrBr
YlGnBu
YlGn
Reds
RdPu
Purples
PuRd
PuBuGn
PuBu
OrRd
Oranges
Greys
Greens
GnBu
BuPu
BuGn
Blues
Set3
Set2
Set1
Pastel2
Pastel1
Paired
Dark2
Accent
Spectral
RdYlGn
RdYlBu
RdGy
RdBu
PuOr
PRGn
PiYG
BrBG

To use one of the palettes, execute the function brewer.pal(n,


name), where n is the number of colors you want, and name is the
name of the palette. For example, to get 4 colors from the color set
Set1", youd use the code

my.colors <- brewer.pal(4, "Set1") # 4 colors from Set1


my.colors

## [1] "#E41A1C" "#377EB8" "#4DAF4A" "#984EA3"

I know the results look like gibberish, but trust me, R will inter-
pret them as the colors in the palette. Once you store the output of
the brewer.pal() function as a vector (something like my.colors),
148 yarrr! the pirates guide to r

you can then use this vector as an argument for the colors in your
library("RColorBrewer")
plot. library("circlize")

# Create Data
drinks <- sample(1:30, size = 100, replace = T)
smokes <- sample(1:30, size = 100, replace = T)
Numerically defined color gradients with colorRamp2 risk <- 1 / (1 + exp(-drinks / 20 + rnorm(100, mean = 0, sd = 1)))

# Create color function from colorRamp2


smoking.colors <- colorRamp2(breaks = c(0, 15, 30),
My favorite way to generate colors that represent numerical data colors = c("blue", "orange", "red"),
transparency = .3
is with the function colorRamp2 in the circlize package (the same )

package that creates that really cool chordDiagram from Chapter 1). # Set up plot layout
layout(mat = matrix(c(1, 2), nrow = 2, ncol = 1),
heights = c(2.5, 5), widths = 4)
The colorRamp2 function allows you to easily generate shades of
# Top Plot
colors based on numerical data. par(mar = c(4, 4, 2, 1))
plot(1, xlim = c(-.5, 31.5), ylim = c(0, .3),
The best way to explain how colorRamp2 works is by giving you type = "n", xlab = "Cigarette Packs",
yaxt = "n", ylab = "", bty = "n",
main = "colorRamp2 Example")
an example. Lets say that you want to want to plot data showing the
segments(x0 = c(0, 15, 30),
relationship between the number of drinks someone has on average y0 = rep(0, 3),
x1 = c(0, 15, 30),
y1 = rep(.1, 3),
per week and the resulting risk of some adverse health effect. Further, lty = 2)

lets say you want to color the points as a function of the number points(x = 0:30,
y = rep(.1, 31), pch = 16,
of packs of cigarettes per week that person smokes, where a value col = smoking.colors(0:30))

text(x = c(0, 15, 30), y = rep(.2, 3),


of 0 packs is colored Blue, 10 packs is Orange, and 30 packs is Red. labels = c("Blue", "Orange", "Red"))

Moreover, you want the values in between these break points of 0, 10 # Bottom Plot
par(mar = c(4, 4, 5, 1))
and 30 to be a mix of the colors. For example, the value of 5 (half way plot(x = drinks, y = risk, col = smoking.colors(smokes),
pch = 16, cex = 1.2, main = "Plot of (Made-up) Data",
xlab = "Drinks", ylab = "Risk")
between 0 and 10) should be an equal mix of Blue and Orange.
mtext(text = "Point color indicates smoking rate", line = .5, side = 3)
colorRamp2 allows you to do exactly this. The function has three
arguments: colorRamp2 Example

breaks: A vector indicating the break points Blue Orange Red

colors: A vector of colors corresponding to each value in breaks 0 5 10 15 20 25 30

transparency: A value between 0 and 1 indicating the trans- Cigarette Packs

parency (1 means fully transparent) Plot of (Madeup) Data


Point color indicates smoking rate
When you run the function, the function will actually return
0.9

another function that you can then use to generate colors. Once you
store the resulting function as an object (something like my.color.fun
0.7
Risk

You can then apply this new function on numerical data (in our
0.5

example, the number of cigarettes someone smokes) to obtain the


correct color for each data point.
0.3

For example, lets create the color ramp function for our smoking 0 5 10 15 20 25 30
data points. Ill use colorRamp2 to create a function that Ill call Drinks
smoking.colors which takes a number as an argument, and returns
the corresponding color:

smoking.colors <- colorRamp2(breaks = c(0, 15, 30),


colors = c("blue", "orange", "red"),
transparency = .3
)
10: plotting: part deux 149

smoking.colors(0) # Equivalent to blue

## [1] "#0000FFB2"

smoking.colors(20) # Mix of orange and red

## [1] "#FF8200B2"

To see this function in action, check out the the margin Figure for
an example, and check out the help menu ?colorRamp2 for more
information and examples.

Stealing any color from your screen with a kuler


One of my favorite tricks for getting great colors in R is to use a
color kuler. A color kuler is a tool that allows you to determine the google.colors <- c(
rgb(19, 72, 206, maxColorValue = 255),
exact RGB values for a color on a screen. For example, lets say that rgb(206, 45, 35, maxColorValue = 255),
rgb(253, 172, 10, maxColorValue = 255),
you wanted to use the exact colors used in the Google logo. To do rgb(18, 140, 70, maxColorValue = 255))

par(mar = rep(0, 4))


this, you need to use an app that allows you to pick colors off your
plot(1, xlim = c(0, 7), ylim = c(0, 1),
computer screen. On a Mac, you can use the program called "Digital xlab = "", ylab = "", xaxt = "n", yaxt = "n",
type = "n", bty = "n"
Color Meter." If you then move your mouse over the color you want, )

points(1:6, rep(.5, 6),


the software will tell you the exact RGB values of that color. In the pch = c(15, 16, 16, 17, 18, 15),
col = google.colors[c(1, 2, 3, 1, 4, 2)],
image below, you can see me figuring out that the RGB value of the cex = 2.5)

G in Google is R: 19, G: 72, B: 206. Using this method, I figured out text(3.5, .7, "Look familiar?", cex = 1.5)

the four colors of Google! Check out the margin Figure for the grand
result.

Look familiar?

Figure 51: Stealing colors from the


internet. Not illegal (yet).
150 yarrr! the pirates guide to r

Plot margins par(mar = rep(8, 4))

x.vals <- rnorm(500)


All plots in R have margins surrounding them that separate the main y.vals <- x.vals + rnorm(500, sd = .5)

plot(x.vals, y.vals, xlim = c(-2, 2), ylim = c(-2, 2),


plotting space from the area where the axes, labels and additional main = "", xlab = "", ylab = "", xaxt = "n",
yaxt = "n", bty = "n", pch = 16, col = gray(.5, alpha = .2))
text lie.. To visualize how R creates plot margins, look at margin
axis(1, at = seq(-2, 2, .5), col.axis = gray(.8), col = gray(.8))
Figure . axis(2, at = seq(-2, 2, .5), col.axis = gray(.8), col = gray(.8))

par(new = T)
You can adjust the size of the margins by specifying a margin par(mar = rep(0, 4))
plot(1, xlim = c(0, 1), ylim = c(0, 1), type ="n",
parameter using the syntax par(mar = c(a, b, c, d)) before you main = "", bty = "n", xlab = "", ylab = "", xaxt = "n", yaxt = "n")

execute your first high-level plotting function, where a, b, c and d are rect(0, 0, 1, 1)

rect(.21, .22, .85, .8, lty = 2)


the size of the margins on the bottom, left, top, and right of the plot.
arrows(c(.5, .5, 0, .85),
Lets see how this works by creating two plots with different margins: c(.8, .22, .5, .5),
c(.5, .5, .21, 1),
In the plot on the left, Ill set the margins to 3 on all sides. In the c(1, 0, .5, .5),
code = 3, length = .1
)
plot on the right, Ill set the margins to 6 on all sides.
text(c(.5, .5, .09, .93),
c(.88, .11, .5, .5),
par(mfrow = c(1, 2)) # Put plots next to each other labels = c("mar[1]", "mar[2]", "mar[3]", "mar[4]"),
pos = c(2, 2, 1, 1)
)
# First Plot text(.7, .9, "This is Margin Text\nMade with mtext()")
par(mar = rep(2, 4)) # Set the margin on all sides to 2
plot(1:10)
mtext("Small Margins", side = 3, line = 1, cex = 1.2)
This is Margin Text
mar[1] Made with mtext()

# Second Plot

2.0
par(mar = rep(6, 4)) # Set the margin on all sides to 6

1.5
plot(1:10)

1.0
mtext("Large Margins", side = 3, line = 1, cex = 1.2)

0.5
0.0
mar[3] mar[4]

Small Margins

1.0
10

2.0
Large Margins
2.0 1.0 0.0 0.5 1.0 1.5 2.0
8

mar[2]
10
8
6

1:10

6
4
4

2 4 6 8 10
2

Index

2 4 6 8 10 The default value for mar is c(5.1, 4.1,


Youll notice that the margins are so small in the first plot that you 4.1, 2.1)

cant even see the axis labels, while in the second plot there is plenty
(probably too much) white space around the plotting region.
In addition to using the mar parameter, you can also specify mar-
gin sizes with the mai parameter. This acts just like mar except that
the values for mai set the margin size in inches.
10: plotting: part deux 151

Arranging multiple plots with par(mfrow) and layout par(mfrow = c(3, 3))
par(mar = rep(2.5, 4))

R makes it easy to arrange multiple plots in the same plotting space. for(i in 1:9) { # Loop across plots

The most common ways to do this is with the par(mfrow) parameter, # Generate data
x <- rnorm(100)
y <- x + rnorm(100)
and the layout() function. Lets go over each in turn:
# Plot data
plot(x, y, xlim = c(-2, 2), ylim = c(-2, 2),
col.main = "gray",
pch = 16, col = gray(.0, alpha = .1),
Simple plot layouts with par(mfrow) and par(mfcol) xaxt = "n", yaxt = "n"
)

The mfrow and mfcol parameters allow you to create a matrix of plots # Add a regression line for fun
abline(lm(y ~ x), col = "gray", lty = 2)

in one plotting space. Both parameters take a vector of length two as


# Add gray axes
an argument, corresponding to the number of rows and columns in axis(1, col.axis = "gray",
col.lab = gray(.1), col = "gray")
the resulting plotting matrix. For example, the following code sets up axis(2, col.axis = "gray",
col.lab = gray(.1), col = "gray")
a 3 x 3 plotting matrix.
# Add large index text
text(0, 0, i, cex = 7)

par(mfrow = c(3, 3)) # Create a 3 x 3 plotting matrix # Create box around border
box(which = "figure", lty = 2)

}
When you execute this code, you wont see anything happen.
However, when you execute your first high-level plotting command,

2
1

1
youll see that the plot will show up in the space reserved for the first
1 2 3

0
y

y
1

1
plot (the top left). When you execute a second high-level plotting

2
2 1 0 1 2 2 1 0 1 2 2 1 0 1 2

command, R will place that plot in the second place in the plotting x x x

2
matrix - either the top middle (if using par(mfrow) or the left middle

1
4 5 6

0
y

y
(if using par(mfcol)). As you continue to add high-level plots, R will

1
2

2
continue to fill the plotting matrix. 2 1 0

x
1 2 2 1 0

x
1 2 2 1 0

x
1 2

So whats the difference between par(mfrow) and par(mfcol)? The


2

2
1

1
only difference is that while par(mfrow) puts sequential plots into the 0

7 8 9

0
y

y
1

1
plotting matrix by row, par(mfcol) will fill them by column.
2

2
2 1 0 1 2 2 1 0 1 2 2 1 0 1 2

When you are finished using a plotting matrix, be sure to reset the Figure 52: A matrix of plotting regions
plotting parameter back to its default state: created by par(mfrow = c(3, 3))

par(mfrow = c(1, 1))

If you dont reset the mfrow parameter, R will continue creating


new plotting matrices.

Complex plot layouts with layout()


While par(mfrow) allows you to create matrices of plots, it does
not allow you to create plots of different sizes. In order to arrange
plots in different sized plotting spaces, you need to use the layout()
function. Unlike par(mfrow), layout is not a plotting parameter,
rather it is a function all on its own. Lets go through the main
arguments of layout():

layout(mat, widths, heights)


152 yarrr! the pirates guide to r

mat: A matrix indicating the location of the next N figures in


the global plotting space. Each value in the matrix must be 0 or
a positive integer. R will plot the first plot in the entries of the
matrix with 1, the second plot in the entries with 2,...
layout.matrix <- matrix(c(2, 1, 0, 3), nrow = 2, ncol = 2)
widths: A vector of values for the widths of the columns of the layout(mat = layout.matrix,
heights = c(1, 2), # Heights of the two rows
plotting space. widths = c(2, 2) # Widths of the two columns
)

heights: A vector of values for the heights of the rows of the layout.show(3)

plotting space.

The layout() function can be a bit confusing at first, so I think its 2


best to start with an example. Lets say you want to place histograms
next to a scatterplot: Lets do this using layout
Well begin by creating the layout matrix, this matrix will tell R in
which order to create the plots:
1 3
layout.matrix <- matrix(c(0, 2, 3, 1), nrow = 2, ncol = 2)
layout.matrix

## [,1] [,2]
Figure 53: A plotting layout created
## [1,] 0 3 by setting a layout matrix and specific
## [2,] 2 1 heights and widths.

Looking at the values of layout.matrix, you can see that weve


told R to put the first plot in the bottom right, the second plot on the
bottom left, and the third plot in the top right. Because we put a 0 in
the first element, R knows that we dont plan to put anything in the
top left area.
Now, because our layout matrix has two rows and two columns,
we need to set the widths and heights of the two columns. We do this
using a numeric vector of length 2. Ill set the heights of the two rows
to 1 and 2 respectively, and the widths of the columns to 1 and 2
respectively. Now, when I run the code layout.show(3), R will show
us the plotting region we set up (see margin Figure 53)
Now were ready to put the plots together
layout.matrix <- matrix(c(2, 1, 0, 3), nrow = 2, ncol = 2)

layout(mat = layout.matrix,
heights = c(1, 2), # Heights of the two rows
widths = c(2, 1) # Widths of the two columns
)

x.vals <- rnorm(100, mean = 100, sd = 10)


y.vals <- x.vals + rnorm(100, mean = 0, sd = 10)

# Plot 1: Scatterplot
par(mar = c(5, 4, 0, 0))
plot(x.vals, y.vals)
10: plotting: part deux 153

abline(h = median(y.vals), lty = 1, col = "gray")


abline(v = median(x.vals), lty = 1, col = "gray")

# Plot 2: X boxplot
par(mar = c(0, 4, 0, 0))
boxplot(x.vals, xaxt = "n",
yaxt = "n", bty = "n", yaxt = "n",
col = "white", frame = F, horizontal = T)

# Plot 3: Y boxplot
par(mar = c(5, 0, 0, 0))
boxplot(y.vals, xaxt = "n",
yaxt = "n", bty = "n", yaxt = "n",
col = "white", frame = F)
120
100
y.vals

80
60

80 90 100 110 120

x.vals
154 yarrr! the pirates guide to r

Additional Tips

To change the background color of a plot, add the command


par(bg = mycolor) (where my.color is the color you want to use)
prior to creating the plot. For example, the following code will put
a light gray background behind a histogram: pdf("/Users/nphillips/Dropbox/Git/YaRrr_Book/media/parrothelvetica.pdf",
width = 8, height = 6)

parrot.data <- data.frame(


par(bg = gray(.9)) "parrots" = 0:6,
"female" = c(200, 150, 100, 175, 55, 25, 10),
hist(x = rnorm(100)) "male" = c(150, 125, 180, 242, 10, 62, 5)
)

n.data <- nrow(parrot.data)

par(bg = rgb(61, 55, 72, maxColorValue = 255),


mar = c(8, 6, 6, 3)
)

Histogram of rnorm(100) plot(1, xlab = "", ylab = "", xaxt = "n",


yaxt = "n", main = "", bty = "n", type = "n",
ylim = c(0, 250), xlim = c(.5, n.data + .5)
)

abline(h = seq(0, 250, 50), lty = 3, col = gray(.95), lwd = 1)


20

mtext(text = seq(50, 250, 50),


Frequency

side = 2, at = seq(50, 250, 50),


las = 1, line = 1, col = gray(.95))

mtext(text = paste(0:(n.data - 1), " Parrots"),


10

side = 1, at = 1:n.data, las = 1,


line = 1, col = gray(.95))

female.col <- gray(1, alpha = .7)


male.col <- rgb (226, 89, 92, maxColorValue = 255, alpha = 220)
0

rect.width <- .35


rect.space <- .04

2 1 0 1 2 rect(1:n.data - rect.width - rect.space / 2,


rep(0, n.data),
1:n.data - rect.space / 2,
parrot.data$female,
col = female.col, border = NA
rnorm(100) )

rect(1:n.data + rect.space / 2,
rep(0, n.data),
1:n.data + rect.width + rect.space / 2,
parrot.data$male,
col = male.col, border = NA
See Figure 54 for a nicer example. )

legend(n.data - 1, 250, c("Male Pirates", "Female Pirates"),


Sometimes youll mess so much with plotting parameters that col = c(female.col, male.col), pch = rep(15, 2),
bty = "n", pt.cex = 1.5, text.col = "white"
)
you may want to set things back to their default value. To see
mtext("Number of parrots owned by pirates", side = 3,
the default values for all the plotting parameters, execute the at = n.data + .5, adj = 1, cex = 1.2, col = "white")

code par() to print the default parameter values for all plotting mtext("Source: Drunken survey on 22 May 2015", side = 1,
at = 0, adj = 0, line = 3, font = 3, col = "white")
parameters to the console. dev.off()

## pdf
## 2

#embed_fonts("/Users/nphillips/Dropbox/Git/YaRrr_Book/media/parrothelvetica.pdf") #

Number of parrots owned by pirates


250
Male Pirates
Female Pirates
200

150

100

50

0 Parrots 1 Parrots 2 Parrots 3 Parrots 4 Parrots 5 Parrots 6 Parrots

Source: Drunken survey on 22 May 2015

Figure 54: Use par(bg = my.color)


before creating a plot to add
a colored background. The
design of this plot was inspired by
http://www.vox.com/2015/5/20/8625785/expensive-wine-
11: Inferential Statistics: 1 and 2-sample Null-Hypothesis
tests

Do we get more treasure from chests buried in the sand or at the bot-
tom of the ocean? Is there a relationship between the number of scars
a pirate has and how much grogg he can drink? Are pirates with
body piercings more likely to wear bandannas than those without
body piercings? Glad you asked, in this chapter, well answer these
questions using 1 and 2 sample frequentist hypothesis tests.
As this is a Pirates Guide to R, and not a Pirates Guide to Statis-
tics, we wont cover all the theoretical background behind frequentist
null hypothesis tests (like t-tests) in much detail. However, its impor-
tant to cover three main concepts: Descriptive statistics, Test statistics,
and p-values. To do this, lets talk about body piercings.

Null vs. Alternative Hypotheses, Descriptive Statistics, Test Statis-


tics, and p-values: A very short introduction

As you may know, pirates are quite fond of body piercings. Both Body Piercing Survey

as a fashion statement, and as a handy place to hang their laundry. 11

Now, there is a stereotype that European pirates have more body 10


Number of Body Piercings

9
piercings than American pirates. But is this true? To answer this, I 8
7
conducted a survey where I asked 10 American and 10 European 6

pirates how many body piercings they had. The results are below, 5
4
and a Pirateplot of the data is in Figure 55: 3
2
1
american.bp <- c(3, 5, 2, 1, 4, 4, 6, 3, 5, 4) American European

european.bp <- c(8, 5, 9, 7, 6, 8, 8, 6, 9, 7) Group

Figure 55: A Pirateplot (using the


pirateplot() function in the yarrr
package) of the body piercing data
Null v Alternative Hypothesis .

In null hypothesis tests, you always start with a null hypothesis. The
specific null hypothesis you choose will depend on the type of ques-
tion you are asking, but in general, the null hypothesis states that
156 yarrr! the pirates guide to r

nothing is going on and everything is the same. For example, in our body
piercing study, our null hypothesis is that American and European Null Hypothesis: Everything is the
same
pirates have the same number of body piercings on average.
Alternative Hypothesis: Everything is
The alternative hypothesis is the opposite of the null hypothesis. In not the same
this case, our alternative hypothesis is that American and European
pirates do not have the same number of piercings on average. We
can have different types of alternative hypotheses depending on how
specific we want to be about our prediction. We can make a 1-sided
(also called 1-tailed) hypothesis, by predicting the direction of the
difference between American and European pirates. For example,
our alternative hypothesis could be that European pirates have more
piercings on average than American pirates.
Alternatively, we could make a 2-sided (also called 2-tailed) alterna-
tive hypothesis that American and European pirates simply differ in
their average number of piercings, without stating which group has
more piercings than the other.
Once weve stated our null and alternative hypotheses, we collect
data and then calculate descriptive statistics.

Descriptive Statistics
Descriptive statistics describe samples of data. For example, a mean,
median, or standard deviation of a dataset is a descriptive statistic of
that dataset. Lets calculate some descriptive statistics on our body
piercing survey American and European pirates using the summary()
function:

summary(american.bp)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 1.00 3.00 4.00 3.70 4.75 6.00

summary(european.bp)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 5.00 6.25 7.50 7.30 8.00 9.00

Well, it looks like our sample of 10 American pirates had 3.7 body
piercings on average, while our sample of 10 European pirates had
7.3 piercings on average. But is this difference large or small? Are we
justified in concluding that American and European pirates in general
differ in how many body piercings they have? To answer this, we
need to calculate a test statistic
11: inferential statistics: 1 and 2-sample null-hypothesis tests 157

Test Statistics
An test statistic compares descriptive statistics, and determines how
different they are. The formula you use to calculate a test statistics
depends the type of test you are conducting, which depends on
many factors, from the scale of the data (i.e.; is it nominal or inter-
val?), to how it was collected (i.e.; was the data collected from the
same person over time or were they all different people?), to how its
distributed (i.e.; is it bell-shaped or highly skewed?).
For now, I can tell you that the type of data we are analyzing calls
for a two-sample T-test. This test will take the descriptive statistics
from our study, and return a test-statistic we can then use to make a
decision about whether American and European pirates really differ.
To calculate a test statistic from a two-sample t-test, we can use the
t.test() function in R. Dont worry if its confusing for now, well go
through it in detail shortly.

bp.test <- t.test(x = american.bp,


y = european.bp,
alternative = "two.sided")

I can get the test statistic from my new bp.test object by using the
$ operator as follows:

bp.test$statistic

## t
## -5.676354

It looks like our test-statistic is 5.68. If there was really no differ-


ence between the groups of pirates, we would expect a test statistic
close to 0. Because test-statistic is 5.68, this makes us think that
there really is a difference. However, in order to make our decision,
we need to get the p-value from the test.

p-value
The p-value is a probability that reflects how consistent the test
statistic is with the hypothesis that the groups are actually the same.

Assuming that there is no difference between the groups,


what is the probability that we would have gotten a test
statistic as extreme as the one we actually got?
For this problem, we can access the p-value as follows:
158 yarrr! the pirates guide to r

bp.test$p.value

## [1] 2.304418e-05

The p-value we got was .000022, this means that, assuming the
two populations of American and European pirates have the same
number of body piercings on average, the probability that we would
obtain a test statistic as large as 5.68 is around 0.0022%. This is very
small in other words, its close to impossible. Because our p-value
is so small, we conclude that the two populations must not be the
same.

p-values are bullshit detectors against the null hypothesis


P-values sounds complicated because they are (In fact, most psy-
chology PhDs get the definition wrong). Its very easy to get con- Figure 56: p-values are like bullshit
fused and not know what they are or how to use them. But let me detectors against the null hypothesis.
The smaller the p-value, the more likely
help by putting it another way: a p-value is like a bullshit detector it is that the null-hypothesis (the idea
against the null hypothesis that goes off when the p-value is too small. that the groups are the same) is bullshit.
If a p-value is too small, the bullshit detector goes off and says "Bull-
shit! Theres no way you would get data like that if the groups were
the same!" If a p-value is not too small, the bullshit alarm stays silent,
and we conclude that we cannot reject the null hypothesis.

How small of a p-value is too small?


Traditionally a p-value of 0.05 (or sometimes 0.01) is used to deter-
mine statistical significance. In other words, if a p-value is less than
.05, most researchers then conclude that the null hypothesis is false.
However, .05 is not a magical number. Anyone who really believes
that a p-value of .06 is much less significant than a p-value of 0.04 has
been sniffing too much glue. However, in order to be consistent with
tradition, I will adopt this threshold for the remainder of this chapter.
That said, let me reiterate that a p-value threshold of 0.05 is just as
arbitrary as a p-value of 0.09, 0.06, or 0.12156325234.

Does the p-value tell us the probability that the null hypothesis is true?
No. The p-value does not tell you the probability that the null
hypothesis is true. In other words, if you calculate a p-value of .04,
this does not mean that the probability that the null hypothesis is
true is 4%. Rather, it means that if the null hypothesis was true, the
probability of obtaining the result you got is 4%. Now, this does
indeed set off our bullshit detector, but again, it does not mean that
the probability that the null hypothesis is true is 4%.
11: inferential statistics: 1 and 2-sample null-hypothesis tests 159

Let me convince you of this with a short example. Imagine that


you and your partner have been trying to have a baby for the past
year. One day, your partner calls you and says "Guess what! I took a
pregnancy test and it came back positive!! Im pregnant!!" So, given
the positive pregnancy test, what is the probability that your partner
is really pregnant?
Now, imagine that the pregnancy test your partner took gives
incorrect results in 1% of cases. In other words, if you are pregnant,
there is a 1% chance that the test will make a mistake and say that
you are not pregnant. If you really are not pregnant, there is a 1%
change that the test make a mistake and say you are pregnant.
Ok, so in this case, the null hypothesis here is that your partner is
not pregnant, and the alternative hypothesis is that they are pregnant.
Now, if the null hypothesis is true, then the probability that they
would have gotten an (incorrect) positive test result is just 1%. Does
this mean that the probability that your partner is not pregnant is
Figure 57: Despite what you may see in
only 1%.
movies, men cannot get pregnant. And
No. Your partner is a man. The probability that the null hypothe- despite what you may want to believe,
sis is true (i.e. that he is not pregnant), is 100%, not 1%. Your stupid p-values do not tell you the probability
that the null hypothesis is true!
boyfriend doesnt understand basic biology and decided to buy an
expensive pregnancy test anyway.
This is an extreme example of course in most tests that you do,
there will be some positive probability that the null hypothesis is
false. However, in order to reasonably calculate an accurate proba-
bility that the null hypothesis is true after collecting data, you must
take into account the prior probability that the null hypothesis was
true before you collected data. The method we use to do this is with
Bayesian statistics. Well go over Bayesian statistics in a later chapter.
160 yarrr! the pirates guide to r

T-test with t.test()

To compare the mean of 1 group to a specific value, or to compare


the means of 2 groups, you do a t-test. The t-test function in R is
t.test(). The t.test() function can take several arguments, here Ill
emphasize a few of them. To see them all, check the help menu for
t.test (?t.test).

1Sample ttest 2Sample ttest


mean(EP)
= 9.34
Null Hypothesis
Mean = 5

mean(No EP)
= 9.41

0 5 10 15 20 0 5 10 15 20

Number of Tattoos Tattoos

Here are some optional arguments to


t.test()
One-Sample t-test alternative A character string specifying the
alternative hypothesis, must be one
In a one-sample t-test, you compare the data from one group of of "two.sided" (default), "greater" or
"less". You can specify just the initial
data to some hypothesized mean. For example, if someone said that letter.
pirates on average have 5 tattoos, we could conduct a one-sample test mu A number indicating the true value
comparing the data from a sample of pirates to a hypothesized mean of the mean (or difference in means
of 5. if you are performing a two sample
test). The default is 0.
To conduct a one-sample t-test in R, just enter a vector of data as
paired A logical indicating whether you
the main argument to t.test(), and put the hypothesized value as want a paired t-test.
the argument to mu: var.equal A logical variable indicating
whether to treat the two variances as
t.test(x = pirates$tattoos, # Vector of data being equal.
mu = 5) # Null: Mean is 5 conf.level The confidence level of the interval.
subset An optional vector specifying a
## subset of observations to be used.
## One Sample t-test
##
## data: pirates$tattoos
## t = 40.735, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
## 9.15187 9.57213
## sample estimates:
## mean of x
11: inferential statistics: 1 and 2-sample null-hypothesis tests 161

## 9.362

As you can see, the function printed lots of information: the


descriptive mean of the sample was 9.36, the test statistic (40.74), and
the p-value was 1.5546485 10214 .
Now, what happens if I change the null hypothesis to a mean of
9.3? Because the descriptive sample mean was 9.36, quite close to 9.3,
the test statistic should decrease, and the p-value should increase:

t.test(x = pirates$tattoos,
mu = 9.3) #Null: Mean is 9.3

##
## One Sample t-test
##
## data: pirates$tattoos
## t = 0.579, df = 999, p-value = 0.5627
## alternative hypothesis: true mean is not equal to 9.3
## 95 percent confidence interval:
## 9.15187 9.57213
## sample estimates:
## mean of x
## 9.362

Looks good. The test statistic decreased to just 0.58, and the p-
value increased to 0.56. In other words, our sample mean of 9.36 is
very consistent with the hypothesis that the true population mean is
9.30.

Two-sample t-test
In a two-sample t-test, you compare the means of two groups of data
and test whether or not they are the same. For example, if someone
said that pirates who wear eye patches have fewer tattoos on average
than those who dont wear eye patches, we could test this with a
two-sample t-test. To do a two-sample t-test, enter the dependent
variable (e.g.; tattoos) and independent variable (e.g.; eye patch) as a
formula in the form dv iv, and specify the dataset containing the
data in data:

t.test(formula = tattoos ~ eyepatch,


data = pirates)

##
## Welch Two Sample t-test
##
162 yarrr! the pirates guide to r

## data: tattoos by eyepatch


## t = 0.29452, df = 621.27, p-value = 0.7685
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3866154 0.5230411
## sample estimates:
## mean in group 0 mean in group 1
## 9.407975 9.339763

This test gave us a test statistic of 0.29 and a p-value of 0.77.

Saving and using hypothesis test objects


When you run a t-test using t.test(), you can save the entire object
as a hypothesis test object using basic assignment. Ill repeat the
previous test but save the result as a new object called tattoo.eye
patch.test

tattoo.eyepatch.test <- t.test(formula = tattoos ~ eyepatch,


data = pirates)

To see all the information contained in the test object, use the
names() function

names(tattoo.eyepatch.test)

## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"


## [6] "null.value" "alternative" "method" "data.name"

Once you know the names of whats in the object, you can access
specific values by name using the $ operator (just like in dataframes).
In the code below, Ill access just the p-value and test statistic from
my new tattoo.eyepatch.test object:

# What was the p-value?


tattoo.eyepatch.test$p.value

## [1] 0.7684595

# What was the test statistic?


tattoo.eyepatch.test$statistic

## t
## 0.2945193
11: inferential statistics: 1 and 2-sample null-hypothesis tests 163

Using subset to select levels of an IV


If your independent variable has more than two values, the t.test()
function will return an error because it doesnt know which two
groups you want to compare. For example, lets say I want to com-
pare the number of tattoos of pirates of different ages. Now, the age
column has many different values, so if I dont tell t.test() which
two values of age I want to compare, I will get an error like this:

# Will return an error because there are more than


# 2 levels of the age IV

t.test(formula = tattoos ~ age,


data = pirates) You can select any subset of data in
the subset argument to the t.test()
function not just the primary indepen-
## Error in t.test.formula(formula = tattoos ~ age, data =
dent variable. For example, if I wanted
pirates): grouping factor must have exactly 2 levels to compare the number of tattoos be-
tween pirates who wear handbands or
not, but only for female pirates, I would
To fix this, I need to tell the t.test() function which two values of do the following
age I want to test. To do this, use the subset argument and indicate
which values of the IV you want to test using the %in% operator. For # Is there an effect of college on # of tattoos
# only for female pirates?
example, to compare the number of tattoos between pirates of age 29
and 30, I would add the subset = age %in% c(29, 30) argument like t.test(formula = tattoos ~ college,
data = pirates,
this: subset = sex == "female")

t.test(formula = tattoos ~ age, ##


## Welch Two Sample t-test
data = pirates,
##
subset = age %in% c(29, 30)) # Compare age of 29 to 30 ## data: tattoos by college
## t = 1.5292, df = 480.25, p-value = 0.1269
## ## alternative hypothesis: true difference in means
## 95 percent confidence interval:
## Welch Two Sample t-test ## -0.1362564 1.0926772
## ## sample estimates:
## mean in group CCCC mean in group JSSFP
## data: tattoos by age
## 9.610442 9.132231
## t = -1.149, df = 132.74, p-value = 0.2526
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.7207322 0.4562161
## sample estimates:
## mean in group 29 mean in group 30
## 8.967742 9.600000

We got a p-value of 0.25 which is pretty high and suggests that we


should fail to reject the null hypothesis.
164 yarrr! the pirates guide to r

Correlation test with cor.test()

Next well cover two-sample correlation tests. In a correlation test,


you are accessing the relationship between two variables on a ratio
or interval scale: like height and weight, or income and beard length.
The test statistic in a correlation test is called a correlation coefficient
and is represented by the letter r. The coefficient can range from -1 to
+1, with -1 meaning a strong negative relationship, and +1 meaning a
Here are some optional arguments to
strong positive relationship. The null hypothesis in a correlation test cor.test(). As always, check the help
is a correlation of 0, which means no relationship at all: menu with ?cor.test for additional
information and examples:
r = 0.82 r = 0.11 r = 0.85
formula A formula in the form x + y,
where x and y are the names of the
two variables you are testing. These
variables should be two separate
y

y
columns in a dataframe.
data The dataframe containing the
variables x and y
alternative A string indicating the direction of
the test. t stands for two-sided, l
To run a correlation test, use the cor.test() function. Lets con- stands for less than, and g stands
duct a correlation test on the pirates dataset to see if there is a for greater than.

relationship between a pirates age and number of parrots theyve method A string indicating which corre-
lation coefficient to calculate and
had in their lifetime: test. pearson (the default) stands
for Pearson, while kendall and
# Is there a correlation between a pirate's age and spearman stand for Kendall and
Spearman correlations respectively.
# the number of parrots (s)he's owned?
conf.level The confidence level of the interval.
subset A vector specifying a subset of
age.parrots.test <- cor.test(formula = ~ age + parrots, observations to use.
data = pirates)

Now lets print the result:

age.parrots.test

##
## Pearson's product-moment correlation
##
## data: age and parrots
## t = 7.4153, df = 998, p-value = 2.588e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1689151 0.2864504
## sample estimates:
## cor
## 0.2285152
11: inferential statistics: 1 and 2-sample null-hypothesis tests 165

Looks like we have a positive correlation of 0.23 and a very small


p-value of 2.5884453 1013 . To see what information we can extract
for this test, lets run the command names() on the test object:

names(age.parrots.test)

## [1] "statistic" "parameter" "p.value" "estimate" "null.value"


## [6] "alternative" "method" "data.name" "conf.int"

Looks like weve got a lot of information in this test object. As


an example, lets look at the confidence interval for the population
correlation coefficient:

# 95% confidence interval of the correlation


# coefficient
age.parrots.test$conf.int

## [1] 0.1689151 0.2864504


## attr(,"conf.level")
## [1] 0.95

Just like the t.test() function, we can use the subset argument
in the cor.test() function to conduct a test on a subset of the entire
dataframe. For example, to run the same correlation test between
a pirates age and the number of parrots shes owned, but only for
female pirates, I can add the subset = sex == "female" argument:

# Is there a correlation between age and


# number parrots ONLY for female pirates?

cor.test(formula = ~ age + parrots,


data = pirates,
subset = sex == "female")

##
## Pearson's product-moment correlation
##
## data: age and parrots
## t = 4.0423, df = 489, p-value = 6.147e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.09280611 0.26410932
## sample estimates:
## cor
## 0.1798206
166 yarrr! the pirates guide to r

The results look pretty much identical. In other words, the


strength of the relationship between a pirates age and the num-
ber of parrots theyve owned is pretty much the same for female
pirates and pirates in general.
11: inferential statistics: 1 and 2-sample null-hypothesis tests 167

Chi-square test

Next, well cover chi-square tests. In a chi-square test test, we test


whether or not there is a difference in the rates of outcomes on a
nominal scale (like sex, eye color, first name etc.). The test statis-
tic of a chi-square text is 2 and can range from 0 to Infinity. The Here are some optional arguments
to chisq.test(). As always, check
null-hypothesis of a chi-square test is that 2 = 0 which means no
the help menu with ?chisq.test for
relationship. additional information and examples:
x A contingency table of one (or two)
1-sample Chi-square test categorical variables

If you conduct a 1-sample chi-square test, you are testing if there is a


difference in the number of members of each category in the vector.
Or in other words, are all category memberships equally prevalent? To get this table of the pirate college
We can test this on the pirates dataset by testing if there are an equal data, just use the table() function:

number of pirates from each college. Here is a table of the college table(pirates$college)
data:
##
## CCCC JSSFP
college frequency ## 667 333
CCCC 667
JSSFP 333

Just by looking at the table, it looks like pirates are much more
likely to come from Captain Chunks Cannon Crew (CCCC) than Jack
Sparrows School of Fashion and Piratery (JSSFP). For this reason,
we should expect a very large test statistic and a very small p-value.
Lets test it using the chisq.test() function. As the main argument
to the chisq.test() function, you enter a table of the data using the Dont forget that you should enter your
table() function rather than the original vector. data into the chisq.test() as a table
using the table() function!

# Are all colleges equally prevelant?

chisq.test(x = table(pirates$college))

##
## Chi-squared test for given probabilities
##
## data: table(pirates$college)
## X-squared = 111.56, df = 1, p-value < 2.2e-16

Indeed, with a test statistic of 111.56 and a p-value of 4.4699898


1026 , we can safely reject the null hypothesis and conclude that
certain college are more popular than others.
168 yarrr! the pirates guide to r

2-sample chi-square test


You can also conduct a 2-sample chi-square test by entering two
separate vectors as arguments to chisq.test(). For example, we
might want to know if there is a relationship between the college a
pirate went to, and whether or not he/she wears an eyepatch. We
can get a contingency table of the data from the pirates dataframe as
follows:

table(pirates$eyepatch,
pirates$favorite.pirate)

Anicetus Blackbeard Edward Low Hook Jack Sparrow Lewis Scot


CCCC 68 53 81 67 337 61
JSSFP 37 32 67 48 103 46

To conduct a chi-square test on these data, we will enter table of


the two data vectors:

# Is there a relationship between a pirate's


# college and whether or not they wear an eyepatch?

chisq.test(x = table(pirates$college,
pirates$eyepatch))

##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(pirates$college, pirates$eyepatch)
## X-squared = 2.0729, df = 1, p-value = 0.1499

It looks like we got a test statistic of 2 = 2.07 and a p-value of


0.15. At the traditional p = .05 threshold for significance, we would
conclude that we fail to reject the null hypothesis and state that
we do not have enough information to determine if pirates from
different colleges differ in how likely they are to wear eye patches.
11: inferential statistics: 1 and 2-sample null-hypothesis tests 169

Getting APA-style conclusions with the apa function

Most people think that R pirates are a completely unhinged, drunken


bunch of pillaging buffoons. But nothing could be further from the
truth! R pirates are a very organized and formal people who like
their statistical output to follow strict rules. The most famous rules
are those written by the American Pirate Association (APA). These
rules specify exactly how an R pirate should report the results of the
most common hypothesis tests to her fellow pirates.
For example, in reporting a t-test, APA style dictates that the result
should be in the form t(df) = X, p = Y (Z-tailed), where df is the
degrees of freedom of the text, X is the test statistic, Y is the p-value,
and Z is the number of tails in the test. Now you can of course read
these values directly from the test result, but if you want to save
some time and get the APA style conclusion quickly, just use the apa
function. Heres how it works:
Consider the following two-sample t-test on the pirates dataset
that compares whether or not there is a significant age difference
between pirates who wear headbands and those who do not:

test.result <- t.test(age ~ headband,


data = pirates)
test.result

##
## Welch Two Sample t-test
##
## data: age by headband
## t = 1.3542, df = 116.92, p-value = 0.1783
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3834475 2.0416673
## sample estimates:
## mean in group no mean in group yes
## 28.14286 27.31375

It looks like the test statistic is 1.35, degrees of freedom is 116.92,


and the p-value is 0.178. Lets see how the apa function gets these
values directly from the test object:

library(yarrr) # Load the yarrr library


apa(test.result)

## [1] "mean difference = -0.83, t(116.92) = 1.35, p = 0.18 (2-tailed)"


170 yarrr! the pirates guide to r

As you can see, the apa function got the values we wanted and
reported them in proper APA style. The apa function will even
automatically adapt the output for Chi-Square and correlation tests
if you enter such a test object. Lets see how it works on a correlation
test where we correlate a pirates age with the number of parrots she
has owned:

# Print an APA style conclusion of the correlation


# between a pirate's age and # of parrots

apa(cor.test(formula = ~ age + parrots,


data = pirates))

## [1] "r = 0.23, t(998) = 7.42, p < 0.01 (2-tailed)"

The apa function has a few optional arguments that control things
like the number of significant digits in the output, and the number of
tails in the test. Run ?apa to see all the options.
11: inferential statistics: 1 and 2-sample null-hypothesis tests 171

Test your R might!

The following questions are based on data from either the movies
or the pirates dataset in the yarrr package. Make sure to load the
package first to get access to the data!

1. Do male pirates have significantly longer beards than female


pirates? Test this by conducting a t-test on the relevant data in
the pirates dataset. (Hint: Youll have to select just the female and
male pirates and remove the other ones using subset())

2. Are pirates whose favorite pixar movie is Up more or less likely to


wear an eye patch than those whose favorite pixar movie is Inside
Out? Test this by conducting a chi-square test on the relevant
data in the pirates dataset. (Hint: Create a new dataframe that
only contains data from pirates whose favorite move is either
Up or Inside Out using subset(). Then do the test on this new
dataframe.)

3. Do longer movies have significantly higher budgets than shorter


movies? Answer this question by conducting a correlation test on
the appropriate data in the movies dataset.

4. Do R rated movies earn significantly more money than PG-13


movies? Test this by conducting a t-test on the relevant data in the
movies dataset.

5. Are certain movie genres significantly more common than others


in the movies dataset? Test this by conducting a 1-sample chi-
square test on the relevant data in the movies dataset.

6. Do sequels and non-sequels differ in their ratings? Test this by


conducting a 2-sample chi-square test on the relevant data in the
movies dataset.
12: ANOVA and Factorial Designs

In the last chapter we covered 1 and two sample hypothesis tests.


In these tests, you are either comparing 1 group to a hypothesized
value, or comparing the relationship between two groups (either
their means or their correlation). In this chapter, well cover how to
analyse more complex experimental designs with ANOVAs.
When do you conduct an ANOVA? You conduct an ANOVA
when you are testing the effect of one or more nominal (aka factor)
independent variable(s) on a numerical dependent variable24 . If 24
A nominal (factor) variable is one
you only include one independent variable, this is called a One-way that contains a finite number of cate-
gories with no inherent order. Gender,
ANOVA. If you include two independent variables, this is called a profession, experimental conditions,
Two-way ANOVA. If you include three independent variables it is and Justin Bieber albums are good
examples of factors (not necessarily of
called a Mnage trois NOVA25 . good music)
For example, lets say you want to test how well each of three 25
Ok maybe its not yet, but we you
different cleaning fluids are at getting poop off of your poop deck.To repeat it enough it will be
test this, you could do the following: over the course of 300 cleaning We can visualise the poopdeck dataus-
days, you clean different areas of the deck with the three different ing (of course) a pirate plot:
cleaners. You then record how long it takes for each cleaner to clean pirateplot(time ~ cleaner + type,
its portion of the deck. At the same time, you could also measure data = poopdeck,
how well the cleaner is cleaning two different types of poop that ylim = c(0, 150),
xlab = "Cleaner",
typically show up on your deck: shark and parrot. Here, your inde- ylab = "Cleaning Time (minutes)",
pendent variables cleaner and type are factors, and your dependent main = "poopdeck data",
point.pch = 16,
variable time is numeric. pal = "basel",
Thankfully, this experiment has already been conducted. The data back.col = gray(.98)
are recorded in a dataframe called poopdeck in the yarrr package. )

Heres how the first 8 rows of the data look:


poopdeck data

150
Cleaning Time (minutes)

125

100

75

50

25

0
a b c a b c
type = parrot type = shark
Cleaner
174 yarrr! the pirates guide to r

day cleaner type time


1 a parrot 47
1 b parrot 55
1 c parrot 64
1 a shark 101
1 b shark 76
1 c shark 63
2 a parrot 25
2 b parrot 44

Given this data, we can use ANOVAs to answer three separate


questions:

1. Is there a difference between the different cleaners on cleaning


time (ignoring poop type)? (One-way ANOVA)

2. Is there a difference between the different poop types on cleaning


time (ignoring which cleaner is used)? (One-way ANOVA)

3. Is there a unique effect of the cleaner or poop types on cleaning


time? (Two-way ANOVA)

4. Does the effect of cleaner depend on the poop type? (Interaction


between cleaner and type)

Between-Subjects ANOVA

There are many types of ANOVAs that depend on the type of data
you are analyzing. In fact, there are so many types of ANOVAs that
there are entire books explaining differences between one type and
another. For this book, well cover just one type of ANOVAs called
full-factorial, between-subjects ANOVAs. These are the simplest types of
ANOVAs which are used to analyze a standard experimental design.
In a full-factorial, between-subjects ANOVA, participants (aka, source
of data) are randomly assigned to a unique combination of factors
where a combination of factors means a specific experimental
condition26 26
For example, consider a psychology
study comparing the effects of caffeine
For the rest of this chapter, I will refer to full-factorial between-
on cognitive performance. The study
subjects ANOVAs as standard ANOVAs could have two independent variables:
drink type (soda vs. coffee vs. energy
drink), and drink dose (.25l, .5l, 1l). In a
What does ANOVA stand for? full-factorial design, each participant in
the study would be randomly assigned
ANOVA stands for "Analysis of variance." At first glance, this sounds to one drink type and one drink dose
like a strange name to give to a test that you use to find differences condition. In this design, there would
be 3 x 3 = 9 conditions.
in means, not differences in variances. However, ANOVA actually uses
variances to determine whether or not there are real differences in
12: anova and factorial designs 175

the means of groups. Specifically, it looks at how variable data are


within groups and comapres that to the variability of data between
groups. If the between-group variance is large compared to the
within group variance, the ANOVA will conclude that the groups
do differ in their means. If the between-group variance is small
compared to the within group variance, the ANOVA will conclude
that the groups are all the same. See Figure 58 for a visual depiction
of an ANOVA.

Figure 58: How ANOVAs work.


NonSignificant ANOVA ANOVA compares the variability
Between var SMALL compared to Within var between groups (i.e.; the differences
in the means) to the variability within
groups (i.e.; the differences between
Between Within
members within groups). If the variabil-
100 ity between groups is small compared to
90 the variability between groups, ANOVA
will return a non-significant result
80
suggesting taht the groups are not really
70 different. If the variability between
data

60 groups is large compared to the variabil-


50
Between Within ity within groups, ANOVA will return
Variability Variability
a significant result indicating that the
40
groups are really different.
30
20
10
0

1 2 3 3
2
1 3
2
1

Significant ANOVA
Between var LARGE compared to Within var

Between Within
100
90
80
70
data

60
Between Within
50 Variability Variability

40
30
20
10
0

1 2 3 3
2
1 3
2
1
176 yarrr! the pirates guide to r

4 Steps to conduct a standard ANOVA in R

Here are the 4 steps you should follow to conduct a standard


ANOVA in R:

Step 1 Create an ANOVA object using the aov() function. In the


aov() function, specify the independent and dependent variable(s)
with a formula with the format y x1 + x2 + ... where y is
the dependent variable, and x1, x2 ... are one (more more) factor
independent variables.

Step 2 Create a summary ANOVA table by applying the summary()


function to the ANOVA object you created in Step 1.

Step 3 If necessary, calculate post-hoc tests by applying a post-hoc


testing function like TukeyHSD() to the ANOVA object you created
in Step 1.

Step 4 If necessary, interpret the nature of the group differences


by creating a linear regression object using lm() using the same
arguments you used in the aov() function in Step 1.

Lets do an example by running both a one-way and two-way


ANOVA on the poopdeck data:

One-way (1 IV) ANOVA

You conduct a one-way ANOVA when you testing one factor inde-
pendent variable and are ignoring all other possible variables. Lets
use the poopdeck data and do a one-way ANOVA with the cleaner
type as the independent variable, and the cleaning time as the depen-
dent variable.

Step 1: Create an ANOVA object from the regression object with aov()
First, well create an ANOVA object with aov. Because cleaner is
the independent variable, and time is the dependent variable, well
include the argument formula = time cleaner

cleaner.aov <- aov(formula = time ~ cleaner,


data = poopdeck)

Step 2: Look at summary tables of the ANOVA with summary()


Now, to see a full ANOVA summary table of the ANOVA object,
apply the summary() to the ANOVA object from Step 1.
12: anova and factorial designs 177

summary(cleaner.aov)

## Df Sum Sq Mean Sq F value Pr(>F)


## cleaner 2 6057 3028 5.294 0.00526 **
## Residuals 597 341511 572
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The main result from our table is that we have a significant effect
of cleaner on cleaning time (F(2, 597) = 5.29, p = 0.005). However,
the ANOVA table does not tell us which levels of the independent
variable differ. In other words, we dont know which cleaner is better
than which. To answer this,w e need to conduct a post-hoc test.

Step 3: Do pairwise comparisons with TukeyHSD()


If youve found a significant effect of a factor, you can then do post-
hoc tests to test the difference between each all pairs of levels of the
independent variable. There are many types of pairwise comparisons
that make different assumptions27 . One of the most common post- 27
To learn more about the logic behind
hoc tests for standard ANOVAs is the Tukey Honestly Significant different post-hoc tests, check out the
Wikipedia page here: Post-hoc Test
Difference (HSD) test28 . To do an HSD test, apply the TukeyHSD() Wikipedia
function to your ANOVA object as follows: 28
To see additional information about
the Tukey HSD test, check out the
Wikipedia page here: Tukey HSD
TukeyHSD(cleaner.aov) Wikipedia

## Tukey multiple comparisons of means


## 95% family-wise confidence level
##
## Fit: aov(formula = time ~ cleaner, data = poopdeck)
##
## $cleaner
## diff lwr upr p adj
## b-a -0.42 -6.039575 5.1995746 0.9831441
## c-a -6.94 -12.559575 -1.3204254 0.0107324
## c-b -6.52 -12.139575 -0.9004254 0.0180906

This table shows us pair-wise differences between each group pair.


The diff column shows us the mean differences between groups
(which thankfully are identical to what we found in the summary of
the regression object before), a confidence interval for the difference,
and a p-value testing the null hypothesis that the group differences
are not different.
178 yarrr! the pirates guide to r

Step 4: Look at the coefficients in a regression analysis with lm()


I almost always find it helpful to combine an ANOVA summary table
with a regression summary table. Because ANOVA is just a special
case of regression (where all the indpendent variables are factors),
youll get the same results with a regression object as you will with
an ANOVA object. However, the format of the results are different
and frequently easier to interpret.
To create a regression object, use the lm() function. Your inputs to
this function will be identical to your inputs to the aov() function

# Create a regression object


cleaner.lm <- lm(formula = time ~ cleaner,
data = poopdeck)

# Show summary
summary(cleaner.lm)

##
## Call:
## lm(formula = time ~ cleaner, data = poopdeck)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.02 -16.60 -1.05 16.92 71.92
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.020 1.691 39.037 < 2e-16 ***
## cleanerb -0.420 2.392 -0.176 0.86066
## cleanerc -6.940 2.392 -2.902 0.00385 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.92 on 597 degrees of freedom
## Multiple R-squared: 0.01743,Adjusted R-squared: 0.01413
## F-statistic: 5.294 on 2 and 597 DF, p-value: 0.005261

As you can see, the regression table does not give us tests for each
variable like the ANOVA table does. Instead, it tells us how different
each level of an independent variable is from a default value. You can
tell which value of an independent variable is the default variable
just by seeing which value is missing from the table. In this case, I
dont see a coefficient for cleaner a, so that must be the default value.
The intercept in the table tells us the mean of the default value. In
12: anova and factorial designs 179

this case, the mean time of cleaner a was 66.02. The coefficients for
the other levels tell us that cleaner b is, on average 0.42 minutes faster
than cleaner a, and cleaner c is on average 6.94 minutes faster than
cleaner a. Not surprisingly, these are the same differences we saw in
the Tukey HSD test!

Multiple independent variables with + (y x1 + x2 + ...)


To conduct a two-way ANOVA or a Mnage trois NOVA, just in-
clude additional independent variables in the regression model
formula with the + sign. Thats it. All the steps are the same. Lets
conduct a two-way ANOVA with both cleaner and type as indepen-
dent variables. To do this, well set formula = time cleaner +
type:

# Step 1: Create ANOVA object with aov()


cleaner.type.aov <- aov(formula = time ~ cleaner + type,
data = poopdeck)

Heres the summary table:

# Step 2: Get ANOVA table with summary()


summary(cleaner.type.aov)

## Df Sum Sq Mean Sq F value Pr(>F)


## cleaner 2 6057 3028 6.945 0.00104 **
## type 1 81620 81620 187.177 < 2e-16 ***
## Residuals 596 259891 436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, given significant effects, we can proceed with post-hoc


tests:

# Step 3: Conduct post-hoc tests


TukeyHSD(cleaner.type.aov)

## Tukey multiple comparisons of means


## 95% family-wise confidence level
##
## Fit: aov(formula = time ~ cleaner + type, data = poopdeck)
##
## $cleaner
## diff lwr upr p adj
## b-a -0.42 -5.326395 4.486395 0.9779465
## c-a -6.94 -11.846395 -2.033605 0.0027112
## c-b -6.52 -11.426395 -1.613605 0.0053376
##
## $type
## diff lwr upr p adj
## shark-parrot 23.32667 19.97811 26.67522 0
180 yarrr! the pirates guide to r

Again, to interpret that nature of


It looks like we found significant effects of both independent
the results, its helpful to repeat the
variables. The only non-significant group difference we found is analysis as a regression using the lm()
between cleaner b and cleaner a. function:

# Step 4: Look at regression coefficients


cleaner.type.lm <- lm(formula = time ~ cleaner + typ
Interactions between variables with * (y x1 * x2) data = poopdeck)

Interactions between variables test whether or not the effect of one summary(cleaner.type.lm)
variable depends on another variable. For example, we could use an
interaction to answer the question: Does the effect of cleaners depend on ##
## Call:
## lm(formula = time ~ cleaner + type, data = poopdeck)
the type of poop they are used to clean? To include interaction terms in ##
## Residuals:
an ANOVA, just use an asterix (*) instead of the plus (+) between the ## Min 1Q Median 3Q Max
## -59.743 -13.792 -0.683 13.583 83.583
terms in your formula29 . ##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
Lets repeat our previous ANOVA with two independent variables, ## (Intercept) 54.357 1.705 31.881 < 2e-16 ***
## cleanerb -0.420 2.088 -0.201 0.840665
but now well include the interaction between cleaner and type. To ## cleanerc -6.940 2.088 -3.323 0.000944 ***
## typeshark 23.327 1.705 13.681 < 2e-16 ***
do this, well set the formula to time cleaner * type. ## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.88 on 596 degrees of freedom
# Step 1: Create ANOVA object ##
##
Multiple R-squared: 0.2523,Adjusted R-squared: 0.2485
F-statistic: 67.02 on 3 and 596 DF, p-value: < 2.2e-16

cleaner.type.int.aov <- aov(formula = time ~ cleaner * type,


data = poopdeck) Now we need to interpret the results
in respect to two default values (here,
cleaner = a and type = parrot). The
# Step 2: Look at summary table intercept means that the average time
for cleaner a on parrot poop was 54.357
summary(cleaner.type.int.aov)
minutes. Additionally, the average time
to clean shark poop was 23.33 minutes
## Df Sum Sq Mean Sq F value Pr(>F) slower than when cleaning parrrot
## cleaner 2 6057 3028 7.824 0.000443 *** poop.
29
When you include an interaction
## type 1 81620 81620 210.863 < 2e-16 ***
term in a regression object, R will
## cleaner:type 2 29968 14984 38.710 < 2e-16 *** autmoatically include the main effects
## Residuals 594 229923 387 as well

## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looks like we did indeed find a significant interaction between


cleaner and type. In other words, the effectiveness of a cleaner de-
pends on the type of poop its being applied to. This makes sense
given our plot of the data at the beginning of the chapter.
To understand the nature of the difference, well look at the regres-
sion coefficients from a regression object:

# Step 4: Calculate regression coefficients


cleaner.type.int.lm <- lm(formula = time ~ cleaner * type,
data = poopdeck)

summary(cleaner.type.int.lm)
12: anova and factorial designs 181

##
## Call:
## lm(formula = time ~ cleaner * type, data = poopdeck)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.28 -12.83 -0.08 12.29 74.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.760 1.967 23.259 < 2e-16 ***
## cleanerb 8.060 2.782 2.897 0.003908 **
## cleanerc 10.370 2.782 3.727 0.000212 ***
## typeshark 40.520 2.782 14.563 < 2e-16 ***
## cleanerb:typeshark -16.960 3.935 -4.310 1.91e-05 ***
## cleanerc:typeshark -34.620 3.935 -8.798 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.67 on 594 degrees of freedom
## Multiple R-squared: 0.3385,Adjusted R-squared: 0.3329
## F-statistic: 60.79 on 5 and 594 DF, p-value: < 2.2e-16

Again, to interpret this table, we first need to know what the


default values are. We can tell this from the coefficients that are
missing from the table. Because I dont see terms for cleanera or
typeparrot, this means that cleaner = "a" and type = "parrot" are
the defaults. Again, we can interpret the coefficients as differences
between a level and the default. It looks like for parrot poop, cleaners
b and c both take more time than cleaner a (the default). Additionally,
shark poop tends to take much longer than parrot poop to clean (the
estimate for typeshark is positive).
The interaction terms tell us how the effect of cleaners changes
when one is cleaning shark poop. The negative estimate (-16.96)
for cleanerb:typeshark means that cleaner b is, on average 16.96
minutes faster when cleaning shark poop compared to parrot poop.
Because the previous estimate for cleaner b (for parrot poop) was just
8.06, this suggests that cleaner b is slower than cleaner a for parrot
poop, but faster than cleaner a for shark poop. Same thing for cleaner
c which simply has stronger effects in both directions.
182 yarrr! the pirates guide to r

Additional tips

Getting additional information from ANOVA objects


You can get a lot of interesting information from ANOVA objects.
To see everything thats stored in one, run the names() command on
an ANOVA object. For example, heres whats in our last ANOVA
object:

names(cleaner.type.int.aov)

## [1] "coefficients" "residuals" "effects" "rank"


## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "contrasts" "xlevels" "call" "terms"
## [13] "model"

For example, the "fitted.values" contains the model fits for


the dependent variable (time) for every observation in our dataset.
We can add these fits back to the dataset with the $ operator and
assignment. For example, lets get the model fitted values from both
the interaction model (cleaner.type.aov) and the non-interaction
model (cleaner.type.int.aov) and assign them to new columns in the
dataframe:

# Add the fits for the interaction model to the dataframe as int.fit

poopdeck$int.fit <- cleaner.type.int.aov$fitted.values

# Add the fits for the main effects model to the dataframe as me.fit

poopdeck$me.fit <- cleaner.type.aov$fitted.values

Now lets look at the first few rows in the table to see the fits for
the first few observations.

head(poopdeck)

## day cleaner type time int.fit me.fit


## 1 1 a parrot 47 45.76 54.35667
## 2 1 b parrot 55 53.82 53.93667
## 3 1 c parrot 64 56.13 47.41667
## 4 1 a shark 101 86.28 77.68333
## 5 1 b shark 76 77.38 77.26333
## 6 1 c shark 63 62.03 70.74333

You can use these fits to see how well (or poorly) the model(s)
12: anova and factorial designs 183

were able to fit the data. For example, we can calculate how far each
models fits were from the true data as follows:

# How far were the interaction model fits from the data on average? We can also see how far the mod-
els were on average from the data
separately for each condition using
mean(abs(poopdeck$int.fit - poopdeck$time))
dplyr.

## [1] 15.35173 library(dplyr)

# How far were the main effect model fits from the data on average? poopdeck %>% group_by(cleaner, type) %>%
summarise(
int.fit.err = mean(abs(int.fit - time)),
mean(abs(poopdeck$me.fit - poopdeck$time)) me.fit.err = mean(abs(me.fit - time))
)
## [1] 16.5351
## Source: local data frame [6 x 4]
## Groups: cleaner [?]
As you can see, the interaction model was off from the data by ##
## cleaner type int.fit.err me.fit.err
15.35 minutes on average, while the main effects model was off from
## (chr) (chr) (dbl) (dbl)
the data by 16.54 on average. This is not surprising as the interaction ## 1 a parrot 14.2752 16.14407
model is more complex than the main effects only model. However, ## 2 a shark 15.4944 17.36033
## 3 b parrot 13.9236 13.92127
just because the interaction model is better at fitting the data doesnt ## 4 b shark 18.4552 18.45053
necessarily mean that the interaction is either meaningful or reliable. ## 5 c parrot 15.4926 17.27167
## 6 c shark 14.4694 16.06273

Repeated measures (linear mixed-effects) ANOVA using the lm4 packge The results show that the interaction
model had better fits (i.e.; lower errors)
If you are conducting an analyses where youre repeating measure- for virtually every condition
ments over one or more third variables, like giving the same partici-
pant different tests, you should do a mixed-effects regression analysis.
To do this, you should use the lmer function in the lme4 package.
For example, in our poopdeck data, we have repeated measurements
for days. That is, on each day, we had 6 measurements. Now, its
possible that the overall cleaning times differed depending on the day.
We can account for this by including random intercepts for day by
adding the (1|day) term to the formula specification30 . 30
For more tips on mixed-
effects analyses, check out
this great tutorial by Bodo Winter
# install.packages(lme4) # If you don't have the package already http://www.bodowinter.com/tutorial/bw_LME_tutorial2.p
library(lme4)

# Calculate a mixed-effects regression on time with


# Two fixed factors (cleaner and type)
# And one repeated measure (day)

my.mod <- lmer(formula = time ~ cleaner + type + (1|day),


data = poopdeck)
184 yarrr! the pirates guide to r

Type I, Type II, and Type III ANOVAs


It turns out that there is not just one way to calculate ANOVAs. In
fact, there are three different types - called, Type 1, 2, and 3 (or Type
I, II and III). These types differ in how they calculate variability
(specifically the sums of of squares). If your data is relatively balanced,
meaning that there are relatively equal numbers of observations
in each group, then all three types will give you the same answer.
However, if your data are unbalanced, meaning that some gropus of
data have many more observations than others, then you need to use
Type II (2) or Type III (3). For more detail on the dif-
The standard aov() function in base-R uses Type I sums of squares. ferent types, check out
https://mcfromnz.wordpress.com/2011/03/02/anova-
Therefore, it is only appropriate when your data are balanced. If type-iiiiii-ss-explained/
your data are unbalanced, you should conduct an ANOVA with Type
II or Type III sums of squares. To do this, you can use the Anova()
function in the car package. The Anova() function has an argument
called type that allows you to specify the type of ANOVA you want
to calculate.
In the next code chunk, Ill calculate 3 separate ANOVAs from
the poopdeck data using the three different types. First, Ill create a
regression object with lm():. As youll see, the Anova() function
requres you to enter a regression
object as the main argument, and not
# Step 1: Calculate regression object with lm() a formula and dataset. That is, you
time.lm <- lm(formula = time ~ type + cleaner, need to first create a regression object
data = poopdeck) from the data with lm() (or glm()), and
then enter that object into the Anova()
function. You can also do the same
Now that Ive created the regression object time.lm, I can calculate thing with the standard aov() function
the three different types of ANOVAs by entering the object as the
main argument to either aov() for a Type I ANOVA, or Anova() in
the car package for a Type II or Type III ANOVA:

# Type I ANOVA - aov()


time.I.aov <- aov(time.lm)

# Type II ANOVA - Anova(type = 2)


time.II.aov <- car::Anova(time.lm, type = 2)

To see if your data are balanced, you


# Type III ANOVA - Anova(type = 3) can use the table() function:
time.III.aov <- car::Anova(time.lm, type = 3)
# Are observations in the poopdeck data balanced?
with(poopdeck,
As it happens, the data in the poopdeck dataframe are perfectly table(cleaner, type))
balanced (see Figure 59), so well get exactly the same result for each
## type
ANOVA type. However, if they were not balanced, then we should ## cleaner parrot shark
not use the Type I ANOVA calculated with the aov() function. ## a 100 100
## b 100 100
## c 100 100

As you can see, in the poopdeck data,


the observations are perfectly balanced,
so it doesnt matter which type of
ANOVA we use to analyse the data.
Figure 59: Testing if variables are
balanced in an ANOVA.
12: anova and factorial designs 185

Test your R Might!

For the following questions, use the pirates dataframe in the yarrr
package

1. Is there a significant relationship between a pirates favorite pixar


movie and the number of tattoos (s)he has? Conduct an appropi-
rate ANOVA with fav.pixar as the independent variable, and
tattoos as the dependent variable. If there is a significant rela-
tionship, conduct a post-hoc test to determine which levels of the
independent variable(s) differ.

2. Is there a significant relationship between a pirates favorite pirate


and how many tattoos (s)he has? Conduct an appopriate ANOVA
with favorite.pirate as the independent variable, and tattoos as
the dependent variable. If there is a significant relationship, con-
duct a post-hoc test to determine which levels of the independent
variable(s) differ.

3. Now, repeat your analysis from the previous two questions, but
include both indpendent variables fav.pixar and favorite.pirate
in the ANOVA. Do your conclusions differ when you include both
variables?

4. Finally, test if there is an interaction between fav.pixar and


favorite.pirate on number of tattoos.
13: Regression

Pirates like diamonds. Who doesnt?! But as much as pirates love


diamonds, they hate getting ripped off. For this reason, a pirate
needs to know how to accurately asssess the value of a diamond. For
example, how much should a pirate pay for a diamond with a weight
of 2.0 grams, a clarity value of 1.0, and a color gradient of 4 out of
10? To answer this, wed like to know how the attributes of diamonds
(e.g.; weight, clarity, color) relate to its value. We can get these values
using linear regression.

The Linear Model

The linear model is easily the most famous and widely used model
in all of statistics. Why? Because it can apply to so many interesting
research questions where you are trying to predict a continuous
variable of interest (the response or dependent variable) on the basis of
one or more other variables (the predictor or independent variables). Figure 60: Insert funny caption here.

The linear model takes the following form, where the x values
represent the predictors, while the beta values represent weights.

y = 0 + 1 x1 + 2 x2 + ... n xn

For example, we could use a regression model to understand


how the value of a diamond relates to two independent variables:
its weight and clarity. In the model, we could define the value of
a diamond as weight weight + clarity clarity. Where weight
indicates how much a diamonds value changes as a function of its
weight, and clarity defines how much a diamonds value change as a
function of its clarity.

Linear regression with lm()

To estimate the beta weights of a linear model in R, we use the lm()


function. The function has three key arguments: formula, and data
188 yarrr! the pirates guide to r

lm()

formula
A formula in a form y x1 + x2 + ..., where y is the dependent
variable, and x1, x2, ... are the independent variables. If you want
to include all columns (excluding y) as independent variables, just
enter y .

data
The dataframe containing the columns specified in the formula.

Estimating the value of diamonds with lm()

Well start with a simple example using a dataset in the yarrr package
called diamonds. The dataset includes data on 150 diamonds sold at
an auction. Here are the first few rows of the dataset

library(yarrr)
head(diamonds)

weight clarity color value


9.3 0.88 4 182
11.1 1.05 5 191
8.7 0.85 6 176
10.4 1.15 5 195
10.6 0.92 5 182
12.3 0.44 4 183

Our goal is to come up with a linear model we can use to estimate


the value of each diamond (DV = value) as a linear combination
of three independent variables: its weight, clarity, and color. The
linear model will estimate each diamonds value using the following
equation:

Int + weight weight + clarity clarity + color color


13: regression 189

where weight is the increase in value for each increase of 1 in


weight, clarity is the increase in value for each increase of 1 in clarity
(etc.). Finally, Int is the baseline value of a diamond with a value of
If you want to include all variables
0 in all independent variables. in a dataframe in the regression, you
To estimate each of the 4 weights, well use lm(). Because value is dont need to type the name of every
the dependent variable, well specify the formula as formula = value independent variable in the formula
argument, just enter a period (.) and R
weight + clarity + color. Well assign the result of the function will assume you are using all variables.
to a new object called diamonds.lm:
# Shorten your formula with . to include
# all variables:
# Create a linear model of diamond values
# DV = value, IVs = weight, clarity, color diamonds.lm <- lm(formula = value ~ .,
data = diamonds)

diamonds.lm <- lm(formula = value ~ weight + clarity + color,


Here are the main components of
data = diamonds)
the summary of our regression object
diamonds.lm"
To see the results of the regression analysis, including estimates for Call This just repeats our arguments to
each of the beta values, well use the summary function: the lm() function
Residuals Summary statistics of how far away
# Print summary statistics from diamond model the model fits are away from the
summary(diamonds.lm) true values. A residual is defined as
the model fit minus the true value.
## For example, the median residual
## Call: of 0.11 means that the median
## lm(formula = value ~ weight + clarity + color, data = diamonds) difference between the model fits
## and the true values is 0.11.
## Residuals:
## Min 1Q Median 3Q Max Coefficients Estimates and inferential statistics
## -10.405 -3.547 -0.113 3.255 11.046 on the beta values.
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 148.335 3.625 40.92 <2e-16 ***
## weight 2.189 0.200 10.95 <2e-16 ***
## clarity 21.692 2.143 10.12 <2e-16 ***
## color -0.455 0.365 -1.25 0.21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.7 on 146 degrees of freedom
## Multiple R-squared: 0.637,Adjusted R-squared: 0.63
## F-statistic: 85.5 on 3 and 146 DF, p-value: <2e-16

Here, we can see from the summary table that the model esti-
mated Int (the intercept), to be 148.34, weight to be 2.19, clarity to
be 21.69, and , color to be 0.45. You can see the full linear model in
Figure 61 below:
You can access lots of different aspects of the regression object. To
see whats inside, use names()
190 yarrr! the pirates guide to r

Figure 61: A linear model estimating


Linear Model of Diamond Values the values of diamonds based on their
weight, clarity, and color.

148.3 + 2.19 xweight + 21.69 xclarity + ( 0.46) xcolor = Value

intercept weight clarity color

# Which components are in the regression object?


names(diamonds.lm)

## [1] "coefficients" "residuals" "effects" "rank"


## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"

For example, to get the estimated coefficients from the model, just
acecss the coefficients attribute:

# The coefficients in the diamond model


diamonds.lm$coefficients

## (Intercept) weight clarity color


## 148.3354 2.1894 21.6922 -0.4549
You can use the fitted values from a
regression object to plot the relationship
If you want to access the entire statistical summary table of the between the true values and the model
coefficients, you just need to access them from the summary object: fits. If the model does a good job in
fitting the data, the data should fall on
a diagonal line:
# Coefficient statistics in the diamond model
summary(diamonds.lm)$coefficients # Plot the relationship between true diamond values
# and linear model fitted values

## Estimate Std. Error t value Pr(>|t|) plot(x = diamonds$value,


## (Intercept) 148.3354 3.6253 40.917 7.009e-82 y = diamonds.lm$fitted.values,
xlab = "True Values",
## weight 2.1894 0.2000 10.948 9.706e-21
ylab = "Model Fitted Values",
## clarity 21.6922 2.1429 10.123 1.411e-18 main = "Regression fits of diamond values"
## color -0.4549 0.3646 -1.248 2.141e-01 )

abline(b = 1, a = 0)

Getting model fits with fitted.values


Regression fits of diamond values
To see the fitted values from a regresion object (the values of the de-
pendent variable predicted by the model), access the fitted.values
Model Fitted Values

195

attribute from a regression object with $fitted.values.


Here, Ill add the fitted values from the diamond regression model
185

as a new column in the diamonds dataframe:


175

175 185 195 205

True Values
13: regression 191

# Add the fitted values as a new column in the dataframe


diamonds$value.lm <- diamonds.lm$fitted.values

# Show the result


head(diamonds)

## weight clarity color value value.lm


## 1 9.35 0.88 4 182.5 186.1
## 2 11.10 1.05 5 191.2 193.1
## 3 8.65 0.85 6 175.7 183.0
## 4 10.43 1.15 5 195.2 193.8
## 5 10.62 0.92 5 181.6 189.3
## 6 12.35 0.44 4 182.9 183.1

According to the model, the first diamond, with a weight of 9.35, a


clarity of 0.88, and a color of 4 should have a value of 186.08. As we
can see, this is not far off from the true value of 182.5.

Using predict() to predict new data from a model


Once you have created a regression model with lm(), you can use
it to easily predict results from new datasets using the predict()
function.
For example, lets say I discovered 3 new diamonds with the
following characteristics:

weight clarity color


20 1.5 5
10 0.2 2
15 5.0 3

Ill use the predict() function to predict the value of each of these
diamonds using the regression model diamond.lm that I created
before. The two main arguments to predict() are object the regres-
sion object weve already defined), and newdata the dataframe of
new data: The dataframe that you use in the
newdata argument to predict() must
have column names equal to the names
# Create a dataframe of new diamond data
of the coefficients in the model. If the
diamonds.new <- data.frame(weight = c(12, 6, 5), names are different, the predict()
clarity = c(1.3, 1, 1.5), function wont know which column of
data applies to which coeffiicent
color = c(5, 2, 3))

# Predict the value of the new diamonds using


# the diamonds.lm regression model
192 yarrr! the pirates guide to r

predict(object = diamonds.lm, # The regression model


newdata = diamonds.new) # dataframe of new data

## 1 2 3
## 200.5 182.3 190.5

This result tells us the the new diamonds are expected to have
values of 200.53, 182.25, and 190.46 respectively according to our
regression model.

Including interactions in models: dv x1 * x2

To include interaction terms in a regression model, just put an asterix


(*) between the independent variables.
For example, to create a regression model on the diamonds data
with an interaction term between weight and clarity, wed use the
formula formula = value weight * clarity:

# Create a regression model with interactions between


# IVS weight and clarity
diamonds.int.lm <- lm(formula = value ~ weight * clarity,
data = diamonds)

# Show summary statistics of model coefficients


summary(diamonds.int.lm)$coefficients

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 157.472 10.569 14.899 0.000
## weight 0.979 1.070 0.915 0.362
## clarity 9.924 10.485 0.947 0.345
## weight:clarity 1.245 1.055 1.180 0.240

Center variables before computing interactions!


Hey what happened? Why are all the variables now non-significant?
Does this mean that there is really no relationship between weight
and clarity on value afterall? No. Recall from your second-year pirate
statistics class that when you include interaction terms in a model,
you should always center the independent variables first. Centering a
variable means simply subtracting the mean of the variable from all
observations.
In the following code, Ill repeat the previous regression, but
first Ill create new centered variables weight.c and clarity.c, and
13: regression 193

then run the regression on the interaction between these centered


variables:

# Create centered versions of weight and clarity


diamonds$weight.c <- diamonds$weight - mean(diamonds$weight)
diamonds$clarity.c <- diamonds$clarity - mean(diamonds$clarity)

# Create a regression model with interactions of centered variables


diamonds.int.lm <- lm(formula = value ~ weight.c * clarity.c,
data = diamonds)

# Print summary
summary(diamonds.int.lm)$coefficients

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 189.402 0.383 494.39 0.00
## weight.c 2.223 0.199 11.18 0.00
## clarity.c 22.248 2.134 10.43 0.00
## weight.c:clarity.c 1.245 1.055 1.18 0.24

Hey that looks much better! Now we see that the main effects are
significant and the interaction is non-significant.
194 yarrr! the pirates guide to r

Regression on non-Normal data with glm()

We can use standard regression with lm() when your dependent vari-
able is Normally distributed (more or less). When your dependent
variable does not follow a nice bell-shaped Normal distribution, you
need to use the Generalized Linear Model (GLM). the GLM is a more
general class of linear models that change the distribution of your
dependent variable. In other words, it allows you to use the linear
model even when your dependent variable isnt a normal bell-shape.
Here are 4 of the most common distributions you can can model with
glm():

Normal Poisson
Continuous, bellshaped Positive Integers

8.93 2
9.92 0
3
10.52 5
8.71 2
8.95 2
0
9.62 3
9.58 5
2
11.07
10.15
10.76

5 10 15 1 3 5 7 9

Binomial Gamma
Only 0s and 1s Continuous, positive

1 2.36
0 4.72
Frequency

0
1 0.62
0 1.95
1 2.52
0 2.36
0 4.72
1 0.62
0
1.95
2.52

0 1
0 2 4 6 8 10
13: regression 195

You can use the glm() function just like lm(). To specify the distri-
bution of the dependent variable, use the family argument.

glm()

function, data, subset


The same arguments as in lm()

family
One of the following strings, indicating the link function for the
general linear model

"binomial": Binary logistic regression, useful when the response


is either 0 or 1.
"gaussian": Standard linear regression. Using this family will
give you the same result as lm()
"Gamma": Gamma regression, useful for exponential response
data
"inverse.gaussian": Inverse-Gaussian regression, useful when
the dv is strictly positive and skewed to the right.
"poisson": Poisson regression, useful for count data. For ex-
# Logit
ample, How many parrots has a pirate owned over his/her logit.fun <- function(x) {1 / (1 + exp(-x))}

lifetime?" curve(logit.fun,
from = -3,
to = 3,
lwd = 2,
main = "Inverse Logit",
ylab = "p(y = 1)",
xlab = "Logit(x)"
)

abline(h = .5, lty = 2)


Logistic regression with glm(family = binomial) abline(v = 0, lty = 1)

The most common non-normal regression analysis is logistic regres-


Inverse Logit
sion, where your dependent variable is just 0s and 1. To do a logistic
regression analysis with glm(), use the family = binomial argument.
0.8

Lets run a logistic regression on the diamonds dataset. First, Ill


create a binary variable called value.g175 indicating whether the
0.6
p(y = 1)

value of a diamond is greater than 175 or not. Then, Ill conduct a


0.4

logistic regression with our new binary variable as the dependent


0.2

variable. Well set family = binomial to tell glm() that the dependent
variable is binary. 3 2 1 0 1 2 3

Logit(x)

# Create a binary variable indicating whether or not


Figure 62: The inverse logit function
# a diamond's value is greater than 190 used in binary logistic regression to
diamonds$value.g190 <- diamonds$value > 190 convert logits to probabilities.
196 yarrr! the pirates guide to r

# Conduct a logistic regression on the new binary variable


diamond.glm <- glm(formula = value.g190 ~ weight + clarity + color,
data = diamonds,
family = binomial)

Here are the resulting coefficients:

# Print coefficients from logistic regression


summary(diamond.glm)$coefficients

## Estimate Std. Error z value Pr(>|z|)


## (Intercept) -18.8009 3.4634 -5.428 5.686e-08
## weight 1.1251 0.1968 5.716 1.088e-08
## clarity 9.2910 1.9629 4.733 2.209e-06
## color -0.3836 0.2481 -1.547 1.220e-01

Just like with regular regression with lm(), we can get the fitted
values from the model and put them back into our dataset to see how
well the model fit the data: Looking at the first few observations, it
. looks like the probabilities match the
data pretty well. For example, the first
diamond with a value of 182.5 had a
# Add the fitted values from the GLM back to the data fitted probability of just 0.16 of being
diamonds$pred.g190 <- diamond.glm$fitted.values valued greater than 190. In contrast,
the second diamond, which had a
value of 191.2 had a much higher fitted
# Look at the first few rows (of the named columns) probability of 0.82.
head(diamonds[c("weight", "clarity", "color", "value", "pred.g190")])

## weight clarity color value pred.g190


## 1 9.35 0.88 4 182.5 0.16252
## 2 11.10 1.05 5 191.2 0.82130
## 3 8.65 0.85 6 175.7 0.03008
## 4 10.43 1.15 5 195.2 0.84559
## 5 10.62 0.92 5 181.6 0.44455
## 6 12.35 0.44 4 182.9 0.08688

Just like we did with regular regression, you can use the predict()
function along with the results of a glm() object to predict new data.
Lets use the diamond.glm object to predict the probability that the
new diamonds will have a value greater than 190:

# Predict the 'probability' that the 3 new diamonds


# will have a value greater than 190

predict(object = diamond.glm,
newdata = diamonds.new)
13: regression 197

## 1 2 3
## 4.8605 -3.5265 -0.3898

What the heck, these dont look like probabilities! True, theyre
not. They are logit-transformed probabilities. To turn them back into
probabilities, we need to invert them by applying the inverse logit
function (see Figure ??).

# Get logit predictions of new diamonds


logit.predictions <- predict(object = diamond.glm,
newdata = diamonds.new
)

# Apply inverse logit to transform to probabilities


# (See Equation in the margin)
prob.predictions <- 1 / (1 + exp(-logit.predictions))

# Print final predictions!


prob.predictions

## 1 2 3
## 0.99231 0.02857 0.40376

So, the model predicts that the probability that the three new
diamonds will be valued over 190 is 99.23%, 2.86%, and 40.38%
respectively.

Getting an ANOVA from a regression model with aov()

Once youve created a regression object with lm() or glm(), you can
summarize the results in an ANOVA table with aov():

# Create ANOVA object from regression


diamonds.aov <- aov(diamonds.lm)

# Print summary results


summary(diamonds.aov)

## Df Sum Sq Mean Sq F value Pr(>F)


## weight 1 3218 3218 147.40 <2e-16 ***
## clarity 1 2347 2347 107.53 <2e-16 ***
## color 1 34 34 1.56 0.21
## Residuals 146 3187 22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
198 yarrr! the pirates guide to r

Additional Tips

Adding a regression line to a plot


Adding a regression line with abline()

You can easily add a regression line to a scatterplot. To do this, just

205
put the regression object you created with lm as the main argument
to abline(). For example, the following code will create the scat-

195
Value
terplot on the right (Figure 63) showing the relationship between a

185
diamonds weight and its value including a red regression line:

175
# Scatterplot of diamond weight and value
6 8 10 12 14
plot(x = diamonds$weight,
Weight
y = diamonds$value,
xlab = "Weight", Figure 63: Adding a regression line to a
scatterplot using abline()
ylab = "Value",
main = "Adding a regression line with abline()"
)

# Calculate regression model


diamonds.lm <- lm(formula = value ~ weight,
data = diamonds)

# Add regression line


abline(diamonds.lm,
col = "red", lwd = 2)
13: regression 199

Transforming skewed variables prior to standard regression


# The distribution of movie revenus is highly
If you have a highly skewed variable that you want to include in # skewed.
a regression analysis, you can do one of two things. Option 1 is to hist(movies$revenue.all,
use the general linear model glm() with an appropriate family (like main = "Movie revenue\nBefore log-transformatio

family = gamma). Option 2 is to do a standard regression analysis


Movie revenue
with lm(), but before doing so, transforming the variable into some- Before logtransformation
thing less skewed. For highly skewed data, the most commmon
transformation is a log-transformation.

3000
For example, look at the distribution of movie revenues in the

Frequency
movies dataset in the margin Figure 64:
As you can see, these data dont look Normally distributed at all.

1000
There are a few movies (like Avatar) that just an obscene amount

0
of money, and many movies that made much less. If we want to 0 500 1500 2500
conduct a standard regression analysis on these data, we need to movies$revenue.all
create a new log-transformed version of the variable. In the following
Figure 64: Distribution of movie
code, Ill create a new variable called revenue.all.lt defined as the
revenues
logarithm of revenue.all

# Create a new log-transformed version of movie revenue # Distribution of log-transformed


movies$revenue.all.lt <- log(movies$revenue.all) # revenue is much less skewed

hist(movies$revenue.all.lt,
In Figure 65 you can see a histogram of the new log-transformed main = "Log-transformed Movie revenue")
variable. Its still skewed, but not nearly as badly as before, so I
would be ok using this variable in a standard regression analysis Logtransformed Movie revenue
with lm().
1000
Frequency

600
200
0

3 4 5 6 7 8

movies$revenue.all.lt

Figure 65: Distribution of movie


revenues
200 yarrr! the pirates guide to r

Test your Might! A ship auction

The following questions apply to the auction dataset in the yarrr


package. This dataset contains information about 1,000 ships sold at a
pirate auction. Heres how the first few rows of the dataframe should
look:

cannons rooms age condition color style jbb price


18 20 140 5 red classic 3976 3502
21 21 93 5 red modern 3463 2955
20 18 48 2 plum classic 3175 3281
24 20 81 5 salmon classic 4463 4400
20 21 93 2 red modern 2858 2177
21 19 60 6 red classic 4420 3792

1. The column jbb is the "Jacks Blue Book" value of a ship. Create a
regression object called jbb.cannon.lm predicting the JBB value of
ships based on the number of cannons it has. Based on your result,
how much value does each additional cannon bring to a ship?

2. Repeat your previous regression, but do two separate regressions:


one on modern ships and one on classic ships. Is there relationship
between cannons and JBB the same for both types of ships?

3. Is there a significant interaction between a ships style and its age


on its JBB value? If so, how do you interpret the interaction?

4. Create a regression object called jbb.all.lm predicting the JBB


value of ships based on cannons, rooms, age, condition, color, and
style. Which aspects of a ship significantly affect its JBB value?

5. Create a regression object called price.all.lm predicting the ac-


tual selling value of ships based on cannons, rooms, age, condition,
color, and style. Based on the results, does the JBB do a good job
of capturing the effect of each variable on a ships selling price?

6. Repeat your previous regression analysis, but instead of using


the price as the dependent variable, use the binary variable
price.gt.3500 indicating whether or not the ship had a selling price
greater than 3500. Call the new regression object price.all.blr.
Make sure to use the appropriate regression function!!.

7. Using price.all.lm, predict the selling price of the 3 new ships cannons
12
rooms
34
age
43
condition
7
color
black
style
classic
8 26 54 3 black modern
displayed on the right in Figure 7 32 65 100 5 red modern
Figure 66: Data from 3 new ships about
8. Using price.all.blr, predict the probability that the three new to be auctioned.
ships will have a selling price greater than 3500.
14: Writing your own functions

Why would you want to write your own function?

Throughout this book, you have been using tons of functions either
built into base-R like mean(), hist(), t.test(), or written by other
people and saved in packages like pirateplot() and apa()in the
yarrr package. However, because R is a complete programming Figure 67: Functions. Theyre kind of a
language, you can easily write your own functions that perform big deal.
specific tasks you want. Boring standard histogram with default
For example, lets say you think the standard histograms made values of hist():

with hist() are pretty boring. Check out the top figure on the right hist(pirates$age,
for an example. Instead of using these boring plots, youd like to xlab = "Age",
easily create a version like the bottom figure just below it. This plot main = "Pirates' Ages")

not only has a more modern design, but it also includes statistical
information, like the data mean, standard deviation, and a 95% Pirates' Ages

confidence interval for the mean as a subheading. Now of course


you know from chapter XX that you can customize plots in R any 250

way that youd like by adding customer parameter values like col,
Frequency

150

bg (etc.). But as you can see from the code above Figure , the plot
has exactly the same inputs as the boring histogram just above it! No
0 50

extra arguments necessary. How did I do that? Well as you may have
guessed, I wrote a custom function called my.hist() that contains all 10 15 20 25 30 35 40 45
the code necessary to build the plot. So now, anytime I want to make Age
a fancy histogram, I can just use the my.hist() function rather than
having to always write the raw code from scratch. Heres an advanced version made with
Of course, functions are limited to creating plots...oh no. You our new function my.hist()!

can write a function to do anything that you can program once in R. my.hist(pirates$age,
If theres anything you like to do repeatedly in R, you will almost xlab = "Age",
main = "Pirates' Ages")
certainly like to write your own custom function to perform that
action quickly and easily whenever youd like. In fact, thats all
functions really are, theyre just chunks of R code that are stored Pirates' Ages
Mean = 27.39 (95% CI [27.05, 27.74]), SD = 5.54
behind the scenes for you to use without having to see (or write) the
code yourself. Some of you reading this will quickly see how how
writing your own functions can save you tons of time. For those of
you who havent...trust me, this is a big deal.

10 15 20 25 30 35 40 45

Age
202 yarrr! the pirates guide to r

The basic structure of a function

A function is simply an object that (usually) takes some input, per-


forms some action (executes some R code), and then (usually) returns
some output. This might sound complicated, but youve been using
functions pre-defined in R throughout this book. For example, the
function mean() takes a numeric vector like the ages of pirates as
an argument, and then returns the arithmetic mean of that vector as a
single scaler value.
Your custom functions will have the following 4 attributes:

1. Name: What is the name of your function? You can give it any
valid object name. However, be careful not to use names of exist-
ing functions or R might get confused.

2. Inputs: What are the inputs to the function? Does it need a vector
of numeric data? Or some text? You can specify as many inputs as
you want.

3. Actions: What do you want the function to do with the inputs?


Create a plot? Calculate a statistic? Run a regression analysis? This
is where youll write all the real R code behind the function.

4. Output: What do you want the code to return when its finished
with the actions? Should it return a scaler statistic? A vector of
data? A dataframe?

Heres how your function will look in R. Youll use two new
functions, called function() and return(). As you can see, you put Yes, you use functions to create func-
the function inputs as arguments to the function() function, and the tions! Very Inception-y

output(s) as argument(s) to the return() function.

NAME <- function(INPUTS) {

ACTIONS

return(OUTPUT)

A simple example
Lets create a very simple example of a function. Well create a func-
tion called my.mean that does the exact same thing as the mean()
function in R. This function will take a vector x as an argument, cre-
ates a new vector called output that is the mean of all the elements
14: writing your own functions 203

of x (by summing all the values in x and dividing by the length of x),
then return the output object to the user.

my.mean <- function(x) { # Single input called x

output <- sum(x) / length(x) # Calculate output

return(output) # Return output to the user after running the function

Try running the code above. When you do, nothing obvious hap- If you ever want to see the exact
pens. However, R has now stored the new function my.mean() in the code used to generate a function, you
can just call the name of the function
current working directory for later use. To use the function, we can without the parentheses. For example,
then just call our function like any other function in R. Lets call our to see the code underlying our new
function my.mean you can run the
new function on some data and make sure that it gives us the same following:
result as mean():
my.mean

data <- c(3, 1, 6, 4, 2, 8, 4, 2) ## function(x) { # Single input called x


##
## output <- sum(x) / length(x) # Calculate output
my.mean(data) ##
## return(output) # Return output to the user after running the function
##
## }
## [1] 3.75

mean(data)

## [1] 3.75

As you can see, our new function my.mean() gave the same result
as Rs built in mean() function! Obviously, this was a bit of a waste
of time as we simply recreated a built-in R function. But you get the
idea...

Using multiple inputs


You can create functions with as many inputs as youd like (even 0!).
Lets do an example. Well create a function called oh.god.how.much.did.i.spend
that helps hungover pirates figure out how much gold they spent af-
ter a long night of pirate debauchery. The function will have three
inputs: grogg: the number of mugs of grogg the pirate drank, port:
the number of glasses of port the pirate drank, and crabjuice: the
number of shots of fermented crab juice the pirate drank. Based on
this input, the function will calculate how much gold the pirate spent.
Well also assume that a mug of grogg costs 1, a glass of port costs 3,
and a shot of fermented crab juice costs 10.
204 yarrr! the pirates guide to r

oh.god.how.much.did.i.spend <- function(grogg,


port, You can use any kind of object as
crabjuice) { an input to a function. For exam-
ple, we could re-create the function
oh.god.how.much.did.i.spend by hav-
output <- grogg * 1 + port * 3 + crabjuice * 10 ing a single vector object as the input.
In this version, well extract the values
of a, b and c using indexing:
return(output) oh.god.how.much.did.i.spend <- function(drinks.vec) {

} grogg <- drinks.vec[1]


port <- drinks.vec[2]
crabjuice <- drinks.vec[3]

Now lets test our new function with a few different values for the output <- grogg * 1 + port * 3 + crabjuice * 10

inputs grogg, port, and crab juice. How much gold did Tamara, who return(output)

}
had had 10 mugs of grogg, 3 glasses of wine, and 0 shots of crab juice
spend? To use this function, the pirate will
enter the number of drinks she had as
oh.god.how.much.did.i.spend(grogg = 10, a single vector with length three rather
than as 3 separate scalers.
port = 3,
crabjuice = 0)

## [1] 19

Looks like Tamara spent 19 gold last night. Ok, now how about
Cosima, who didnt drink any grogg or port, but went a bit nuts on
the crab juice:

oh.god.how.much.did.i.spend(grogg = 0,
port = 0,
crabjuice = 7)

## [1] 70

Cosimas taste for crab juice set her back 70 gold pieces.

Including default values


When you create functions with many inputs, youll probably want
to start adding default values. Default values are input values which
the function will use if the user does not specify their own. Including Most functions that youve used so far
defaults can save the user a lot of time because it keeps them from have default values. For example, the
hist() function will use default values
having to specify every possible input to a function. for inputs like main, xlab, (etc.) if you
To add a default value to a function input, just include = DEFAULT dont specify them
after the input. For example, lets add a default value of 0 to each
argument in the oh.god.how.much.did.i.spend function. By doing
this, R will set any inputs that the user does not specify to 0 in
other words, it will assume that if you dont tell it how many drinks
of a certain type you had, then you must have had 0.
14: writing your own functions 205

oh.god.how.much.did.i.spend <- function(grogg = 0,


port = 0,
crabjuice = 0) {

output <- grogg * 1 + port * 3 + crabjuice * 10

return(output)
}

Lets test the new version of our function with data from Hyejeong,
who had 5 glasses of port but no grogg or crab juice. Because 0 is the
default, we can just ignore these arguments:
If you have a default value for ev-
ery input, you can even call the
oh.god.how.much.did.i.spend(port = 5) function without specifying any
inputs R will set all of them to
## [1] 15 the default. For example, if we call
oh.god.how.much.did.i.spend with-
out specifying any inputs, R will set
Looks like Hyejeong only spent 15 by sticking with port. them all to 0 (which should make the
result 0).

Using if/then statements in functions oh.god.how.much.did.i.spend()

Imagine a restaurant waiter who gives everyone crab juice no matter ## [1] 0
what they ordered. Or a guy who wears a tuxedo no matter if hes
going to a dance or a barbeque. Well, unlike these guys, a good
function needs to change what it does depending on what you give it.
That is, before serving someone crab juice, a good server() function
needs to first make sure that the order argument was actually crab
juice (order == "crabjuice").
Youve already used functions that change based on their inputs.
Consider the summary() function: If you evaluate the summary
function on a numeric vector, it will return summary statistics of that
vector. However, if you give the summary function a regression object,
it will return output specific to regression objects like estimates
of the beta values. In other words, before actually calculating and
displaying summary data, the summary() function first checks what
type of input it received, and then calculates summary information
specific to the type of input it is given.
Like all programming languages, R uses if-then statements to se-
lectively run code based on some criteria. The function for running
if-then statements is if() . The if() function has two main ele-
ments, a logical test and a chunk of conditional code (in curly braces).
The chunk of code in the curly braces is conditional because it is only
evaluated if the logical test is TRUE. If the logical test is FALSE, R will
completely ignore the conditional code.
For example, in the following code chunk, I will define an object a
206 yarrr! the pirates guide to r

and assign the value TRUE to it. I will then write an ifstatement that
checks if a is TRUE, and if so, prints the statement "a was true!":

a <- TRUE

if(a) { # If a is TRUE, do the following...

print("a was true!")

} # End of conditional code

## [1] "a was true!"

Since a was TRUE, R executed the conditional code in the curly


braces (and printed the statement). Now, If I run the code again,
but set a to value of FALSE, the conditional code wont run, and the
statement wont print:

a <- FALSE

if(a) {

print("a was true!")

Of course, you dont have to explictly enter the value TRUE or


FALSE as a logical test. You can put any R code in the logical test
of an if statement as long as it returns a logical value of TRUE or
FALSE.
Lets use if() statements in a really important function called
feed.me.negatives. The function will have one input called x. The
function will check if x is negative. If it is, the function will do one
thing. If the input is not negative, it will do something else.

feed.me.negatives <- function(x) {

if(x < 0) {output <- "Yum! I love negative numbers!!"}

if(x > 0) {output <- "WTF WAS THAT?! I am feed.me.negatives, not be.a.douchebag"}

if(x == 0) {output <- "...I'm confused"}

return(output)

}
14: writing your own functions 207

Lets test the function on some different inputs

feed.me.negatives(-2)

## [1] "Yum! I love negative numbers!!"

feed.me.negatives(10)

## [1] "WTF WAS THAT?! I am feed.me.negatives, not be.a.douchebag"

feed.me.negatives(0)

## [1] "...I'm confused"

Using if() statements in your functions can allow you to do some


really neat things. Lets create a function called show.me() that takes
a vector of data, and either creates a plot, tells the user some statistics,
or tells a joke! The function has two inputs: x a vector of data, and
what a string value that tells the function what to do with x. Well
set the function up to accept three different values of what either
"plot", which will plot the data, "stats", which will return basic
statistics about the vector, or "tellmeajoke", which will return a
funny joke! Youll notice that in the show.me()
function I used the my.hist() function
that we specified earlier in this chap-
show.me <- function(x, what) { ter. If you try running show.me(what
= "plot") without first defining
if(what == "plot") { my.hist(), the function will return
an error!

my.hist(x)
output <- "Ok! I hope you like the plot..."
}

if(what == "stats") {

output <- paste("Yarr! The mean of this data be ", round(mean(x), 2),
" and the standard deviation be ", round(sd(x), 2),
sep = ""
)
}

if(what == "tellmeajoke") {

output <- "I am a pirate, not your joke monkey."


}

return(output)
208 yarrr! the pirates guide to r

Lets test show.me() on the lengths of pirate beards. First, lets get
it to create a histogram by setting what = "plot"

show.me(x = pirates$beard.length, what = "plot")

Histogram of x
Mean = 9.82 (95% CI [9.19, 10.46]), SD = 10.24

0 10 20 30 40

## [1] "Ok! I hope you like the plot..."

Looks good! Now lets get the same function to tell us some
statistics about the data by setting what = "stats":

show.me(x = pirates$beard.length, what = "stats")

## [1] "Yarr! The mean of this data be 9.82 and the standard deviation be 10.24"

Phew that was exhausting, I need to hear a funny joke. Lets set
what = "tellmeajoke":

show.me(what = "tellmeajoke")

## [1] "I am a pirate, not your joke monkey."

That wasnt very funny.


14: writing your own functions 209

Storing and loading your functions to and from a function file with
source()

As you do more programming in R, you may find yourself writing


several function that youll want to use again and again in many
different R scripts. It would be a bit of a pain to have to re-type your
functions every time you start a new R session, but thankfully you
dont need to do that. Instead, you can store all your functions in one
R file and then load that file into each R session.
I recommend that you put all of your custom R functions into a
single R script with a name like Custom_R_Functions.R. Mine is
called "Custom_Pirate_Functions.R. Once youve done this, you
can load all your functions into any R session by using the source()
function. The source function takes a file directory as an argument
(the location of your custom function file) and then executes the R
script into your current session.
For example, on my computer my custom function file is stored
at Users/Nathaniel/Dropbox/Custom_Pirate_Functions.R. When I
start a new R session, I load all of my custom functions by running
the following code:

source(file = "Users/Nathaniel/Dropbox/Custom_Pirate_Functions.R")

Once Ive run this, I have access to all of my functions, I highly


recommend that you do the same thing!

Tips and tricks for complex functions

Here are several tips and tricks that will help you to build good
functions:

Conduct input quality checks


In most of your functions, you will expect the user to specify cer-
tain kinds of inputs. For example, in our show.me function, we only
considered three possible inputs ("plot", "stats", "tellmeajoke")
for the what argument. If a user enters a different value for the what
argument, the function wont do anything because none of the if()
statements were satisfied. This can be frustrating for the user because
he doesnt know if he did something wrong or if the function has a
bug.
To help the user know which inputs are valid, you can include
a quality check, where you explicitly check if the users inputs are
valid. If they valid, you can run the function as normal. If they are
210 yarrr! the pirates guide to r

not valid, you can return an error message to the user telling them
what is wrong.
Lets add a quality control check to the show.me function. Well
do this by creating a logical value called valid.input which is TRUE
when the input for what is valid, and FALSE when the input for what
is not valid. Then, well define a warning message in the case where
the input is not valid:

show.me <- function(x, what) {

valid.input <- what %in% c("plot", "stats", "tellmeajoke")

if(valid.input == TRUE) { # Begin TRUE input section

if(what == "plot") {

my.hist(x)
output <- "Ok here's your histogram!"

if(what == "stats") {

output <- (paste("Yarr! The mean of this data be ", round(mean(x), 2),
" and the standard deviation be ", round(sd(x), 2),
sep = ""
))

if(what == "tellmeajoke") {

output <- "I am a pirate, not your joke monkey."

} # Close TRUE valid input section

if(valid.input == FALSE) { # Begin FALSE valid input section

output <- "Bad what input. Please enter plot, stats, or tellmeajoke."

} # Close FALSE valid input section

return(output)
}

Lets try it out. Well execute the show.me() function with an


invalid value of what. Now, instead of being silent (like before), it
should give us a warning telling us what went wrong:

show.me(x = 1, what = "surprise")

## [1] "Bad what input. Please enter plot, stats, or tellmeajoke."


14: writing your own functions 211

Test your functions by hard-coding input values


When you start writing more complex functions, with several inputs
and lots of function code, youll need to constantly test your function
line-by-line to make sure its working properly. However, because the
input values are defined in the input definitions (which you wont
execute when testing the function), you cant actually test the code
line-by-line until youve defined the input objects in some other way.
To do this, I recommend that you include temporary hard-coded
values for the inputs at the beginning of the function code.
For example, consider the following function called remove.outliers.
The goal of this function is to take a vector of data and remove any
data points that are outliers. This function takes two inputs x and
outlier.def, where x is a vector of numerical data, and outlier.def
is used to define what an outlier is: if a data point is outlier.def
standard deviations away from the mean, then it is defined as an
outlier and is removed from the data vector.
In the following function definition, Ive included two lines where
I directly assign the function inputs to certain values (in this case, I
set x to be a vector with 100 values of 1, and one outlier value of 999,
and outlier.def to be 2). Now, if I want to test the function code
line by line, I can uncomment these test values, execute the code that
assigns those test values to the input objects, then run the function
code line by line to make sure the rest of the code works.

remove.outliers <- function(x, outlier.def = 2) {

# Test values (only used to test the following code)


# x <- c(rep(1, 100), 999)
# outlier.def <- 2

is.outlier <- x > (mean(x) + outlier.def * sd(x)) | x < (mean(x) - outlier.def * sd(x))
x.nooutliers <- x[is.outlier == F]

return(x.nooutliers)

Trust me, when you start building large complex functions, hard-
coding these test values will save you many headaches. Just dont
forget to comment them out when you are done testing or the func-
tion will always use those values!
212 yarrr! the pirates guide to r

Using ... as option inputs


For some functions that you write, you may want the user to be
able to specify inputs to functions within your overall function. For
example, if I create a custom function that includes the histogram
function hist() in R, I might also want the user to be able to specify
optional inputs for the plot, like main, xlab, ylab, etc. However, it
would be a real pain in the pirate ass to have to include all possible
plotting parameters as inputs to our new function. Thankfully, we
can take care of all of this by using the ... notation as an input to
the function31 . The ... input tells R that the user might add additional 31
The ... notation will only pass ar-
inputs that should be used later in the function. guments on to functions that are
specifically written to allow for optional
Heres a quick example, lets create a function called hist.advanced inputs. If you look at the help menu for
that plots a histogram with some optional additional arguments hist(), youll see that it does indeed
allow for such option inputs passed on
passed on with ... from other functions.

hist.advanced <- function(x, add.ci = T, ...) {

hist(x, # Main Data


... # Here is where the additional arguments go
)

if(add.ci == T) {

ci <- t.test(x)$conf.int # Get 95% CI


segments(ci[1], 0, ci[2], 0, lwd = 5, col = "red") hist.advanced(x = rnorm(100), add.ci = T,
main = "Treasure Chests found",
xlab = "Number of Chests",
mtext(paste("95% CI of Mean = [", round(ci[1], 2), ",", col = "lightblue")
round(ci[2], 2), "]"), side = 3, line = 0)
} Treasure Chests found
} 95% CI of Mean = [ 0.33 , 0.06 ]
20

Now, lets test our function with the optional inputs main, xlab,
Frequency

15

and col. These arguments will be passed down to the hist() func-
10

tion within hist.advanced(). The result is in margin Figure . As you


5

can see, R has passed our optional plotting arguments down to the
main hist() function in the function code.
0

2 1 0 1 2 3

Number of Chests
A worked example: Custom plotting functions

Lets create our own advanced own custom plotting function called
plot.advanced that acts like the normal plotting function, but has
several additional arguments
add.mean: A logical value indicating whether or not to add verti-
cal and horizontal lines at the mean value of x and y.
14: writing your own functions 213

add.regression: A logical value indicating whether or not to add a


linear regression line

sig.line.col: The color of the regression line if the slope is signifi-


cant.

nonsig.line.col: The color of the regression line if the slope is NOT


significant.

p.threshold: A numeric scaler indicating the p.value threshold for


determining significance

add.modeltext: A logical value indicating whether or not to in-


clude the regression equation as a sub-title to the plot

This plotting code is a bit long, but its all stuff youve learned
before.
plot.advanced <- function (x = rnorm(100),
y = rnorm(100),
add.mean = F,
add.regression = F,
sig.line.col = "red",
nonsig.line.col = "black",
p.threshold = .05,
add.modeltext = F,
... # Optional further arguments passed on to plot
) {

# Generate the plot with optional arguments


# like main, xlab, ylab, etc.
plot(x, y, ...)

# Add mean reference lines if add.mean is TRUE


if(add.mean == T) {

abline(h = mean(y), lty = 2)


abline(v = mean(x), lty = 2)
}

# Add regression line if add.regression is TRUE


if(add.regression == T) {

model <- lm(y ~ x) # Run regression

p.value <- anova(model)$"Pr(>F)"[1] # Get p-value

# Define line color from model p-value and threshold


if(p.value < p.threshold) {line.col <- sig.line.col}
if(p.value >= p.threshold) {line.col <- nonsig.line.col}

abline(lm(y ~ x), col = line.col, lwd = 2) # Add regression line

# Add regression equation text if add.modeltext is TRUE


if(add.modeltext == T) {
214 yarrr! the pirates guide to r

# Run regression
model <- lm(y ~ x)

# Determine coefficients from model object


coefficients <- model$coefficients
a <- round(coefficients[1], 2)
b <- round(coefficients[2], 2)
plot.advanced(x = pirates$age,
y = pirates$tchests,
# Create text
add.regression = TRUE,
model.text <- paste("Regression Equation: ", a, " + ",
add.modeltext = TRUE,
b, " * x", sep = "")
p.threshold = .05,
main = "plot.advanced()",
# Add text to top of plot
xlab = "Age", ylab = "Treasure Chests
mtext(model.text, side = 3, line = .5, cex = .8)
pch = 16,
col = gray(.2, .3))
}
}
plot.advanced()
Regression Equation: 3.02 + 0.15 * x
Lets try the function on data from the pirates dataset. Well put
age on the x-axis and tchests on the y-axis. To see what the function

50
Treasure Chests Found

40
can do, well turn on some of the additional elements, and include

30
optional titles for the plot. You can see the result in the margin figure

20
on the right!

10
0
15 20 25 30 35 40 45

Age
15: Loops

One of the golden rules of programming is D.R.Y. Dont repeat


yourself." Why? Not because you cant, but because its almost
certainly a waste of time. You see, while computers are still much,
much worse than humans at some tasks (like recognizing faces), Figure 68: Loops in R can be fun.
they are much, much better than humans at doing a few key things Just...you know...dont screw it up.
- like doing the same thing over...and over...and over. To tell R to do
something over and over, we use a loop. Loops are absolutely critical
in conducting many analyses because they allow you to write code
once but evaluate it tens, hundreds, thousands, or millions of times
without ever repeating yourself.
For example, imagine that you conduct a survey of many pirates
containing 100 yes/no questions. Question 1 might be Do you
ever shower? and Question 2 might be No seriously, do you ever
shower?! When you finish the survey, you could store the data
as a dataframe with X rows (where X is the number of pirates you
surveyed), and 100 columns representing all 100 questions. Now,
because every question should have a yes or no answer, the only
values in the dataframe should be Y or N Unfortunately, as is
the case with all real world data collection, you will likely get some
invalid responses like "Maybe" or "What be yee phone number?!.
For this reason, youd like to go through all the data, and recode any
invalid response as NA (aka, missing). To do this sequentially, youd
have to write the following 100 lines of code...

survey.df$q.1[(survey.data$q1 %in% c("Y", "N")) == F] <- NA


survey.df$q.2[(survey.data$q2 %in% c("Y", "N")) == F] <- NA
# . ... Wait...I have to type this 98 more times?!
# .
# . ... My god this is boring...
# .
# .
# . ... I'm about to walk myself off the plank...
# .
survey.df$q.100[(survey.data$q100 %in% c("Y", "N")) == F] <- NA
216 yarrr! the pirates guide to r

Pretty brutal right? Imagine if you have a huge dataset with 1,000
columns, now youre really doing a lot of typing. Thankfully, with
a loop you can take care of this in no time. Check out this following
code chunk which uses a loop to convert the data for all 100 columns
in our survey dataframe.

for(i in 1:100) { # Loop over all 100 columns

y <- survey.df[, i] # Get data for ith column and save in a new object y

y[(y %in% c("Y", "N")) == F] <- NA # Convert invalid values in y to NA

survey.df[,column.i] <- y # Assign y back to survey.df!

} # Close loop!

Done. All 100 columns. Take a look at the code and see if you can
understand the general idea. But if not, no worries. By the end of this
chapter, youll know all the basics of how to construct loops like this
one.

What are loops?

A loop is, very simply, code that tells a program like R to repeat a
certain chunk of code several times with different values of an index
that changes for every run of the loop. In R, the format of a for-loop
is as follows:

for(INDEX.OBJECT in INDEX.VALUES) {

LOOP.CODE

As you can see, there are three key aspects of loops: The index
object, the index vector, and the loop code:

1. index object: The object that is changing according to the index


values. In the previous example, the index is just i. You can use
any object name that you want for the index. While most people
use single character object names, sometimes its more transparent
to use names that tell you something about the data the object
represents. For example, if you are doing a loop over participants
in a study, you can call the index participant.i.
15: loops 217

2. index vector: A vector specifying all values that the index will
take over the loop. In the previous example, the index values
To see if our loop gave us the correct
are 1:10. You can specify the index values any way youd like.
answer, we can do the same calculation
If youre running a loop over numbers, youll probably want to without a loop by using a:b and the
use a:b or seq(). However, if you want to run a loop over a few sum() function:

specific values, you can just use the c() function to type the values sum(1:100)
manually. For example, to run a loop over three different pirate
## [1] 5050
ships, you could set the index values as c("Jolly Roger", "Black
Pearl", "Queen Annes Revenge"). Theres actually a funny story about
how to quickly add integers (without
3. loop code: The code that will be executed for all index values. In a loop). According to the story, a lazy
the previous example, the loop code is print(i). You can write teacher who wanted to take a nap
decided that the best way to occupy his
any R code youd like in a loop - from plotting to analyses. students was to ask them to privately
count all the integers from 1 to 100 at
their desks. To his surprise, a young
A simple loop: Adding integers from 1 to 100 student approached him after a few
moments with the correct answer: 5050.
Lets use a loop to add all the integers from 1 to 100. To do this, The teacher suspected a cheat, but
well start by creating an object called current.sum that contains the the student didnt count the numbers.
Instead he realized that he could use
running sum of the numbers. This is where well store the running the formula n(n+1) / 2. Dont believe
sum. Well set the index object to i, the index vector to 1:100, and the the story? Check it out:
loop code to current.sum <- current.sum + i. Because we want the
100 * 101 / 2
starting sum to be 0, well set the initial value of current.sum to 0.
Here is the code: ## [1] 5050

This boy grew up to be Gauss, a super


current.sum <- 0 legit mathematician.

for(i in 1:100) {

current.sum <- current.sum + i

current.sum

## [1] 5050

Looks like we get an answer of 5050.

Creating multiple plots with a loop

You can use loops to do much more than basic math. Oh no. One of
Figure 69: Gauss. Guys a total pirate.
the best uses of a loop is to put multiple graphs quickly and easily And totally would give us shit for using
on the same chart. Lets use a loop to quickly create a matrix of four a loop to calculate the sum of 1 to 100...
plots. For this example, well plot a histogram of the ages of pirates
based on their favorite pixar movie from the pirates dataset.
218 yarrr! the pirates guide to r

Heres how the plotting loop works.


First, I set up a 4 x 4 plotting space
par(mfrow = c(4, 4)) # Set up a 4 x 4 plotting space with par(mfrow()) (If you havent seen
par(mfrow()) before, just know that
it allows you to put multiple plots
# Create the index vector pixar.movies (all possible movies) side-by-side). Next, I defined the
index vector as all the unique values
pixar.movies <- unique(pirates$fav.pixar) of fav.pixar in the pirates dataframe.
I assigned this object to pixar.movies
for (movie.i in pixar.movies) { which will later become our index
vector. I then started the loop by setting
the index object to movie.i and the
# subset data for current movie index vector to pixar.movies. Next, I
defined the loop code. In the loop code,
data.temp <- subset(pirates, I created a temporary data frame called
fav.pixar == movie.i) data.temp which is a subset of the
entire pirates data frame containing
only rows for which the column
# Plot histogram of temporary data fav.pixar is equal to the index object
hist(data.temp$age, movies.i. Finally, I created a histogram
of the age column from data.temp!
main = movie.i,
xlab = "age",
xlim = c(0, 50))
}

Up Toy Story 2 Cars The Incredibles


0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

20
15

15
Frequency

Frequency

Frequency

Frequency
10

10
5

5
0

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

age age age age

Inside Out Finding Nemo Toy Story WALLE


10

15
100

50

8
Frequency

Frequency

Frequency

Frequency

10
6
60

30

5
2
0 20

0 10

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

age age age age

Monsters, Inc. Toy Story 3 Monsters University Cars 2


20
4

8
15

15
3

6
Frequency

Frequency

Frequency

Frequency
10

10
2

4
5

2
0

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

age age age age

A Bug's Life Brave Ratatouille


6
15

0.8
5
Frequency

Frequency

Frequency
4
10

0.4
2
5

0.0
0

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

age age age


15: loops 219

Storing sequential loop results in a container object

One common use of loops is to create a table showing several statis-


tics over some index. For example, in a survey of college students,
you might want to calculate summary statistics for each level of sex,
graduation year, major, etc. If you need to calculate a simple sum-
mary statistic (like a sample mean or median), you could easily do
this using aggregate() or dplyr. However, for complex calculations,
you may want to use a loop.
To store loop results, youll need to start by setting up a container
where the results will be stored. This container will usually be a
vector, a data frame, or a list (well learn about lists later). Then, in
your loop code, you will assign each result in your loop to an entry
in this container.
Lets do a simple example. Well create a vector called squares,
Figure 70: This is what I got when I
where each entry in the vector is the square of the integers from one googled funny container.
to ten.
The first thing we need to do is create a container vector called
squares that is filled with NA values.

squares <- rep(NA, 10)

Now we can do our loop. Our index value will be i and our index
vector will be 1:10. For our loop code, well calculate the square of
the index value, and assign the result to the ith index of squares

for(i in 1:10) {

result.i <- i ^ 2
squares[i] <- result.i

Now lets look at the result. Hopefully well get the squares of 1 to
10...

squares

## [1] 1 4 9 16 25 36 49 64 81 100
Again, there are easier ways to calculate
the squares of integers than using a
Ok that loop was pretty simple but hopefully it shows you the loop. Here, we can do the same thing
basic idea of combining indexing with assignment in loops. using basic vector arithmetic:

squares <- (1:10) ^ 2


A complex data loop
Lets do a slightly more complex example. For this one, well use a
loop to create a matrix containing many different statistics calculated
220 yarrr! the pirates guide to r

from the pirates dataset. Specifically, well create a matrix called


parrot.ci.df that contains 95% confidence intervals for the mean
number of parrots owned by pirates in each age group. That is, we
want the 95% confidence interval of parrots owned by pirates who
are 20, 21, ... 30. We can do this using a loop.
First, lets create the storage dataframe parrot.ci.df. To keep
the dataframe small enough to print in this book, Ill restrict the age
values to 20, 21, ... 30. However, you can include as many ages as
youd like:
15: loops 221

Heres how I did each step:

# CREATING THE DATA CONTAINER parrot.ci.df 1. Step 1: Determine all the age values
to analyze and assign them to
# Step 1 - Determine the age levels to analyze the object ages.to.analyze. Then
ages.to.analyze <- 20:30 determine total number of unique
n.ages <- length(ages.to.analyze) age values to analyze and assign the
total to the object n.ages.
# Step 2 - Create the storage dataframe parrot.ci.df 2. Step 2: Create a storage dataframe
parrot.ci.df <- data.frame( called parrot.ci.df with N rows
age = rep(NA, n.ages), (where N is the number of ages
mean = rep(NA, n.ages), we will analyze), and 3 columns
lb = rep(NA, n.ages), (age, lower.bound, upper.bound). I
ub = rep(NA, n.ages), started by filling parrot.ci.df with
stringsAsFactors = F NA values. We will use the loop to
) replace these NA values with our
names(parrot.ci.df) <- c("age", "mean", "lb", "ub") statistics.

All done. Heres how the storage dataframe parrot.ci.df looks:


parrot.ci.df

## age mean lb ub
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## 7 NA NA NA NA
## 8 NA NA NA NA
## 9 NA NA NA NA
## 10 NA NA NA NA
## 11 NA NA NA NA

Now that we have the storage dataframe, we can use a loop to


replace the NA values with the values we want. For this loop, well
loop over the rows of the dataframe parrots.ci.df. Heres how I did each step:
1. Step 3: Set up the loop with an
# Step 3 - Define the index and index values
index of row.i where the index
for (row.i in 1:nrow(parrot.ci.df)) {
values are 1, 2, ... to the number of
rows in the dataframe result.df.
# Step 4 - Calculate statistics for current index value
Our loop will analyze data specific
age.i <- ages.to.analyze[row.i]
to each row in the dataframe.
data.temp <- subset(pirates, age == age.i) # Subset the original data
mean.i <- mean(data.temp$parrots) # Get the sample mean 2. Step 4: As a function of the current
ci.i <- t.test(data.temp$parrots) # Get the ci index, determine the age value
in that row, create a temporary
# Step 5 - Assign statistics to parrot.ci.df dataset of data for that age value,
parrot.ci.df$age[row.i] <- age.i then determine the sample mean
parrot.ci.df$mean[row.i] <- mean.i and confidence interval for that
parrot.ci.df$lb[row.i] <- ci.i[1] temporary dataset.
parrot.ci.df$ub[row.i] <- ci.i[2] 3. Step 5: Assign the results for the
current index value to row.i of the
} # Close loop appropriate column in result.df

Lets look at the result to make sure it worked correctly. We should


get a dataframe with 4 columns showing summary statistics for ages
20 through 30:
222 yarrr! the pirates guide to r

parrot.ci.df

## age mean lb ub
## 1 20 1.667 5.493 35
## 2 21 2.265 6.265 33
## 3 22 2.150 6.954 39
## 4 23 2.300 7.893 59
## 5 24 1.797 7.154 63
## 6 25 2.324 10.41 67
## 7 26 2.415 6.463 52
## 8 27 2.702 8.656 83
## 9 28 2.400 8.053 59
## 10 29 2.532 7.218 61
## 11 30 2.947 7.409 74

Looks like it worked!

Loops over multiple indices

So far weve covered simple loops with a single index value - but
how can you do loops over multiple indices? You could do this
by creating multiple nested loops. However, these are ugly and
cumbersome. Instead, I recommend that you use design matrices to
reduce loops with multiple index values into a single loop with just
one index. Heres how you do it:
Lets say you want to calculate the mean, median, and standard
deviation of some quantitative variable for all combinations of two
factors. For a concrete example, lets say we wanted to calculate
these summary statistics on the age of pirates for all combinations of
colleges and sex.

Design Matrices
To do this, well start by creating a design matrix. This matrix will
have all combinations of our two factors. To create this design matrix
matrix, well use the expand.grid() function. This function takes
several vectors as arguments, and returns a dataframe with all combi-
nations of values of those vectors. For our two factors college and sex,
well enter all the factor values we want. Additionally, well add NA
columns for the three summary statistics we want to calculate

design.matrix <- expand.grid(


"college" = c("JSSFP", "CCCC"), # college factor
"sex" = c("male", "female"), # sex factor
"median.age" = NA, # NA columns for our future calculations
15: loops 223

"mean.age" = NA, #...


"sd.age" = NA, #...
stringsAsFactors = F
)

Heres how the design matrix looks:

design.matrix

## college sex median.age mean.age sd.age


## 1 JSSFP male NA NA NA
## 2 CCCC male NA NA NA
## 3 JSSFP female NA NA NA
## 4 CCCC female NA NA NA

As you can see, the design matrix contains all combinations of our
factors in addition to three NA columns for our future statistics. Now
that we have the matrix, we can use a single loop where the index is
the row of the design.matrix, and the index values are all the rows in
the design matrix. For each index value (that is, for each row), well
get the value of each factor (college and sex) by indexing the current
row of the design matrix. Well then subset the pirates dataframe
with those factor values, calculate our summary statistics, then assign
them

for(row.i in 1:nrow(design.matrix)) {

# Get factor values for current row


college.i <- design.matrix$college[row.i]
sex.i <- design.matrix$sex[row.i]

# Subset pirates with current factor values


data.temp <- subset(pirates, college == college.i & sex == sex.i)

# Calculate statistics
median.i <- median(data.temp$age)
mean.i <- mean(data.temp$age)
sd.i <- sd(data.temp$age)

# Assign statistics to row.i of design.matrix


design.matrix$median.age[row.i] <- median.i
design.matrix$mean.age[row.i] <- mean.i
design.matrix$sd.age[row.i] <- sd.i

}
224 yarrr! the pirates guide to r

Lets look at the result to see if it worked!

design.matrix

## college sex median.age mean.age sd.age


## 1 JSSFP male 31 31.69 2.437
## 2 CCCC male 24 23.55 3.834
## 3 JSSFP female 34 33.67 3.151
## 4 CCCC female 27 26.07 3.521

Sweet! Our loop filled in the NA values with the statistics we


wanted.

The list object

Lets say you are conducting a loop where the outcome of each index
is a vector. However, the length of each vector could change - one
might have a length of 1 and one might have a length of 100. How
can you store each of these results in one object? Unfortunately, a
vector, matrix or dataframe might not be appropriate because their
size is fixed. The solution to this problem is to use a list(). A list is
a special object in R that can store virtually anything. You can have a
list that contains several vectors, matrices, or dataframes of any size.
If you want to get really Inception-y, you can even make lists of lists
(of lists of lists....).
To create a list in R, use the list() function. Lets create a list
that contains 3 vectors where each vector is a random sample from a
normal distribution. Well have the first element have 10 samples, the
second will have 5, and the third will have 15.

number.list <- list("first" = rnorm(n = 10),


"second" = rnorm(n = 5),
"third" = rnorm(n = 15)
)

number.list

## $first
## [1] -1.2430 0.1740 -1.1913 -0.5787 -1.2293 0.9176 1.6177 -2.0238
## [9] -0.2475 0.9151
##
## $second
## [1] -1.62736 -1.10305 -0.52422 0.07205 -2.35107
##
## $third
15: loops 225

## [1] -0.65392 0.40617 -0.61255 -0.35520 -1.00425 -0.42765 -2.32361


## [8] -0.69052 -0.42575 -1.04829 0.69998 0.74078 -0.08779 0.69702
## [15] -0.00156

To index a list, use double brackets [[]] or $ if the list has names.
For example, to get the first element of a list named number.list,
wed use number.ls[[1]]:

number.list[[1]]

## [1] -1.2430 0.1740 -1.1913 -0.5787 -1.2293 0.9176 1.6177 -2.0238


## [9] -0.2475 0.9151

number.list$second

## [1] -1.62736 -1.10305 -0.52422 0.07205 -2.35107

Ok, now lets use the list object within a loop. Well create a loop
that generates 5 different samples from a Normal distribution with
mean 0 and standard deviation 1 and saves the results in a list called
samples.ls. The first element will have 1 sample, the second element
will have 2 samples, etc.
First, we need to set up an empty list container object. To do this,
use the vector function:

samples.ls <- vector("list", 5)

If we look at sample.ls, we can see that it has 5 empty entries:

samples.ls

## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
226 yarrr! the pirates guide to r

Now, lets run the loop. For each run of the loop, well generate
random samples and assign them to an object called samples. Well
then assign the samples object to the ith entry in samples.ls

for(i in 1:5) {
samples <- rnorm(n = i, mean = 0, sd = 1)
samples.ls[[i]] <- samples
}

Lets look at the result:

samples.ls

## [[1]]
## [1] 0.06156
##
## [[2]]
## [1] 0.9576 2.0565
##
## [[3]]
## [1] 0.48995 0.94618 -0.02269
##
## [[4]]
## [1] 1.4533 0.2751 0.6653 -0.9734
##
## [[5]]
## [1] 0.61423 -0.09755 0.77833 -0.20921 0.28763

Looks like it worked. The first element has one sample, the second
element has two samples
If you want to convert a list to a vector format, you can use the
command unlist().

unlist(samples.ls)

## [1] 0.06156 0.95758 2.05650 0.48995 0.94618 -0.02269 1.45330


## [8] 0.27510 0.66534 -0.97345 0.61423 -0.09755 0.77833 -0.20921
## [15] 0.28763

Now all the results are compressed into one vector. Of course, the
resulting vector has lost some information because you dont know
which values came from which loop index.

When and when not to use loops

Loops are great because they save you a lot of code. However, a
drawback of loops is that they can be slow relative to other functions.
15: loops 227

For example, lets say we wanted to create a vector called one.to.ten


that contains the integers from one to ten. We could do this using the
following for-loop:

one.to.ten <- rep(NA, 10) # Create a dummy vector


for (i in 1:10) {one.to.ten[i] <- i} # Assign new values to vector
one.to.ten # Print the result

## [1] 1 2 3 4 5 6 7 8 9 10

While this for-loop works just fine, you may have noticed that its
a bit silly. Why? Because R has built-in functions for quickly and
easily calculating sequences of numbers. In fact, we used one of those
functions in creating this loop! (See if you can spot it...its 1:10). The
lesson is: before creating a loop, make sure theres not already a
function in R that can do what you want.

Parallel computing with snowfall()

If youre running a long loop and find that its taking a long time,
you can try running in in parallel using snowfall(). Snowfall is a
package that allows you to use multiple processors (called "slaves")
on your computer to run a loop. Theoretically, this can increase the
speed of your loop by 4, 16, 24 (etc.) times, but obviously it depends
on how many cores your computer (or the server youre running the
loop on) has.
To install the snowfall package, run the code

install.packages("snowfall")

There are 5 general steps to using snowfall():

1. Load the snowfall package with library("snowfall")

2. Set up the slaves with sfInit(). Here, you determine how many
slaves (processors) you want to run.

3. Send objects and libraries to the slaves with sfExport(), sfExportAll(),


and/or sfLibrary().

4. Write a function that each slave will evaluate (e.g.; slave.fun). The
function must have a single input.

5. Run the slaves using sfLapply(x, fun), or sfSapply(), where x are


the values that the slaves will evaluate on the function fun. When
the function is completed, it will return a list.

6. Turn the slaves off with sfStop()


228 yarrr! the pirates guide to r

Lets go through the five steps for a detailed example. In this


example, well use two slaves to simultaneously calculate the average
age of pirates who went to Jack Sparrows School of Fashion and
Piratry (JSSFP) and Captain Chunks Cannon Crew (CCCC).
First, well load the snowfall package.

library("snowfall")

## Loading required package: snow

Second, well set up the cluster of slaves by running the sfInit()


function. Enter the number of slaves you want to run using the cpus
argument32 . Once you execute this, R will prepare those slaves in the 32
There isnt really a correct number of
background. slaves to use. If you use too many, then
each might run very slowly. Personally,
I set the number of slaves to be 2 times
sfInit(parallel = T, cpus = 2) the number of cores on my computer.
My computer has 8 cores, so I usually
## R Version: R version 3.2.4 (2016-03-10) use 16 slaves. But again, this might
not be the optimal number. If youre
concerned about speed, you can try
## snowfall 1.84-6.1 initialized (using snow 0.4-1): parallel using different numbers of slaves and
execution on 2 CPUs. see which number seems to do the job
the fastest on your computer
Third, you need to send objects and libraries to the slaves. These
slaves are like brand new R sessions that are completely separate
from your current R session. They wont have access to any of the
objects you have defined or libraries youve loaded in your current
session. To send objects and libraries to the slaves, use the sfExport
and sfExportAll commands. If you run sfExportAll(), R will send
all the objects in your current R session to the slaves. If you want to
just send a few specific objects to the slaves, use sfExport().
Lets send the pirates dataset to the slaves using sfExport. For
some reason (that I dont quite understand), you need to name the
objects as strings (with quotation marks). If you dont include them,
the function wont work

sfExport("pirates")

## Error in sfExport("pirates"): Unknown/unfound variable


pirates in export. (local=TRUE)

Now lets load a library in each of the slaves, use sfLibrary().


Lets load the dplyr package on the slaves (we actually dont need it
for our loop, but Ill show you how to do it anyway):

sfLibrary(dplyr)

## Library dplyr loaded.


15: loops 229

## Library dplyr loaded in cluster.

Fourth, well define the function that you want each slave to run.
In this case, well create a function called slave.fun() that takes the
name of a college as an input, and return the average age of pirates
from that college as an output. Each slave will evaluate this function
on a specific college.

slave.fun <- function(college.i) {

data.temp <- subset(pirates, college == college.i)


output <- mean(data.temp$age)

return(output)

Fifth, now were ready to run the cluster! We do this using the
sfSapply()33 function. There are two arguments to sfSapply(): x, a 33
In addition to sfSapply(), you can
also use sfLapply(), which returns
vector of index values sent to the function, and fun, a function that
stores the results as a list
will be evaluated with each element of x. For this example, well set x
to a vector of the two colleges, and the function to be slave.fun. The
function will then send each value of the vector x to a slave, the slave
will evaluate slave.fun() for a value of x, and return the result to the
main R session:

cluster.result <- sfSapply(x = c("JSSFP", "CCCC"), fun = slave.fun)

## Error in checkForRemoteErrors(val): 2 nodes produced errors;


first error: object pirates not found

Sixth, now that the cluster is finished, we need to close the clusters
with the command sfStop().

sfStop()

##
## Stopping cluster

Now that were finished, we can look at the result. It should be


a vector of length two, where the first element is the mean age of
pirates who went to JSSFP and the second element is the mean age of
pirates who went to CCCC:
230 yarrr! the pirates guide to r

cluster.result

## Error in eval(expr, envir, enclos): object cluster.result


not found

Looks like our result is what we want. Now, of course this was
a very simple example, and it would have been much easier to use
the aggregate() function to accomplish this task. But the general
structure should allow you to perform much more complicated loops
in parallel.

Additional tips for using Snowfall


1. Run your simulations in the background! While youre using
snowfall (or running any processor intensive code), you wont be
able to continue executing any code in R. However, if you need to
scratch your R itch during a long simulation, you can open up a
second instance of R that will run independently of any other R
sessions. On a Mac, you can do this by opening up the Terminal
(in the Applications folder) and running the command open -n
-a RStudio.app. When you run this command, another instance
of RStudio should open up that you can now use while your
simulation is running

2. Print a simulation progress report. If your snowfall loop is taking


a long time to run, you can get a progress report that tells you
what percentage of the simulation is completed. While this wont
speed anything up, it will give you an idea of how fast things are
moving and help you know when it might be finished. To do this,
use the perUpdate argument, which takes an integer as an argu-
ment, and prints a notification for every ith% run of the simulation.
The perUpdate argument only works in the the cluster function
sfClusterApplySR() and not in sfSapply(), but the functions are
virtually identical. For example, lets say we want to execute a
slave function 1,000,000 times. We can get a progress report for
every 1% of simulations (every 10,000 instances) by using the
following code:

cluster.result <- sfClusterApplySR(1:1000000, # Execute the function from 1 to 1,000,000


fun = slave.fun, # The slave function
perUpdate = 1 # Print an update for every 1% completion of the simulatio
)
16: Data Cleaning and preparation

In this chapter, well cover many tips and tricks for preparing a
dataset for analysis - from recoding values in a dataframe, to merging
two dataframes together, to ... lots of other fun things. Some of the
material here has been covered in other chapters - however, because
data preparation is so important and because so many of these
procedures go together, I thought it would be wise to have them all
in one place.

The Basics

Changing column names in a dataframe


To change the names of the columns in a dataframe, use names(),
indexing, and assignment. For example, to change the name of the
first column in a dataframe called data to "participant", youd do
the following:

names(data)[1] <- "participant"

If you dont know the exact index of a column whose name you
want to change, but you know the original name, you can use logical
indexing. For example, to change the name of a column titled sex to
gender, you can do the following:

names(data)[names(data) == "sex"] <- "gender"

Changing the order of columns


You can change the order of columns in a dataframe with indexing
and reassignment. For example, consider the ChickWeight dataframe
which has four columns in the order: weight, Time, Chick and Diet

ChickWeight[1:2,]
232 yarrr! the pirates guide to r

## Grouped Data: weight ~ Time | Chick


## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
If you only want to move some
columns without changing the others,
As you can see, weight is in the first column, Time is in the second, you can use the following trick using
Chick is in the third, and Diet is in the fourth. To change the column the setdiff() function. For example,
lets say we want to move the two
order to, say: Chick, Diet, Time, weight, well create a new column in- columns Diet and weight to the front of
dex vector called new.order. This vector will put the original columns the ChickWeight dataset. Well create a
new index by combining the names of
in their new places. So, to put the Chick column first, well start with the first two columns, with the names
a column index of 3 (the original Chick column). To put the Diet of all remaining columns:
column second, well put the column index 4 next (etc.):
# Define first two column names
to.front <- c("Diet", "weight")
new.order <- c(3, 4, 2, 1)
# Chick, Diet, Time, weight # Get rest of names with setdiff()
others <- setdiff(names(ChickWeight),
to.front)
Now, well use reassignment to put the datframe in the new order:
# Index ChickWeight with c(to.front, others)
ChickWeight <- ChickWeight[c(to.front, others)]
ChickWeight <- ChickWeight[,new.order]
# Check Result
Here is the result: ChickWeight[1:2,]

## Grouped Data: weight ~ Time | Chick


ChickWeight[1:2,] ## Diet weight Time Chick
## 1 1 42 0 1
## Grouped Data: weight ~ Time | Chick ## 2 1 51 2 1
## Time Chick weight Diet
If you are working with a dataset with
## 1 0 1 42 1 many columns, this trick can save you a
## 2 2 1 51 1 lot of time

Instead of using numerical indices, you can also reorder the


columns using a vector of column names. Heres another index
vector containing the names of the columns in a new order:

new.order <- c("Time", "weight", "Diet", "Chick")

Here is the result:

ChickWeight <- ChickWeight[,new.order]


ChickWeight[1:2,]

## Grouped Data: weight ~ Time | Chick


## Time weight Diet Chick
## 1 0 42 1 1
## 2 2 51 1 1
16: data cleaning and preparation 233

Converting values in a vector or dataframe


In one of the early chapters of this book, we learned how to change
values in a vector item-by-item. For example, to change all values of 1 #recodev function
in a vector called vec to 0, wed use the code: # Execute this code in R to use it!

vec[vec == 1] <- 0 recodev <- function(original.vector,


old.values,
new.values,
However, if you need to convert many values in a vector, typing others = NULL) {
this type of code over and over can become tedious. Instead, I recom-
if(is.null(others)) {
mend writing a function to do this. As far as I know, this function
is not stored in base R, so well write one ourselves called recodev() new.vector <- original.vector
(which stands for recode vector). The recodev function is included
}
in the yarrr package. If you dont have the package, the full defini-
tion of the function is on the sidebar. just execute the code in your if(is.null(others) == F) {
R session to use it. The function has 4 inputs new.vector <- rep(others,
length(original.vector))
original.vector: The original vector that you want to recode
}
old.values: A vector of values that you want to replace. For
example, old.values = c(1, 2) means that you want to replace all for (i in 1:length(old.values)) {

values of 1 or 2 in the original vector. change.log <- original.vector == old.values[i] &


is.na(original.vector) == F
new.values: A vector of replacement values. This should be the
same length as old.values. For example, new.values = c("male", new.vector[change.log] <- new.values[i]
"female") means that you want to replace the two values in }
old.values with "male" and "female".
return(new.vector)
others: An optional value that is used to replace any values in
}
original.vector that are not found in old.values. For example,
others = NA will convert any values in original.vector not found
in old.values to NA. If you want to leave other values in their
original state, just leave the others argument blank.

Heres the function in action: Lets say we have a vector gender


which contains 0s, 1s, and 2s

gender <- c(0, 1, 0, 1, 1, 0, 0, 2, 1)

Lets use the recodev() function to convert the 0s to "female", 1s to


"male" and 2s to "other"
gender <- recodev(original.vector = gender,
old.values = c(0, 1, 2),
new.values = c("female", "male", "other")
)

Now lets look at the new version of gender. The former values of
0 should now be female, 1 should now be male, and 2 should now be
other:
234 yarrr! the pirates guide to r

gender

## [1] "female" "male" "female" "male" "male" "female" "female" "other"


## [9] "male"

Changing the class of a vector


If you would like to convert the class of a vector, use a combination
of the as.numeric() and as.character() functions.
For example, the following dataframe called data, has two columns:
one for age and one for gender.

data <- data.frame("age" = c("12", "20", "18", "46"),


"gender" = factor(c("m", "m", "f", "m")),
stringsAsFactors = F
)
str(data)

## 'data.frame': 4 obs. of 2 variables:


## $ age : chr "12" "20" "18" "46" If we try to calculate the mean() of age
without first converting it to numeric,
## $ gender: Factor w/ 2 levels "f","m": 2 2 1 2 well get an error:

As you can see, age is coded as a character (not numeric), and # We'll get an error because age is not
# numeric (yet)
gender is coded as a factor. Wed like to convert age to numeric, and mean(data$age)
gender to character.
To make age numeric, just use the as.numeric() function: ## Warning in
mean.default(data$age): argument
is not numeric or logical: returning
data$age <- as.numeric(data$age) NA

## [1] NA
To convert gender from a factor to a character, use the as.character()
function:

data$gender <- as.character(data$gender)

Lets make sure it worked:

str(data)

## 'data.frame': 4 obs. of 2 variables:


## $ age : num 12 20 18 46
## $ gender: chr "m" "m" "f" "m"

Now that age is numeric, we can apply numeric functions like


mean() to the column:
16: data cleaning and preparation 235

mean(data$age)

## [1] 24

Splitting numerical data into groups using cut()

When we create some plots and analyses, we may want to group


numerical data into bins of similar values. For example, in our pirate
survey, we might want to group pirates into age decades, where
all pirates in their 20s are in one group, all those in their 30s go
into another group, etc. Once we have these bins, we can calculate
aggregate statistics for each group.
R has a handy function for grouping numerical data called cut()

cut()
x
A vector of numeric data

breaks
Either a numeric vector of two or more unique cut points or a
single number (greater than or equal to 2) giving the number of
intervals into which x is to be cut. For example, breaks = 1:10
will put break points at all integers from 1 to 10, while breaks = 5
will split the data into 5 equal sections.

labels
An optional string vector of labels for each grouping. By default,
labels are constructed using "(a,b]" interval notation. If labels =
FALSE, simple integer codes are returned instead of a factor.

right
A logical value indicating if the intervals should be closed on the
right (and open on the left) or vice versa.

Lets try a simple example by converting the integers from 1 to 50


into bins of size 10:
cut(1:50, seq(0, 50, 10))

## [1] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10]


## [9] (0,10] (0,10] (10,20] (10,20] (10,20] (10,20] (10,20] (10,20]
## [17] (10,20] (10,20] (10,20] (10,20] (20,30] (20,30] (20,30] (20,30]
236 yarrr! the pirates guide to r

## [25] (20,30] (20,30] (20,30] (20,30] (20,30] (20,30] (30,40] (30,40]


## [33] (30,40] (30,40] (30,40] (30,40] (30,40] (30,40] (30,40] (30,40]
## [41] (40,50] (40,50] (40,50] (40,50] (40,50] (40,50] (40,50] (40,50]
## [49] (40,50] (40,50]
## Levels: (0,10] (10,20] (20,30] (30,40] (40,50]

As you can see, our result is a vector of factors, where the first ten
elements are (0, 10], the next ten elements are (10, 20], and so on. In
other words, the new vector treats all numbers from 1 to 10 as being
the same, and all numbers from 11 to 20 as being the same.
Lets test the cut() function on the age data from pirates. Well
Once youve used cut() to convert a
add a new column to the dataset called age.decade, which separates numeric variable into bins, you can then
the age data into bins of size 10. This means that every pirate be- use aggregate() or dplyr to calculate
aggregate statistics for each bin. For
tween the ages of 10 and 20 will be in the first bin, those between example, to calculate the mean number
the ages of 21 and 30 will be in the second bin, and so on. To do this, of tattoos of pirates in their 20s, 30s, 40s,
well enter pirates$age as the x argument, and seq(10, 60, 10) as ... we could do the following:

the breaks argument: # Calculate the decade for each pirate


pirates$age.decade <- cut(
pirates$age.decade <- cut(
x = pirates$age, # The raw data
pirates$age,
breaks = seq(10, 60, 10) # The break points of the cuts
breaks = seq(0, 100, 10)
)
)

To show you how this worked, lets look at the first few rows of # Calculate the mean number of tattoos
the columns age and age.cut # in each decade

aggregate(tattoos ~ age.decade,
head(pirates[c("age", "age.decade")]) FUN = mean,
data = pirates)
## age age.decade
## age.decade tattoos
## 1 30 (20,30] ## 1 (10,20] 9.722
## 2 25 (20,30] ## 2 (20,30] 9.320
## 3 25 (20,30] ## 3 (30,40] 9.335
## 4 (40,50] 8.143
## 4 29 (20,30]
## 5 31 (30,40]
## 6 30 (20,30]

As you can see, age.cut has correctly converted the original age
variable to a factor.
From these data, we can now easily calculate how many pirates
are in each age group using table()

table(pirates$age.decade)

##
## (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]
## 0 115 600 278 7 0 0 0
## (80,90] (90,100]
## 0 0
16: data cleaning and preparation 237

Merging two dataframes

Merging two dataframes together allows you to combine information


from both dataframes into one. For example, a teacher might have
a dataframe called students containing information about her class.
She then might have another dataframe called exam1scores showing
the scores each student received on an exam. To combine these data
into one dataframe, you can use the merge() function. For those of
you who are used to working with Excel, merge() works a lot like
vlookup in Excel:

merge()
x, y
2 dataframes to be merged

by, by.x, by.y


The names of the columns that will be used for merging. If the
the merging columns have the same names in both dataframes,
you can just use by = c("col.1", "col.2"...). If the merging
columns have different names in both dataframes, use by.x to
name the columns in the x dataframe, and by.y to name the
columns in the y dataframe. For example, if the merging column is
called STUDENT.NAME in dataframe x, and name in dataframe y, you
can enter by.x = "STUDENT.NAME", by.y = "name"

all.x, all.y
A logical value indicating whether or not to include non-matching
rows of the dataframes in the final output. The default value is
all.y = FALSE, such that any non-matching rows in y are not
included in the final merged dataframe.

A generic use of merge(), looks like this:

new.df <- merge(x = df.1, # First dataframe


y = df.2, # Second dataframe
by = "column" # Common column name in both x and y
)

where df.1 is the first dataframe, df.2 is the second dataframe,


and "column" is the name of the column that is common to both
dataframes.
238 yarrr! the pirates guide to r

For example, lets say that we have some survey data in a dataframe
called survey. survey <- data.frame(
Heres how the survey data looks: "pirate" = 1:10,
"country" = c("Germany",
"Portugal",
survey "Spain",
"Austria",
## pirate country "Australia",
## 1 1 Germany "Austria",
"Germany",
## 2 2 Portugal "Portugal",
## 3 3 Spain "Portugal",
"Germany"
## 4 4 Austria ),
## 5 5 Australia stringsAsFactors = F
## 6 6 Austria )

## 7 7 Germany
## 8 8 Portugal
## 9 9 Portugal
## 10 10 Germany

Now, lets say we want to add some country-specific data to the


dataframe. For example, based on each pirates country, we could
add a column called language with the pirates native language, and
continent the continent the pirate is from. To do this, we can start country.info <- data.frame(
"country" = c("Germany", "Portugal",
by creating a new dataframe called country.info, which tell us the "Spain", "Austria", "Australia"),
language and continent for each country: "language" = c("German", "Portugese",
"Spanish", "German", "English"),
Lets take a look at the country.info dataframe:
"continent" = c("Europe", "Europe", "Europe",
"Europe", "Australia"),
country.info
stringsAsFactors = F)
## country language continent
## 1 Germany German Europe
## 2 Portugal Portugese Europe
## 3 Spain Spanish Europe
## 4 Austria German Europe
## 5 Australia English Australia

Now, using merge(), we can combine the information from the


country.info dataframe to the survey dataframe:

survey <- merge(survey,


country.info,
by = "country"
)

Lets look at the result!


survey

## country pirate language continent


## 1 Australia 5 English Australia
16: data cleaning and preparation 239

## 2 Austria 4 German Europe


## 3 Austria 6 German Europe
## 4 Germany 1 German Europe
## 5 Germany 10 German Europe
## 6 Germany 7 German Europe
## 7 Portugal 8 Portugese Europe
## 8 Portugal 9 Portugese Europe
## 9 Portugal 2 Portugese Europe
## 10 Spain 3 Spanish Europe

As you can see, the merge() function added all the country.info
data to the survey data.

Random Data Preparation Tips


Appendix
242 yarrr! the pirates guide to r

plot(1, xlim = c(0, 26), ylim = c(0, 26),


type = "n", main = "Named Colors", xlab = "", ylab = "",
xaxt = "n", yaxt = "n")

rect(xleft = rep(1:26, each = 26)[1:length(colors())] - .5,


ybottom = rep(26:1, times = 26)[1:length(colors())] - .5,
xright = rep(1:26, each = 26)[1:length(colors())] + .5,
ytop = rep(26:1, times = 26)[1:length(colors())] + .5,
col = colors()
)

text(x = rep(1:26, each = 26)[1:length(colors())],


y = rep(26:1, times = 26)[1:length(colors())],
labels = colors(), cex = .3
)

Named Colors

white darkgoldenrod4
blue1 chocolate1 dodgerblue3gray4 gray30 gray56 gray82 grey0 grey26 grey52 grey78honeydew3
darkseagreen3 lightgreen
lavenderblush3 lightyellow
mediumpurple2
olivedrab2 slategray thistlewhitesmoke
purplesandybrown
paleturquoise2

darkgray
aliceblue blue2 chocolate2 dodgerblue4gray5 gray31 gray57 gray83 grey1 grey27 grey53 grey79honeydew4
darkseagreen4 lightgrey
lavenderblush4 lightyellow1
mediumpurple3 purple1seagreen
paleturquoise3
olivedrab3 slategray1thistle1 yellow

darkgreen
antiquewhiteblue3 chocolate3 firebrick gray6 gray32 gray58 gray84 grey2 grey28 grey54 grey80 hotpinklawngreenlightpink
darkslateblue lightyellow2
mediumpurple4 purple2seagreen1
paleturquoise4
olivedrab4 slategray2thistle2 yellow1

antiquewhite1 darkgrey
blue4 chocolate4 firebrick1 gray7 gray33 gray59 gray85 grey3 grey29 grey55 grey81 hotpink1
darkslategray lightpink1
lemonchiffon lightyellow3 orange
mediumseagreen purple3seagreen2
palevioletred slategray3thistle3 yellow2

antiquewhite2
blueviolet coral darkkhaki firebrick2 gray8 gray34 gray60 gray86 grey4 grey30 grey56 grey82 hotpink2
darkslategray1 lightpink2
lemonchiffon1 lightyellow4 orange1
mediumslateblue purple4seagreen3
palevioletred1 slategray4thistle4 yellow3

antiquewhite3 darkmagenta
brown coral1 firebrick3 gray9 gray35 gray61 gray87 grey5 grey31 grey57 grey83 hotpink3
darkslategray2 lightpink3
lemonchiffon2 limegreen orange2
mediumspringgreenpalevioletred2red seagreen4
slategrey tomato yellow4

antiquewhite4 darkolivegreen
brown1 coral2 firebrick4 gray10 gray36 gray62 gray88 grey6 grey32 grey58 grey84 hotpink4
darkslategray3 lightpink4 linen
lemonchiffon3 orange3
mediumturquoise yellowgreen
palevioletred3red1 seashell snow tomato1

aquamarine darkolivegreen1
brown2 coral3 floralwhitegray11 gray37 gray63 gray89 grey7 grey33 grey59 grey85 indianred
darkslategray4 magenta
lightsalmon
lemonchiffon4 orange4
mediumvioletred palevioletred4red2 seashell1 snow1 tomato2

aquamarine1 darkolivegreen2
brown3 coral4 forestgreengray12 gray38 gray64 gray90 grey8 grey34 grey60 grey86indianred1lightblue
darkslategrey magenta1
lightsalmon1 midnightblue
orangered
papayawhip red3 seashell2 snow2 tomato3

aquamarine2
brown4 darkolivegreen3
cornflowerblue gainsborogray13 gray39 gray65 gray91 grey9 grey35 grey61 grey87indianred2
darkturquoise lightblue1 magenta2
lightsalmon2 orangered1
mintcream peachpuff red4 seashell3 snow3 tomato4

darkolivegreen4
burlywoodcornsilk
aquamarine3 ghostwhitegray14 gray40 gray66 gray92 grey10 grey36 grey62 grey88indianred3
darkviolet lightblue2 magenta3
lightsalmon3 orangered2
mistyrose rosybrown
peachpuff1 seashell4 snow4 turquoise

burlywood1
aquamarine4 darkorange
cornsilk1 deeppink gold gray15 gray41 gray67 gray93 grey11 grey37 grey63 grey89indianred4
lightblue3 magenta4
lightsalmon4 orangered3
mistyrose1 rosybrown1siennaspringgreen
peachpuff2 turquoise1

azureburlywood2
cornsilk2 deeppink1 gold1 gray16 gray42 gray68 gray94 grey12 grey38 grey64 grey90
darkorange1 ivory lightblue4
lightseagreen orangered4
maroonmistyrose2 rosybrown2
peachpuff3 springgreen1
sienna1 turquoise2

azure1burlywood3
cornsilk3 deeppink2 gold2 gray17 gray43 gray69 gray95 grey13 grey39 grey65 grey91 ivory1 lightcoral
darkorange2 lightskyblue mistyrose3orchidpeachpuff4
maroon1 rosybrown3 springgreen2
sienna2 turquoise3

azure2burlywood4
cornsilk4 deeppink3 gold3 gray18 gray44 gray70 gray96 grey14 grey40 grey66 grey92 ivory2 lightcyan
darkorange3 lightskyblue1 mistyrose4orchid1
maroon2 peru rosybrown4 springgreen3
sienna3 turquoise4

azure3cadetblue cyandarkorange4
deeppink4 gold4 gray19 gray45 gray71 gray97 grey15 grey41 grey67 grey93 ivory3 lightcyan1
lightskyblue2
maroon3moccasinorchid2 springgreen4violet
pink royalbluesienna4

azure4cadetblue1cyan1 darkorchid goldenrodgray20 gray46 gray72 gray98 grey16 grey42 grey68 grey94 ivory4 lightcyan2
deepskyblue lightskyblue3 navajowhite
maroon4 orchid3 pink1 royalblue1skybluesteelbluevioletred

beige cadetblue2cyan2darkorchid1 goldenrod1gray21 gray47 gray73 gray99 grey17 grey43 grey69 grey95
deepskyblue1 khaki lightcyan3
lightskyblue4 navajowhite1
mediumaquamarine orchid4 pink2 royalblue2
skyblue1
steelblue1
violetred1

bisquecadetblue3cyan3darkorchid2 goldenrod2gray22 gray48 gray74 gray100 grey18 grey44 grey70 grey96 khaki1lightcyan4
deepskyblue2 lightslateblue palegoldenrod
navajowhite2
mediumblue pink3 royalblue3
skyblue2
steelblue2
violetred2

bisque1cadetblue4cyan4darkorchid3 goldenrod3gray23 gray49 gray75 green grey19 grey45 grey71 grey97 khaki2
deepskyblue3 lightgoldenrod
lightslategray
mediumorchid palegreen pink4 royalblue4
navajowhite3 skyblue3
steelblue3
violetred3

bisque2chartreuse
darkblue
darkorchid4 goldenrod4gray24 gray50 gray76 green1 grey20 grey46 grey72 grey98 khaki3
deepskyblue4 lightgoldenrod1
lightslategrey palegreen1 plumsaddlebrown
navajowhite4
mediumorchid1 skyblue4
steelblue4
violetred4

bisque3 darkcyandarkred dimgray gray


chartreuse1 gray25 gray51 gray77 green2 grey21 grey47 grey73 grey99 khaki4
lightgoldenrod2
lightsteelblue navy palegreen2plum1 salmon slateblue
mediumorchid2 tan wheat

bisque4 darkgoldenrod
chartreuse2 dimgrey gray0 gray26 gray52 gray78 green3 grey22 grey48 grey74 grey100lavender
darksalmon lightgoldenrod3
lightsteelblue1 navyblue
mediumorchid3 palegreen3plum2 salmon1slateblue1 tan1 wheat1

darkgoldenrod1
blackchartreuse3 dodgerbluegray1 gray27 gray53 gray79 green4 grey23 grey49 grey75honeydew
darkseagreen lightgoldenrod4
lavenderblush lightsteelblue2 oldlacepalegreen4plum3 salmon2slateblue2 tan2
mediumorchid4 wheat2

blanchedalmond darkgoldenrod2
chartreuse4 dodgerblue1gray2 gray28 gray54 gray80greenyellowgrey24 grey50 grey76honeydew1
darkseagreen1 lightgoldenrodyellow
lavenderblush1 lightsteelblue3
mediumpurple paleturquoise
olivedrab plum4 salmon3slateblue3 tan3 wheat3

darkgoldenrod3
blue chocolate dodgerblue2gray3 gray29 gray55 gray81
darkseagreen2 grey grey25 grey51 grey77honeydew2 lightgray
lavenderblush2 lightsteelblue4
mediumpurple1 paleturquoise1
olivedrab1 powderblue
salmon4slateblue4 tan4 wheat4

Figure 71: The colors stored in


colors().
appendix 243
Index

%in%, 66 glm(), 195 rnorm(), 108


runif(), 109
legend(), 137
a:b, 44
license, 2 Sammy Davis Jr., 115
aggregate(), 100
Linear Model, 187 sample(), 110
assignment, 35
lm(), 188 seq(), 45
subset(), 82
c(), 42 merge(), 237
correlation, 164 t-test, 160
curve(), 136 read.table(), 93
cut(), 235 rep(), 46 write.table(), 92

You might also like