Mastering Software Development in R
Mastering Software Development in R
Mastering Software Development in R
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
2. Advanced R Programming . . . . . . . . . . . . . . . . . . . . . . . . 93
2.1 Control Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.3 Functional Programming . . . . . . . . . . . . . . . . . . . . . . . 109
2.4 Expressions & Environments . . . . . . . . . . . . . . . . . . . . . 123
2.5 Error Handling and Generation . . . . . . . . . . . . . . . . . . . 131
2.6 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
2.7 Profiling and Benchmarking . . . . . . . . . . . . . . . . . . . . . 146
2.8 Non-standard Evaluation . . . . . . . . . . . . . . . . . . . . . . . 157
2.9 Object Oriented Programming . . . . . . . . . . . . . . . . . . . . 159
2.10 Gaining Your tidyverse Citizenship . . . . . . . . . . . . . . . . 174
Setup
This book makes use of the following R packages, which should be installed
to take full advantage of the examples.
choroplethr
choroplethrMaps
data.table
datasets
devtools
dlnm
dplyr
faraway
forcats
GGally
ggmap
ggplot2
ggthemes
Introduction ii
ghit
GISTools
grid
gridExtra
httr
knitr
leaflet
lubridate
magrittr
methods
microbenchmark
package
pander
plotly
profvis
pryr
purrr
rappdirs
raster
RColorBrewer
readr
rmarkdown
sp
stats
stringr
testthat
tidyr
tidyverse
tigris
titanic
viridis
You can install all of these packages with the following code:
Note: Some of the material in this section is taken from R Programming for Data
Science.
The learning objectives for this section are to:
At the R prompt we type expressions. The <- symbol (gets arrow) is the
assignment operator.
x <- 1
print(x)
[1] 1
x
[1] 1
msg <- "hello"
Evaluation
The [1] shown in the output indicates that x is a vector and 5 is its first element.
Typically with interactive work, we do not explicitly print objects with the
print function; it is much easier to just auto-print them by typing the name of
the object and hitting return/enter. However, when writing scripts, functions,
or longer programs, there is sometimes a need to explicitly print objects
because auto-printing does not work in those settings.
When an R vector is printed you will notice that an index for the vector
is printed in square brackets [] on the side. For example, see this integer
sequence of length 20.
x <- 11:30
x
[1] 11 12 13 14 15 16 17 18 19 20 21 22
[13] 23 24 25 26 27 28 29 30
The numbers in the square brackets are not part of the vector itself, they are
merely part of the printed output.
With R, its important that one understand that there is a difference between
the actual R object and the manner in which that R object is printed to the
console. Often, the printed output may have additional bells and whistles to
make the output more friendly to the users. However, these bells and whistles
are not inherently part of the object.
Note that the : operator is used to create integer sequences.
R Objects
character
numeric (real numbers)
The R Programming Environment 4
integer
complex
logical (True/False)
The most basic type of R object is a vector. Empty vectors can be created with
the vector() function. There is really only one rule about vectors in R, which
is: A vector can only contain objects of the same class.
But of course, like any good rule, there is an exception, which is a list, which
we will get to a bit later. A list is represented as a vector but can contain
objects of different classes. Indeed, thats usually why we use them.
There is also a class for raw objects, but they are not commonly used directly
in data analysis and we wont cover them here.
Numbers
Creating Vectors
Note that in the above example, T and F are short-hand ways to specify TRUE
and FALSE. However, in general one should try to use the explicit TRUE and FALSE
values when indicating logical values. The T and F values are primarily there
for when youre feeling lazy.
You can also use the vector() function to initialize vectors.
Mixing Objects
There are occasions when different classes of R objects get mixed together.
Sometimes this happens by accident but it can also happen on purpose. So
what happens with the following code?
In each case above, we are mixing objects of two different classes in a vector.
But remember that the only rule about vectors says this is not allowed. When
different objects are mixed in a vector, coercion occurs so that every element
in the vector is of the same class.
In the example above, we see the effect of implicit coercion. What R tries to
do is find a way to represent all of the objects in the vector in a reasonable
fashion. Sometimes this does exactly what you want andsometimes not.
For example, combining a numeric object with a character object will create
a character vector, because numbers can usually be easily represented as
strings.
The R Programming Environment 6
Explicit Coercion
Objects can be explicitly coerced from one class to another using the as.*
functions, if available.
x <- 0:6
class(x)
[1] "integer"
as.numeric(x)
[1] 0 1 2 3 4 5 6
as.logical(x)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
as.character(x)
[1] "0" "1" "2" "3" "4" "5" "6"
Sometimes, R cant figure out how to coerce an object and this can result in
NAs being produced.
When nonsensical coercion takes place, you will usually get a warning from
R.
Matrices
m <- 1:10
m
[1] 1 2 3 4 5 6 7 8 9 10
dim(m) <- c(2, 5)
m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
x <- 1:3
y <- 10:12
cbind(x, y)
x y
[1,] 1 10
[2,] 2 11
[3,] 3 12
rbind(x, y)
[,1] [,2] [,3]
x 1 2 3
y 10 11 12
Lists
Lists are a special type of vector that can contain elements of different classes.
Lists are a very important data type in R and you should get to know them
well. Lists, in combination with the various apply functions discussed later,
make for a powerful combination.
Lists can be explicitly created using the list() function, which takes an
arbitrary number of arguments.
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 1+4i
We can also create an empty list of a prespecified length with the vector()
function
The R Programming Environment 9
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
Factors
Often factors will be automatically created for you when you read a dataset
in using a function like read.table(). Those functions often, as a default, create
factors when they encounter data that look like characters or strings.
The order of the levels of a factor can be set using the levels argument to
factor(). This can be important in linear modelling because the first level is
used as the baseline level. This feature can also be used to customize order in
plots that include factors, since by default factors are plotted in the order of
their levels.
Missing Values
Data Frames
Data frames are used to store tabular data in R. They are an important type
of object in R and are used in a variety of statistical modeling applications.
Hadley Wickhams package dplyr has an optimized set of functions designed
to work efficiently with data frames, and ggplot2 plotting functions work best
with data stored in data frames.
Data frames are represented as a special type of list where every element of
the list has to have the same length. Each element of the list can be thought of
as a column and the length of each element of the list is the number of rows.
Unlike matrices, data frames can store different classes of objects in each
column. Matrices must have every element be the same class (e.g. all integers
or all numeric).
In addition to column names, indicating the names of the variables or pre-
dictors, data frames have a special attribute called row.names which indicate
information about each row of the data frame.
Data frames are usually created by reading in a dataset using the read.table()
or read.csv(). However, data frames can also be created explicitly with the
data.frame() function or they can be coerced from other types of objects like
lists.
The R Programming Environment 12
Names
R objects can have names, which is very useful for writing readable code and
self-describing objects. Here is an example of assigning names to an integer
vector.
x <- 1:3
names(x)
NULL
names(x) <- c("New York", "Seattle", "Los Angeles")
x
New York Seattle Los Angeles
1 2 3
names(x)
[1] "New York" "Seattle" "Los Angeles"
$Boston
[1] 2
$London
[1] 3
names(x)
[1] "Los Angeles" "Boston" "London"
Column names and row names can be set separately using the colnames() and
rownames() functions.
Note that for data frames, there is a separate function for setting the row
names, the row.names() function. Also, data frames do not have column names,
they just have names (like lists). So to set the column names of a data
frame just use the names() function. Yes, I know its confusing. Heres a quick
summary:
Attributes
In general, R objects can have attributes, which are like metadata for the
object. These metadata can be very useful in that they help to describe the
object. For example, column names on a data frame help to tell us what data
are contained in each of the columns. Some examples of R object attributes
are
names, dimnames
dimensions (e.g. matrices, arrays)
class (e.g. integer, numeric)
length
other user-defined attributes/metadata
Attributes of an object (if any) can be accessed using the attributes() function.
Not all R objects contain attributes, in which case the attributes() function
returns NULL.
Summary
All R objects can have attributes that help to describe what is in the object.
Perhaps the most useful attributes are names, such as column and row names
in a data frame, or simply names in a vector or list. Attributes like dimensions
are also important as they can modify the behavior of objects, like turning a
vector into a matrix.
The R Programming Environment 15
Define tidy data and to transform non-tidy data into tidy data
One unifying concept of this book is the notion of tidy data. As defined by
Hadley Wickham in his 2014 paper published in the Journal of Statistical
Software, a tidy dataset has the following properties:
The purpose of defining tidy data is to highlight the fact that most data do not
start out life as tidy. In fact, much of the work of data analysis may involve
simply making the data tidy (at least this has been our experience). Once a
dataset is tidy, it can be used as input into a variety of other functions that
may transform, model, or visualize the data.
As a quick example, consider the following data illustrating death rates in
Virginia in 1940 in a classic table format:
While this format is canonical and is useful for quickly observing the rela-
tionship between multiple variables, it is not tidy. This format violates the
tidy form because there are variables in both the rows and columns. In this
case the variables are age category, gender, and urban-ness. Finally, the death
rate itself, which is the fourth variable, is presented inside the table.
Converting this data to tidy format would give us
The R Programming Environment 16
library(tidyr)
library(dplyr)
VADeaths %>%
tbl_df() %>%
mutate(age = row.names(VADeaths)) %>%
gather(key, death_rate, -age) %>%
separate(key, c("urban", "gender"), sep = " ") %>%
mutate(age = factor(age), urban = factor(urban), gender = factor(gender))
# A tibble: 20 4
age urban gender death_rate
<fctr> <fctr> <fctr> <dbl>
1 50-54 Rural Male 11.7
2 55-59 Rural Male 18.1
3 60-64 Rural Male 26.9
4 65-69 Rural Male 41.0
5 70-74 Rural Male 66.0
6 50-54 Rural Female 8.7
7 55-59 Rural Female 11.7
8 60-64 Rural Female 20.3
9 65-69 Rural Female 30.9
10 70-74 Rural Female 54.3
11 50-54 Urban Male 15.4
12 55-59 Urban Male 24.3
13 60-64 Urban Male 37.0
14 65-69 Urban Male 54.6
15 70-74 Urban Male 71.1
16 50-54 Urban Female 8.4
17 55-59 Urban Female 13.6
18 60-64 Urban Female 19.3
19 65-69 Urban Female 35.1
20 70-74 Urban Female 50.0
The Tidyverse
There are a number of R packages that take advantage of the tidy data form
and can be used to do interesting things with data. Many (but not all) of these
packages are written by Hadley Wickham and the collection of packages is
sometimes referred to as the tidyverse because of their dependence on and
presumption of tidy data. Tidyverse packages include
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(tidyverse)
Read tabular data into R and read in web data via web scraping tools
and APIs
The readr package is the primary means by which we will read tablular data,
most notably, comma-separated-value (CSV) files. The readr package has a few
functions in it for reading and writing tabular datawe will focus on the
read_csv function. The readr package is available on CRAN and the code for
the package is maintained on GitHub.
The importance of the read_csv function is perhaps better understood from
an historical perspective. Rs built in read.csv function similarly reads CSV
files, but the read_csv function in readr builds on that by removing some of
the quirks and gotchas of read.csv as well as dramatically optimizing the
speed with which it can read data into R. The read_csv function also adds some
nice user-oriented features like a progress meter and a compact method for
specifying column types.
The only required argument to read_csv is a character string specifying the
path to the file to read. A typical call to read_csv will look as follows.
The R Programming Environment 18
library(readr)
teams <- read_csv("data/team_standings.csv")
Parsed with column specification:
cols(
Standing = col_integer(),
Team = col_character()
)
teams
# A tibble: 32 2
Standing Team
<int> <chr>
1 1 Spain
2 2 Netherlands
3 3 Germany
4 4 Uruguay
5 5 Argentina
6 6 Brazil
7 7 Ghana
8 8 Paraguay
9 9 Japan
10 10 Chile
# ... with 22 more rows
By default, read_csv will open a CSV file and read it in line-by-line. It will also
(by default), read in the first few rows of the table in order to figure out
the type of each column (i.e. integer, character, etc.). In the code example
above, you can see that read_csv has correctly assigned an integer class to
the Standing variable in the input data and a character class to the Team
variable. From the read_csv help page:
If [the argument for col_types is] NULL, all column types will be
imputed from the first 1000 rows on the input. This is convenient
(and fast), but not robust. If the imputation fails, youll need to
supply the correct types yourself.
You can also specify the type of each column with the col_types argument.
In general, its a good idea to specify the column types explicitly. This rules
out any possible guessing errors on the part of read_csv. Also, specifying the
column types explicitly provides a useful safety check in case anything about
the dataset should change without you knowing about it.
The R Programming Environment 19
Note that the col_types argument accepts a compact representation. Here "cc"
indicates that the first column is character and the second column is character
(there are only two columns). Using the col_types argument is useful because
often it is not easy to automatically figure out the type of a column by looking
at a few rows (especially if a column has many missing values).
The read_csv function will also read compressed files automatically. There is
no need to decompress the file first or use the gzfile connection function.
The following call reads a gzip-compressed CSV file containing download logs
from the RStudio CRAN mirror.
Note that the message (Parse with column specification ) printed after the
call indicates that read_csv may have had some difficulty identifying the type
of each column. This can be solved by using the col_types argument.
The R Programming Environment 20
You can specify the column type in a more detailed fashion by using the
various col_* functions. For example, in the log data above, the first column
is actually a date, so it might make more sense to read it in as a Date variable.
If we wanted to just read in that first column, we could do
Now the date column is stored as a Date object which can be used for relevant
date-related computations (for example, see the lubridate package).
The R Programming Environment 21
The read_csv function has a progress option that defaults to TRUE. This
options provides a nice progress meter while the CSV file is being read.
However, if you are using read_csv in a function, or perhaps embedding it
in a loop, its probably best to set progress = FALSE.
The readr package includes a variety of functions in the read_* family that
allow you to read in data from different formats of flat files. The following
table gives a guide to several functions in the read_* family.
Not only can you read in data locally stored on your computer, with R it is
also fairly easy to read in data stored on the web.
The simplest way to do this is if the data is available online as a flat file (see
note below). For example, the Extended Best Tracks for the North Atlantic
are hurricane tracks that include both the best estimate of the central location
of each storm and also gives estimates of how far winds of certain speeds
extended from the storms center in four quadrants of the storm (northeast,
northwest, southeast, southwest) at each measurement point. You can see this
file online here.
The R Programming Environment 22
How can you tell if youve found a flat file online? Here are a couple of
clues:
It will not have any formatting. Instead, it will look online as if you
opened a file in a text editor on your own computer.
It will often have a web address that ends with a typical flat file
extension (".csv", ".txt", or ".fwf", for example).
If you copy and paste the web address for this file, youll see that the url for
this example hurricane data file is non-secure (starts with http:) and that it
ends with a typical flat file extension (.txt, in this case). You can read this file
into your R session using the same readr function that you would use to read
it in if the file were stored on your computer.
First, you can create an R object with the filepath to the file. In the case of
online files, thats the url. To fit the long web address comfortably in an R
script window, you can use the paste0 function to paste pieces of the web
address together:
Next, since this web-based file is a fixed width file, youll need to define the
width of each column, so that R will know where to split between columns.
You can then use the read_fwf function from the readr package to read the file
into your R session. This data, like a lot of weather data, uses the string "-99"
for missing data, and you can specify that missing value character with the
na argument in read_fwf. Also, the online file does not include column names,
The R Programming Environment 23
so youll have to use the data documentation file for the dataset to determine
and set those yourself.
library(readr)
# Create a vector of column names, based on the online documentation for this data
ext_tracks_colnames <- c("storm_id", "storm_name", "month", "day",
"hour", "year", "latitude", "longitude",
"max_wind", "min_pressure", "rad_max_wind",
"eye_diameter", "pressure_1", "pressure_2",
paste("radius_34", c("ne", "se", "sw", "nw"), sep = "_"),
paste("radius_50", c("ne", "se", "sw", "nw"), sep = "_"),
paste("radius_64", c("ne", "se", "sw", "nw"), sep = "_"),
"storm_type", "distance_to_land", "final")
For some fixed width files, you may be able to save the trouble of counting
column widths by using the fwf_empty function in the readr package. This
function guesses the widths of columns based on the positions of empty
columns. However, the example hurricane dataset we are using here is
a bit too messy for this in some cases, there are values from different
columns that are not separated by white space. Just as it is typically safer
for you to specify column types yourself, rather than relying on R to
correctly guess them, it is also safer when reading in a fixed width file to
specify column widths yourself.
The R Programming Environment 24
You can use some dplyr functions to check out the dataset once its in R (there
will be much more about dplyr in the next section). For example, the following
call prints a sample of four rows of data from Hurricane Katrina, with, for
each row, the date and time, maximum wind speed, minimum pressure, and
the radius of maximum winds of the storm for that observation:
library(dplyr)
ext_tracks %>%
filter(storm_name == "KATRINA") %>%
select(month, day, hour, max_wind, min_pressure, rad_max_wind) %>%
sample_n(4)
# A tibble: 4 6
month day hour max_wind min_pressure rad_max_wind
<chr> <chr> <chr> <int> <int> <int>
1 11 01 12 20 1011 90
2 08 25 18 60 988 15
3 08 24 00 30 1007 40
4 08 29 12 110 923 20
With the functions in the readr package, you can also read in flat files from
secure urls (ones that starts with https:). (This is not true with the read.table
family of functions from base R.) One example where it is common to find flat
files on secure sites is on GitHub. If you find a file with a flat file extension in
a GitHub repository, you can usually click on it and then choose to view the
Raw version of the file, and get to the flat file version of the file.
For example, the CDC Epidemic Prediction Initiative has a GitHub repository
with data on Zika cases, including files on cases in Brazil. When we wrote
this, the most current file was available here, with the raw version (i.e., a flat
file) available by clicking the Raw button on the top right of the first site.
The R Programming Environment 25
zika_brazil %>%
select(location, value, unit)
# A tibble: 210 3
location value unit
<chr> <int> <chr>
1 Brazil-Acre 2 cases
2 Brazil-Alagoas 75 cases
3 Brazil-Amapa 7 cases
4 Brazil-Amazonas 8 cases
5 Brazil-Bahia 263 cases
6 Brazil-Ceara 124 cases
7 Brazil-Distrito_Federal 5 cases
8 Brazil-Espirito_Santo 13 cases
9 Brazil-Goias 14 cases
10 Brazil-Maranhao 131 cases
# ... with 200 more rows
Web APIs are growing in popularity as a way to access open data from
government agencies, companies, and other organizations. API stands for
Application Program Interface; an API provides the rules for software ap-
plications to interact. In the case of open data APIs, they provide the rules you
need to know to write R code to request and pull data from the organizations
web server into your R session. Usually, some of the computational burden of
querying and subsetting the data is taken on by the sources server, to create
the subset of requested data to pass to your computer. In practice, this means
you can often pull the subset of data you want from a very large available
dataset without having to download the full dataset and load it locally into
your R session.
As an overview, the basic steps for accessing and using data from a web API
when working in R are:
Once you get back data from the request, parse it into an easier-to-use
format if necessary
To get the data from an API, you should first read the organizations API
documentation. An organization will post details on what data is available
through their API(s), as well as how to set up HTTP requests to get that
data to request the data through the API, you will typically need to send
the organizations web server an HTTP request using a GET or POST method.
The API documentation details will typically show an example GET or POST
request for the API, including the base URL to use and the possible query
parameters that can be used to customize the dataset request.
For example, the National Aeronautics and Space Administration (NASA)
has an API for pulling the Astronomy Picture of the Day. In their API doc-
umentation, they specify that the base URL for the API request should be
https://api.nasa.gov/planetary/apod and that you can include parameters
to specify the date of the daily picture you want, whether to pull a high-
resolution version of the picture, and a NOAA API key you have requested
from NOAA.
Many organizations will require you to get an API key and use this key in
each of your API requests. This key allows the organization to control API
access, including enforcing rate limits per user. API rate limits restrict how
often you can request data (e.g., an hourly limit of 1,000 requests per user for
NASA APIs).
API keys should be kept private, so if you are writing code that includes an
API key, be very careful not to include the actual key in any code made public
(including any code in public GitHub repositories). One way to do this is to
save the value of your key in a file named .Renviron in your home directory.
This file should be a plain text file and must end in a blank line. Once youve
saved your API key to a global variable in that file (e.g., with a line added to the
.Renviron file like NOAA_API_KEY="abdafjsiopnab038"), you can assign the key value
to an R object in an R session using the Sys.getenv function (e.g., noaa_api_key <-
Sys.getenv("NOAA_API_KEY")), and then use this object (noaa_api_key) anywhere
you would otherwise have used the character string with your API key.
The R Programming Environment 27
library(httr)
meso_url <- "https://mesonet.agron.iastate.edu/cgi-bin/request/asos.py/"
denver <- GET(url = meso_url,
query = list(station = "DEN",
data = "sped",
year1 = "2016",
month1 = "6",
day1 = "1",
year2 = "2016",
month2 = "6",
day2 = "30",
tz = "America/Denver",
format = "comma")) %>%
content() %>%
read_csv(skip = 5, na = "M")
The content call in this code extracts the content from the response to the
HTTP request sent by the GET function. The Iowa Environmental Mesonet
API offers the option to return the requested data in a comma-separated
file (format = "comma" in the GET request), so here content and read_csv are
used to extract and read in that csv file. Usually, data will be returned
in a JSON format instead. We include more details later in this section on
parsing data returned in a JSON format.
The only tricky part of this process is figuring out the available parameter
names (e.g., station) and possible values for each (e.g., "DEN" for Denver).
Currently, the details you can send in an HTTP request through Iowa Envi-
ronmental Mesonets API include:
Starting and ending dates describing the range for which youd like to
pull data (year1, month1, day1, year2, month2, day2)
The time zone to use for date-times for the weather observations (tz)
Different formatting options (e.g., delimiter to use in the resulting data
file [format], whether to include longitude and latitude)
Typically, these parameter names and possible values are explained in the
API documentation. In some cases, however, the documentation will be
limited. In that case, you may be able to figure out possible values, especially
if the API specifies a GET rather than POST method, by playing around with
the websites point-and-click interface and then looking at the url for the
resulting data pages. For example, if you look at the Iowa Environmental
Mesonets page for accessing this data, youll notice that the point-and-click
web interface allows you the options in the list above, and if you click through
to access a dataset using this interface, the web address of the data page
includes these parameter names and values.
The riem package implements all these ideas in three very clean and straight-
forward functions. You can explore the code behind this package and see how
these ideas can be incorporated into a small R package, in the /R directory of
the packages GitHub page.
R packages already exist for many open data APIs. If an R package already
exists for an API, you can use functions from that package directly, rather
than writing your own code using the API protocols and httr functions. Other
examples of existing R packages to interact with open data APIs include:
twitteR: Twitter
rnoaa: National Oceanic and Atmospheric Administration
Quandl: Quandl (financial data)
RGoogleAnalytics: Google Analytics
censusr, acs: United States Census
WDI, wbstats: World Bank
GuardianR, rdian: The Guardian Media Group
blsAPI: Bureau of Labor Statistics
rtimes: New York Times
dataRetrieval, waterData: United States Geological Survey
The R Programming Environment 30
If an R package doesnt exist for an open API and youd like to write your own
package, find out more about writing API packages with this vignette for the
httr package. This document includes advice on error handling within R code
that accesses data through an open API.
You can also use R to pull and clean web-based data that is not accessible
through a web API or as an online flat file. In this case, the strategy will often
be to pull in the full web page file (often in HTML or XML) and then parse or
clean it within R.
The rvest package is a good entry point for handling more complex collec-
tion and cleaning of web-based data. This package includes functions, for
example, that allow you to select certain elements from the code for a web
page (e.g., using the html_node and xml_node functions), to parse tables in an
HTML document into R data frames (html_table), and to parse, fill out, and
submit HTML forms (html_form, set_values, submit_form). Further details on web
scraping with R are beyond the scope of this course, but if youre interested,
you can find out more through the rvest GitHub README.
Often, data collected from the web, including the data returned from an open
API or obtained by scraping a web page, will be in JSON, XML, or HTML
format. To use data in a JSON, XML, or HTML format in R, you need to parse
the file from its current format and convert it into an R object more useful for
analysis.
Typically, JSON-, XML-, or HTML-formatted data is parsed into a list in R,
since list objects allow for a lot of flexibility in the structure of the data.
However, if the data is structured appropriately, you can often parse data
into another type of object (a data frame, for example, if the data fits well
into a two-dimensional format of rows and columns). If the data structure
of the data that you are pulling in is complex but consistent across different
observations, you may alternatively want to create a custom object type to
parse the data into.
There are a number of packages for parsing data from these formats, includ-
ing jsonlite and xml2. To find out more about parsing data from typical web
The R Programming Environment 31
formats, and for more on working with web-based documents and data, see
the CRAN task view for Web Technologies and Services
The two packages dplyr and tidyr, both tidyverse packages, allow you to
quickly and fairly easily clean up your data. These packages are not very
old, and so much of the example R code you might find in books or online
might not use the functions we use in examples in this section (although this
is quickly changing for new books and for online examples). Further, there
are many people who are used to using R base functionality to clean up their
data, and some of them still do not use these packages much when cleaning
data. We think, however, that dplyr is easier for people new to R to learn than
learning how to clean up data using base R functions, and we also think it
produces code that is much easier to read, which is useful in maintaining
and sharing code.
For many of the examples in this section, we will use the ext_tracks hurricane
dataset we input from a url as an example in a previous section of this book.
If you need to load a version of that data, we have also saved it locally, so you
can create an R object with the example data for this section by running:
Piping
The dplyr and tidyr functions are often used in conjunction with piping,
which is done with the %>% function from the magrittr package. Piping can be
done with many R functions, but is especially common with dplyr and tidyr
functions. The concept is straightforward the pipe passes the data frame
output that results from the function right before the pipe to input it as the
first argument of the function right after the pipe.
Here is a generic view of how this works in code, for a pseudo-function named
function that inputs a data frame as its first argument:
# Without piping
function(dataframe, argument_2, argument_3)
# With piping
dataframe %>%
function(argument_2, argument_3)
For example, without piping, if you wanted to see the time, date, and maxi-
mum winds for Katrina from the first three rows of the ext_tracks hurricane
data, you could run:
In this code, you are creating new R objects at each step, which makes the
code cluttered and also requires copying the data frame several times into
memory. As an alternative, you could just wrap one function inside another:
The R Programming Environment 33
This avoids re-assigning the data frame at each step, but quickly becomes
ungainly, and its easy to put arguments in the wrong layer of parentheses.
Piping avoids these problems, since at each step you can send the output from
the last function into the next function as that next functions first argument:
ext_tracks %>%
filter(storm_name == "KATRINA") %>%
select(month, day, hour, max_wind) %>%
head(3)
# A tibble: 3 4
month day hour max_wind
<chr> <chr> <chr> <int>
1 10 28 18 30
2 10 29 00 30
3 10 29 06 30
Summarizing data
The dplyr and tidyr packages have numerous functions (sometimes referred
to as verbs) for cleaning up data. Well start with the functions to summarize
data.
The primary of these is summarize, which inputs a data frame and creates a
new data frame with the requested summaries. In conjunction with summarize,
you can use other functions from dplyr (e.g., n, which counts the number of
observations in a given column) to create this summary. You can also use R
functions from other packages or base R functions to create the summary.
For example, say we want a summary of the number of observations in
the ext_tracks hurricane dataset, as well as the highest measured maximum
windspeed (given by the column max_wind in the dataset) in any of the storms,
and the lowest minimum pressure (min_pressure). To create this summary, you
can run:
The R Programming Environment 34
ext_tracks %>%
summarize(n_obs = n(),
worst_wind = max(max_wind),
worst_pressure = min(min_pressure))
# A tibble: 1 3
n_obs worst_wind worst_pressure
<int> <int> <int>
1 11824 160 0
This summary provides particularly useful information for this example data,
because it gives an unrealistic value for minimum pressure (0 hPa). This
shows that this dataset will need some cleaning. The highest wind speed
observed for any of the storms, 160 knots, is more reasonable.
You can also use summarize with functions youve written yourself, which gives
you a lot of power in summarizing data in interesting ways. As a simple
example, if you wanted to present the maximum wind speed in the summary
above using miles per hour rather than knots, you could write a function to
perform the conversion, and then use that function within the summarize call:
ext_tracks %>%
summarize(n_obs = n(),
worst_wind = knots_to_mph(max(max_wind)),
worst_pressure = min(min_pressure))
# A tibble: 1 3
n_obs worst_wind worst_pressure
<int> <dbl> <int>
1 11824 184.32 0
So far, weve only used summarize to create a single-line summary of the data
frame. In other words, the summary functions are applied across the entire
dataset, to return a single value for each summary statistic. However, often
you might want summaries stratified by a certain grouping characteristic of
the data. For the hurricane data, for example, you might want to get the worst
wind and worst pressure by storm, rather than across all storms.
You can do this by grouping your data frame by one of its column variables,
using the function group_by, and then using summarize. The group_by function
The R Programming Environment 35
does not make a visible change to a data frame, although you can see, if you
print out a grouped data frame, that the new grouping variable will be listed
under Groups at the top of a print-out:
ext_tracks %>%
group_by(storm_name, year) %>%
head()
Source: local data frame [6 x 29]
Groups: storm_name, year [1]
As a note, since hurricane storm names repeat at regular intervals until they
are retired, to get a separate summary for each unique storm, this example
requires grouping by both storm_name and year.
Even though applying the group_by function does not cause a noticeable
change to the data frame itself, youll notice the difference in grouped and
ungrouped data frames when you use summarize on the data frame. If a data
frame is grouped, all summaries are calculated and given separately for each
unique value of the grouping variable:
The R Programming Environment 36
ext_tracks %>%
group_by(storm_name, year) %>%
summarize(n_obs = n(),
worst_wind = max(max_wind),
worst_pressure = min(min_pressure))
Source: local data frame [378 x 5]
Groups: storm_name [?]
This grouping / summarizing combination can be very useful for quickly plot-
ting interesting summaries of a dataset. For example, to plot a histogram of
maximum wind speed observed for each storm (Figure @ref(fig:windhistogram)),
you could run:
library(ggplot2)
ext_tracks %>%
group_by(storm_name) %>%
summarize(worst_wind = max(max_wind)) %>%
ggplot(aes(x = worst_wind)) + geom_histogram()
The R Programming Environment 37
Histogram of the maximum wind speed observed during a storm for all Atlantic basin
tropical storms, 19882015.
From Figure @ref(fig:windhistogram), we can see that only two storms had
maximum wind speeds at or above 160 knots (well check this later with some
other dplyr functions).
When cleaning up data, you will need to be able to create subsets of the data,
by selecting certain columns or filtering down to certain rows. These actions
can be done using the dplyr functions select and filter.
The select function subsets certain columns of a data frame. The most basic
way to use select is select certain columns by specifying their full column
names. For example, to select the storm name, date, time, latitude, longitude,
and maximum wind speed from the ext_tracks dataset, you can run:
ext_tracks %>%
select(storm_name, month, day, hour, year, latitude, longitude, max_wind)
# A tibble: 11,824 8
storm_name month day hour year latitude longitude max_wind
<chr> <chr> <chr> <chr> <int> <dbl> <dbl> <int>
1 ALBERTO 08 05 18 1988 32.0 77.5 20
2 ALBERTO 08 06 00 1988 32.8 76.2 20
3 ALBERTO 08 06 06 1988 34.0 75.2 20
4 ALBERTO 08 06 12 1988 35.2 74.6 25
5 ALBERTO 08 06 18 1988 37.0 73.5 25
6 ALBERTO 08 07 00 1988 38.7 72.4 25
7 ALBERTO 08 07 06 1988 40.0 70.8 30
8 ALBERTO 08 07 12 1988 41.5 69.0 35
9 ALBERTO 08 07 18 1988 43.0 67.5 35
10 ALBERTO 08 08 00 1988 45.0 65.5 35
# ... with 11,814 more rows
There are several functions you can use with select that give you more
flexibility, and so allow you to select columns without specifying the full
The R Programming Environment 39
names of each column. For example, the starts_with function can be used
within a select function to pick out all the columns that start with a certain
text string. As an example of using starts_with in conjunction with select, in
the ext_tracks hurricane data, there are a number of columns that say how
far from the storm center winds of certain speeds extend. Tropical storms
often have asymmetrical wind fields, so these wind radii are given for each
quadrant of the storm (northeast, southeast, northwest, and southeast of the
storms center). All of the columns with the radius to which winds of 34 knots
or more extend start with radius_34. To get a dataset with storm names,
location, and radii of winds of 34 knots, you could run:
ext_tracks %>%
select(storm_name, latitude, longitude, starts_with("radius_34"))
# A tibble: 11,824 7
storm_name latitude longitude radius_34_ne radius_34_se radius_34_sw
<chr> <dbl> <dbl> <int> <int> <int>
1 ALBERTO 32.0 77.5 0 0 0
2 ALBERTO 32.8 76.2 0 0 0
3 ALBERTO 34.0 75.2 0 0 0
4 ALBERTO 35.2 74.6 0 0 0
5 ALBERTO 37.0 73.5 0 0 0
6 ALBERTO 38.7 72.4 0 0 0
7 ALBERTO 40.0 70.8 0 0 0
8 ALBERTO 41.5 69.0 100 100 50
9 ALBERTO 43.0 67.5 100 100 50
10 ALBERTO 45.0 65.5 NA NA NA
# ... with 11,814 more rows, and 1 more variables: radius_34_nw <int>
Other functions that can be used with select in a similar way include:
ends_with: Select all columns that end with a certain string (for example,
select(ext_tracks, ends_with("ne")) to get all the wind radii for the north-
east quadrant of a storm for the hurricane example data)
contains: Select all columns that include a certain string (select(ext_-
tracks, contains("34")) to get all wind radii for 34-knot winds)
matches: Select all columns that match a certain relative expression
(select(ext_tracks, matches("_[0-9][0-9]_")) to get all columns where the
column name includes two numbers between two underscores, a pat-
tern that matches all of the wind radii columns)
The R Programming Environment 40
While select picks out certain columns of the data frame, filter picks out
certain rows. With filter, you can specify certain conditions using Rs logical
operators, and the function will return rows that meet those conditions.
Rs logical operators include:
If you are ever unsure of how to write a logical statement, but know how to
write its opposite, you can use the ! operator to negate the whole statement.
For example, if you wanted to get all storms except those named KATRINA
and ANDREW, you could use !(storm_name %in% c("KATRINA", "ANDREW")). A
common use of this is to identify observations with non-missing data (e.g.,
!(is.na(radius_34_ne))).
A logical statement, run by itself on a vector, will return a vector of the same
length with TRUE every time the condition is met and FALSE every time it is not.
head(ext_tracks$hour)
[1] "18" "00" "06" "12" "18" "00"
head(ext_tracks$hour == "00")
[1] FALSE TRUE FALSE FALSE FALSE TRUE
When you use a logical statement within filter, it will return just the rows
where the logical statement is true:
The R Programming Environment 41
ext_tracks %>%
select(storm_name, hour, max_wind) %>%
head(9)
# A tibble: 9 3
storm_name hour max_wind
<chr> <chr> <int>
1 ALBERTO 18 20
2 ALBERTO 00 20
3 ALBERTO 06 20
4 ALBERTO 12 25
5 ALBERTO 18 25
6 ALBERTO 00 25
7 ALBERTO 06 30
8 ALBERTO 12 35
9 ALBERTO 18 35
ext_tracks %>%
select(storm_name, hour, max_wind) %>%
filter(hour == "00") %>%
head(3)
# A tibble: 3 3
storm_name hour max_wind
<chr> <chr> <int>
1 ALBERTO 00 20
2 ALBERTO 00 25
3 ALBERTO 00 35
Filtering can also be done after summarizing data. For example, to determine
which storms had maximum wind speed equal to or above 160 knots, run:
ext_tracks %>%
group_by(storm_name, year) %>%
summarize(worst_wind = max(max_wind)) %>%
filter(worst_wind >= 160)
Source: local data frame [2 x 3]
Groups: storm_name [2]
If you would like to string several logical conditions together and select rows
where all or any of the conditions are true, you can use the and (&) or or (|)
The R Programming Environment 42
operators. For example, to pull out observations for Hurricane Andrew when
it was at or above Category 5 strength (137 knots or higher), you could run:
ext_tracks %>%
select(storm_name, month, day, hour, latitude, longitude, max_wind) %>%
filter(storm_name == "ANDREW" & max_wind >= 137)
# A tibble: 2 7
storm_name month day hour latitude longitude max_wind
<chr> <chr> <chr> <chr> <dbl> <dbl> <int>
1 ANDREW 08 23 12 25.4 74.2 145
2 ANDREW 08 23 18 25.4 75.8 150
If you want to check that two things are equal, make sure you
use double equal signs (==), not a single one. At best, a single
equals sign wont work; in some cases, it will cause a variable
to be re-assigned (= can be used for assignment, just like <-).
If you are trying to check if one thing is equal to one of several
things, use %in% rather than ==. For example, if you want to
filter to rows of ext_tracks with storm names of KATRINA
and ANDREW, you need to use storm_name %in% c("KATRINA",
"ANDREW"), not storm_name == c("KATRINA", "ANDREW").
If you want to identify observations with missing values (or
without missing values), you must use the is.na function,
not == or !=. For example, is.na(radius_34_ne) will work, but
radius_34_ne == NA will not.
The mutate function in dplyr can be used to add new columns to a data frame or
change existing columns in the data frame. As an example, Ill use the worldcup
dataset from the package faraway, which statistics from the 2010 World Cup.
To load this example data frame, you can run:
The R Programming Environment 43
library(faraway)
data(worldcup)
This dataset has observations by player, including the players team, position,
amount of time played in this World Cup, and number of shots, passes, tackles,
and saves. This dataset is currently not tidy, as it has one of the variables
(players names) as rownames, rather than as a column of the data frame.
You can use the mutate function to move the player names to its own column:
You can also use mutate in coordination with group_by to create new columns
that give summaries within certain windows of the data. For example, the
following code will add a column with the average number of shots for a
players position added as a new column. While this code is summarizing
the original data to generate the values in this column, mutate will add
these repeated summary values to the original dataset by group, rather than
returning a dataframe with a single row for each of the grouping variables
(try replacing mutate with summarize in this code to make sure you understand
the difference).
If there is a column that you want to rename, but not change, you can use the
rename function. For example:
worldcup %>%
rename(Name = player_name) %>%
slice(1:3)
# A tibble: 3 9
Team Position Time Shots Passes Tackles Saves Name ave_shots
<fctr> <fctr> <int> <int> <int> <int> <int> <chr> <dbl>
1 Algeria Midfielder 16 0 6 0 0 Abdoun 2.394737
2 Japan Midfielder 351 0 101 14 0 Abe 2.394737
3 France Defender 180 0 91 6 0 Abidal 1.164894
The tidyr package includes functions to transfer a data frame between long
and wide. Wide format data tends to have different attributes or variables
describing an observation placed in separate columns. Long format data
tends to have different attributes encoded as levels of a single variable,
followed by another column that contains tha values of the observation at
those different levels.
In the section on tidy data, we showed an example that used gather to convert
data into a tidy format. The data is first in an untidy format:
data("VADeaths")
head(VADeaths)
Rural Male Rural Female Urban Male Urban Female
50-54 11.7 8.7 15.4 8.4
55-59 18.1 11.7 24.3 13.6
60-64 26.9 20.3 37.0 19.3
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
After changing the age categories from row names to a variable (which can
be done with the mutate function), the key problem with the tidyness of the
data is that the variables of urban / rural and male / female are not in their
own columns, but rather are embedded in the structure of the columns.
To fix this, you can use the gather function to gather values spread across
several columns into a single column, with the column names gathered into
a key column. When gathering, exclude any columns that you dont want
gathered (age in this case) by including the column names with a the minus
sign in the gather function. For example:
The R Programming Environment 45
data("VADeaths")
library(tidyr)
Even if your data is in a tidy format, gather is occasionally useful for pulling
data together to take advantage of faceting, or plotting separate plots based
The R Programming Environment 46
library(tidyr)
library(ggplot2)
worldcup %>%
select(Position, Time, Shots, Tackles, Saves) %>%
gather(Type, Number, -Position, -Time) %>%
ggplot(aes(x = Time, y = Number)) +
geom_point() +
facet_grid(Type ~ Position)
The R Programming Environment 47
Example of a faceted plot created by taking advantage of the gather function to pull
together data.
The spread function is less commonly needed to tidy data. It can, however,
be useful for creating summary tables. For example, if you wanted to print a
table of the average number and range of passes by position for the top four
teams in this World Cup (Spain, Netherlands, Uruguay, and Germany), you
could run:
The R Programming Environment 48
library(knitr)
Notice in this example how spread has been used at the very end of the code
sequence to convert the summarized data into a shape that offers a better
tabular presentation for a report. In the spread call, you first specify the name
of the column to use for the new column names (Position in this example) and
then specify the column to use for the cell values (pass_summary here).
In this code, Ive used the kable function from the knitr package to create
the summary table in a table format, rather than as basic R output. This
function is very useful for formatting basic tables in R markdown docu-
ments. For more complex tables, check out the pander and xtable packages.
Merging datasets
Often, you will have data in two separate datasets that youd like to combine
based on a common variable or variables. For example, for the World Cup
example data weve been using, it would be interesting to add in a column
with the final standing of each players team. Weve included data with that
information in a file called team_standings.csv, which can be read into the
R object team_standings with the call:
The R Programming Environment 50
This data frame has one observation per team, and the team names are
consistent with the team names in the worldcup data frame.
You can use the different functions from the *_join family to merge this team
standing data with the player statistics in the worldcup data frame. Once youve
done that, you can use other data cleaning tools from dplyr to quickly pull
and explore interesting parts of the dataset. The main arguments for the *_-
join functions are the object names of the two data frames to join and by,
which specifies which variables to use to match up observations from the
two dataframes.
There are several functions in the *_join family. These functions all merge
together two data frames; they differ in how they handle observations that
exist in one but not both data frames. Here are the four functions from this
family that you will likely use the most often:
In this table, the left data frame refers to the first data frame input in the
*_join call, while the right data frame refers to the second data frame input
into the function. For example, in the call
The R Programming Environment 51
the world_cup data frame is the left data frame and the team_standings data
frame is the right data frame. Therefore, using left_join would include all
rows from world_cup, whether or not the player had a team listed in team_stand-
ings, while right_join would include all the rows from team_standings, whether
or not there were any players from that team in world_cup.
Remember that if you are using piping, the first data frame (left
for these functions) is by default the dataframe created by the code
right before the pipe. When you merge data frames as a step in
piped code, therefore, the left data frame is the one piped into the
function while the right data frame is the one stated in the *_join
function call.
As an example of merging, say you want to create a table of the top 5 players
by shots on goal, as well as the final standing for each of these players teams,
using the worldcup and team_standings data. You can do this by running:
data(worldcup)
worldcup %>%
mutate(Name = rownames(worldcup),
Team = as.character(Team)) %>%
select(Name, Position, Shots, Team) %>%
arrange(desc(Shots)) %>%
slice(1:5) %>%
left_join(team_standings, by = "Team") %>% # Merge in team standings
rename("Team Standing" = Standing) %>%
kable()
In addition to the merging in this code, there are a few other interesting things
The R Programming Environment 52
to point out:
The code uses the as.character function within a mutate call to change
the team name from a factor to a character in the worldcup data frame.
When merging two data frames, its safest if the column youre using
to merge has the same class in each data frame. The Team column is
a character class in the team_standings data frame but a factor class in
the worldcup data frame, so this call converts that column to a character
class in worldcup. The left_join function will still perform a merge if you
dont include this call, but it will throw a warning that it is coercing the
column in worldcup to a character vector. Its generally safer to do this
yourself explictly.
It uses the select function both to remove columns were not interested
in and also to put the columns we want to keep in the order wed like
for the final table.
It uses arrange followed by slice to pull out the top 5 players and order
them by number of shots.
For one of the column names, we want to use Team Standing rather
than the current column name of Standing. This code uses rename at
the very end to make this change right before creating the table. You
can also use the col.names argument in the kable function to customize
all the column names in the final table, but this rename call is a quick fix
since we just want to change one column name.
R has special object classes for dates and date-times. It is often worthwhile to
convert a column in a data frame to one of these special object types, because
you can do some very useful things with date or date-time objects, including
The R Programming Environment 53
pull out the month or day of the week from the observations in the object, or
determine the time difference between two values.
Many of the examples in this section use the ext_tracks object loaded earlier
in the book. If you need to reload that, you can use the following code to do
so:
The lubridate package (another package from the tidyverse) has some
excellent functions for working with dates in R. First, this package includes
functions to transform objects into date or date-time classes. For example, the
ymd_hm function (along with other functions in the same family: ymd, ymd_h, and
ymd_hms) can be used to convert a vector from character class to Rs data and
datetime classes, POSIXlt and POSIXct, respectively.
Functions in this family can be used to parse character strings into dates,
regardless of how the date is formatted, as long as the date is in the order:
year, month, day (and, for time values, hour, minute). For example:
The R Programming Environment 54
library(lubridate)
ymd("2006-03-12")
[1] "2006-03-12"
ymd("'06 March 12")
[1] "2006-03-12"
ymd_hm("06/3/12 6:30 pm")
[1] "2006-03-12 18:30:00 UTC"
The following code shows how to use the ymd_h function to transform the
date and time information in a subset of the hurricane example data called
andrew_tracks (the storm tracks for Hurricane Andrew) to a date-time class
(POSIXct). This code also uses the unite function from the tidyr package to join
together date components that were originally in separate columns before
applying ymd_h.
library(dplyr)
library(tidyr)
head(andrew_tracks, 3)
# A tibble: 3 3
datetime max_wind min_pressure
<dttm> <int> <int>
1 1992-08-16 18:00:00 25 1010
2 1992-08-17 00:00:00 30 1009
3 1992-08-17 06:00:00 30 1008
class(andrew_tracks$datetime)
[1] "POSIXct" "POSIXt"
Now that the datetime variable in this dataset has been converted to a date-
time class, the variable becomes much more useful. For example, if you plot
a time series using datetime, ggplot2 can recognize that this object is a date-
time and will make sensible axis labels. The following code plots maximum
wind speed and minimum air pressure at different observation times for
Hurricane Andrew (Figure @ref(fig:andrewwind)) check the axis labels to
see how theyve been formatted. Note that this code uses gather from the tidyr
The R Programming Environment 55
package to enable easy faceting, to create separate plots for wind speed and
air pressure.
andrew_tracks %>%
gather(measure, value, -datetime) %>%
ggplot(aes(x = datetime, y = value)) +
geom_point() + geom_line() +
facet_wrap(~ measure, ncol = 1, scales = "free_y")
The R Programming Environment 56
Example of how variables in a date-time class can be parsed for sensible axis labels.
etc., of the date. The following code uses the datetime variable in the Hurricane
Andrew track data to add new columns for the year, month, weekday, year
day, and hour of each observation:
andrew_tracks %>%
select(datetime) %>%
mutate(year = year(datetime),
month = months(datetime),
weekday = weekdays(datetime),
yday = yday(datetime),
hour = hour(datetime)) %>%
slice(1:3)
# A tibble: 3 6
datetime year month weekday yday hour
<dttm> <dbl> <chr> <chr> <dbl> <int>
1 1992-08-16 18:00:00 1992 August Sunday 229 18
2 1992-08-17 00:00:00 1992 August Monday 230 0
3 1992-08-17 06:00:00 1992 August Monday 230 6
library(gridExtra)
grid.arrange(a, b, ncol = 1)
The R Programming Environment 59
Example of using lubridate functions to explore data with a date variable by different
time groupings
To get the weekday and month values in the right order, the code uses
the factor function in conjunction with the levels option, to control
the order in which R sets the factor levels. By specifying the order we
want to use with levels, the plot prints out using this order, rather than
alphabetical order (try the code without the factor calls for month and
weekday and compare the resulting graphs to the ones shown here).
The grid.arrange function, from the gridExtra package, allows you to
arrange different ggplot objects in the same plot area. Here, Ive used
it to put the bar charts for weekday (a) and for month (b) together in one
column (ncol = 1).
If you ever have ggplot code that you would like to re-use for a new
plot with a different data frame, you can save a lot of copying and
pasting by using the %+% function. This function takes a ggplot object
(a in this case, which is the bar chart by weekday) and substitutes a
different data frame (check_months) for the original one (check_weekdays),
but otherwise maintains all code. Note that we used rename to give the
x-variable the same name in both datasets so we could take advantage
of the %+% function.
The lubridate package also has functions for handling time zones. The hurri-
cane tracks date-times are, as is true for a lot of weather data, in Coordinated
Universal Time (UTC). This means that you can plot the storm track by
date, but the dates will be based on UTC rather than local time near where
the storm hit. Figure @ref(fig:andrewutc) shows the location of Hurricane
Andrew by date as it neared and crossed the United States, based on date-
time observations in UTC.
The R Programming Environment 61
library(ggmap)
miami <- get_map("miami", zoom = 5)
ggmap(miami) +
geom_path(data = andrew_tracks, aes(x = -longitude, y = latitude),
color = "gray", size = 1.1) +
geom_point(data = andrew_tracks,
aes(x = -longitude, y = latitude, color = date),
size = 2)
The R Programming Environment 62
To create this plot using local time for Miami, FL, rather than UTC (Figure
@ref(fig:andrewlocal)), you can use the with_tz function from lubridate to
convert the datetime variable in the track data from UTC to local time. This
function inputs a date-time object in the POSIXct class, as well as a character
string with the time zone of the location for which youd like to get local time,
and returns the corresponding local time for that location.
The R Programming Environment 63
ggmap(miami) +
geom_path(data = andrew_tracks, aes(x = -longitude, y = latitude),
color = "gray", size = 1.1) +
geom_point(data = andrew_tracks,
aes(x = -longitude, y = latitude, color = date),
size = 2)
The R Programming Environment 64
This section has only skimmed the surface of the date-time manipu-
lations you can do with the lubridate package. For more on what this
package can do, check out Garrett Grolemund and Hadley Wick-
hams article in the Journal of Statistical Software on the package
Dates and Times Made Easy with lubridate or the current pack-
age vignette.
Most common types of data are encoded in text, even if that text is rep-
resenting numerical values, so being able to manipulate text as a software
developer is essential. R provides several built-in tools for manipulating text,
and there is a rich ecosystem of packages for R for text based analysis. First
lets concentrate on some basic text manipulation functions.
By default the paste() function inserts a space between each word. You can
insert a different string between each word by specifying the sep argument:
The R Programming Environment 66
A shortcut for combining all of the string arguments without any characters
in between each of them is to use the paste0() function:
As you can see, all of the possible string combinations are produced when you
provide a vector of strings as an argument to paste(). You can also collapse all
of the elements of a vector of strings into a single string by specifying the
collapse argument:
Besides pasting strings together, there are a few other basic string manipula-
tion functions you should be made aware of. The nchar() function counts the
number of characters in a string:
nchar("Supercalifragilisticexpialidocious")
[1] 34
The toupper() and tolower() functions make strings all uppercase or lowercase
respectively:
The R Programming Environment 67
Regular Expressions
Now that weve covered the basics of string manipulation in R, lets discuss
the more advanced topic of regular expressions. A regular expression is a
string that defines a pattern that could be contained within another string. A
regular expression can be used for searching for a string, searching within a
string, or replacing one part of a string with another string. In this section I
might refer to a regular expression as a regex, just know that theyre the same
thing.
Regular expressions use characters to define patterns of other characters.
Although that approach may seem problematic at first, well discuss meta-
characters (characters that describe other characters) and how you can use
them to create powerful regular expressions.
One of the most basic functions in R that uses regular expressions is the
grepl() function, which takes two arguments: a regular expression and a
string to be searched. If the string contains the specified regular expression
then grepl() will return TRUE, otherwise it will return FALSE. Lets take a look at
one example:
grepl(regular_expression, string_to_search)
[1] TRUE
In the example above we specify the regular expression "a" and store it
in a variable called regular_expression. Remember that regular expressions
are just strings! We also store the string "Maryland" in a variable called
string_to_search. The regular expression "a" represents a single occurrence
of the character "a". Since "a" is contained within "Maryland", grepl() returns
the value TRUE. Lets try another simple example:
The R Programming Environment 68
grepl(regular_expression, string_to_search)
[1] FALSE
grepl("land", "Maryland")
[1] TRUE
grepl("ryla", "Maryland")
[1] TRUE
grepl("Marly", "Maryland")
[1] FALSE
grepl("dany", "Maryland")
[1] FALSE
Since "land" and "ryla" are sub-strings of "Maryland", grepl() returns TRUE,
however when a regular expression like "Marly" or "dany" is searched grepl()
returns FALSE because neither are sub-strings of "Maryland".
Theres a dataset that comes with R called state.name which is a vector of
Strings, one for each state in the United States of America. Were going to use
this vector in several of the following examples.
head(state.name)
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
[6] "Colorado"
Lets build a regular expression for identifying several strings in this vector,
specifically a regular expression that will match names of states that both
start and end with a vowel. The state name could start and end with any
vowel, so we wont be able to match exact sub-strings like in the previous
examples. Thankfully we can use metacharacters to look for vowels and
other parts of strings. The first metacharacter that well discuss is ".". The
metacharacter that only consists of a period represents any character other
than a new line (well discuss new lines soon). Lets take a look at some
examples using the peroid regex:
The R Programming Environment 69
grepl(".", "Maryland")
[1] TRUE
grepl(".", "*&2[0+,%<@#~|}")
[1] TRUE
grepl(".", "")
[1] FALSE
As you can see the period metacharacter is very liberal. This metacharacter
is most userful when you dont care about a set of characters in a regular
expression. For example:
In the case above grepl() returns TRUE for all strings that contain an a followed
by any other character followed by a b.
You can specify a regular expression that contains a certain number of
characters or metacharacters using the enumeration metacharacters. The +
metacharacter indicates that one or more of the preceding expression should
b present and * indicates that zero or more of the preceding expression is
present. Lets take a look at some examples using these metacharacters:
You can also specify exact numbers of expressions using curly brackets {}. For
example "a{5}" specifies a exactly five times, "a{2,5}" specifies a between
2 and 5 times, and "a{2,}" specifies a at least 2 times. Lets take a look at
some examples:
The R Programming Environment 70
grepl("\\w", "abcdefghijklmnopqrstuvwxyz0123456789")
[1] TRUE
grepl("\\d", "0123456789")
[1] TRUE
grepl("\\d", "abcdefghijklmnopqrstuvwxyz")
[1] FALSE
grepl("\\D", "abcdefghijklmnopqrstuvwxyz")
[1] TRUE
You can also specify specific character sets using straight brackets []. For
example a character set of just the vowels would look like: "[aeiou]". You can
find the complement to a specific character by putting a carrot after the first
bracket. For example "[aeiou]" matches all characters except the lowercase
vowels. You can also specify ranges of characters using a hyphen - inside
of the brackets. For example "[a-m]" matches all of the lowercase characters
between a and m, while "[5-8]" matches any digit between 5 and 8 inclusive.
Lets take a look at some examples using custom character sets:
grepl("[aeiou]", "rhythms")
[1] FALSE
grepl("[^aeiou]", "rhythms")
[1] TRUE
grepl("[a-m]", "xyz")
[1] FALSE
grepl("[a-m]", "ABC")
[1] FALSE
grepl("[a-mA-M]", "ABC")
[1] TRUE
The R Programming Environment 72
You might be wondering how you can use regular expressions to match
a particular punctuation mark since many punctuation marks are used as
metacharacters! Putting two backslashes before a punctuation mark that is
also a metacharacter indicates that you are looking for the symbol and not
the metacharacter meaning. For example "\\." indicates you are trying to
match a period in a string. Lets take a look at a few examples:
grepl("\\.", "http://www.jhsph.edu/")
[1] TRUE
There are also metacharacters for matching the beginning and the end of a
string which are "" and "$" respectively. Lets take a look at a few examples:
Finally weve learned enough to create a regular expression that matches all
state names that both begin and end with a vowel:
state.name[vowel_state_lgl]
[1] "Alabama" "Alaska" "Arizona" "Idaho" "Indiana" "Iowa"
[7] "Ohio" "Oklahoma"
Metacharacter Meaning
. Any Character
\w A Word
\W Not a Word
\d A Digit
\D Not a Digit
\s Whitespace
\S Not Whitespace
[xyz] A Set of Characters
[xyz] Negation of Set
[a-z] A Range of Characters
Beginning of String
$ End of String
\n Newline
+ One or More of Previous
* Zero or More of Previous
? Zero or One of Previous
Either the Previous or the Following
{5} Exactly 5 of Previous
{2, 5} Between 2 and 5 or Previous
{2, } More than 2 of Previous
The R Programming Environment 74
RegEx Functions in R
So far weve been using grepl() to see if a regex matches a string. There are a
few other built in reged functions you should be aware of. First well review
our workhorse of this chapter, grepl() which stands for grep logical.
Then theres old fashioned grep() which returns the indices of the vector that
match the regex:
The gsub() function is nearly the same as sub() except it will replace every
instance of the regex that is matched in each string.
The strsplit() function will split up strings according to the provided regex.
If strsplit() is provided with a vector of strings it will return a list of string
vectors.
The R Programming Environment 75
[[2]]
[1] "Mi" "i" "ippi"
[[3]]
[1] "Mi" "ouri"
[[4]]
[1] "Tenne" "ee"
The str_extract() function returns the sub-string of a string that matches the
providied regular expression.
library(stringr)
state_tbl <- paste(state.name, state.area, state.abb)
head(state_tbl)
[1] "Alabama 51609 AL" "Alaska 589757 AK" "Arizona 113909 AZ"
[4] "Arkansas 53104 AR" "California 158693 CA" "Colorado 104247 CO"
str_extract(state_tbl, "[0-9]+")
[1] "51609" "589757" "113909" "53104" "158693" "104247" "5009"
[8] "2057" "58560" "58876" "6450" "83557" "56400" "36291"
[15] "56290" "82264" "40395" "48523" "33215" "10577" "8257"
[22] "58216" "84068" "47716" "69686" "147138" "77227" "110540"
[29] "9304" "7836" "121666" "49576" "52586" "70665" "41222"
[36] "69919" "96981" "45333" "1214" "31055" "77047" "42244"
[43] "267339" "84916" "9609" "40815" "68192" "24181" "56154"
[50] "97914"
head(state.name)
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
[6] "Colorado"
str_order(state.name)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50
head(state.abb)
[1] "AL" "AK" "AZ" "AR" "CA" "CO"
str_order(state.abb)
[1] 2 1 4 3 5 6 7 8 9 10 11 15 12 13 14 16 17 18 21 20 19 22 23
[24] 25 24 26 33 34 27 29 30 31 28 32 35 36 37 38 39 40 41 42 43 44 46 45
[47] 47 49 48 50
The str_pad() function pads strings with other characters which is often useful
when the string is going to be eventually printed for a person to read.
The str_to_title() function acts just like tolower() and toupper() except it puts
strings into Title Case.
to_trim <- c(" space", "the ", " final frontier ")
str_trim(to_trim)
[1] "space" "the" "final frontier"
The str_wrap() function inserts newlines in strings so that when the string is
printed each lines length is limited.
The R Programming Environment 77
The word() function allows you to index each word in a string as if it were a
vector.
a_tale <- "It was the best of times it was the worst of times it was the age of wisdom it was\
the age of foolishness"
word(a_tale, 2)
[1] "was"
word(a_tale, end = 3)
[1] "It was the"
Summary
library(pryr)
mem_used()
127 MB
The primary use of this function is to make sure your memory usage in R isnt
getting too big. If the output from mem_used() is in the neighborhood of 75%-
80% of your total physical RAM, you might need to consider a few things.
First, you might consider removing a few very large objects in your workspace.
You can see the memory usage of objects in your workspace by calling the
object_size() function.
The R Programming Environment 79
The object_size() function will print the number of bytes (or kilobytes, or
megabytes) that a given object is using in your R session. If you want see what
the memory usage of the largest 5 objects in your workspace is, you can use
the following code.
library(magrittr)
sapply(ls(), function(x) object.size(get(x))) %>% sort %>% tail(5)
worldcup denver check_tracks ext_tracks miami
61424 222768 239848 1842472 13121608
Note: We have had to use the object.size() function here (see note below)
because the current version of object_size() in pryr throws an error for
certain types of objects.
Here we can see that the miami and ext_tracks objects (created in previous
chapters of this book) are currently taking up the most memory in our R
session. Since we no longer need those objects, we can remove them from
the workspace and free up some memory.
The R Programming Environment 80
mem_used()
127 MB
rm(ext_tracks, miami)
mem_used()
125 MB
Here you can see how much memory we save by deleting these two objects.
But you may be wondering why there isnt a larger savings, given the number
reported by object_size(). This has to do with the internal representation of
the miami object, which is of the class ggmap. Occasionally, certain types of R
objects can appear to take up more memory than the actually do, in which
case functions like object_size() will get confused.
Viewing the change in memory usage by executing an R expression can actu-
ally be simplified using the mem_change() function. We can see what happens
when we remove the next three largest objects.
R has a built in function called object.size() that also calculates the size
of an object, but it uses a slightly different calculation than object_size()
in pryr. While the two functions will generally agree for most objects, for
things like functions and formulas, which have enclosing environments
attached to them, they will differ. Similarly, objects with shared elements
(i.e. character vectors) may result in different computations of their size.
The compare_size() function in pryr allows you to see how the two functions
compare in their calculations. We will discuss these concepts more in the
next chapter.
When reading in large datasets or creating large R objects, its often useful
to do a back of the envelope calculation of how much memory the object will
occupy in the R session (ideally before creating the object). To do this its useful
The R Programming Environment 81
to know roughly how much memory different types of atomic data types in R
use.
Its difficult to generalize how much memory is used by data types in R, but on
most 64 bit systems today, integers are 32 bits (4 bytes) and double-precision
floating point numbers (numerics in R) are 64 bits (8 bytes). Furthermore,
character data are usually 1 byte per character. Because most data come in
the form of numbers (integer or numeric) and letters, just knowing these
three bits of information can be useful for doing many back of the envelope
calculations.
For example, an integer vector is roughly 4 bytes times the number of
elements in the vector. We can see that for a zero-length vector, that still
requires some memory to represent the data structure.
object_size(integer(0))
40 B
However, for longer vectors, the overhead stays roughly constant, and the
size of the object is determined by the number of elements.
If you are reading in tabular data of integers and floating point numbers, you
can roughly estimate the memory requirements for that table by multiplying
the number of rows by the memory required for each of the columns. This
can be a useful exercise to do before reading in large datasets. If you acciden-
tally read in a dataset that requires more memory than your computer has
available, you may end up freezing your R session (or even your computer).
The .Machine object in R (found in the base package) can give you specific
details about how your computer/operation system stores different types of
data.
The R Programming Environment 82
str(.Machine)
List of 18
$ double.eps : num 2.22e-16
$ double.neg.eps : num 1.11e-16
$ double.xmin : num 2.23e-308
$ double.xmax : num 1.8e+308
$ double.base : int 2
$ double.digits : int 53
$ double.rounding : int 5
$ double.guard : int 0
$ double.ulp.digits : int -52
$ double.neg.ulp.digits: int -53
$ double.exponent : int 11
$ double.min.exp : int -1022
$ double.max.exp : int 1024
$ integer.max : int 2147483647
$ sizeof.long : int 8
$ sizeof.longlong : int 8
$ sizeof.longdouble : int 16
$ sizeof.pointer : int 8
If youre familiar with other programming languages like C, youll notice that
you do not need to explicitly allocate and de-allocate memory for objects in
R. This is because R has a garbage collection system that recycles unused
memory and gives it back to R. This happens automatically without the need
for user intervention.
Roughly, R will periodically cycle through all of the objects that have been
created and see if there are still any references to the object somewhere in
the session. If there are no references, the object is garbage-collected and
the memory returned. Under normal usage, the garbage collection is not
noticeable, but occasionally, when working with very large R objects, you
may notice a hiccup in your R session when R triggers a garbage collection
to reclaim unused memory. Theres not really anything you can do about this
except not panic when it happens.
The gc() function in the base package can be used to explicitly trigger a
garbage collection in R. Calling gc() explicitly is never actually needed, but
it does produce some output that is worth understanding.
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1692825 90.5 2637877 140.9 2637877 140.9
Vcells 3766804 28.8 11515887 87.9 18887070 144.1
The used column gives you the amount of memory currently being used by
R. The distinction between Ncells and Vcells is not importantthe mem_used()
function in pryr essentially gives you the sum of this column. The gc trigger
column gives you the amount of memory that can be used before a garbage
collection is triggered. Generally, you will see this number go up as you
allocate more objects and use more memory. The max used column shows
the maximum space used since the last call to gc(reset = TRUE) and is not
particularly useful.
R now offers now offers a variety of options for working with large datasets.
We wont try to cover all these options in detail here, but rather give an
overview of strategies to consider if you need to work with a large dataset, as
well as point you to additional resources to learn more about working with
large datasets in R.
In-memory strategies
In this section, we introduce the basics of why and how to use data.table
to work with large datasets in R. We have included a video demonstration
online showing how functions from the data.table package can be used to
load and explore a large dataset more efficiently.
The data.table package can help you read a large dataset into R and explore
it more efficiently. The fread function in this package, for example, can read
large flat files in much more quickly than comparable base R packages. Since
all of the data.table functions will work with smaller datasets, as well, well
illustrate using data.table with the Zika data accessed from GitHub in an
earlier section of this chapter. Weve saved that data locally to illustrate how
to read it in and work with it using data.table.
First, to read this data in using fread, you can run:
The R Programming Environment 85
library(data.table)
brazil_zika <- fread("data/COES_Microcephaly-2016-06-25.csv")
head(brazil_zika, 2)
report_date location location_type data_field
1: 2016-06-25 Brazil-Acre state microcephaly_confirmed
2: 2016-06-25 Brazil-Alagoas state microcephaly_confirmed
data_field_code time_period time_period_type value unit
1: BR0002 NA NA 2 cases
2: BR0002 NA NA 75 cases
class(brazil_zika)
[1] "data.table" "data.frame"
If you are working with a very large dataset, data.table will provide a status
bar showing your progress towards loading the code as you read it in using
fread.
If you have a large dataset for which you only want to read in certain columns,
you can save time when using data.table by only reading in the columns you
want with the select argument in fread. This argument takes a vector of either
the names or positions of the columns that you want to read in:
fread("data/COES_Microcephaly-2016-06-25.csv",
select = c("location", "value", "unit")) %>%
dplyr::slice(1:3)
location value unit
1 Brazil-Acre 2 cases
2 Brazil-Alagoas 75 cases
3 Brazil-Amapa 7 cases
explore the data in a data.table object. You can find out more about using
data.table functions at the data.table wiki.
When you are working with datasets that are large, but can still fit in-memory,
youll want to optimize your code as much as possible. There are more details
on profiling and optimizing code in a later chapter, but one strategy for
speeding up R code is to write some of the code in C++ and connect it to R
using the Rcpp package. Since C++ is a compiled rather than an interpreted
language, it runs much faster than similar code written in R. If you are
more comfortable coding in another compiled language (C or FORTRAN, for
example), you can also use those, although the Rcpp package is very nicely
written and well-maintained, which makes C++ an excellent first choice for
creating compiled code to speed up R.
Further, a variety of R packages have been written that help you run R code
in parallel, either locally or on a cluster. Parallel strategies may be work
pursuing if you are working with very large datasets, and if the coding tasks
can be split to run in parallel. To get more ideas and find relevant packages,
visit CRANs High-Performance and Parallel Computing with R task view.
Out-of-memory strategies
If you need to work with a very large dataset, there are also some options to
explore and model the dataset without ever loading it into R, while still using
R commands and working from the R console or an R script. These options
can make working with large datasets more efficient, because they let other
software handle the heavy lifting of sifting through the data and / or avoid
loading large datasets into RAM, instead using data stored on hard drive.
For example, database management systems are optimized to more effi-
ciently store and better search through large sets of data; popular examples
The R Programming Environment 87
include Oracle, MySQL, and PostgreSQL. There are several R packages that
allow you to connect your R session to a database. With these packages, you
can use functions from the R console or an R script to search and subset
data without loading the whole dataset into R, and so take advantage of the
improved efficiency of the database management system in handling data, as
well as work with data too big to fit in memory.
The DBI package is particularly convenient for interfacing R code with a
database management system, as it provides a top-level interface to a number
of different database management systems, with system-specific code applied
by a lower-level, more specific R package (Figure @ref(fig:rdbi)).
The DBI package therefore allows you to use the same commands for working
with database-stored data in R, without worrying about details specific to
the exact type of database management system youre connecting to. The
following table outlines the DBI functions you can use to perform a variety
of tasks when working with data stored in a database:
The R Programming Environment 88
For more on the DBI package, including its history, see the packages
GitHub README page.
The packages for working with database management systems require you
to send commands to the database management system in that systems
command syntax (e.g., SQL). You can, however, do SELECT database queries
directly using dplyr syntax for some database systems, rather than with SQL
syntax. While this functionality is limited to SELECT calls, often this is all
youll need within a data analysis script. For more details, see the dplyr
database vignette.
The R Programming Environment 89
Inevitably, no matter what your level of expertise, you will get to a point in
your R programming where youre stuck. It happens to us every single day.
The first question is always How do you know you have a problem? Two
things must be satisfied in this situation:
The R Programming Environment 90
While it might seem overly didactic to separate out these two things, one
common mistake is to only focus on the second part, i.e. what actually
happened. Typically, we see an error message or a warning or some other
bad sign and we intuitively know that there is a problem. While its important
to recognize these warning signs, its equally important to be able to say
specifically what your expectation was. What output were you expecting to
see? What did you think the answer was going to be?
The more specific you can be with your expectation, the more likely youll be
able to figure out what went wrong. In particular, in many cases it might be
that your expectations were incorrect. For example, you might think its a bug
that the log() function returns NaN when called on a negative number. If you
were expecting there to be an error in this situation, then your expectation is
incorrect because the log() function was specifically designed to return the
NaN value (indicating an undefined operation) and give a warning when called
with negative numbers.
There are two basic approaches to diagnosing and solving problems.
1. Googling
2. Asking a human
Before asking a human, its usually best to see if you can Google your way
out. This can be a real timesaver for all involved. We discuss both approaches
below.
Like with any other programming language, its essential that you know how
to Google your way out of a jam. A related resource in this situation is the
Stack Overflow web site, which is a popular Q&A web site for programming
related questions. However, often results from Google will simply point you
to Stack Overflow, so Google can serve as useful wrapper around a variety of
web site like this.
While we dont exactly have an algorithm for getting unstuck from a jam,
here are few tips.
The R Programming Environment 91
If you get an error message, copy and paste the entire error message
into Google. Why? Because, almost surely, someone else has gotten this
very same error and has asked a question about it on some forum that
Google has indexed. Chances are, that person copy-and-pasted the error
message into that forum posting and, presto! You have your answer. Or
something close to it.
For working with certain high-level functions, you can simply Google the
name of the function, perhaps with the phrase R function following it
in case it is a somewhat generic function name. This will usually bring
up the help page for the function first, but it will also commonly bring up
various tutorials that people have written that use this function. Often,
seeing how other people use a certain function can be very helpful in
understanding how a function works.
If youre trying to learn a new R package, Google [package name] vi-
gnette and [package name] tutorial. Often, someone will have written
course slides, a blog post, or a document that walks you through how to
use the package.
If you are struggling with how to write the code for a plot, try using
Google Images. Google r [name or description of plot] (e.g., r pareto
plot) and then choose the Images tab in the results. Scroll through
to find something that looks like the plot you want to create, and then
check the images website. It will often include the R code used to create
the image.
In the event that Googling around does not find you an answer, you may
need to wade into a forum like Stack Overflow, Reddit, or perhaps the R-help
mailing list to get help with a problem. When asking questions on a forum,
there are some general rules that are always worth following.
Read the posting guide for the forum, if there is one. This may cover the
rules of posting to the forum and will save you a bit of grief later on.
If the forum has a FAQ, read it. The answer to your question may already
be there.
State the problem youre trying to solve, along with the approach that
you took that lead to your problem. In particular, state what you
The R Programming Environment 92
were expecting to see from your code. Sometimes the source of your
problem lies higher up the chain than you might think. In particular,
your expectations may be incorrect.
Show that youve done your homework and have tried to diagnose the
problem yourself, read the help page, Googled for answers, etc.
Provide a reproducible example of your problem. This cannot be
stressed enough. In order for others to help you, its critical that they
can reproduce the problem on their own machines. Otherwise, they will
have to diagnose your problem from afar, and much like with human
beings, this is often very difficult to do. If your problem involves massive
amounts of computation, try to come up with a simple example that
reproduces the same problem. Other people will not download your 100
GB dataset just so they can reproduce your error message.
2. Advanced R Programming
This course covers advanced topics in R programming that are necessary for
developing powerful, robust, and reusable data science tools. Topics covered
include functional programming in R, robust error handling, object oriented
programming, profiling and benchmarking, debugging, and proper design
of functions. Upon completing this course you will be able to identify and
abstract common data analysis tasks and to encapsulate them in user-facing
functions. Because every data science environment encounters unique data
challenges, there is always a need to develop custom software specific to
your organizations mission. You will also be able to define new data types
in R and to develop a universe of functionality specific to those data types to
enable cleaner execution of data science tasks and stronger reusability within
a team.
The learning objectives of the chapter are:
Note: Some of the material in this section is adapted from R Programming for
Data Science.
Advanced R Programming 94
Most control structures are not used in interactive sessions, but rather when
writing functions or longer expressions. However, these constructs do not
have to be used in functions and its a good idea to become familiar with them
before we delve into functions.
if-else
The if-else combination is probably the most commonly used control struc-
ture in R (or perhaps any language). This structure allows you to test a
condition and act on it depending on whether its true or false.
For starters, you can just use the if statement.
if(<condition>) {
## do something
}
## Continue with rest of code
The above code does nothing if the condition is false. If you have an action
you want to execute when the condition is false, then you need an else clause.
Advanced R Programming 95
if(<condition>) {
## do something
} else {
## do something else
}
You can have a series of tests by following the initial if with any number of
else ifs.
if(<condition1>) {
## do something
} else if(<condition2>) {
## do something different
} else {
## do something different
}
if(<condition1>) {
if(<condition2>) {
}
Advanced R Programming 96
for Loops
For loops are pretty much the only looping construct that you will need in R.
While you may occasionally find a need for other types of loops, in most data
analysis situations, there are very few cases where a for loop isnt sufficient.
In R, for loops take an iterator variable and assign it successive values from a
sequence or vector. For loops are most commonly used for iterating over the
elements of an object (list, vector, etc.)
This loop takes the i variable and in each iteration of the loop gives it values 1,
2, 3, , 10, executes the code within the curly braces, and then the loop exits.
The following three loops all have the same behavior.
for(i in 1:4) {
## Print out each element of 'x'
print(x[i])
}
[1] "a"
[1] "b"
[1] "c"
[1] "d"
for(letter in x) {
print(letter)
}
[1] "a"
[1] "b"
[1] "c"
[1] "d"
For one line loops, the curly braces are not strictly necessary.
However, curly braces are sometimes useful even for one-line loops, because
that way if you decide to expand the loop to multiple lines, you wont be
burned because you forgot to add curly braces (and you will be burned by
this).
x <- matrix(1:6, 2, 3)
for(i in seq_len(nrow(x))) {
for(j in seq_len(ncol(x))) {
print(x[i, j])
}
}
next, break
for(i in 1:100) {
if(i <= 20) {
## Skip the first 20 iterations
next
}
## Do something here
}
break is used to exit a loop immediately, regardless of what iteration the loop
may be on.
for(i in 1:100) {
print(i)
Summary
Control structures like if-else and for allow you to control the flow of an
R program.
Advanced R Programming 99
2.2 Functions
Code
Often we start out analyzing data by writing straight R code at the console.
This code is designed to accomplish a single taskwhatever it is that we
are trying to do right now. For example, consider the following code that
Advanced R Programming 100
library(readr)
library(dplyr)
the date: this code only reads data for July 20, 2016. But what about data
from other days? Note that we would first need to obtain that data if we
were interested in knowing download statistics from other days.
the package: this code only returns the number of downloads for the
filehash package. However, there are many other packages on CRAN
and we may want to know how many times these other packages were
downloaded.
Function interface
date, a character string indicating the date for which you want download
statistics, in year-month-day format
Given the date and package name, the function downloads the appropriate
download logs from the RStudio server, reads the CSV file, and then returns
the number of downloads for the package.
library(dplyr)
library(readr)
Now we can call our function using whatever date or package name we
choose.
num_download("filehash", "2016-07-20")
[1] 179
num_download("Rcpp", "2016-07-19")
[1] 13572
Note that for this date, the CRAN log file had to be downloaded separately
because it had not yet been downloaded.
Advanced R Programming 102
Default values
The way that the num.download() function is currently specified, the user must
enter the date and package name each time the function is called. However,
it may be that there is a logical default date for which we always want to
know the number of downloads, for any package. We can set a default value
for the date argument, for example, to be July 20, 2016. In that case, if the date
argument is not explicitly set by the user, the function can use the default
value. The revised function might look as follows:
Now we can call the function in the following manner. Notice that we do not
specify the date argument.
num_download("Rcpp")
[1] 14761
Default values play a critical role in R functions because R functions are often
called interactively. When using R in interactive mode, it can be a pain to
have to specify the value of every argument in every instance of calling the
function. Sometimes we want to call a function multiple times while varying
a single argument (keeping the other arguments at a sensible default).
Also, function arguments have a tendency to proliferate. As functions mature
and are continuously developed, one way to add more functionality is to
increase the number of arguments. But if these new arguments do not have
sensible default values, then users will generally have a harder time using
the function.
As a function author, you have tremendous influence over the users behavior
by specifying defaults, so take care in choosing them. However, just note that
Advanced R Programming 103
a judicious use of default values can greatly improve the user experience with
respect to your function.
Re-factoring code
Now that we have a function written that handles the task at hand in a more
general manner (i.e. it can handle any package and any date), it is worth
taking a closer look at the function and asking whether it is written in the most
useful possible manner. In particular, it could be argued that this function
does too many things:
It might make sense to abstract the first two things on this list into a separate
function. For example, we could create a function called check_for_logfile()
to see if we need to download the log file and then num_download() could call
this function.
This file takes the original download code from num_download() and adds a bit
of error checking to see if download.file() was successful (if not, an error is
thrown with stop()).
Now the num_download() function is somewhat simpler.
Advanced R Programming 104
In addition to being simpler to read, another key difference is that the num_-
download() function does not need to know anything about downloading or
URLs or files. All it knows is that there is a function check_for_logfile() that
just deals with getting the data to your computer. From there, we can just read
the data with read_csv() and get the information we need. This is the value of
abstraction and writing functions.
Dependency Checking
The num_downloads() function depends on the readr and dplyr packages. Without
them installed, the function wont run. Sometimes it is useful to check to see
that the needed packages are installed so that a useful error message (or other
behavior) can be provided for the user.
We can write a separate function to check that the packages are installed.
There are a few things to note about this function. First, it uses the re-
quire() function to attempt to load the readr and dplyr packages. The require()
function is similar to library(), however library() stops with an error if the
package cannot be loaded whereas require() returns TRUE or FALSE depending
on whether the package can be loaded or not. For both functions, if the
package is available, it is loaded and attached to the search() path.
Typically, library() is good for interactive work because you usually cant
go on without a specific package (thats why youre loading it in the first
place!). On the other hand, require() is good for programming because you
Advanced R Programming 105
Vectorization
One final aspect of this function that is worth noting is that as currently
written it is not vectorized. This means that each argument must be a single
valuea single package name and a single date. However, in R, it is a common
paradigm for functions to take vector arguments and for those functions to
return vector or list results. Often, users are bitten by unexpected behavior
because a function is assumed to be vectorized when it is not.
One way to vectorize this function is to allow the pkgname argument to be a
character vector of package names. This way we can get download statistics
for multiple packages with a single function call. Luckily, this is fairly straight-
forward to do. The two things we need to do are
1. Adjust our call to filter() to grab rows of the data frame that fall within
a vector of package names
2. Use a group_by() %>% summarize() combination to count the downloads for
each package.
Advanced R Programming 106
num_download(c("filehash", "weathermetrics"))
# A tibble: 2 2
package n
<chr> <int>
1 filehash 179
2 weathermetrics 7
Note that the output of num_download() has changed. While it previously re-
turned an integer vector, the vectorized function returns a data frame. If you
are authoring a function that is used by many people, it is usually wise to give
them some warning before changing the nature of the output.
Vectorizing the date argument is similarly possible, but it has the added
complication that for each date you need to download another log file. We
leave this as an exercise for the reader.
Argument Checking
Checking that the arguments supplied by the reader are proper is a good way
to prevent confusing results or error messages from occurring later on in the
function. It is also a useful way to enforce documented requirements for a
function.
In this case, the num_download() function is expecting both the pkgname and date
arguments to be character vectors. In particular, the date argument should be
a character vector of length 1. We can check the class of an argument using
is.character() and the length using the length() function.
## Check arguments
if(!is.character(pkgname))
stop("'pkgname' should be character")
if(!is.character(date))
stop("'date' should be character")
if(length(date) != 1)
stop("'date' should be length 1")
Note that here, we chose to stop() and throw an error if the argument was not
of the appropriate type. However, an alternative would have been to simply
coerce the argument to be of character type using the as.character() function.
R package
Deciding when to write a function depends on the context in which you are
programming in R. For a one-off type of activity, its probably not worth
considering the design of a function or set of functions. However, in our
Advanced R Programming 108
Your closest collaborator is you six months ago, but you dont reply
to emails.
This comment relates to the general question of whether some code will ever
have any users, including yourself later on. If the code will likely have more
than one user, they will benefit from the abstraction and simplification af-
forded by encapsulating the code in functions and providing a clean interface.
In Rogers book, Executive Data Science, he writes about when to write a
function:
Summary
add2(5)
[1] 7
add3(5)
[1] 8
There are groups of functions that are essential for functional programming.
In most cases they take a function and a data structure as arguments, and
that function is applied to that data structure in some way. The purrr library
contains many of these functions and well be using it throughout this section.
Function programming is concerned mostly with lists and vectors. I may
refer to just lists or vectors, but you should know that what applies for lists
generally applies for vectors and vice-versa.
Advanced R Programming 111
Map
library(purrr)
Think about evaluating each function above with just one of the arguments in
the specified numeric vector, and then combining all of those function results
into one vector.
The map_if() function takes as its arguments a list or vector containing data,
a predicate function, and then a function to be applied. A predicate function
is a function that returns TRUE or FALSE for each element in the provided list
or vector. In the case of map_if(): if the predicate functions evaluates to TRUE,
then the function is applied to the corresponding vector element, however if
the predicate function evaluates to FALSE then the function is not applied. The
map_if() function always returns a list, so Im piping the result of map_if() to
unlist() so it look prettier:
Advanced R Programming 112
map_if(1:5, function(x){
x %% 2 == 0
},
function(y){
y^2
}) %>% unlist()
[1] 1 4 3 16 5
Notice how only the even numbers are squared, while the odd numbers are
left alone.
The map_at() function only applies the provided function to elements of a
vector specified by their indexes. map_at() always returns a list so like before
Im piping the result to unlist():
Like we expected to happen the providied function is only applied to the first,
third, and fifth element of the vector provided.
In each of the examples above we have only been mapping a function over
one data structure, however you can map a function over two data structures
with the map2() family of functions. The first two arguments should be two
vectors of the same length, followed by a function which will be evaluated
with an element of the first vector as the first argument and an element of
the second vector as the second argument. For example:
pmap_chr(list(
list(1, 2, 3),
list("one", "two", "three"),
list("uno", "dos", "tres")
), paste)
[1] "1 one uno" "2 two dos" "3 three tres"
Reduce
x is 4
y is 5
x is 9
y is 7
[1] 16
On the first iteration x has the value 1 and y has the value 3, then the two
values are combined (theyre added together). On the second iteration x has
the value of the result from the first iteration (4) and y has the value of the
third element in the provided numeric vector (5). This process is repeated for
each iteration. Heres a similar example using string data:
Advanced R Programming 114
x is ab
y is c
x is abc
y is d
[1] "abcd"
By default reduce() starts with the first element of a vector and then the second
element and so on. In contrast the reduce_right() function starts with the last
element of a vector and then proceeds to the second to last element of a vector
and so on:
x is dc
y is b
x is dcb
y is a
[1] "dcba"
Search
You can search for specific elements of a vector using the contains() and
detect() functions. contains() will return TRUE if a specified element is present
in a vector, otherwise it returns FALSE:
Advanced R Programming 115
contains(letters, "a")
[1] TRUE
contains(letters, "A")
[1] FALSE
detect(20:40, function(x){
x > 22 && x %% 2 == 0
})
[1] 24
The detect_index() function takes the same arguments, however it returns the
index of the provided vector which contains the first element that satisfies
the predicate function:
detect_index(20:40, function(x){
x > 22 && x %% 2 == 0
})
[1] 5
Filter
The group of functions that includes keep(), discard(), every(), and some()
are known as filter functions. Each of these functions takes a vector and a
predicate function. For keep() only the elements of the vector that satisfy the
predicate function are returned while all other elements are removed:
keep(1:20, function(x){
x %% 2 == 0
})
[1] 2 4 6 8 10 12 14 16 18 20
The discard() function works similarly, it only returns elements that dont
satisfy the predicate function:
Advanced R Programming 116
discard(1:20, function(x){
x %% 2 == 0
})
[1] 1 3 5 7 9 11 13 15 17 19
The every() function returns TRUE only if every element in the vector satisfies
the predicate function, while the some() function returns TRUE if at least one
element in the vector satisfies the predicate function:
every(1:20, function(x){
x %% 2 == 0
})
some(1:20, function(x){
x %% 2 == 0
})
Compose
Finally, the compose() function combines any number of functions into one
function:
rep(1:5, 1:5)
[1] 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5
n_unique(rep(1:5, 1:5))
[1] 5
Partial Application
library(purrr)
mult_by_15(z = 4)
[1] 60
By using partial application you can bind some data to the arguments of a
function before using that function elsewhere.
Side Effects
Side effects of functions occur whenever a function interacts with the out-
side world reading or writing data, printing to the console, and displaying
a graph are all side effects. The results of side effects are one of the main
motivations for writing code in the first place! Side effects can be tricky to
handle though, since the order in which functions with side effects are exe-
cuted often matters and there are variables that are external to the program
(the relative location of some data). If you want to evaluate a function across
a data structure you should use the walk() function from purrr. Heres a simple
example:
library(purrr)
Recursion
to solve problems called base cases, and then a case for more complicated
problems where the function is called inside of itself. The central philos-
ophy of recursive programming is that problems can be broken down into
simpler parts, and then combining those simple answers results in the answer
to a complex problem.
Imagine you wanted to write a function that adds together all of the numbers
in a vector. You could of course accomplish this with a loop:
You could also think about how to solve this problem recursively. First ask
yourself: whats the base case of finding the sum of a vector? If the vector
only contains one element, then the sum is just the value of that element. In
the more complex case the vector has more than one element. We can remove
the first element of the vector, but then what should we do with the rest of
the vector? Thankfully we have a function for computing the sum of all of the
elements of a vector because were writing that function right now! So well
add the value of the first element of the vector to whatever the cumulative
sum is of the rest of the vector. The resulting function is illustrated below:
fib(1)
[1] 0
fib(2)
[1] 1
fib(3)
[1] 1
fib(4)
[1] 2
fib(5)
[1] 3
Advanced R Programming 120
fib(6)
[1] 5
fib(7)
[1] 8
map_dbl(1:12, fib)
[1] 0 1 1 2 3 5 8 13 21 34 55 89
Looks like its working well! There is one optimization that we could apply
here which comes up in recursive programming often. When you execute
the function fib(6), within that function youll execute fib(5) and fib(4). Then
within the execution of fib(5), fib(4) will be executed again. An illustration
of this phenomenon is below:
is recursively calculated and stored in the table. Notice that were using the
complex assignment operator <<- in order to modify the table outside the
scope of the function. Youll learn more about the complex operator in the
section titled Expressions & Environments.
if(!is.na(fib_tbl[n])){
fib_tbl[n]
} else {
fib_tbl[n - 1] <<- fib_mem(n - 1)
fib_tbl[n - 2] <<- fib_mem(n - 2)
fib_tbl[n - 1] + fib_tbl[n - 2]
}
}
map_dbl(1:12, fib_mem)
[1] 0 1 1 2 3 5 8 13 21 34 55 89
It works! But is it any faster than the original fib()? Below Im going to use
the microbenchmark package in order assess whether fib() or fib_mem() is faster:
library(purrr)
library(microbenchmark)
library(tidyr)
library(magrittr)
library(dplyr)
fib_data %<>%
gather(num, time) %>%
group_by(num) %>%
summarise(med_time = median(time))
memo_data %<>%
gather(num, time) %>%
group_by(num) %>%
summarise(med_time = median(time))
As you can see as higher Fibonacci numbers are calculated the time it takes
to calculate a number with fib() grows exponentially, while the time it takes
to do the same task with fib_mem() stays constant.
Summary
Expressions
eval(two_plus_two)
[1] 4
You might encounter R code that is stored as a string that you want to evaluate
with eval(). You can use parse() to transform a string into an expression:
Advanced R Programming 124
eval(tpt_expression)
[1] 4
You can reverse this process and transform an expression into a string using
deparse():
deparse(two_plus_two)
[1] "2 + 2"
One interesting feature about expressions is that you can access and modify
their contents like you a list(). This means that you can change the values in
an expression, or even the function being executed in the expression before
it is evaluated:
You can compose expressions using the call() function. The first argument is
a string containing the name of a function, followed by the arguments that
will be provided to that function.
Advanced R Programming 125
You can capture the the expression an R user typed into the R console when
they executed a function by including match.call() in the function the user
executed:
You could of course then manipulate this expression inside of the function
youre writing. The exmaple below first uses match.call() to capture the
expression that the user entered. The first argument of the function is then
extracted an evaluated. If the first expressions is a number, then a string
is returned describing the first argument, otherwise the string "The first
argument is not numeric." is returned.
Expressions are a powerful tool for writing R programs that can manipulate
other R programs.
Advanced R Programming 126
Environments
You can get all of the variable names that have been assigned in an environ-
ment using ls(), you can remove an association between a variable name and
a value using rm(), and you can check if a variable name has been assigned in
an environment using exists():
Advanced R Programming 127
ls(my_new_env)
[1] "x" "y"
rm(y, envir = my_new_env)
exists("y", envir = my_new_env)
[1] TRUE
exists("x", envir = my_new_env)
[1] TRUE
my_new_env$x
[1] 4
my_new_env$y
NULL
search()
[1] ".GlobalEnv" "package:magrittr"
[3] "package:tidyr" "package:microbenchmark"
[5] "package:purrr" "package:dplyr"
[7] "package:readr" "package:parallel"
[9] "package:knitr" "package:stats"
[11] "package:graphics" "package:grDevices"
[13] "package:utils" "package:datasets"
[15] "Autoloads" "package:base"
library(ggplot2)
search()
[1] ".GlobalEnv" "package:ggplot2"
[3] "package:magrittr" "package:tidyr"
[5] "package:microbenchmark" "package:purrr"
[7] "package:dplyr" "package:readr"
[9] "package:parallel" "package:knitr"
[11] "package:stats" "package:graphics"
[13] "package:grDevices" "package:utils"
[15] "package:datasets" "Autoloads"
[17] "package:base"
Execution Environments
Although there may be several cases where you need to create a new environ-
ment using new.env(), you will more often create new environments whenever
you execute functions. An execution environment is an environment that
exists temporarily within the scope of a function that is being executed. For
example if we have the following code:
x <- 10
my_func()
What do you think will be the result of my_func()? Make your guess and then
take a look at the executed code below:
Advanced R Programming 129
x <- 10
my_func()
[1] 5
So what exactly is happening above? First the name x is bring assigned the
value 10 in the global environment. Then the name my_func is being assigned
the value of the function function(){x <- 5};return(x)} in the global environ-
ment. When my_func() is executed, a new environment is created called the
execution environment which only exists while my_func() is running. Inside of
the execution environment the name x is assigned the value 5. When return()
is executed it looks first in the execution environment for a value that is
assigned to x. Then the value 5 is returned. In contrast to the situation above,
take a look at this variation:
x <- 10
another_func()
[1] 10
x <- 10
x
[1] 10
assign1()
x
[1] "Wow!"
You can see that the value associated with x has been changed from 10 to
"Wow!" in the global environment. You can also use <<- to assign names to
values that have not been yet been defined in the global environment from
inside a function:
a_variable_name
Error in eval(expr, envir, enclos): object 'a_variable_name' not found
exists("a_variable_name")
[1] FALSE
assign2()
exists("a_variable_name")
[1] TRUE
a_variable_name
[1] "Magic!"
If you want to see a case for using <<- in action, see the section of this book
about functional programming and the discussion there about memoization.
Summary
What is an error?
Errors most often occur when code is used in a way that it is not intended
to be used. For example adding two strings together produces the following
error:
"hello" + "world"
Error in "hello" + "world": non-numeric argument to binary operator
The as.numeric() function attempts to convert each string in c("5", "6", "seven")
into a number, however it is impossible to convert "seven", so a warning is
generated. Execution of the code is not halted, and an NA is produced for
"seven" instead of a number.
Messages simply print to the R console, though they are generated by an un-
derlying mechanism that is similar to how errors and warning are generated.
Heres a small function that will generate a message:
Advanced R Programming 132
f <- function(){
message("This is a message.")
}
f()
This is a message.
Generating Errors
There are a few essential functions for generating errors, warnings, and
messages in R. The stop() function will generate an error. Lets generate an
error:
If an error occurs inside of a function then the name of that function will
appear in the error message:
name_of_function()
Error in name_of_function(): Something bad happened.
error_if_n_is_greater_than_zero(5)
Error: n <= 0 is not TRUE
The warning() function creates a warning, and the function itself is very
similar to the stop() function. Remember that a warning does not stop the
execution of a program (unlike an error.)
Just like errors, a warning generated inside of a function will include the
name of the function in which it was generated:
make_NA("Sodium")
Warning in make_NA("Sodium"): Generating an NA.
[1] NA
Messages are simpler than errors or warnings; they just print strings to the R
console. You can issue a message with the message() function:
message("In a bottle.")
In a bottle.
Stopping the execution of your program with stop() should only happen in
the event of a catastrophe - meaning only if it is impossible for your program
to continue. If there are conditions that you can anticipate that would cause
your program to create an error then you should document those conditions
so whoever uses your software is aware. Common failure conditions like
Advanced R Programming 134
Imagine writing a program that will take a long time to complete because of
a complex calculation or because youre handling a large amount of data. If
an error occurs during this computation then youre liable to lose all of the
results that were calculated before the error, or your program may not finish
a critical task that a program further down your pipeline is depending on. If
you anticipate the possibility of errors occurring during the execution of your
program then you can design your program to handle them appropriately.
Advanced R Programming 135
beera({
2 + 2
})
Finally done!
[1] 4
beera({
"two" + 2
})
An error occurred:
Error in "two" + 2: non-numeric argument to binary operator
Finally done!
Advanced R Programming 136
beera({
as.numeric(c(1, "two", 3))
})
A warning occured:
simpleWarning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced by coerci\
on
Finally done!
Notice that weve effectively transformed errors and warnings into messages.
Now that you know the basics of generating and catching errors youll need
to decide when your program should generate an error. My advice to you is
to limit the number of errors your program generates as much as possible.
Even if you design your program so that its able to catch and handle errors,
the error handling process slows down your program by orders of magnitude.
Imagine you wanted to write a simple function that checks if an argument is
an even number. You might write the following:
is_even(768)
[1] TRUE
is_even("two")
Error in n%%2: non-numeric argument to binary operator
You can see that providing a string causes this function to raise an error.
You could imagine though that you want to use this function across a list of
different data types, and you only want to know which elements of that list
are even numbers. You might think to write the following:
Advanced R Programming 137
is_even_error(714)
[1] TRUE
is_even_error("eight")
[1] FALSE
This appears to be working the way you intended, however when applied to
more data this function will be seriously slow compared to alternatives. For
example I could check that n is numeric before treating n like a number:
is_even_check(1876)
[1] TRUE
is_even_check("twelve")
[1] FALSE
Notice that by using is.numeric() before the AND operator (&&) the
expression n %% 2 == 0 is never evaluated. This is a programming
language design feature called short circuiting. The expression
can never evaluate to TRUE if the left hand side of && evaluates to
FALSE, so the right hand side is ignored.
To demonstrate the difference in the speed of the code well use the microbench-
mark package to measure how long it takes for each function to be applied to
the same data.
library(microbenchmark)
microbenchmark(sapply(letters, is_even_check))
Advanced R Programming 138
Unit: microseconds
expr min lq mean median uq max neval
sapply(letters, is_even_check) 46.224 47.7975 61.43616 48.6445 58.4755 167.091 100
microbenchmark(sapply(letters, is_even_error))
Unit: microseconds
expr min lq mean median uq max neval
sapply(letters, is_even_error) 640.067 678.0285 906.3037 784.4315 1044.501 2308.931 100
Summary
2.6 Debugging
traceback()
occurred so that you can see what level of function calls the error occurred.
If you have many functions calling each other in succession, the traceback()
output can be useful for identifying where to go digging first.
For example, the following code gives an error.
Running the traceback() function immediately after getting this error would
give us
traceback()
3: stop("n should be <= 0") at #2
2: check_n_value(n) at #2
1: error_if_n_is_greater_than_zero(5)
From the traceback, we can see that the error occurred in the check_n_value()
function. Put another way, the stop() function was called from within the
check_n_value() function.
error_if_n_is_greater_than_zero(5)
Called from: check_n_value(n)
Browse[1]>
Tracing Functions
If you have easy access to the source code of a function (and can modify the
code), then its usually easiest to insert browser() calls directly into the code as
you track down various bugs. However, if you do not have easy access to a
functions code, or perhaps a function is inside a package that would require
rebuilding after each edit, it is sometimes easier to make use of the trace()
function to make temporary code modifications.
The simplest use of trace() is to just call trace() on a function without any
other arguments.
trace("check_n_value")
Error in trace("check_n_value"): could not find function "check_n_value"
Now, whenever check_n_value() is called by any other functions, you will see
a message printed to the console indicating that the function was called.
error_if_n_is_greater_than_zero(5)
Error in check_n_value(n): n should be <= 0
Here we can see that check_n_value() was called once before the error oc-
curred. But we can do more with trace(), such as inserting a call to browser()
in a specific place, such as right before the call to stop().
We can obtain the expression numbers of each part of a function by calling
as.list() on the body() of a function.
Advanced R Programming 142
as.list(body(check_n_value))
[[1]]
`{`
[[2]]
if (n > 0) {
stop("n should be <= 0")
}
Here, the if statement is the second expression in the function (the first
expression being the very beginning of the function). We can further break
down the second expression as follows.
as.list(body(check_n_value)[[2]])
[[1]]
`if`
[[2]]
n > 0
[[3]]
{
stop("n should be <= 0")
}
Now we can see the call to stop() is the third sub-expression within the second
expression of the overall function. We can specify this to trace() by passing
an integer vector wrapped in a list to the at argument.
The trace() function has a side effect of modifying the function and convert-
ing into a new object of class function.
Advanced R Programming 143
check_n_value
function(n) {
if(n > 0) {
stop("n should be <= 0")
}
}
<environment: 0x7fe991531f88>
body(check_n_value)
{
if (n > 0) {
stop("n should be <= 0")
}
}
Here we can see that the code has been altered to add a call to browser() just
before the call to stop().
We can add more complex expressions to a function by wrapping them in a
call to quote() within the the trace() function. For example, we may only want
to invoke certain behaviors depending on the local conditions of the function.
trace("check_n_value", quote({
if(n == 5) {
message("invoking the browser")
browser()
}
}), at = 2)
Error in getFunction(what, where = whereF): no function 'check_n_value' found
body(check_n_value)
{
if (n > 0) {
stop("n should be <= 0")
}
}
Debugging functions within a package is another key use case for trace(). For
example, if we wanted to insert tracing code into the glm() function within
the stats package, the only addition to the trace() call we would need is to
provide the namespace information via the where argument.
Here we show the first few expressions of the modified glm() function.
body(stats::glm)[1:5]
{
call <- match.call()
if (is.character(family))
family <- get(family, mode = "function", envir = parent.frame())
{
.doTrace(browser(), "step 4")
if (is.function(family))
family <- family()
}
if (is.null(family$family)) {
print(family)
stop("'family' not recognized")
}
}
The debug() and debugonce() functions can be called on other functions to turn
on the debugging state of a function. Calling debug() on a function makes it
such that when that function is called, you immediately enter a browser and
can step through the code one expression at a time.
Advanced R Programming 145
recover()
The recover() function is not often used but can be an essential tool when
debugging complex code. Typically, you do not call recover() directly, but
rather set it as the function to invoke anytime an error occurs in code. This
can be done via the options() function.
options(error = recover)
Usually, when an error occurs in code, the code stops execution and you are
brought back to the usual R console prompt. However, when recover() is in
use and an error occurs, you are given the function call stack and a menu.
error_if_n_is_greater_than_zero(5)
Error in check_n_value(n) : n should be <= 0
1: error_if_n_is_greater_than_zero(5)
2: #2: check_n_value(n)
Selection:
Selecting a number from this menu will bring you into that function on the
call stack and you will be placed in a browser environment. You can exit the
browser and then return to this menu to jump to another function in the call
stack.
Advanced R Programming 146
The recover() function is very useful if an error is deep inside a nested series
of function calls and it is difficult to pinpoint exactly where an error is
occurring (so that you might use browser() or trace()). In such cases, the debug()
function is often of little practical use because you may need to step through
many many expressions before the error actually occurs. Another scenario
is when there is a stochastic element to your code so that errors occur in
an unpredictable way. Using recover() will allow you to browse the function
environment only when the error eventually does occur.
Summary
Some of the R code that you write will be slow. Slow code often isnt worth
fixing in a script that you will only evaluate a few times, as the time it will
take to optimize the code will probably exceed the time it takes the computer
to run it. However, if you are writing functions that will be used repeatedly,
it is often worthwhile to identify slow sections of the code so you can try to
improve speed in those sections.
In this section, we will introduce the basics of profiling R code, using functions
from two packages, microbenchmark and profvis. The profvis package is fairly
new and requires recent versions of both R (version 3.0 or higher) and
RStudio. If you are having problems running either package, you should try
updating both R and RStudio (the Preview version of RStudio, which will
provide full functionality for profvis, is available for download here).
microbenchmark
library(microbenchmark)
microbenchmark(a <- rnorm(1000),
b <- mean(rnorm(1000)))
Unit: microseconds
expr min lq mean median uq max
a <- rnorm(1000) 74.060 85.650 97.13831 93.1350 96.739 368.743
b <- mean(rnorm(1000)) 80.677 101.783 108.15279 104.3055 111.648 253.702
neval
100
100
date temp
2015-07-01 26.5
2015-07-02 27.2
2015-07-03 28.0
2015-07-04 26.9
2015-07-05 27.5
2015-07-06 25.9
2015-07-07 28.0
2015-07-08 28.2
and outputs a data frame that has an additional binary record_temp column,
specifying if that day meet the two conditions, like this:
Advanced R Programming 149
Below are two example functions that can perform these actions. Since the
record_temp column depends on temperatures up to that day, one option is to
use a loop to create this value. The first function takes this approach. The
second function instead uses tidyverse functions to perform the same tasks.
If you apply the two functions to the small example data set, you can see that
they both create the desired output:
Advanced R Programming 150
all.equal(test_1, test_2)
[1] TRUE
This output gives summary statistics (min, lq, mean, median, uq, and max) describ-
ing the time it took to run the two function over the 100 iterations of each
function call. By default, these times are given in a reasonable unit, based on
the observed profiling times (units are given in microseconds in this case).
Its useful to check next to see if the relative performance of the two functions
is similar for a bigger data set. The chicagoNMMAPS data set from the dlnm package
includes temperature data over 15 years in Chicago, IL. Here are the results
when we benchmark the two functions with that data (note, this code takes a
minute or two to run):
library(dlnm)
data("chicagoNMMAPS")
While the function with the loop (find_records_1) performed better with the
very small sample data, the function that uses tidyverse functions (find_-
records_2) performs much, much better with a larger data set.
library(ggplot2)
# For small example data
autoplot(record_temp_perf)
By default, this plot gives the Time axis on a log scale. You can change this
with the argument log = FALSE.
Advanced R Programming 154
profvis
Once youve identified slower code, youll likely want to figure out which
parts of the code are causing bottlenecks. The profvis function from the
profvis package is very useful for this type of profiling. This function uses
the RProf function from base R to profile code, and then displays it in an
interactive visualization in RStudio. This profiling is done by sampling, with
the RProf function writing out the call stack every 10 milliseconds while
running the code.
To profile code with profvis, just input the code (in braces if it is mutli-line)
into profvis within RStudio. For example, we found that the find_records_1
function was slow when used with a large data set. To profile the code in
that function, run:
library(profvis)
datafr <- chicagoNMMAPS
threshold <- 27
profvis({
highest_temp <- c()
record_temp <- c()
for(i in 1:nrow(datafr)){
highest_temp <- max(highest_temp, datafr$temp[i])
record_temp[i] <- datafr$temp[i] >= threshold &
datafr$temp[i] >= highest_temp
}
datafr <- cbind(datafr, record_temp)
})
The profvis output gives you two options for visualization: Flame Graph
or Data (a button to toggle between the two is given in the top left of
the profvis visualization created when you profile code). The Data output
defaults to show you the time usage of each first-level function call. Each of
these calls can be expanded to show deeper and deeper functions calls within
the call stack. This expandable interface allows you to dig down within a call
stack to determine what calls are causing big bottlenecks. For functions that
are part of a package you have loaded with devtools::load_all, this output
includes a column with the file name where a given function is defined. This
functionality makes this Data output pane particularly useful in profiling
functions in a package you are creating.
Advanced R Programming 155
The Flame Graph view in profvis output gives you two panels. The top panel
shows the code called, with bars on the right to show memory use and time
spent on the line. The bottom panel also visualizes the time used by each line
of code, but in this case it shows time use horizontally and shows the full call
stack at each time sample, with initial calls shown at the bottom of the graph,
and calls deeper in the call stack higher in the graph. Clicking on a block in
the bottom panel will show more information about a call, including which
file it was called from, how much time it took, how much memory it took, and
its depth in the call stack.
Figure @ref(fig:profvisexample) shows example output from profiling the
code in the find_records_1 function defined earlier in this section.
You can use the argument interval in profvis to customize the sampling
interval. The default is to sample every 10 milliseconds (interval = 0.01), but
you can decrease this sampling interval. In some cases, you may be able to
use this option to profile faster-running code. However, you should avoid
using an interval smaller than about 5 milliseconds, as below that you will
get inaccurate estimates with profvis. If you are running very fast code, youre
better off profiling with microbenchmark, which can give accurate estimates at
finer time intervals.
Here are some tips for optimizing your use of profvis:
You may find it convenient to use the Show in new window button on
the RStudio pane with profiling results to expand this window while you
are interpreting results.
An Options button near the top right gives different options for how
to display the profiling results, including whether to include memory
profiling results and whether to include lines of code with zero time.
You can click-and-drag results in the bottom visualization panel, as well
as pan in and out.
You may need to update your version of RStudio to be able to use the full
functionality of profvis. You can download a Preview version of RStudio
here.
Advanced R Programming 157
If youd like to share code profiling results from profvis publicly, you can
do that by using the Publish button on the top right of the rendered
profile visualization to publish the visualization to RPubs. The FAQ
section of RStudios profvis documentation includes more tips for shar-
ing a code profile visualization online.
If you get a lot of blocks labeled <Anonymous>, try updating your ver-
sion of R. In newer versions of R, functions called using package::function()
syntax or list$function() syntax are labeled in profiling blocks in a more
meaningful way. This is likely to be a particular concern if you are
profiling code in a package you are developing, as you will often be using
package::function() syntax extensively to pass CRAN checks.
Summary
Functions from packages like dplyr, tidyr, and ggplot2 are excellent for creat-
ing efficient and easy-to-read code that cleans and displays data. However,
Advanced R Programming 158
they allow shortcuts in calling columns in data frames that allow some room
for ambiguity when you move from evaluating code interactively to writing
functions for others to use. The non-standard evaluation used within these
functions mean that, if you use them as you would in an interactive session,
youll get a lot of no visible bindings warnings when you run CRAN checks
on your package. These warnings will look something like this:
When you write a function for others to use, you need to avoid non-standard
evaluation and so avoid all of these functions (culprits include many dplyr
and tidyr functions including mutate, select, filter, group_by, summarize, gather,
spread but also some functions in ggplot2, including aes). Fortunately, these
functions all have standard evaluation alternatives, which typically have the
same function name followed by an underscore (for example, the standard
evaluation version of mutate is mutate_).
The input to the function call will need to be a bit different for standard
evaluation versions of these functions. In many cases, this change is as easy
as using formula notation () within the call, but in some cases it requires
something more complex, including using the .dots argument.
Here is a table with examples of non-standard evaluation calls and their
standard evaluation alternatives (these are all written assuming that the
function is being used as a step in a piping flow, where the input data frame
has already been defined earlier in the piping sequence):
If you have any non-standard evaluation in your package code (which youll
notice because of the no visible bindings warnings youll get when you
check the package), go through and change any instances to use standard
evaluation alternatives. This change prevents these warnings when you
check your package and will also ensure that the functions behave like you
expect them to when they are run by other users in their own R sessions.
In this section, weve explained only how to convert from functions that use
non-standard evaluation to those that use standard evaluation, to help in
passing CRAN checks as you go from coding scripts to writing functions for
packages. If you would like to learn more about non-standard evaluation in
R, you should check out the chapter on non-standard evaluation in Hadley
Wickhams Advanced R book.
Summary
Design and Implement a new S3, S4, or reference class with generics and
methods
Advanced R Programming 160
Introduction
method will then return an individual bus object with the attributes that we
specified.
You could also imagine that after making the bus class you might want to
make a special kind of class for a party bus. Party buses have all of the
same attributes and methods as our bus class, but they also have additional
attributes and methods like the number of refrigerators, window blinds that
can be opened and closed, and smoke machines that can be turned on and off.
Instead of rewriting the entire bus class and then adding new attributes and
methods, it is possible for the party bus class to inherit all of the attributes
and methods from the bus class. In this framework of inheritance, we talk
about the bus class as the super-class of the party bus, and the party bus is
the sub-class of the bus. What this relationship means is that the party bus
has all of the same attributes and methods as the bus class plus additional
attributes and methods.
S3
class(2)
[1] "numeric"
class("is in session.")
[1] "character"
class(class)
[1] "function"
Now its time to wade into some of the quirks of Rs object oriented systems.
In the S3 system you can arbitrarily assign a class to any object, which goes
against most of what we discussed in the Object Oriented Principles section.
Class assignments can be made using the structure() function, or you can
assign the class using class() and <-:
Advanced R Programming 162
special_num_2 <- 2
class(special_num_2)
[1] "numeric"
class(special_num_2) <- "special_number"
class(special_num_2)
[1] "special_number"
This is completely legal R code, but if you want to have a better behaved S3
class you should create a constructor which returns an S3 object. The shape_-
S3() function below is a constructor that returns a shape_S3 object:
Weve now made two shape_S3 objects: square_4 and triangle_3, which are
both instantiations of the shape_S3 class. Imagine that you wanted to create
a method that would return TRUE if a shape_S3 object was a square, FALSE
if a shape_S3 object was not a square, and NA if the object providied as an
argument to the method was not a shape_s3 object. This can be achieved using
Rs generic methods system. A generic method can return different values
based depending on the class of its input. For example mean() is a generic
method that can find the average of a vector of number or it can find the
average day from a vector of dates. The following snippet demonstrates this
behavior:
Advanced R Programming 163
mean(c(2, 3, 7))
[1] 4
mean(c(as.Date("2016-09-01"), as.Date("2016-09-03")))
[1] "2016-09-02"
Now lets create a generic method for identifying shape_S3 objects that are
squares. The creation of every generic method uses the UseMethod() function
in the following way with only slight variations:
Now we can add the actual function definition for detecting whether or not a
shape is a square by specifying is_square.shape_S3. By putting a dot (.) and then
the name of the class after is_squre, we can create a method that associates
is_squre with the shape_S3 class:
is_square(square_4)
[1] TRUE
is_square(triangle_3)
[1] FALSE
is_square("square")
[1] NA
is_square(c(1, 1, 1, 1))
[1] NA
print(square_4)
$side_lengths
[1] 4 4 4 4
attr(,"class")
[1] "shape_S3"
Doesnt that look ugly? Lucky for us print() is a generic method, so we can
specify a print method for the shape_S3 class:
print(square_4)
[1] "A square with four sides of length 4"
print(triangle_3)
[1] "A triangle with side lengths of 3 3 and 3"
print(shape_s3(c(10, 10, 20, 20, 15)))
[1] "A shape with 5 slides."
print(shape_s3(c(2, 3, 4, 5)))
[1] "A quadrilateral with side lengths of 2 3 4 and 5"
Advanced R Programming 165
Since printing an object to the console is one of the most common things to
do in R, nearly every class has an assocaited print method! To see all of the
methods associated with a generic like print() use the methods() function:
head(methods(print), 10)
[1] "print,ANY-method" "print,diagonalMatrix-method"
[3] "print,sparseMatrix-method" "print.acf"
[5] "print.anova" "print.anova.gam"
[7] "print.anova.lme" "print.aov"
[9] "print.aovlist" "print.ar"
class(square_4)
[1] "shape_S3"
class(square_4) <- c("shape_S3", "square")
class(square_4)
[1] "shape_S3" "square"
inherits(square_4, "square")
[1] TRUE
The S3 system doesnt have a formal way to define a class but typically, we
use a list to define the class and elements of the list serve as data elements.
Here is our definition of a polygon represented using Cartesian coordinates.
The class contains an element called xcoord and ycoord for the x- and y-coordi-
nates, respectively. The make_poly() function is the constructor function for
polygon objects. It takes as arguments a numeric vector of x-coordinates and
a corresponding numeric vector of y-coordinates.
Advanced R Programming 166
Now that we have a class definition, we can develop some methods for
operating on objects from that class.
The first method well define is the print() method. The print() method should
just show some simple information about the object and should not be too
verbosejust enough information that the user knows what the object is.
Here the print() method just shows the user how many vertices the polygon
has. It is a convention for print() methods to return the object x invisibly.
Next is the summary() method. The summary() method typically shows a bit
more information and may even do some calculations. This summary() method
computes the ranges of the x- and y-coordinates.
The typical approach for summary() methods is to allow the summary method
to compute something, but to not print something. The strategy is
Note that it simply returns an object of class summary_polygon. Now the corre-
sponding print() method.
print(x)
a polygon with 4 vertices
And we can use the summary() method to get a bit more information about the
object.
Advanced R Programming 168
Because of auto-printing we can just call the summary() method and let the
results auto-print.
summary(x)
$rng.x
[1] 1 4
$rng.y
[1] 1 5
attr(,"class")
[1] "summary_polygon"
From here, we could build other methods for interacting with our polygon
object. For example, it may make sense to define a plot() method or maybe
methods for intersecting two polygons together.
S4
The S4 system is slightly more restrictive than S3, but its similar in many
ways. To create a new class in S4 you need to use the setClass() function. You
need to specify two or three arguments for this function: Class which is the
name of the class as a string, slots, which is a named list of attributes for
the class with the class of those attributes specified, and optionally contains
which includes the super-class of they class youre specifying (if there is a
super-class). Take look at the class definition for a bus_S4 and a party_bus_S4
below:
Advanced R Programming 169
setClass("bus_S4",
slots = list(n_seats = "numeric",
top_speed = "numeric",
current_speed = "numeric",
brand = "character"))
setClass("party_bus_S4",
slots = list(n_subwoofers = "numeric",
smoke_machine_on = "logical"),
contains = "bus_S4")
Now that weve created the bus_S4 and the party_bus_S4 classes we can create
bus objects using the new() function. The new() functions arguments are the
name of the class and values for each slot in our S4 object.
Slot "top_speed":
[1] 80
Slot "current_speed":
[1] 0
Slot "brand":
[1] "Volvo"
my_party_bus <- new("party_bus_S4", n_seats = 10, top_speed = 100,
current_speed = 0, brand = "Mercedes-Benz",
n_subwoofers = 2, smoke_machine_on = FALSE)
my_party_bus
An object of class "party_bus_S4"
Slot "n_subwoofers":
[1] 2
Slot "smoke_machine_on":
[1] FALSE
Slot "n_seats":
[1] 10
Slot "top_speed":
[1] 100
Advanced R Programming 170
Slot "current_speed":
[1] 0
Slot "brand":
[1] "Mercedes-Benz"
my_bus@n_seats
[1] 20
my_party_bus@top_speed
[1] 100
This is essentially the same as using the $ operator with a list or an environ-
ment.
S4 classes use a generic method system that is similar to S3 classes. In order
to implement a new generic method you need to use the setGeneric() function
and the standardGeneric() function in the following way:
setGeneric("new_generic", function(x){
standardGeneric("new_generic")
})
setGeneric("is_bus_moving", function(x){
standardGeneric("is_bus_moving")
})
[1] "is_bus_moving"
Now we need to actually define the function which we can to with setMethod().
The setMethod() functions takes as arguments the name of the method as a
stirng, the method signature which specifies the class of each argument for
the method, and then the function definition of the method:
Advanced R Programming 171
setMethod("is_bus_moving",
c(x = "bus_S4"),
function(x){
x@current_speed > 0
})
[1] "is_bus_moving"
is_bus_moving(my_bus)
[1] FALSE
my_bus@current_speed <- 1
is_bus_moving(my_bus)
[1] TRUE
In addition to creating your own generic methods, you can also create a
method for your new class from an existing generic. First use the setGeneric()
function with the name of the existing method you want to use with your
class, and then use the setMethod() function like in the previous example. Lets
make a print() method for the bus_S4 class:
setGeneric("print")
[1] "print"
setMethod("print",
c(x = "bus_S4"),
function(x){
paste("This", x@brand, "bus is traveling at a speed of", x@current_speed)
})
[1] "print"
print(my_bus)
[1] "This Volvo bus is traveling at a speed of 1"
print(my_party_bus)
[1] "This Mercedes-Benz bus is traveling at a speed of 0"
Reference Classes
With reference classes we leave the world of Rs old object oriented systems
and enter the philosophies of other prominent object oriented programming
languages. We can use the setRefClass() function to define a class fields,
methods, and super-classes. Lets make a reference class that represents a
student:
Advanced R Programming 172
To recap: weve created a class definition called Student which defines the
student class. This class has five fields and three methods. To create a Student
object use the new() method:
You can access the fields and methods of each object using the $ operator:
brooke$credits
[1] 40
roger$hello()
[1] "Hi! My name is Roger"
roger$get_email()
[1] "[email protected]"
Methods can change the state of an object, for instanct in the case of the add_-
credits() function:
Advanced R Programming 173
brooke$credits
[1] 40
brooke$add_credits(4)
brooke$credits
[1] 44
Notice that the add_credits() method uses the complex assignment operator
(<<-). You need to use this operator if you want to modify one of the fields of an
object with a method. Youll learn more about this operator in the Expressions
& Environments section.
Reference classes can inheret from other classes by specifying the contains
argument when theyre defined. Lets create a sub-class of Student called
Grad_Student which includes a few extra features:
jeff$defend()
[1] "Batch Effects. QED."
Summary
R has three object oriented systems: S3, S4, and Reference Classes.
Reference Classes are the most similar to classes and objects in other
programming languages.
Classes are blueprints for an object.
Objects are individual instances of a class.
Methods are functions that are associaed with a particular class.
Constructors are methods that create objects.
Advanced R Programming 174
Everything in R is an object.
S3 is a liberal object oriented system that allows you to assign a class to
any object.
S4 is a more strict object oriented system that build upon ideas in S3.
Reference Classes are a modern object oriented system that is similar to
Java, C++, Python, or Ruby.
Many of the tools that we discuss in this book revolve around the so-called
tidyverse set of tools. These tools, largely developed by Hadley Wickham
but also including a diverse community of developers, have a set of principles
that are adhered to when they are being developed. Hadley Wicham laid out
these principles in his Tidy Tools Manifesto, a vignette within the tidyverse
package.
The four basic principles of the tidyverse are:
R has a number of data structures (data frames, vectors, etc.) that people have
grown accustomed to over the many years of Rs existence. While it is often
tempting to develop custom data structures, for example, by using S3 or S4
classes, it is often worthwhile to consider reusing a commonly used structure.
Youll notice that many tidyverse functions make heavy use of the data frame
(typically as their first argument), because the data frame is a well-known,
well-understood structure used by many analysts. Data frames have a well-
known and reasonably standardized corresponding file format in the CSV file.
While common data structures like the data frame may not be perfectly suited
to your needs as you develop your own software, it is worth considering
using them anyway because the enormous value to the community that is
already familiar with them. If the user community feels familiar with the data
structures required by your code, they are likely to adopt them quicker.
Advanced R Programming 175
One of the original principles of the Unix operating system was that every
program should do one thing well. The limitation of only doing one thing
(but well!) was removed by being able to easily pipe the output of one function
to be the input of another function (the pipe operator on Unix was the
| symbol). Typical Unix commands would contain long strings commands
piped together to (eventually) produce some useful output. On Unix systems,
the unifying concept that allowed programs to pipe to each other was the use
of [textual formats]. All data was rendered in textual formats so that if you
wrote a new program, you would not need to worry about decoding some
obscure proprietary format.
Much like the original Unix systems, the tidyverse eschews building mono-
lithic functions that have many bells and whistles. Rather, once you are
finished writing a simple function, it is better to start afresh and work off
the input of another function to produce new output (using the %>% operator,
for example). The key to this type of development is having clean interfaces
between functions and an expectation that the output of every function may
serve as the input to another function. This is why the first principle (reuse
existing data structures) is important, because the reuse of data structures
that are well-understood and characterized lessens the burden on other
developers who are developing new code and would prefer not to worry
about new-fangled data structures at every turn.
This can be a tough principle for people coming from other non-functional
programming languages. But the reality is, R is a functional programming
language (with its roots in Scheme) and its best not to go against the grain. In
our section on Functional Programming, we outlined many of the principles
that are fundamental to functional-style programming. In particular, the
purrr package implements many of those ideas.
also allows for simple parallelization, so that we can quickly parallelize any
code that uses lapply() or map().
Making your code readable and usable by people is goal that is overlooked
surprisingly often. The result is things like function names that are obscure
and do not actually communicate what they do. When writing code, using
things like good, explicit, function names, with descriptive arguments, can
allow for users to quickly learn your API. If you have a set of functions with
a similar purpose, they might share a prefix (see e.g. geom_point(), geom_line(),
etc.). If you have an argument like color that could either take arguments 1,
2, and 3, or black, red, and green, think about which set of arguments might be
easier for humans to handle.
3. Building R Packages
This section covers building R packages. Writing good code for data science
is only part of the job. In order to maximize the usefulness and reusability of
data science software, code must be organized and distributed in a manner
that adheres to community-based standards and provides a good user experi-
ence. This section covers the primary means by which R software is organized
and distributed to others. We cover R package development, writing good
documentation and vignettes, writing robust software, cross-platform devel-
opment, continuous integration tools, and distributing packages via CRAN
and GitHub. Learners will produce R packages that satisfy the criteria for
submission to CRAN.
The Learning objectives for this section are:
then the tools you need come with R and RStudio. However, if you want to
build packages with compiled C, C++, or Fortran code (or which to build other
peoples packages with such code), then you will need to install additional
tools. Which tools you install depends on what platform you are running.
Mac OS
For developing in Mac OS, you will first need to download the Xcode develop-
ment environment. While you do not need the IDE that comes with Xcode to
develop R packages you need many of the tools that come with it, including
the C compiler (clang). Xcode can be obtained from either the Mac App Store
or from Apples Xcode developers page. Once this is installed you will have
the C compiler as well as a number of additional Unix shell tools. You will also
have necessary header files for compiling C code.
While its unlikely that you will be building your own packages with Fortran
code, many older packages (including R itself) contain Fortran code. There-
fore, in order to build these packages, you need a Fortran compiler. Mac OS
does not come with one by default and so you can download the GNU Fortran
Compiler from the R for Mac tools page.
There are more details provided on the R for Mac tools package maintained
by Simon Urbanek, particularly for older versions of Mac OS.
Windows
On Windows, the R Core has put together a package of tools that you can
download all at once and install via a simple installer tool. The Rtools package
comes in different versions, depending on the version of R that you are using.
Make sure to get the version of Rtools that matches your version of R. Once
you have installed this, you will have most of the tools needed to build R
packages. You can optionally install a few other tools, documented here.
Unix/Linux
If you are using R on a Unix-like system then you may have already have the
tools for building R packages. In particular, if you built R from the sources,
then you already have a C compiler and Fortran compiler. If, however,
you installed R from a package management system, then you may need to
Building R Packages 179
install the compilers, as well as the header files. These usually coming in
packages with the suffix -devel. For example, the header files for the readline
package may come in the package readline-devel. The catch is that these -devel
packages are not needed to run R, only to build R packages from the sources.
3.2 R Packages
This chapter highlights the key elements of building R packages. The fine
details of building a package can be found in the Writing R Extensions
manual.
At the top level of your package directory you will have a DESCRIPTION file and
a NAMESPACE file. This represents the minimal requirements for an R package.
Other files and sub-directories can be added and will discuss how and why in
the sections below.
DESCRIPTION File
Package: mvtsplot
Version: 1.0-3
Date: 2016-05-13
Depends: R (>= 3.0.0)
Imports: splines, graphics, grDevices, stats, RColorBrewer
Title: Multivariate Time Series Plot
Author: Roger D. Peng <[email protected]>
Maintainer: Roger D. Peng <[email protected]>
Description: A function for plotting multivariate time series data.
License: GPL (>= 2)
URL: https://github.com/rdpeng/mvtsplot
NAMESPACE File
The NAMESPACE file specifies the interface to the package that is presented
to the user. This is done via a series of export() statements, which indicate
which functions in the package are exported to the user. Functions that are
not exported cannot be called directly by the user (although see below). In
addition to exports, the NAMESPACE file also specifies what functions or
packages are imported by the package. If your package depends on functions
from another package, you must import them via the NAMESPACE file.
Below is the NAMESPACE file for the mvtsplot package described above.
export("mvtsplot")
import(splines)
import(RColorBrewer)
importFrom("grDevices", "colorRampPalette", "gray")
importFrom("graphics", "abline", "axis", "box", "image", "layout",
"lines", "par", "plot", "points", "segments", "strwidth",
"text", "Axis")
importFrom("stats", "complete.cases", "lm", "na.exclude", "predict",
"quantile")
Here we can see that only a single function is exported from the package (the
mvtsplot() function). There are two types of import statements:
As you start to use many packages in R, the likelihood of two functions having
the same name increases. For example, the commonly used dplyr package has
a function named filter(), which is also the name of a function in the stats
package. If one has both packages loaded (a more than likely scenario) how
can one specific exactly which filter() function they want to call?
In R, every function has a full name, which includes the package namespace
as part of the name. This format is along the lines of
For example, the filter() function from the dplyr package can be referenced
as dplyr::filter(). This way, there is no confusion over which filter() func-
tion we are calling. While in principle every function can be referenced in this
Building R Packages 183
The R Sub-directory
The man sub-directory contains the documentation files for all of the exported
objects of a package. With older versions of R one had to write the documen-
Building R Packages 184
tation of R objects directly into the man directory using a LaTeX-style notation.
However, with the development of the roxygen2 package, we no longer need
to do that and can write the documentation directly into the R code files.
Therefore, you will likely have little interaction with the man directory as all
of the files in there will be auto-generated by the roxygen2 package.
Summary
Hands down, the best resource for mastering the devtools package is
the book R Packages by Hadley Wickham. The full book is available
online for free at http://r-pkgs.had.co.nz. It is also available as a
hard copy book published by OReilly. If you plan to develop a lot of
R packages, it is well worth your time to read this book closely.
Here are some of the key functions included in devtools and what they do,
roughly in the order you are likely to use them as you develop an R package:
Building R Packages 185
Function Use
create Create the file structure for a new package
load_all Load the code for all functions in the package
document Create \man documentation files and the
NAMESPACE file from roxygen2 code
use_data Save an object in your R session as a dataset
in the package
use_package Add a package youre using to the
DESCRIPTION file
use_vignette Set up the package to include a vignette
use_readme_rmd Set up the package to include a README file
in R Markdown format
use_build_ignore Specify files that should be ignored when
building the R package (for example, if you
have a folder where youre drafting a journal
article about the package, you can include all
related files in a folder that you set to be
ignored during the package build)
check Check the full R package for any ERRORs,
WARNINGs, or NOTEs
build_win Build a version of the package for Windows
and send it to be checked on a Windows
machine. Youll receive an email with a link
to the results.
use_travis Set the package up to facilitate using Travis
CI with the package
use_cran_comments Create a file where you can add comments to
include with your CRAN submission.
submit_cran Submit the package to CRAN
use_news_md Add a file to the package to give news on
changes in new versions
Some of these functions youll only need to use once for a package. The
one-time (per package) functions are mostly those that set up a certain
type of infrastructure for the package. For example, if you want to use R
Markdown to create a README file for a package you are posting to GitHub,
you can create the proper infrastructure with the use_readme_rmd function. This
function adds a starter README file in the main directory of the package with
the name README.Rmd. You can edit this file and render it to Markdown to
provide GitHub users more information about your package. However, you
will have problems with your CRAN checks if there is a README file in this
top-level directory of the package, so the use_readme_rmd function also adds
the files names for the R Markdown README file, and the Markdown file
Building R Packages 186
Creating a package
The earliest infrastructure function you will use from the devtools package is
create. This function inputs the filepath for the directory where you would
like to create the package and creates the initial package structure (as a note,
this directory should not yet exist). You will then add the elements (code, data,
etc.) for the package within this structure. As an alternative to create, you can
also initialize an R package in RStudio by selecting File -> New Project ->
New Direction -> R Package.
Example of the directory contents of the initial package structure created with devtools.
Other functions
In contrast to the devtools infrastructure functions that you will only use
once per package, there are other devtools functions youll use many times
as you develop a package. Two of the workhorses of devtools are load_all
and document. The load_all function loads the entire package (by default, based
on the current working directory, although you can also give the filepath to
load a package directory elsewhere). In addition to loading all R functions, it
also loads all package data and compiles and connects C, C++, and FORTRAN
code in the package. As you add to a package, you can use load_all to
ensure youre using the latest version of all package functions and data. The
document function rewrites the help files and NAMESPACE file based on the
latest version of the roxygen2 comments for each function (writing roxygen2 is
covered in more detail in the next section).
Summary
The devtools package contains functions that help with R package devel-
opment. These functions include create, which creates the initial structure
for a new package, as well as a number of functions for adding useful
infrastructure to the package directory and functions to load and document
the package.
3.4 Documentation
There are two main types of documentation you may want to include with
packages:
Longer documents that give tutorials or overviews for the whole pack-
age
Shorter, function-specific help files for each function or group of related
functions
You can create the first type of document using package vignettes, README
files, or both. For the function-specific help files, the easiest way to create
these is with the roxygen2 package.
In this section, well cover why and how to create this documentation. In
addition, vignette / README documentation can be done using knitr to create
R Markdown documents that mix R code and text, so well include more
details on that process.
You will likely want to create a document that walks users through the basics
of how to use your package. You can do this through two formats:
Building R Packages 189
A package likely only needs a README file if you are posting the package
to GitHub. For any GitHub repository, if there is a README.md file in the
top directory of the repository, it will be rendered on the main GitHub
repository page below the listed repository content. For an example, visit
https://github.com/geanders/countytimezones and scroll down. Youll see a
list of all the files and subdirectories included in the package repository and
below that is the content in the packages README.md file, which gives a tutorial
on using the package.
If the README file does not need to include R code, you can write it directly
as an .md file, using Markdown syntax, which is explained in more detail
in the next section. If you want to include R code, you should start with a
README.Rmd file, which you can then render to Markdown using knitr. You can
use the devtools package to add either a README.md or README.Rmd file to a package
directory using use_readme_md or use_readme_rmd, respectively. These functions
will add the appropriate file to the top level of the package directory and will
also add the file name to .Rbuildignore, since having one of these files in
the top level of the package directory could otherwise cause some problems
when building the package.
The README file is a useful way to give GitHub users information about your
package, but it will not be included in builds of the package or be available
through CRAN for packages that are posted there. Instead, if you want to
create tutorials or overview documents that are included in a package build,
you should do that by adding one or more package vignettes. Vignettes are
stored in a vignettes subdirectory within the package directory.
To add a vignette file, saved within this subdirectory (which will be created if
you do not already have it), use the use_vignette function from devtools. This
function takes as arguments the file name of the vignette youd like to create
and the package for which youd like to create it (the default is the package in
Building R Packages 190
the current working directory). For example, if you are currently working in
your packages top-level directory and you would like to add a vignette called
model_details, you can do that with the code:
use_vignette("model_details")
You can have more than one vignette per package, which can be useful if
you want to include one vignette that gives a more general overview of the
package as well as a few vignettes that go into greater detail about particular
aspects or applications.
Knitr / Markdown
Both vignettes and README files can be written as R Markdown files, which
will allow you to include R code examples and results from your package.
One of the most exciting tools in R is the knitr system for combining code and
text to create a reproducible document. In terms of the power you get for
time invested in learning a tool, knitr probably cant be beat. Everything you
need to know to create and knit a reproducible document can be learned
in about 20 minutes, and while there is a lot more you can do to customize
this process if you want to, probably 80% of what youll ever want to do with
knitr youll learn in those first 20 minutes.
To write a file in Markdown, youll need to learn the conventions for creating
formatting. This table shows what you would need to write in a flat file for
some common formatting choices:
The start of a Markdown file gives some metadata for the file (authors, title,
format) in a language called YAML. For example, the YAML section of a
package vignette might look like this:
Building R Packages 192
---
title: "Model Details for example_package"
author: "Jane Doe"
date: "2017-01-05"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Model Details for example_package}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
To open a new R Markdown file, go to File -> New File -> RMark-
down. To start, choose a Document in HTML format.
This will open a new R Markdown file in RStudio. The file extension for
R Markdown files is .Rmd.
The new file comes with some example code and text. You can run the
file as-is to try out the example. You will ultimately delete this example
code and text and replace it with your own.
Once you knit the R Markdown file, R will render an HTML file with
the output. This is automatically saved in the same directory where you
saved your .Rmd file.
Write everything besides R code using Markdown syntax.
Building R Packages 193
The knit function from the knitr package works by taking a document in R
Markdown format (among a few possible formats), reading through it for
any markers of the start of R code, running any of the code between that
start marker and a marker showing a return to regular Markdown, writing
any of the relevant results from R code into the Markdown file in Markdown
format, and then passing the entire document to software that can render
from Markdown to the desired output format (for example, compile a pdf,
Word, or HTML document).
This means that all a user needs to do to include R code within a document
is to properly separate it from other parts of the document through the
appropriate markers. To indicate R code in an RMarkdown document, you
need to separate off the code chunk using the following syntax:
```{r}
my_vec <- 1:10
```
This syntax tells R how to find the start and end of pieces of R code (code
chunks) when the file is rendered. R will walk through, find each piece of R
code, run it and create output (printed output or figures, for example), and
then pass the file along to another program to complete rendering (e.g., Tex
for pdf files).
You can specify a name for each chunk, if youd like, by including it after r
when you begin your chunk. For example, to give the name load_mtcars to a
code chunk that loads the mtcars dataset, specify that name in the start of the
code chunk:
```{r load_mtcars}
data(mtcars)
```
You do not have to name each chunk. However, there are some advantages:
You can also add options when you start a chunk. Many of these options can
be set as TRUE / FALSE and include:
Option Action
echo Print out the R code?
eval Run the R code?
messages Print out messages?
warnings Print out warnings?
include If FALSE, run code, but dont print code or results
Other chunk options take values other than TRUE / FALSE. Some you might
want to include are:
Option Action
results How to print results (e.g., hide runs the code, but
doesnt print the results)
fig.width Width to print your figure, in inches (e.g., fig.width
= 4)
fig.height Height to print your figure
To include any of these options, add the option and value in the opening
brackets and separate multiple options with commas:
You can set global options at the beginning of the document. This will create
new defaults for all of the chunks in the document. For example, if you want
echo, warning, and message to be FALSE by default in all code chunks, you can run:
Building R Packages 195
```{r global_options}
knitr::opts_chunk$set(echo = FALSE, message = FALSE,
warning = FALSE)
```
If you set both global and local chunk options that you set specifically for
a chunk will take precedence over global options. For example, running a
document with:
```{r global_options}
knitr::opts_chunk$set(echo = FALSE, message = FALSE,
warning = FALSE)
```
would print the code for the check_mtcars chunk, because the option specified
for that specific chunk (echo = TRUE) would override the global option (echo =
FALSE).
You can also include R output directly in your text (inline) using backticks:
\bigskip
There are `r nrow(mtcars)` observations in the mtcars data set. The average
miles per gallon is `r mean(mtcars$mpg, na.rm = TRUE)`.
\bigskip
Once the file is rendered, this gives: \bigskip
There are 32 observations in the mtcars data set. The average miles per gallon
is 20.090625.
\bigskip
Building R Packages 196
Here are some tips that will help you diagnose some problems
rendering R Markdown files:
Youll want to try out pieces of your code as you write an R Markdown
document. There are a few ways you can do that:
You can run code in chunks just like you can run code from a script (Ctrl-
Return or the Run button).
You can run all the code in a chunk (or all the code in all chunks) using
the different options under the Run button in RStudio.
All the Run options have keyboard shortcuts, so you can use those.
#' @export
hello_world <- function(to_print = "Hello world", excited = FALSE){
if(excited) to_print <- paste0(to_print, "!")
print(to_print)
}
You can run the document function from the devtools package at any time
to render the latest version of these roxygen2 comments for each of your
functions. This will create function-specific help files in the packages man
subdirectory as well as update the packages NAMESPACE file.
Here are some of the common roxygen2 tags to use in creating this documen-
tation:
Tag Meaning
@return A description of the object returned by the
function
@parameter Explanation of a function parameter
@inheritParams Name of a function from which to get
parameter definitions
@examples Example code showing how to use the function
@details Add more details on how the function works
(for example, specifics of the algorithm being
used)
@note Add notes on the function or its use
@source Add any details on the source of the code or
ideas for the function
@references Add any references relevant to the function
@importFrom Import a function from another package to use
in this function (this is especially useful for
inline functions like %>% and %within%)
@export Export the function, so users will have direct
access to it when they load the package
Here are a few things to keep in mind when writing help files using roxygen2:
The tags @example and @examples do different things. You should always
use the @examples (plural) tag for example code, or you will get errors
when you build the documentation.
Building R Packages 199
The @inheritParams function can save you a lot of time, because if you
are using the same parameters in multiple functions in your package,
you can write and edit those parameter descriptions just in one place.
However, keep in mind that you must point @inheritParams to the function
where you originally define the parameters using @param, not another
function where you use the parameters but define them using an @in-
heritParams pointer.
If you want users to be able to directly use the function, you must include
@export in your roxygen2 documentation. If you have written a function
but then find it isnt being found when you try to compile a README
file or vignette, a common culprit is that you have forgotten to export
the function.
You can include formatting (lists, etc.) and equations in the roxygen2 documen-
tation. Here are some of the common formatting tags you might want to use:
Tag Meaning
\code{} Format in a typeface to look like code
\dontrun{} Use with examples, to avoid running the example
code during package builds and testing
\link{} Link to another R function
\eqn{}{} Include an inline equation
\deqn{}{} Include a display equation (i.e., shown on its own
line)
\itemize{} Create an itemized list
\url{} Include a web link
\href{}{} Include a web link
Usually, youll want you use the \link tag only in combination with
the \code tag, since youre linking to another R function. Make sure
you use these with \code wrapping \link, not the other way around
(\code{\link{other_function}}), or youll get an error.
Some of the equation formatting, including superscripts and subscripts,
wont parse in Markdown-based documentation (but will for pdf-based
Building R Packages 200
documentation). With the \eqn and deqn tags, you can include two ver-
sions of an equation, one with full formatting, which will be fully com-
piled by pdf-based documentation, and one with a reduced form that
looks better in Markdown-based documentation (for example, \deqn{
\frac{X2}{Y} }{ X2 / Y }).
For any examples in help files that take a while to run, youll want to
wrap the example code in the \dontrun tag.
The tags \url and \href both include a web link. The difference be-
tween the two is that \url will print out the web address in the help
documentation, href allows you to use text other than the web address
for the anchor text of the link. For example: "For more information, see
\url{www.google.com}."; "For more information, \href{www.google.com}{Google
it}.".
In addition to document functions, you should also document any data that
comes with your package. To do that, create a file in the /R folder of the
package called data.R to use to documentation all of the packages datasets.
You can use roxygen2 to document each dataset, and end each with the name
of the dataset in quotation marks. There are more details on documenting
package data using roxygen2 in the next section.
As you prepare a package for sharing with others, you may want to
create a pdf manual, which provides a more user-friendly format
for proofreading all the package help files. You can create one
with the R CMD Rd2pdf shell command. To use this, open a shell and
navigate to the parent directory of your R package directory (an
easy way to do this is to open a shell using the Shell option
for the gear button in the Git pane in RStudio and then running
cd .. to move up one directory). Then, from the shell, run R CMD
Rd2pdf followed by your packages name (e.g., for a package named
examplepackage, run R CMD Rd2pdf examplepackage). This command
builds your package and creates and opens a pdf with the text of
all help files for exported functions. Check out this StackOverflow
thread for more.
Summary
You should include documentation to help others use your package, both
longer-form documentation through vignettes or README files and function-
Building R Packages 201
Data Objects
to find out more information about the included data set. You should create
one R file called data.R in the R/ directory of your package. You can write the
data documentation in the data.R file. Lets take a look at some documentation
examples from the minimap package. First well look at the documentation for
a data frame called maple:
Data frames that you include in your package should follow the general
schema above where the documentation page has the following attributes:
The minimap package also includes a few vectors. Lets look at the documenta-
tion for mexico_abb:
Building R Packages 203
You should always include a title for a description of a vector or any other
object. If you need to elaborate on the details of a vector you can include a
description in the documentation or a @source tag. Just like with data frames
the documentation for a vector should end with a string containing the name
of the object.
Raw Data
A common task for R packages is to take raw data from files and to import
them into R objects so that they can be analyzed. You might want to include
some sample raw data files so you can show different methods and options
for importing the data. To include raw data files in your package you should
create a directory under inst/extdata in your R package. If you stored a data
file in this directory called response.json in inst/extdata and your package
is named mypackage then a user could access the path to this file with sys-
tem.file("extdata", "response.json", package = "mypackage"). Include that line
of code in the documentation to your package so that your users know how
to access the raw data file.
Internal Data
Functions in your package may need to have access to data that you dont
want your users to be able to access. For example the swirl package contains
translations for menu items into languages other than English, however that
data has nothing to do with the purpose of the swirl package and so its hidden
from the user. To add internal data to your package you can use the use_-
data() function from devtools, however you must specify the internal = TRUE
argument. All of the objects you pass to use_data(..., internal = TRUE) can be
referenced by the same name within your R package. All of these objects will
be saved to one file called R/sysdata.rda.
Building R Packages 204
Data Packages
There are several packages which were created for the sole purpose of
distributing data including janeaustenr, gapminder, babynames, and lego.
Using an R package as a means of distributing data has advantages and
disadvantages. On one hand the data is extremely easy to load into R, as a
user only needs to install and load the package. This can be useful for teaching
folks who are new to R and may not be familiar with importing and cleaning
data. Data packages also allow you document datasets using roxygen2, which
provides a much cleaner and more programmer-friendly kind of code book
compared to including a file that describes the data. On the other hand data in
a data package is not accessible to people who are not using R, though theres
nothing stopping you from distributing the data in multiple ways.
If you decide to create a data package you should document the process that
you used to obtain, clean, and save the data. One approach to doing this is
to use the use_data_raw() function from devtools. This will create a directory
inside of your package called data_raw. Inside of this directory you should
include any raw files that the data objects in your package are derived from.
You should also include one or more R scripts which import, clean, and save
those data objects in your R package. Theoretically if you needed to update the
data package with new data files you should be able to just run these scripts
again in order to rebuild your package.
Summary
Including data in a package is useful for showing new users how to use your
package, using data internally, and sharing and documenting datasets. The
devtools package includes several useful functions to help you add data to
your package including use_data() and use_data_raw(). You can document data
within your package just like you would document a function.
Once youve written code for an R package and have gotten that code to a
point where you believe its working, it may be a good time to step back and
consider a few things about your code.
How do you know its working? Given that you wrote the functions,
you have a certain set of expectations about how the functions should
behave. Specifically, for a given set of inputs you expect a certain output.
Having these expectations clearly in mind is an important aspect of
knowing whether code is working.
Have you already tested your code? Chances are, throughout the
development of your code, you ran little tests to see if your functions
were working. Assuming these tests were valid for the code you were
testing, its worth keeping these tests on hand and making them part of
your package.
Setting up a battery of tests for the code in your package can play a big role in
maintaining the ongoing smooth operation of the package in hunting down
bugs in the code, should they arise. Over time, many aspects of a package can
change. Specifically:
As you actively develop your code, you may change/break older code
without knowing it. For example, modifying a helper function that lots
of other functions rely on may be better for some functions but may
break behavior for other functions. Without a comprehensive testing
framework, you might not know that some behavior is broken until a
user reports it to you.
The environment in which your package runs can change. The version
of R, libraries, web sites and any other external resources, and packages
can all change without warning. In such cases, your code may be un-
changed, but because of an external change, your code may not produce
the expected output given a set of inputs. Having tests in place that are
run regularly can help to catch these changes even if your package isnt
under active development.
As you fix bugs in your code, its often a good idea to include a specific
test that addresses each bug so that you can be sure that the bug does
not return in a future version of the package (this is also known as a
regression).
Building R Packages 206
Testing your code effectively has some implications for code design. In
particular, it may be more useful to divide your code into smaller functions so
that you can test individual pieces more effectively. For example, if you have
one large function that returns TRUE or FALSE, it is easy to test this function,
but ultimately it may not be possible to identify problems deep in the code
by simply checking if the function returns the correct logical value. It may be
better to divide up large function into smaller functions so that core elements
of the function can be tested separately to ensure that they are behaving
appropriately.
The testthat package is designed to make it easy to setup a battery of tests for
your R package. A nice introduction to the package can be found in Hadley
Wickhams article in the R Journal. Essentially, the package contains a suite
of functions for testing function/expression output with the expected output.
The simplest use of the package is for testing a simple expression:
library(testthat)
expect_that(sqrt(3) * sqrt(3), equals(3))
Note that the equals() function allows for some numerical fuzz, which is why
this expression actually passes the test. When a test fails, expect_that() throws
an error and does not return something.
The expect_that() function can be used to wrap many different kinds of test,
beyond just numerical output. The table below provides a brief summary of
the types of comparisons that can be made.
Building R Packages 207
Expectation Description
equals() check for equality with numerical fuzz
is_identical_to() strict equality via identical()
is_equivalent_to() like equals() but ignores object attributes
is_a() checks the class of an object (using
inherits())
matches() checks that a string matches a regular
expression
prints_text() checks that an expression prints to the
console
shows_message() checks for a message being generated
gives_warning() checks that an expression gives a warning
throws_error() checks that an expression (properly) throws
an error
is_true() checks that an expression is TRUE
test_that("model fitting", {
data(airquality)
fit <- lm(Ozone ~ Wind, data = airquality)
expect_that(fit, is_a("lm"))
expect_that(1 + 1, equals(2))
})
Typically, you would put your tests in an R file. If you have multiple sets of
tests that test different domains of a package, you might put those tests in
different files. Individual files can have their tests run with the test_file()
function. A collection of tests files can be placed in a directory and tested all
together with the test_dir() function.
In the context of an R package, it makes sense to put the test files in the tests
directory. This way, when running R CMD check (see the next section) all of the
tests will be run as part of the process of checking the entire package. If any
of your tests fail, then the entire package checking process will fail and will
prevent you from distributing buggy code. If you want users to be able to
easily see the tests from an installed package, you can place the tests in the
inst/tests directory and have a separate file in the tests directory to run all
of the tests.
Building R Packages 208
Before submitting a package to CRAN, you must pass a battery of tests that
are run by the R itself via the R CMD check program. In RStudio, if you are in
an R Package Project you can run R CMD check by clicking the Check button in
the build tab. This will run a series of tests that check the metadata in your
package, the NAMESPACE file, the code, the documentation, run any tests,
build any vignettes, and many others.
Here is an example of the output form R CMD check for the filehash package
which currently passes all tests.
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking line endings in C/C++/Fortran sources/headers ... OK
* checking compiled code ... OK
* checking sizes of PDF files under 'inst/doc' ... OK
* checking installed files from 'inst/doc' ... OK
* checking files in 'vignettes' ... OK
* checking examples ... OK
* checking for unstated dependencies in 'tests' ... OK
* checking tests ...
OK
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in 'inst/doc' ... OK
* checking running R code from vignettes ...
'filehash.Rnw' ... OK
OK
* checking re-building of vignette outputs ... OK
* checking PDF version of manual ... OK
* DONE
Status: OK
Here, it appears that the functions Axis() and lm() are needed by the package
but are not available because they are not imported from their respective
packages. In this case, R CMD check provides a suggestion of how you can
modify the NAMESPACE package, but you are probably better off modifying
the roxygen2 documentation in the code file instead.
Moving on the rest of the checks, we see:
Here the problem is that the code has the first argument named x while the
documentation has the first argument named y.
Building R Packages 211
Because of the mismatch in code and documentation for the first argument,
we have an argument that is not properly documented (y) and an argument
that is documented but not used (x).
In case the checks fly by too quickly, you will receive a summary message the
end saying what errors and warnings you got.
* DONE
Status: 2 WARNINGs, 1 NOTE
You can specify how your R package is licensed in the package DESCRIPTION
file under the License: section. How you license your R package is important
because it provides a set of constraints for how other R developers use your
code. If youre writing an R package to be used internally in your company
then your company may choose to not share the package. In this case licensing
your R package is less important since the package belongs to your company.
Building R Packages 212
In your package DESCRIPTION you can specify License: file LICENSE, and then
create a text file called LICENSE which explains that your company reserves all
rights to the package.
However if you (or your company) would like to publicly share your R
package you should consider open source licensing. The philosophy of open
source revolves around three principles:
Nearly all open source licenses provide the protections above. Lets discuss
three of the most popular open source licenses among R packages.
Known as the GPL, the GNU GPL, and GPL-3, the General Public License was
originally written by Richard Stallman. The GPL is known as a copyleft license,
meaning that any software that is bundled with or originates from software
licensed under the GPL must also be released under the GPL. The exact mean-
ing of bundle will depend a bit on the circumstances. For example, software
distributed with an operating system can be licensed under different licenses
even if the operating system itself is licensed under the GPL. You can use the
GPL-3 as the license for your R package by specifying License: GPL-3 in the
DESCRIPTION file.
It is worth noting that R itself is licensed under version 2 of the GPL, or GPL-2,
which is an earlier version of this license.
The MIT license is a more permissive license compared to the GPL. MIT
licensed software can be modified or incorporated into software that is
not open source. The MIT license protects the copyright holder from legal
liability that might be incurred from using the software. When using the MIT
license in a R package you should specify License: MIT + file LICENSE in the
DESCRIPTION file. You should then add a file called LICENSE to your package
which uses the following template exactly:
Building R Packages 213
The Creative Commons licenses are usually used for artistic and creative
works, however the CC0 license is also appropriate for software. The CC0
license dedicates your R package to the public domain, which means that you
give up all copyright claims to your R package. The CC0 license allows your
software to join other great works like Pride and Prejudice, The Adventures of
Huckleberry Finn, and The Scarlet Letter in the public domain. You can use the
CC0 license for your R package by specifying License: CC0 in the DESCRIPTION
file.
Youve put weeks of sweat and mental anguish into writing a new R package,
so why should you provide an open source license for software that you or
your company owns by default? Lets discuss a few arguments for why open
sourcing your software is a good idea.
Paying it Forward
So with that in mind, if you feel that the R language or the R community has
contributed to your success or the success of your company consider open
sourcing your software so that the greater R community can benefit from its
availability.
Linuss Law
Now lets turn off the NPR pledge campaign and move our line of thinking
from the Berkeley Kumbaya circle to the Stanford MBA classroom: as a busi-
ness person why should you open source your software? One great reason
is a concept called Linuss Law which refers to Linus Torvalds, the creator
of Linux. The Linux operating system is a huge open source software project
involving thousands of people. Linux has a reputation for security and for its
lack of bugs which is in part a result of so many people looking at and being
able to modify the source code. If the users of your software are able to view
and modify the source code of your R package your package will likely be
improved because of Linuss Law.
Hiring
already familiar with your source code. On the other hand if youre looking
for a job your contributions to open source software can be a part of a
compelling portfolio which showcases your software skills.
However there are pitfalls you should be aware of when weighing a candi-
dates open source contributions. Many open source contributions are essen-
tially free work - work that a candidate was able to do in their spare time.
The best candidates often cannot afford to make open source contributions.
The most meaningful ways that an individual contributes to their community
usually has nothing to do with writing software.
Summary
Licensing and copyright laws vary between countries and jurisdictions. You
shouldnt consider any part of this chapter as legal advice. If you have
questions about open source licensing software youre building at work you
should consult with your legal department. In most situations software that
you write on your own time belongs to you, and software that you write while
being paid by somebody else belongs to whoever is paying you. Open source
licensing allows you to put restrictions on how your software can be used by
others. The open source philosophy does not oppose the commercial sale of
software. Many companies offer an open source version of their software that
comes with limitations, while also offering a paid license for more expansive
commercial use. This business model is used by companies like RStudio and
Highcharts.
GitHub allows you to post and interact with online code repositories, where
all repositories are under git version control. You can post R packages on
GitHub and, with the install_github function from the devtools package, install
R packages directly from GitHub. GitHub can be particularly useful for
collaborating with others on R packages, as it allows all collaborators to push
Building R Packages 216
and pull code between their personal computers and a GitHub repository.
While git historically required you to leave R and run git functions at a
command line, RStudio now has a number of features that make it easier to
interface directly with GitHub.
When using git and GitHub, there are three levels of tasks youll need to do:
1. Initial set-up these are things you will only need to do once (at least
per computer).
Download git
Configure git with your user name and email
Set up a GitHub account
Set up a SSH key to link RStudio on your personal computer with
your GitHub account
2. Set-up of a specific repository these are things you will need to do
every time you create a new repository, but will only need to do once
per repository.
Initialize the directory on your personal computer as a git reposi-
tory
Make an initial commit of files in the repository
Create an empty GitHub repository
Add the GitHub repository as a remote branch of the local reposi-
tory
Push the local repository to the GitHub remote branch
(If you are starting from a GitHub repository rather than a local
repository, either clone the repository or fork and clone the repos-
itory instead.)
3. Day-to-day workflow for a repository these are things you will do
regularly as you develop the code in a repository.
Commit changes in files in the repository to save git history locally
Push committed changes to the GitHub remote branch
Pull the latest version of the GitHub remote branch to incorporate
changes from collaborators into the repository code saved on your
personal computer
Write and resolve Issues with the code in the repository
Fix any merge conflicts that come up between different collabora-
tors code edits
If the repository is a fork, keep up-to-date with changes in the
upstream branch
Building R Packages 217
Each of these elements are described in detail in this section. More generally,
this section describes how to use git and GitHub for version control and
collaboration when building R packages.
git
Git is a version control system. When a repository is under git version control,
information about all changes made, saved, and committed on any non-
ignored file in a repository is saved. This allows you to revert back to previous
versions of the repository and search through the history for all commits
made to any tracked files in the repository. If you are working with others,
using git version control allows you to see every change made to the code,
who made it, and why (through the commit messages).
You will need git on your computer to create local git repositories that you can
sync with GitHub repositories. Like R, git is open source. You can download
it for different operating systems.
After downloading git but before you use it, you should configure it. For
example, you should make sure it has your name and email address. You can
configure git from a bash shell (for Macs, you can use Terminal, while for
PCs you can use GitBash, which comes with the git installation).
You can use git config functions to configure your version of git. Two changes
you should make are to include your name and email address as the user.name
and user.email. For example, the following code, if run in a bash shell, would
configure a git account for a user named Jane Doe who has a generic email
address:
Once youve installed git, you should restart RStudio so RStudio can identify
that git is now available. Often, just restarting RStudio will be enough.
However, in some cases, you may need to take some more steps to activate git
in RStudio. To do this, go to RStudio -> Preferences -> Git/SVN. Choose
Enable version control. If RStudio doesnt automatically find your version
of git in the Git executable box (youll known it hasnt if that box is blank),
browse for your git executable file using the Browse button beside that box.
Building R Packages 218
If you arent sure where your git executable is saved, try opening a bash
shell and running which git, which should give you the filepath if you have
git installed.
You can initialize a git repository either using commands from a bash shell
or directly from RStudio. First, to initialize a git repository from a bash shell,
take the following steps:
cd ~/example_analysis
git init
If you do not see Git in the box for Version control system, it
means either that you do not have git installed on your computer or
that RStudio was unable to find it. If so, see the earlier instructions
for making sure that RStudio has identified the git executable.
Once you initialize the project as a git repository, you should have a Git
window in one of your RStudio panes (top right pane by default). As you make
and save changes to files, they will show up in this window for you to commit.
For example, Figure @ref(fig:examplegitwindow) is what the Git window in
RStudio looks like when there are changes to two files that have not yet been
committed.
Example of a git window in RStudio when files in the repository have been changed and
saved, but the changes havent yet been committed to git.
Committing
When you want git to record changes, you commit the files with the changes.
Each time you commit, you have to include a short commit message with
some information about the changes. You can make commits from a shell.
Building R Packages 220
1. Click on the boxes by the filenames in the top left panel to select the files
to commit.
2. If youd like, you can use the bottom part of the window to look through
the changes you are committing in each file.
3. Write a message in the Commit message box in the top right panel.
Keep the message to one line in this box if you can. If you need to explain
more, write a short one-line message, skip a line, and then write a longer
explanation.
4. Click on the Commit button on the right.
Building R Packages 221
Once you commit changes to files, they will disappear from the Git window
until you make and save more changes.
Browsing history
On the top left of the Commit window, you can toggle to History. This
window allows you to explore the history of commits for the repository.
Figure @ref(fig:examplehistorywindow) shows an example of this window.
The top part of this window lists commits to the repository, from most recent
to least. The commit message and author are shown for each commit. If you
click on a commit, you can use the bottom panel to look through the changes
made to that file with a specific commit.
Example of the History window for exploring git commit history in RStudio.
GitHub allows you to host git repositories online. This allows you to:
To do any of this, you will need a GitHub account. You can sign up at
https://github.com. A free account is fine as long as you dont mind all of your
repositories being Public (viewable by anyone).
The basic unit for working in GitHub is the repository. A repository is a direc-
tory of files with some supplemental files saving some additional information
about the directory. While R Projects have this additional information saved
as an .RProj file, git repositories have this information in a directory called
.git.
If you have a local directory that you would like to push to GitHub, these are
the steps to do it. First, you need to make sure that the directory is under
git version control. See the previous notes on initializing a repository. Next,
you need to create an empty repository on GitHub to sync with your local
repository. To do that:
Now you are ready to connect the two repositories. First, you should change
some settings in RStudio so GitHub will recognize that your computer can be
Building R Packages 223
trusted, rather than asking for you password every time. Do this by adding
an SSH key from RStudio to your GitHub account with the following steps:
Now youre ready to push your local repository to the empty GitHub reposi-
tory you created.
1. Open a shell and navigate to the directory you want to push. (You can
open a shell from RStudio using the gear button in the Git window.)
2. Add the GitHub repository as a remote branch with the following com-
mand (this gives an example for adding a GitHub repository named
ex_repo in my GitHub account, geanders): git remote add origin
[email protected]:geanders/ex_repo.git As a note, when you create a repos-
itory in GitHub, GitHub will provide suggested git code for adding the
GitHub repository as the origin remote branch to a repository. That
code is similar to the code shown above, but it uses https://github.com
rather than [email protected]; the latter tends to work better with
RStudio.
3. Push the contents of the local repository to the GitHub repository. git
push -u origin master
To pull a repository that already exists on GitHub and to which you have
access (or that youve forked and so have access to the forked branch), first
use cd from a bash shell on your personal computer to move into the directory
where you want to put the repository. Then, use the git clone function to
clone the repository locally. For example, to clone a GitHub repository called
ex_repo posted in a GitHub account with the user name janedoe, you could
run:
Building R Packages 224
Once you have linked a local R project with a GitHub repository, you can
push and pull commits using the blue down arrow (pull from GitHub) and
green up arrow (push to GitHub) in the Git window in RStudio (see Figure
@ref(fig:examplegitwindow) to see examples of these arrows).
GitHub helps you work with others on code. There are two main ways you
can do this:
Collaborating: Different people have the ability to push and pull di-
rectly to and from the same repository. When one person pushes a
change to the repository, other collaborators can immediately get the
changes by pulling the latest GitHub commits to their local repository.
Forking: Different people have their own GitHub repositories, with each
linked to their own local repository. When a person pushes changes to
GitHub, it only makes changes to his own repository. The person must
issue a pull request to another persons fork of the repository to share
the changes.
Issues
Each original GitHub repository (i.e., not a fork of another repository) has a
tab for Issues. This page works like a Discussion Forum. You can create new
Issue threads to describe and discuss things that you want to change about
the repository.
Issues can be closed once the problem has been resolved. You can close issues
on the Issue page with the Close issue button. If a commit you make in
RStudio closes an issue, you can automatically close the issue on GitHub by
including Close #[issue number] in your commit message and then pushing
to GitHub. For example, if issue #5 is Fix typo in section 3, and you make a
change to fix that typo, you could make and save the change locally, commit
that change with the commit message Close #5, and then push to GitHub,
and issue #5 in Issues for that GitHub repository will automatically be
closed, with a link to the commit that fixed the issue.
Building R Packages 225
Pull request
You can use a pull request to suggest changes to a repository that you do not
own or otherwise have the permission to directly change. Take the following
steps to suggest changes to someone elses repository:
You can also use pull requests within your own repositories. Some people will
create a pull request every time they have a big issue they want to fix in one
of their repositories.
In GitHub, each repository has a Pull requests tab where you can manage
pull requests (submit a pull request to another fork or merge in someone
elses pull request for your fork).
Merge conflicts
At some point, if you are using GitHub to collaborate on code, you will get
merge conflicts. These happen when two people have changed the same piece
of code in two different ways at the same time.
For example, say two people are both working on local versions of the same
repository, and the first person changes a line to mtcars[1, ] while the second
person changes the same line to head(mtcars, 1). The second person pushes his
commits to the GitHub version of the repository before the first person does.
Now, when the first person pulls the latest commits to the GitHub repository,
he will have a merge conflict for this line. To be able to commit a final version,
the first person will need to decide which version of the code to use and
commit a version of the file with that code.
If there are merge conflicts, theyll show up like this in the file:
Building R Packages 226
<<<<<<< HEAD
mtcars[1, ]
=======
head(mtcars, 1)
>>>>>>> remote-branch
To fix them, search for all these spots in files with conflicts (Ctrl-F can be
useful for this), pick the code you want to use, and delete everything else.
For the example conflict, it could be resolved by changing the file from this:
<<<<<<< HEAD
mtcars[1, ]
=======
head(mtcars, 1)
>>>>>>> remote-branch
To this:
head(mtcars, 1)
That merge conflict is now resolved. Once you resolve all merge conflicts in
all files in the repository, you can save and commit the files.
These merge conflicts can come up in a few situations:
You pull in commits from the GitHub branch of a repository youve been
working on locally.
Someone sends a pull request for one of your repositories, and you
have updated some of the code between when the person forked the
repository and submitted the pull request.
Summary
R code can be kept under version control using git, and RStudio offers con-
venient functionality for working with a directory under git version control.
A directory under git version control can also be pushed to GitHub, which
provides a useful platform for sharing and collaborating on code.
Building R Packages 227
The R programming language is open source software and many open source
software packages draw some inspiration from the design of the Unix oper-
ating system which macOS and Linux are based on. Ken Thompson - one
of the designers of Unix - first laid out this philosophy, and many Unix
philosophy principles can be applied to R programs. The overarching philo-
sophical theme of Unix programs is to do one thing well. Sticking to this rule
accomplishes several objectives:
1. Since your program only does one thing the chance that your program
contains many lines of code is reduced. This means that others can more
easily read the code for your program so they can understand exactly
how it works (if they need to know).
2. Simplicity in your program reduces the chance there will be major bugs
in your program since fewer lines of code means fewer opportunities to
make a mistake.
3. Your program will be easier for users to understand since the number of
inputs and outputs are reduced for a program that only does one thing.
4. Programs built with other small programs have a higher chance of also
being small. This ability to string several small programs together to
make a more complex (but also small) program is called composability.
Unix command line programs are notable for their use of the pipe operator
(|) and so the Unix philosophy also encourages programs to produce outputs
that can be piped into program inputs. Recently pipes in R have surged
in popularity thanks to projects like the magrittr package. When it makes
Building R Packages 228
sense for your function to take data (usually a vector or a data frame) as
an argument and then return data, you should consider making the data
argument the first argument in your function so that your function can be
part of a data pipeline.
One case where many R programs differ from the greater Unix philosophy is
in terms of user interaction. Unix programs will usually only print a message
to the user if a program produces an error or warning. Although this is a
good guideline for your programs, many R programs print messages to the
console even if the program works correctly. Many R users only use the
language interactively, so showing messages to your users might make sense
for your package. One issue with messages is that they produce output which
is separate from the results of your program, and therefore messages are
harder to capture.
Default Values
Every function argument is an opportunity for your function to fail the user
by producing an error because of bad or unexpected inputs. Therefore you
should provide as many default values for your functions as is reasonable. If
theres an argument in your function that should only be one of a handful
of values you should use the match.arg() function to check that one of the
permitted values is provided:
multiply_by(5, "two")
[1] 10
multiply_by(5, "six")
Error in match.arg(multiplier): 'arg' should be one of "two", "three", "four"
Naming Things
1. Use snake case and lowercase. Modern R packages use function and
variable names like geom_line(), bind_rows(), and unnest_token() where
words are separated by underscores (_) and all characters are lower-
case. Once upon a time words were commonly separated by periods (.)
but that scheme can cause confusion with regard to generic functions
(see the object oriented programming chapter for more information).
2. Names should be short. A short name is faster to type and is more
memorable than a long and complicated name. The length of a variable
name has to be balanced with the fact that:
3. Names should be meaningful and descriptive. Function names should
generally describe the actions they perform. Other object names should
describe the data or attributes they encompass. In general you should
avoid numbering variable names like apple1, apple2, and apple3. Instead
you should create a data structure called apples so you can access each
apple with apple[[1]], apple[[2]], and apple[[3]].
4. Be sure that youre not assigning names that already exist and are
common in R. For example mean, summary, and rt are already names of
functions in R, so try to avoid overwriting them. You can check if a name
is taken using the apropos() function:
apropos("mean")
[1] ".colMeans" ".rowMeans" "colMeans" "kmeans"
[5] "mean" "mean.Date" "mean.default" "mean.difftime"
[9] "mean.POSIXct" "mean.POSIXlt" "rowMeans" "weighted.mean"
apropos("my_new_function")
character(0)
If you write a package with useful functions that are well designed then you
may be lucky enough that your package becomes popular! Others may build
upon your functions to extend or adapt thier features for other purposes.
This means that when you establish a set of arguments for a function youre
implicitly promising some amount of stability for the inputs and outputs of
that function. Changing the order or the nature of function arguments or
return values can break other peoples code, creating work and causing pain
for those who have chosen to use your software. For this reason you should
think very carefully about function arguments and outputs to ensure that
both can grow and change sustainably. You should seek to strike a balance
between frustrating your users by making breaking changes and ensuring
that your package follows up to date programming patterns and ideas. If you
believe that the functions in a package youre developing are not yet stable
you should make users aware of that fact so that theyre warned if they choose
to build on your work.
Summary
Most of software design is ensuring that your users stumble into their desired
outcome. You may think youre writing the most intuitive package, but sitting
down with a colleague and watching them use your package can teach you
volumes about what users want and expect out of your package. There are
libraries full of books written about software design and this chapter is
only meant to serve as a jumping off point. If you happen to be looking for
inspiration I highly recommend this talk Bret Victor called: The Future of
Programming.
Well discuss two services for continuous integration: the first is Travis which
will test your package on Linux, and then theres AppVeyor which will test
your package on Windows. Both of these services are free for R packages
that are built in public GitHub repositories. These continuous integration
services will run every time you push a new set of commits for your package
repository. Both services integrate nicely with GitHub so you can see in
GitHubs pull request pages whether or not your package is building correctly.
Using Travis
Open up your R console and navigate to your R package repository. Now load
the devtools package with library(devtools) and enter use_travis() into your R
console. This command will set up a basic .travis.yml for your R package. You
can now add, commit, and push your changes to GitHub, which will trigger
the first build of your package on Travis. Go back to https://travis-ci.org to
watch your package be built and tested at the same time! You may want to
make some changes to your .travis.yml file, and you can see all of the options
available in this guide.
Once your package has been built for the first time youll be able to obtain
a badge, which is just a small image generated by Travis which indicates
whether you package is building properly and passing all of your tests. You
should display this badge in the README.md file of your packages GitHub repos-
itory so that you and others can monitor the build status of your package.
Using AppVeyor
Summary
make it possible to re-test your code on different platforms after every git
push. Using continuous integration makes it easy for you and for others to
simultaneously work on building an R package without breaking package
features by mistake.
One of the great features about R is that you can run R code on multiple kinds
of computers and operating systems and it will behave the same way on each
one. Most of time you dont need to worry about what platform your R code is
running on. The following sections discuss strategies and functions that you
should use to ensure that your R code runs uniformly on every kind of system.
Handling Paths
Paths to files and folders can have big differences between operating systems.
In general you should avoid constructing a path by hand. For example if I
wanted to access a file called data.txt that I know will be located on the users
desktop using the string "/Desktop/data.txt" would not work if that code was
run on a Windows machine. In general you should always use functions to
construct and find paths to files and folders. The correct programmatic way
to construct the path above is to use the file.path() function. So to get the file
above I would do the following:
Sys.info()['sysname']
sysname
"Darwin"
If the resulting line above says "Darwin" its referring to the core of macOS. If
you dont have a Mac try running both lines of code above to see the resulting
path and the type of system that youre running.
In general its not guaranteed on any system that a particular file or folder
youve looking for will exist however if the user of your package has
installed your package you can be sure that any files within your package
exist on their machine. You can find the path to files included in your package
using the system.file() function. Any files or folders in the inst/ directory
of your package will be copied one level up once your package is installed.
If your package is called ggplyr2 and theres file in your package under
inst/data/first.txt you can get the path to that file with system.file("data",
"first.txt", package = "ggplyr2"). Packaging files with your package is the
best way to ensure that users have access to them when theyre using your
package.
In terms of constructing paths there are a few other functions you should
be aware of. Remember that the results for many of these functions are
contingent on this book being built on a Mac, so if youre using Windows
I encourage you to run these functions yourself to see their result. The
path.expand() function is usually used to find the absolute path name of a
users home directory when the tilde () is included in the path. The tilde
is a shortcut for the path to the current users home directory. Lets take a
look at path.expand() in action:
path.expand("~")
[1] "/Users/rdpeng"
path.expand(file.path("~", "Desktop"))
[1] "/Users/rdpeng/Desktop"
normalizePath(file.path("~", "R"))
[1] "/Users/sean/R"
normalizePath(".")
[1] "/Users/sean/books/msdr"
normalizePath("..")
[1] "/Users/sean/books"
To extract parts of a path you can use the basename() function to get the name
of the file or the deepest directory in the path and you can use dirname() to
get the part of the path that does not include either the file or the deepest
directory. Lets take a look at some examples:
Packages should not write in the users home filespace, nor any-
where else on the file system apart from the R sessions temporary
directory (or during installation in the location pointed to by
TMPDIR: and such usage should be cleaned up). Installing into
the systems R installation (e.g., scripts to its bin directory) is not
allowed. Limited exceptions may be allowed in interactive sessions
if the package obtains confirmation from the user.
Building R Packages 236
In general you should strive to get the users consent before you create or save
files on their computer. With some functions consent is implicit, for example
its clear somebody using write.csv() consents to producing a csv file at a
specified path. When its not absolutely clear that the user will be creating a
file or folder when they use your functions you should ask them specifically.
Take a look at the code below for a skeleton of a function that asks for a users
consent:
#
# ... some code that does something useful ...
#
if(!dir.exists(file.path("~", "Desktop"))){
warning("No Desktop found.")
} else {
if(!force && interactive()){
result <- select.list(c("Yes", "No"),
title = "May this program create data.txt on your desktop?")
if(result == "Yes"){
file.create(file.path("~", "Desktop", "data.txt"))
}
} else if(force){
file.create(file.path("~", "Desktop", "data.txt"))
} else {
warning("data.txt was not created on the Desktop.")
}
}
}
the description of the function clearly states that the function attempts to
create the data.txt file. This function has a force argument which will create
the data.txt file without asking the user first. By setting force = FALSE as the
default, the user must set force = TRUE, which is one method to get consent
from the user. The function above uses the interactive() function in order
to determine whether the user is using this function in an R console or
if this function is being run in a non-interactive session. If the user is in
an interactive R session then using select.list() is a decent method to ask
the user a question. You should strive to use select.list() and interactive()
together in order to prevent an R session from waiting for input from a user
that doesnt exist.
rappdirs
Even the contrived example above implicitly raises a good question: where
should your package save files? The most obvious answer is to allow the user
to provide an argument for the path where a file should be saved. This is a
good idea as long as your package wont need to depend on the location of that
file in the future, for example if your package is creating an output data file.
But what if you need persistent and consistent access to a file? You might be
tempted to use path.package() in order to find the directory that your package
is installed in so you can store files there. This isnt a good idea because file
access permissions often do not allow users to modify files where R packages
are stored.
In order to find a location where you can read and write files that will persist
on a users computer you should use the rappdirs package. This package
contains functions that will return paths to directories where you package
can store files for future use. The user_data_dir() function will provide a user-
specific path for your package, while the site_data_dir() function will return a
directory path that is shared by all users. Lets take a look at rappdirs in action:
Building R Packages 238
library(rappdirs)
Loading required package: methods
site_data_dir(appname = "ggplyr2")
[1] "/Library/Application Support/ggplyr2"
user_data_dir(appname = "ggplyr2")
[1] "/Users/rdpeng/Library/Application Support/ggplyr2"
Both of the examples above are probably the Mac-specific paths. We can get
the Windows specific paths by specifying the os argument:
If you dont supply the os argument then the function will determine the
operating system automatically. One feature about user_data_dir() you should
note is the roaming = TRUE argument. Many Windows networks are configured
so that any authorized user can log in to any computer on the network and
have access to their desktop, settings, and files. Setting roaming = TRUE returns
a special path so that R will have access to your packages files everywhere, but
this requires the directory to be synced often. Make sure to only use roaming =
TRUE if the files your package will storing with rappdirs are going to be small.
For more information about rappdirs see https://github.com/hadley/rappdirs.
Several R Packages allow users to set global options that affect the behavior
of the package using the options() function. The options() function returns
a list, and named values in this list can be set using the following syntax:
options(key = value). Its a common feature for packages to allow a user to set
options which may specify package defaults, or change the behavior of the
package in some way. You should thoroughly document how your package is
effected by which options are set.
When an R session begins a series of files are searched for and run if found as
detailed in help("Startup"). One of those files is .Rprofile. The .Rprofile file is
just a regular R file which is usually located in a users home directory (which
you can find with normalizePath("")). A users .Rprofile is run every time they
start an R session, so its a good file for setting options that a user wants to be
set when using R. If you want a user to be able to set an option that is related
to your package that is unlikely to change (like a username or a key), then you
should consider instructing them to create or make changes to their .Rprofile.
Building R Packages 239
Package Installation
In cases where users might have a weak internet connection its often eas-
ier for a user to download the source of your package as a zip file and
then to install it using install.packages(). Instead of asking users to discern
the path of zip file theyve downloaded you should ask them to enter in-
stall.packages(file.choose(), repos = NULL, type = "source") into the R console
and then they can interactively select the file they just downloaded. If a user
is denied permission to modify their local package directory, they still may
be able to use a package if they specify a directory they have access to with
the lib argument for install.packages().
Environmental Attributes
Occasionally you may need to know specific information about the hardware
and software limitations of the computer that is running your R code. The
environmental variables .Platform and .Machine are lists which contain named
elements that can tell your program about the underlying machine. For ex-
ample .Platform$OS.type is a good method for checking whether your program
is in a Windows environment since the only values it can return are "windows"
and "unix":
.Platform$OS.type
[1] "unix"
For more information about information contained in .Platform see the help
file: help(".Platform").
The .Machine variable contains information specific to the computer archi-
tecture that your program is being run on. For example .Machine$double.xmax
Building R Packages 240
.Machine$double.xmax
[1] 1.797693e+308
.Machine$double.xmax + 100 == .Machine$double.xmax
[1] TRUE
.Machine$double.xmin
[1] 2.225074e-308
You might also find .Machine$double.eps useful, which is the smallest number
on a machine such that 1 + .Machine$double.eps != 1 evaluates to TRUE:
1 + .Machine$double.eps != 1
[1] TRUE
1 + .Machine$double.xmin != 1
[1] FALSE
Summary
File and folder paths differ across platforms so R provides several functions
to ensure that your program can construct paths correctly. The rappdirs
package helps further by identifying locations where you can safely store files
that your package can access. However before creating files anywhere on a
users disk you should always ask the users permission. You should provide
clear and easy instructions so people can easily install your package. The
.Platform and .Machine variables can inform your program about hardware
and software details.
4. Building Data Visualization Tools
The data science revolution has produced reams of new data from a wide
variety of new sources. These new datasets are being used to answer new
questions in ways never before conceived. Visualization remains one of the
most powerful ways draw conclusions from data, but the influx of new
data types requires the development of new visualization techniques and
building blocks. This section provides you with the skills for creating those
new visualization building blocks. We focus on the ggplot2 framework and
describe how to use and extend the system to suit the specific needs of your
organization or team.
The objectives for this section are:
The ggplot2 package allows you to quickly plot attractive graphics and to
visualize and explore data. Objects created with ggplot2 can also be exten-
sively customized with ggplot2 functions (more on that in the next subsec-
tion), and because ggplot2 is built using grid graphics, anything that cannot
be customized using ggplot2 functions can often be customized using grid
graphics. While the structure of ggplot2 code differs substantially from that
of base R graphics, it offers a lot of power for the required effort. This
first subsection focuses on useful, rather than attractive graphs, since this
subsection focuses on exploring rather than presenting data. Later sections
will give more information about making more attractive or customized plots,
as youd want to do for final reports, papers, etc.
To show how to use basic ggplot2, well use a dataset of Titanic passengers,
their characteristics, and whether or not they survived the sinking. This
dataset has become fairly famous in data science, because its used, among
other things, for one of Kaggles long-term learning competitions, as well as
in many tutorials and texts on building classification models.
To get this dataset, youll need to install and load the titanic package, and then
you can load and rename the training datasets, which includes data on about
two-thirds of the Titanic passengers:
The other data example well use in this subsection is some data on players
in the 2010 World Cup. This is available from the faraway package:
Building Data Visualization Tools 243
Unlike most data objects youll work with, the data that comes with
an R package will often have its own help file. You can access this
using the ? operator. For example, try running: ?worldcup.
All of the plots well make in this subsection will use the ggplot2 package
(another member of the tidyverse!). If you dont already have that installed,
youll need to install it. You then need to load the package in your current
session of R:
The process of creating a plot using ggplot2 follows conventions that are a
bit different than most of the code youve seen so far in R (although it is
somewhat similar to the idea of piping we introduced in an earlier course).
The basic steps behind creating a plot with ggplot2 are:
1. Create an object of the ggplot class, typically specifying the data and
some or all of the aesthetics;
2. Add on geoms and other elements to create and customize the plot,
using +.
You can add on one or many geoms and other elements to create plots that
range from very simple to very customized. Well focus on simple geoms and
added elements first, and then explore more detailed customization later.
The first step in creating a plot using ggplot2 is to create a ggplot object. This
object will not, by itself, create a plot with anything in it. Instead, it typically
specifies the data frame you want to use and which aesthetics will be mapped
to certain columns of that data frame (aesthetics are explained more in the
next subsection).
Use the following conventions to initialize a ggplot object:
Building Data Visualization Tools 244
## Generic code
object <- ggplot(dataframe, aes(x = column_1, y = column_2))
## or, if you don't need to save the object
ggplot(dataframe, aes(x = column_1, y = column_2))
The dataframe is the first parameter in a ggplot call and, if you like, you can
use the parameter definition with that call (e.g., data = dataframe). Aesthetics
are defined within an aes function call that typically is used within the ggplot
call.
Plot aesthetics
Aesthetics are properties of the plot that can show certain elements of the
data. For example, in Figure @ref(fig:aesmapex), color shows (i.e., is mapped
to) gender, x-position shows height, and y-position shows weight in a sample
data set of measurements of children in Nepal.
Building Data Visualization Tools 245
Example of how different properties of a plot can show different elements to the data.
Here, color indicates gender, position along the x-axis shows height, and position along
the y-axis shows weight. This example is a subset of data from the nepali dataset in the
faraway package.
Which aesthetics are required for a plot depend on which geoms (more on
those in a second) youre adding to the plot. You can find out the aesthetics
you can use for a geom in the Aesthetics section of the geoms help file (e.g.,
?geom_point). Required aesthetics are in bold in this section of the help file
and optional ones are not. Common plot aesthetics you might want to specify
include:
Building Data Visualization Tools 246
Code Description
x Position on x-axis
y Position on y-axis
shape Shape
color Color of border of elements
fill Color of inside of elements
size Size
alpha Transparency (1: opaque; 0: transparent)
linetype Type of line (e.g., solid, dashed)
To create a plot, you need to add one of more geoms to the ggplot object. The
system of creating a ggplot object, mapping aesthetics to columns of the data,
and adding geoms makes more sense once you try a few plots. For example,
say youd like to create a histogram showing the fares paid by passengers in
the example Titanic data set. To plot the histogram, youll first need to create a
ggplot object, using a dataframe with the Fares column you want to show in
the plot. In creating this ggplot object, you only need one aesthetic (x, which in
this case you want to map to Fares), and then youll need to add a histogram
geom. In code, this is:
Titanic data
This code sets the dataframe as the titanic object in the users working
session, maps the values in the Fare column to the x aesthetic, and adds a
histogram geom to generate a histogram.
Building Data Visualization Tools 248
If R gets to the end of a line and there is not some indication that the
call is not over (e.g., %>% for piping or + for ggplot2 plots), R interprets
that as a message to run the call without reading in further code. A
common error when writing ggplot2 code is to put the + to add a
geom or element at the beginning of a line rather than the end of a
previous line in this case, R will try to execute the call too soon.
To avoid errors, be sure to end lines with +, dont start lines with it.
There is some flexibility in writing the code to create this plot. For example,
you could specify the aesthetic for the histogram in an aes statement when
adding the geom (geom_histogram) rather than in the ggplot call:
ggplot(data = titanic) +
geom_histogram(aes(x = Fare))
Similarly, you could specify the dataframe when adding the geom rather than
in the ggplot call:
ggplot() +
geom_histogram(data = titanic, aes(x = Fare))
Finally, you can pipe the titanic dataframe into a ggplot call, since the ggplot
function takes a dataframe as its first argument:
titanic %>%
ggplot() +
geom_histogram(aes(x = Fare))
# or
titanic %>%
ggplot(aes(x = Fare)) +
geom_histogram()
While all of these work, for simplicity we will use the syntax of specifying
the data and aesthetics in the ggplot call for most examples in this subsection.
Later, well show how this flexibility can be used to use data from differents
dataframe for different geoms or change aesthetic mappings between geoms.
A key thing to remember, however, is that ggplot is not flexible about whether
you specify aesthetics within an aes call or not. We will discuss what happens
Building Data Visualization Tools 249
if you do not later in the book, but it is very important that if you want to
show values from a column of the data using aesthetics like color, size, shape,
or position, you remember to make that specification within aes. Also, be sure
that you specify the dataframe before or when you specify aesthetics (i.e.,
you cant specify aesthetics in the ggplot statement if you havent specified
the dataframe yet), and if you specify a dataframe within a geom, be sure to
use data = syntax rather than relying on parameter position, as data is not
the first parameter expected for geom functions.
When you run the code to create a plot in RStudio, the plot will be
shown in the Plots tab in one of the RStudio panels. If you would
like to save the plot, you can do so using the Export button in this
tab. However, if you would like to use code in an R script to save a
plot, you can do so (and its more reproducible!).
To save a plot using code in a script, take the following steps: (1)
open a graphics device (e.g., using the function pdf or png); (2) run
the code to draw the map; and (3) close the graphics device using the
dev.off function. Note that the function you use to open a graphics
device will depend on the type of device you want to open, but you
close all devices with the same function (dev.off).
Geoms
Geom functions add the graphical elements of the plot; if you do not include at
least one geom, youll get a blank plot space. Each geom function has its own
arguments to adjust how the graph is created. For example, when adding a
historgram geom, you can use the bins argument to change the number of
bins used to create the histogram try:
As with any R functions, you can find out more about the arguments available
for a geom function by reading the functions help file (e.g., ?geom_histogram).
Geom functions differ in the aesthetic inputs they require. For example, the
geom_histogram funciton only requires a single aesthetic (x). If you want to
create a scatterplot, youll need two aesthetics, x and y. In the worldcup dataset,
Building Data Visualization Tools 250
the Time column gives the amount of time each player played in the World Cup
2010 and the Passes column gives the number of passes he made. To see the
relationship between these two variables, you can create a ggplot object with
the dataframe, mapping the x aesthetic to Time and the y aesthetic to Passes,
and then adding a point geom:
All geom functions have both required and accepted aesthetics. For example,
the geom_point function requires x and y, but the function will also accept alpha
Building Data Visualization Tools 251
(transparency), color, fill, group, size, shape, and stroke aesthetics. If you try to
create a geom without one its required aesthetics, you will get an error:
You can, however, add accepted aesthetics to the geom if youd like; for
example, to use color to show player position and size to show shots on goal
for the World Cup data, you could call:
The following table gives some of the geom functions you may find useful
in ggplot2, along with the required aesthetics and some of the most useful
some useful specific arguments for each geom function (there are other useful
arguments that can be applied to many different geom functions, which will
be covered later). The elements created by these geom functions are usually
clear from the function names (e.g., geom_point plots points; geom_segment plots
segments).
Building Data Visualization Tools 253
Several geoms can be added to the same ggplot object, which allows you to
build up layers to create interesting graphs. For example, we previously made
a scatterplot of time versus shots for World Cup 2010 data. You could make
that plot more interesting by adding label points for noteworthy players with
those players team names and positions. First, you can create a subset of data
with the information for noteworthy players and add a column with the text
to include on the plot. Then you can add a text geom to the previous ggplot
object:
library(dplyr)
noteworthy_players <- worldcup %>% filter(Shots == max(Shots) |
Passes == max(Passes)) %>%
mutate(point_label = paste(Team, Position, sep = ", "))
Soccer games last 90 minutes each, and different teams play a different
number of games at the World Cup, based on how well they do. To check
if horizontal clustering is at 90-minute intervals, you can plot a histogram
of player time (Time), with reference lines every 90 minutes. First initialize
the ggplot object, with the dataframe to use and appropriate mapping to
aesthetics, then add geoms for a histogram as well as vertical reference lines:
Constant aesthetics
You can do this with any of the aesthetics for a geom, including color, fill,
shape, and size. If you want to change the shape of points, in R, you use a
number to specify the shape you want to use. Figure @ref(fig:shapeexamples)
shows the shapes that correspond to the numbers 1 to 25 in the shape aesthetic.
This figure also provides an example of the difference between the color
aesthetic (black for all these example points) and fill aesthetic (red for these
examples). If a geom has both a border and an interior, the color aesthetic
Building Data Visualization Tools 259
specifies the color of the border while the fill aesthetic specifies the color of
the interior. You can see that, for point geoms, some shapes include a fill (21
for example), while some are either empty (1) or solid (19).
Examples of the shapes corresponding to different numeric choices for the shape aes-
thetic. For all examples, color is set to black and fill to red.
If you want to set color to be a constant value, you can do that in R using char-
acter strings for different colors. Figure @ref(fig:colorexamples) gives an ex-
ample of a few of the different blues available in R. To find images that show
all these named choices for colors in R, google R colors and search by Im-
ages (for example, there is a pdf here: http://www.stat.columbia.edu/tzheng/files/Rcolo
Building Data Visualization Tools 260
There are also a number of elements besides geoms that you can add onto a
ggplot object using +. A few that are used very frequently are:
Element Description
ggtitle Plot title
xlab, ylab x- and y-axis labels
xlim, ylim Limits of x- and y-axis
You can also use this syntax to customize plot scales and themes, which we
will discuss later in this section.
Building Data Visualization Tools 261
Example plots
In this subsection, well show a few more examples of basic plots created with
ggplot2. For the example plots in this subsection, well use a dataset in the
faraway package called nepali. This gives data from a study of the health of a
group of Nepalese children. You can load this data using:
head(nepali)
id sex wt ht age
1 120011 Male 12.8 91.2 41
2 120012 Female 14.9 103.9 57
3 120021 Female 7.7 70.1 8
4 120022 Female 12.1 86.4 35
5 120023 Male 14.2 99.4 49
6 120031 Male 13.9 96.4 46
Well use this cleaned dataset to show how to use ggplot2 to make histograms,
scatterplots, and boxplots.
Building Data Visualization Tools 262
Histograms
Basic example of plotting a histogram with ggplot2. This histogram shows the distribution
of heights for the first recorded measurements of each child in the nepali dataset.
If you run the code with no arguments for binwidth or bins in geom_-
histogram, you will get a message saying stat_bin() using bins = 30.
Pick better value with binwidth.. This message is just saying that a
default number of bins was used to create the histogram. You can
use arguments to change the number of bins used, but often this
default is fine. You may also get a message that observations with
missing values were removed.
Building Data Visualization Tools 263
You can add some elements to this plot to customize it a bit. For example
(Figure @ref(fig:nepalihist2)), you can add a figure title (ggtitle) and clearer
labels for the x-axis (xlab). You can also change the range of values shown by
the x-axis (xlim).
Note that these additional graphical elements are added on by adding func-
tion calls to ggtitle, xlab, and xlim to our ggplot object.
Scatterplots
Again, you can use some of the options and additions to change the plot
appearance. For example, to add a title, change the x- and y-axis labels, and
change the color and size of the points on the scatterplot (Figure @ref(fig:nepaliscatter2))
you can run:
You can also try mapping a variable to the color aesthetic of the plot. For
example, to use color to show the sex of each child in the scatterplot (Figure
@ref(fig:nepaliscatter3)), you can run add an additional mapping of this
optional aesthetic to the sex column of the nepali dataframe with the following
code:
Boxplots
Boxplots are one way to show the distribution of a continuous variable. You
can add a boxplot geom with the geom_boxplot function. To plot a boxplot for
a single, continuous variable, you can map that variable to y in the aes call
and map x to the constant 1. For example, to create a boxplot of the heights of
children in the Nepali dataset (Figure @ref(fig:nepaliboxplot1)), you can run:
Example of creating a boxplot. The example shows the distribution of height data for
children in the nepali dataset.
You can also create separate boxplots, one for each level of a factor (Figure
@ref(fig:nepaliboxplot2)). In this case, youll need to map columns in the
input dataframe to two aesthetics (x and y) when initializing the ggplot object
The y variable is the variable for which the distribution will be shown, and the
x variable should be a discrete (categorical or TRUE/FALSE) variable, which
will be used to group the variable.
Extensions of ggplot2
There are a number of packages that extend ggplot2 and allow you to create
a variety of interesting plots. For example, you can use the ggpairs function
from the GGally package to plot all pairs of scatterplots for several variables
(Figure @ref(fig:ggallyexample)).
library(GGally)
ggpairs(nepali %>% select(sex, wt, ht, age))
Building Data Visualization Tools 269
Example of using ggpairs from the GGally package for exploratory data analysis.
Notice how this output shows continuous and binary variables differently.
For example, the center diagonal shows density plots for continuous vari-
ables, but a bar chart for the categorical variable.
See https://www.ggplot2-exts.org to find more ggplot2 extensions. Later in this
course, we will give an overview of how to make your own extensions.
Building Data Visualization Tools 270
With slightly more complex code, you can create very interesting and cus-
tomized plots using ggplot2. In this section, well provide an overview of some
guidelines for creating good plots, based on the work of Edward Tufte and
others, and show how you can customize ggplot objects to adhere to some of
these guidelines. This overview will provide a framework for describing how
to customize ggplot objects. Well end the subsection by going over scales and
color specifically.
A number of very thoughtful books and articles have been written about
creating graphics that effectively communicate information. Some of the
authors we highly recommend (and from whose work weve pulled and
aggregated the guidelines for good graphics well go over) are:
In this section, well overview six guidelines for good graphics, based on the
writings of these and other specialists in data display. The guidelines are:
For the examples in this subsection, well use dplyr for data cleaning and, for
plotting, the packages ggplot2, gridExtra, and ggthemes, so you should load those
packages if you plan to follow along with the examples.
library(dplyr)
library(ggplot2)
library(gridExtra)
library(ggthemes)
You can load the data for the examples in this subsection with the following
code:
# install.packages("faraway") ## Uncomment and run if you do not have the `faraway` package \
installed
library(faraway)
data(nepali)
data(worldcup)
# install.packages("dlnm") ## Uncomment and run if you do not have the `dlnm` package ins\
talled
library(dlnm)
data(chicagoNMMAPS)
chic <- chicagoNMMAPS
chic_july <- chic %>%
filter(month == 7 & year == 1995)
You should try to increase, as much as possible, the data to ink ratio in your
graphs. This is the ratio of ink providing information to all ink used in the
figure. In other words, if an element of the plot is redundant, take it out.
Building Data Visualization Tools 272
Example of plots with lower (left) and higher (right) data-to-ink ratios. Each plot shows
the number of players in each position in the worldcup dataset from the faraway
package.
Example of plots with lower (left) and higher (right) data-to-ink ratios. Each plot shows
daily mortality in Chicago, IL, in July 1995 using the chicagoNMMAPS data from the dlnm
package.
By increasing the data-to-ink ratio in a plot, you can help viewers see the mes-
sage of the data more quickly. A cluttered plot is harder to interpret. Further,
you leave room to add some of the other elements well talk about, including
elements to highlight interesting data and useful references. Notice how the
plots on the left in Figures @ref(fig:datainkratio1) and @ref(fig:datainkratio2)
are already cluttered and leave little room for adding extra elements, while
the plots on the right of those figures have much more room for additions.
One quick way to increase data density in ggplot2 is to change the theme for
the plot, which will quickly change several elements of the plots appearance.
There are several themes that come with ggplot2, including a black-and-white
theme and a minimal theme. To use a theme, you can add it to a ggplot object
by using a theme function like theme_bw. For example, to use the classic
theme for a scatterplot using the World Cup 2010 data, you can run:
Minimal theme
theme_linedraw
theme_bw
theme_minimal
theme_void
theme_dark
theme_classic
You can find even more theme functions in packages that extend ggplot2. The
ggthemes package, in particular, has some excellent additional themes. These
include themes based on the graphing principles of Stephen Few (theme_few)
and Edward Tufte (theme_tufte). Again, you can use one of these themes by
adding it to a ggplot object:
Building Data Visualization Tools 275
library(ggthemes)
ggplot(worldcup, aes(x = Time, y = Shots)) +
geom_point() +
theme_tufte()
Tufte theme
Daily mortality in Chicago, IL, in July 1995. This figure gives an example of the plot using
different themes.
You can see that these themes can vary sustantially in their data-to-ink ratios.
Between changing themes and choosing geoms carefully, you can reduce the
Building Data Visualization Tools 277
data-to-ink ratio in a plot substantially. For example, here is the code for the
two plots from @ref(fig:datainkratio2):
chicago_plot +
geom_area(fill = "black") +
theme_excel()
chicago_plot +
geom_line() +
theme_tufte()
We will teach you how to make your own ggplot theme later in the course.
Meaningful labels
Graphs often default to use abbreviations for axis labels and other labeling.
For example, the default is for ggplot2 plots to use column names as labels for
the x- and y-axes of a scatterplot. While this is convenient for exploratory
plots, its often not adequate for plots for presentations and papers. Youll
want to use short and easy-to-type column names in your dataframe to make
coding easier (e.g., wt), but you should use longer and more meaningful
labeling in plots and tables that others need to interpret (e.g., Weight (kg)).
Furthermore, text labels are often aligned in a way that makes them hard to
read. For example, when plotting a categorical variable along the x-axis, it
can be difficult to fit categorical labels that are long enough to be meaningful
without rotating them and so making them harder to read.
Figure @ref(fig:labelsexample) gives an example of the same information
(number of players in the World Cup data set by position) shown with labels
that are harder to read and interpret (left) versus with clear, meaningful
labels (right). Notice how the graph on the left is using abbreviations for
the categorical variable (DF for Defense), abbreviations for axis labels
(Pos for Position and count for Number of players), and has the player
Building Data Visualization Tools 278
The number of players in each position in the worldcup data from the faraway package.
Both graphs show the same information, but the left graph has murkier labels, while the
right graph has labels that are easier to read and interpret.
There are a few strategies you can use to make labels clearer when plotting
with ggplot2:
You can use the xlab and ylab functions to customize the axis labels on a
ggplot object, rather than using the column names in the original data.
You can use the name parameter of the scale family of functions (e.g.,
scale_x_continuous) to relabel x- and y-axes these functions also give
you the power to make other changes to the x- and y-axes (e.g., changing
break points for the axis ticks). However, if you only need to change axis
labels, xlab and ylab are often quicker.
Use tidyverse functions to clean your data before plotting it. This is
particularly useful if you need to change the labels of categorical data.
You can pipe directly from tidyverse data cleaning into a ggplot call (see
the example code below).
Include units of measurement in axis titles when relevant. If units are
dollars or percent, check out the scales package, which allows you to
add labels directly to axis elements by including arguments like labels =
Building Data Visualization Tools 279
percent in scale elements. See the helpfile for scale_x_continuous for some
examples.
If the x-variable requires longer labels, as is often the case with categor-
ical data (for example, player positions Figure @ref(fig:labelsexample)),
consider flipping the coordinates, rather than abbreviating or rotating
the labels. You can use coord_flip to do this.
For example, here is the code used to generate the plots similar to those in
Figure @ref(fig:labelsexample) (we first create a version of the worldcup data
with worse column names and factor labels to show how to improve these
when creating a ggplot object):
library(forcats)
# Create a messier example version of the data
wc_example_data <- worldcup %>%
dplyr::rename(Pos = Position) %>%
mutate(Pos = fct_recode(Pos,
"DC" = "Defender",
"FW" = "Forward",
"GK" = "Goalkeeper",
"MF" = "Midfielder"))
wc_example_data %>%
ggplot(aes(x = Pos)) +
geom_bar()
wc_example_data %>%
mutate(Pos = fct_recode(Pos,
"Defender" = "DC",
"Forward" = "FW",
"Goalkeeper" = "GK",
"Midfielder" = "MF")) %>%
ggplot(aes(x = Pos)) +
geom_bar(fill = "lightgray") +
xlab("") +
ylab("Number of players") +
coord_flip() +
theme_tufte()
In this code example, weve used the fct_recode function from the
forcats package to both create the messier example data and also to
clean up category names for the second plot. The forcats package
has a number of useful functions for working with factors in R.
Building Data Visualization Tools 280
In R, once you load a library, you do not specify that library when
calling its function (e.g., once youve loaded dplyr, you can call
rename). Usually, R does a good job of finding the right function
under this system. However, if you have several packages loaded
that have functions with the same name, you can run into problems.
As you add on packages for plotting and mapping, you may find
that some of your data cleaning code suddenly doesnt work. If
this happens, it may be that youve added code that loads the plyr
package, which has several functions with the same name as dplyr
functions. If this happens to you, try using the package::function
notation to clarify that you want to use the dplyr function. You can
see an example of this in the above code, where weve specified
dplyr::rename when creating the messier example dataset.
References
Data is easier to interpret when you add references. For example, if you show
what it typical, it helps viewers interpret how unusual outliers are.
Figure @ref(fig:referenceexample1) shows daily mortality during July 1995
in Chicago, IL. The graph on the right has added shading showing the range
of daily death counts in July in Chicago for neighboring years (19901994 and
19962000). This added reference helps clarify for viewers how unusual the
number of deaths during the July 1995 heat wave was.
Building Data Visualization Tools 281
Daily mortality during July 1995 in Chicago, IL. In the graph on the right, we have added
a shaded region showing the range of daily mortality counts for neighboring years, to
show how unusual this event was.
Relationship between passes and shots taken among Forwards in the worldcup dataset
from the faraway package. The plot on the right has a smooth function added to help
show the relationship between these two variables.
For scatterplots created with ggplot2, you can use the function geom_smooth to
add a smooth or linear reference line. Here is the code that produces Figure
@ref(fig:referenceexample3):
Relationship between passes and shots taken among Forwards in the worldcup dataset
from the faraway package. The plot has a smooth function added to help show the
relationship between these two variables.
method: The default is to add a loess curve if the data includes less than
1000 points and a generalized additive model for 1000 points or more.
However, you can change to show the fitted line from a linear model
using method = "lm" or from a generalized linear model using method =
"glm".
span: How wiggly or smooth the smooth line should be (smaller value:
more flexible; larger value: more smooth)
se: TRUE or FALSE, indicating whether to include shading for 95%
confidence intervals.
level: Confidence level for confidence interval (e.g., 0.90 for 90% confi-
dence intervals)
Building Data Visualization Tools 284
Lines and polygons can also be useful for adding references, as in Figure
@ref(fig:referenceexample1). Useful geoms for such shapes include:
You want these references to support the main data shown in the plot, but not
overwhelm it. When adding these references:
Add reference elements first, so they will be plotted under the data,
instead of on top of it.
Use alpha to add transparency to these elements.
Use colors that are unobtrusive (e.g., grays).
For lines, consider using non-solid line types (e.g., linetype = 3).
Highlighting
Mortality in Chicago, July 1995. In the plot on the right, a thick red line has been added
to show the dates of a heat wave.
Passes versus shots for World Cup 2010 players. In the plot on the right, notable players
have been highlighted.
You can add highlighting elements using geoms like geom_text and geom_line.
Often, you will need to use a different dataframe for this highlighting geom.
For example, you may want to create a subset of the original dataframe with
notable points to which you want to add text labels. You can specify a new
Building Data Visualization Tools 286
dataframe for a geom using the data parameter in the function that adds that
geom. For example, to create the right plot in Figure @ref(fig:highlightpoints),
we first created a subset dataframe with only the players with the most shots
and passes (when creating this subset, we also included some code to create
the text label we want to use in the plot):
Now you can create a ggplot object based on the worldcup data, add a point
geom to create the scatterplot with all data, and then add the text geom with
the data from noteworthy players to add labels for those players:
Small multiples
Small multiples are graphs that use many small plots to show different subsets
of the data. Typically in small multiples, all plots use the same ranges for the
x- and y-axes. This makes it easier to compare across plots, and it also allows
you to save room by limiting axis annotation. In ggplot2, you can use faceting
to creates small multiples.
For example, the worldcup dataset used in earlier examples includes each
players position. If you want to explore a relationship (e.g., time played vs.
shots on goal), you could try using color:
Building Data Visualization Tools 288
data(worldcup)
worldcup %>%
ggplot(aes(x = Time, y = Shots, color = Position)) +
geom_point()
However, often its clearer to see relationships if you use faceting instead to
create a small separate plot for each position. You can do this with either the
facet_grid function or the facet_wrap function:
worldcup %>%
ggplot(aes(x = Time, y = Shots)) +
geom_point() +
facet_grid(. ~ Position)
Building Data Visualization Tools 289
The facet_grid and facet_wrap functions differ in whether the small graphs are
placed with one faceting variable per dimension (facet_grid) or whether the
plots are wrapped across several rows (facet_wrap).
The facet_grid function can facet by one or two variables. One will be shown
by rows, and one by columns:
## Generic code
facet_grid([factor for rows] ~ [factor for columns])
The facet_wrap() function can facet by one or more variables, and it wraps
the small graphs, so they dont all have to be in one row or column:
## Generic code
facet_wrap(~ [formula with factor(s) for faceting],
ncol = [number of columns])
For example, if you wanted to show relationships for the final two teams in
World Cup 2010 (Spain and Holland) and facet by both position and team, you
could run:
Building Data Visualization Tools 290
worldcup %>%
filter(Team %in% c("Spain", "Netherlands")) %>%
ggplot(aes(x = Time, y = Shots)) +
geom_point() +
facet_grid(Team ~ Position)
With facet_wrap, you can specify how many columns you want to use, which
Building Data Visualization Tools 291
makes it useful if you want to facet across a variable with a lot of variables.
For example, there are 32 teams in the World Cup. You can create a faceted
graph of time played versus shots taken by team by running:
worldcup %>%
ggplot(aes(x = Time, y = Shots)) +
geom_point(alpha = 0.25) +
facet_wrap(~ Team, ncol = 6)
Building Data Visualization Tools 292
Using facet_wrap
Often, when you facet a plot, youll want to re-name your factors levels or re-
order them. For this, youll need to use the factor() function on the original
vector, or use some of the tools from the forcats package. For example, to
rename the sex factor levels from 1 and 2 to Male and Female, you
can run:
Building Data Visualization Tools 293
Notice that the labels for the two graphs have now changed:
To re-order the factor and show the plot for Female first, you can use factor
to change the order of the levels:
Order
Adding order to plots can help highlight interesting findings. Often, factor or
categorical variables are ordered by something that is not interesting, like
alphabetical order (Figure @ref(fig:plotorder), left plot).
Building Data Visualization Tools 295
Mean time per player in World Cup 2010 by team. The plot on the right has reordered
teams to show patterns more clearly.
You can make the ranking of data clearer from a graph by using order to show
rank (Figure @ref(fig:plotorder), right). You can re-order factor variables in a
graph by resetting the factor using the factor function and changing the order
that levels are included in the levels parameter. For example, here is the code
for the two plots in Figure @ref(fig:plotorder):
## Left plot
worldcup %>%
group_by(Team) %>%
summarize(mean_time = mean(Time)) %>%
ggplot(aes(x = mean_time, y = Team)) +
geom_point() +
theme_few() +
xlab("Mean time per player (minutes)") + ylab("")
## Right plot
worldcup %>%
group_by(Team) %>%
summarize(mean_time = mean(Time)) %>%
Building Data Visualization Tools 296
As another example, you can customize the faceted plot created in the
previous subsection to order these plots from least to most average shots for
a position using the following code. This example also has some added code
to highlight the top players in each position in terms of shots on goal, as well
as customizing colors and the theme.
worldcup %>%
select(Position, Time, Shots) %>%
group_by(Position) %>%
mutate(ave_shots = mean(Shots),
most_shots = Shots == max(Shots)) %>%
ungroup() %>%
arrange(ave_shots) %>%
mutate(Position = factor(Position, levels = unique(Position))) %>%
ggplot(aes(x = Time, y = Shots, color = most_shots)) +
geom_point(alpha = 0.5) +
scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black"),
guide = FALSE) +
facet_grid(. ~ Position) +
theme_few()
worldcup %>%
dplyr::select(Team, Time) %>%
dplyr::group_by(Team) %>%
dplyr::mutate(ave_time = mean(Time),
min_time = min(Time),
max_time = max(Time)) %>%
dplyr::arrange(ave_time) %>%
dplyr::ungroup() %>%
dplyr::mutate(Team = factor(Team, levels = unique(Team))) %>%
ggplot(aes(x = Time, y = Team)) +
geom_segment(aes(x = min_time, xend = max_time, yend = Team),
alpha = 0.5, color = "gray") +
geom_point(alpha = 0.5) +
geom_point(aes(x = ave_time), size = 2, color = "red", alpha = 0.5) +
theme_minimal() +
ylab("")
Building Data Visualization Tools 298
Building Data Visualization Tools 299
Well finish this section by going into a bit more details about how to cus-
tomize the scales and colors for ggplot objects, including more on scales and
themes.
There are a number of different scale functions that allow you to customize
the scales of ggplot objects. Because color is often mapped to an aesthetic, you
can adjust colors in many ggplot objects using scales, as well (the exception
is if you are using a constant color for an element). The functions from the
scale family follow the following convention:
## Generic code
scale_[aesthetic]_[vector type]
For example, to adjust the x-axis scale for a continuous variable, youd use
scale_x_continuous. You can use a scale function to change a variety of elements
of an axis, including the axis label (which you could also change with xlab or
ylab) as well as position and labeling of breaks. For aesthetics other than x
and y, the axis will typically be the plot legend for that aesthetic, so these
scale functions can be used to set the name, breaks, labels, and colors of plot
legends.
For example, here is a plot of Time versus Passes for the World Cup 2010 data,
with the number of shots taken shown by size and position shown by color,
using the default scales for each aesthetic:
You may want to customize the x-axis for this plot, changing the scale to show
breaks every 90 minutes (the approximate length of each game). Further, you
may want to give that axis a different axis title. Because you want to change
the x axis and the aesthetic mapping is continuous (this aesthetic is mapped to
the Time column of the data, which is numeric), you can make this change
using scale_x_continuous:
You may also want to change the legend for Shots to have the title Shots
on goal and to only show the sizes for 0, 10, or 20 shots. The data on shots is
mapped to the size aesthetic, and the data is continuous, so you can change
that legend using scale_size_continuous:
Legends for color and fill can be manipulated in a somewhat similar way,
which we explain in more detail later in this subsection.
The scale functions allow a number of different parameters. Some you may
find helpful are:
Parameter Description
name Label or legend name
breaks Vector of break points
minor_breaks Vector of minor break points
labels Labels to use for each break
limits Limits to the range of the axis
For are mapping data that is in a date format, you can use date-specific scale
functions like scale_x_date and scale_x_datetime. For example, heres a plot of
deaths in Chicago in July 1995 using default values for the x-axis:
These date-specific scale functions allow you to change the formatting of the
date (with the date_labels parameter), as well as do some of the tasks you
would do with a non-date scale function, like change the name of the axis:
You can also use the scale functions to transform an axis. For example, to
show the Chicago plot with deaths on a log scale, you can run:
Building Data Visualization Tools 304
For color and fill aesthetics, the conventions for naming the scale functions
vary a bit, and there are more options. For example, to adjust the color scale
when youre mapping a discrete variable (i.e., categorical, like gender or
animal breed) to color, one option is to use scale_color_hue, but you can also
use scale_color_manual and a few other scale functions. To adjust the color
scale for a continuous variable, like age, one option is the scale_color_gradient
function.
There are custom scale functions you can use if you want to pull specific color
palettes. One option is to use one of the Brewer color palettes, which you
can do with functions like scale_color_brewer and scale_color_distiller.
The Brewer palettes fall into three categories: sequential, divergent, and
qualitative. You should use sequential or divergent for continuous data
and qualitative for categorical data. You can explore the Brewer palettes
at http://colorbrewer2.org/. You can also use display.brewer.pal to show the
palettes within R:
Building Data Visualization Tools 305
library(RColorBrewer)
display.brewer.pal(name = "Set1", n = 8)
display.brewer.pal(name = "PRGn", n = 8)
display.brewer.pal(name = "PuBuGn", n = 8)
Once you have picked a Brewer palette you would like to use, you can specify
it with the palette argument within brewer scale function. The following plot
shows examples of the same plot with three different Brewer palettes (a dark
theme is also added with the pastel palette to show those points more clearly):
a <- wc_example +
scale_color_brewer(palette = "Set1") +
ggtitle("Set1")
b <- wc_example +
scale_color_brewer(palette = "Dark2") +
ggtitle("Dark2")
c <- wc_example +
scale_color_brewer(palette = "Pastel2") +
ggtitle("Pastel2") +
theme_dark()
Building Data Visualization Tools 306
d <- wc_example +
scale_color_brewer(palette = "Accent") +
ggtitle("Accent")
grid.arrange(a, b, c, d, ncol = 2)
You can set discrete colors manually using scale_color_manual and scale_fill_-
manual:
It is very easy to confuse the color and fill aesthetics. If you try to
use a scale function for color or fill and it doesnt seem to be doing
anything, make sure youve picked the correct aesthetic of these
two. The fill aesthetic specifies the color to use for the interior of an
element. The color aesthetic specifies the color to use for the border
of an element. Many elements, including lines and some shapes of
points, will only take a color aesthetic. In other cases, like polygon
geoms, you may find you often accidently specify a color aesthetic
when you meant to specify a fill aesthetic.
Some packages provide additional color palettes. For example, there is a pack-
age called viridis with four good color palettes that are gaining population in
visualization. From the packages GitHub repository:
These four color maps are designed in such a way that they
will analytically be perfectly perceptually-uniform, both in regular
Building Data Visualization Tools 308
library(viridis)
worldcup %>%
ggplot(aes(x = Time, y = Shots, color = Passes)) +
geom_point(size = 0.9) +
facet_wrap(~ Position) +
scale_color_viridis()
Building Data Visualization Tools 309
You can use these colors for discrete values, as well, by setting the discrete
parameter in the scale_color_viridis function to TRUE:
Building Data Visualization Tools 310
worldcup %>%
ggplot(aes(x = Time, y = Shots, color = Position)) +
geom_point(alpha = 0.7) +
scale_color_viridis(discrete = TRUE)
The option argument allows you to pick between four palettes: Magma,
Inferno, Plasma, and Viridis. Here are examples of each of those palettes
applies to the World Cup example plot:
library(gridExtra)
There are some excellent resources available for finding out more about
creating plots using the gpplot2 package.
If you want to get more practical tips on how to plot with ggplot2, check out:
4.3 Mapping
Often, data will include a spatial component, and you will want to map the
data either for exploratory data analysis or to present interesting aspects of
Building Data Visualization Tools 313
the data to others. R has a range of capabilities for mapping data. The simplest
techniques involve using data that includes latitude and longitude values and
using these location values as the x and y aesthetics in a regular plot. R also
has the ability to work with more complex spatial data objects and import
shapefiles through extensions like the sp package.
In this section, we will cover the basics of mapping in R and touch on some of
the more advanced possibilities. We will also present some useful packages
for making quick but attractive maps in R. R also now has the capability to
make interactive maps using the plotly and leaflet packages; in the end of this
section, well present these packages and explain a bit more about htmlWidgets
in general.
Basics of mapping
The most basic way to map data in R is to create a regular ggplot object and
map longitude to the x aesthetic and latitude to the y aesthetic. You can use
this technique to create maps of geographic areas, like states or countries,
and to map locations as points, lines, and other shapes. The ggplot2 package
includes a few datasets with geographic information that can be accessed
with the map_data function. Well pull one of these to use as an example of
this basic method of mapping.
You can use the map_data function from the ggplot2 package to pull data for
maps at different levels (usa, state, world, county). The data you pull
give locations of the borders of geographic polygons like states and counties.
For example, you can get the polygon location data for U.S. states by running
the following code:
library(ggplot2)
us_map <- map_data("state")
head(us_map, 3)
long lat group order region subregion
1 -87.46201 30.38968 1 1 alabama <NA>
2 -87.48493 30.37249 1 2 alabama <NA>
3 -87.52503 30.37249 1 3 alabama <NA>
Notice that the dataframe includes columns with location (long and lat). It
also includes a column describing the order in which these points should be
Building Data Visualization Tools 314
connected to form a polygon (order), the name of the state (region), and a group
column that separates the points into unique polygons that should be plotted
(more on this in a minute).
If you plot the points for a couple of state, mapping longitude to the x aesthetic
and latitude to the y aesthetic, you can see that the points show the outline of
the state:
us_map %>%
filter(region %in% c("north carolina", "south carolina")) %>%
ggplot(aes(x = long, y = lat)) +
geom_point()
Map of Carolinas
If you try to join these points by just using a path geom rather than a points
geom, however, youll have a problem:
us_map %>%
filter(region %in% c("north carolina", "south carolina")) %>%
ggplot(aes(x = long, y = lat)) +
geom_path()
Building Data Visualization Tools 315
Map of Carolinas
If you create a path for all the points in the map, without separating polygons
for different geographic groupings (like states or islands), the path will be
drawn without picking up the pen between one states polygon and the next
states polygon, resulting in unwanted connecting lines.
Mapping a group aesthetic in the ggplot object fixes this problem. This will
plot a separate path or polygon for each separate polygon. In the U.S. states
data, each polygons group is specified by the group column. No two states
share a group, and some states have more than one group (if, for example,
they have islands). Here is the code for mapping the group column to the group
aesthetic to create the map:
us_map %>%
filter(region %in% c("north carolina", "south carolina")) %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_path()
Building Data Visualization Tools 316
You may have noticed that we used a path rather than line geom
to plot the state borders in the previous maps. This is because the
line geom connects points by their order on the x-axis. While you
often want that for statistical graphs, for maps in ggplot2 the x-axis is
longitude, and we want to connect the points in a way that outlines
the geographic areas. The geom_path function connects the points in
the order they appear in the dataframe, which typically gives us
the desired plot for mapping geographic areas. You likely will also
sometimes want to use a polygon geom for mapping geographic
areas, as shown in some of the following examples.
If you would like to set the color inside each geographic area, you should use
a polygon geom rather than a path geom. You can then use the fill aesthetic to
set the color inside the polygon and the color aesthetic to set the color of the
border. For example, to set the interior of the states to blue and the borders
to black, you can run:
us_map %>%
filter(region %in% c("north carolina", "south carolina")) %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "lightblue", color = "black")
Building Data Visualization Tools 317
To get rid of the x- and y-axes and the background grid, you can add the void
theme to the ggplot output:
us_map %>%
filter(region %in% c("north carolina", "south carolina")) %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "lightblue", color = "black") +
theme_void()
To extend this code to map the full continental U.S., just remove the line of the
pipe chain that filtered the state mapping data to North and South Carolina:
Building Data Visualization Tools 318
us_map %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "lightblue", color = "black") +
theme_void()
In the previous few graphs, we used a constant aesthetic for the fill color.
However, you can map a variable to the fill to create a choropleth map with
a ggplot object. For example, the votes.repub dataset in the maps package gives
some voting data by state and year:
data(votes.repub)
head(votes.repub)
1856 1860 1864 1868 1872 1876 1880 1884 1888 1892
Alabama NA NA NA 51.44 53.19 40.02 36.98 38.44 32.28 3.95
Alaska NA NA NA NA NA NA NA NA NA NA
Arizona NA NA NA NA NA NA NA NA NA NA
Arkansas NA NA NA 53.73 52.17 39.88 39.55 40.50 38.07 32.01
California 18.77 32.96 58.63 50.24 56.38 50.88 48.92 52.08 49.95 43.76
Colorado NA NA NA NA NA NA 51.28 54.39 55.31 41.13
1896 1900 1904 1908 1912 1916 1920 1924 1928 1932
Alabama 28.13 34.67 20.65 24.38 8.26 21.97 30.98 27.01 48.49 14.15
Alaska NA NA NA NA NA NA NA NA NA NA
Arizona NA NA NA NA 12.74 35.37 55.41 41.26 57.57 30.53
Building Data Visualization Tools 319
Arkansas 25.11 35.04 40.25 37.31 19.73 28.01 38.73 29.28 39.33 12.91
California 49.13 54.48 61.90 55.46 0.58 46.26 66.24 57.21 64.70 37.40
Colorado 13.84 42.04 55.27 46.88 21.88 34.75 59.32 57.02 64.72 41.43
1936 1940 1944 1948 1952 1956 1960 1964 1968 1972 1976
Alabama 12.82 14.34 18.20 19.04 35.02 39.39 41.75 69.5 14.0 72.4 43.48
Alaska NA NA NA NA NA NA 50.94 34.1 45.3 58.1 62.91
Arizona 26.93 36.01 40.90 43.82 58.35 60.99 55.52 50.4 54.8 64.7 58.62
Arkansas 17.86 20.87 29.84 21.02 43.76 45.82 43.06 43.9 30.8 68.9 34.97
California 31.70 41.35 42.99 47.14 56.39 55.40 50.10 40.9 47.8 55.0 50.89
Colorado 37.09 50.92 53.21 46.52 60.27 59.49 54.63 38.7 50.5 62.6 55.89
To create a choropleth for one of the years, you can tidy the data, join it with
the U.S. data by state, and then map the voting percentages to the fill aesthetic:
library(dplyr)
library(viridis)
votes.repub %>%
tbl_df() %>%
mutate(state = rownames(votes.repub),
state = tolower(state)) %>%
right_join(us_map, by = c("state" = "region")) %>%
ggplot(aes(x = long, y = lat, group = group, fill = `1976`)) +
geom_polygon(color = "black") +
theme_void() +
scale_fill_viridis(name = "Republican\nvotes (%)")
Building Data Visualization Tools 320
This code uses piping and tidyverse functions to clean the data, merge it
with the geographic data, and pipe to ggplot. See earlier sections of this
book to find out more about tidying data.
The votes.repub data initially is a matrix. The tbl_df function from dplyr
is used to convert it to a tibble.
The state names were originally in the row names of votes.repub. The
mutate function is used to move these into a column of the dataframe.
The names are then converted to lowercase to allow easier merging with
the geographic data.
The voting data includes Alaska and Hawaii, but the geographic data
does not. Therefore, weve used right_join to join the two datasets, so
only non-missing values from the us_map geographic data will be kept.
Because the column names for the years do not follow the rules for
naming R objects (1976 starts with a number), weve surrounded the
column name with backticks when calling it in the aesthetic statement.
We want the borders of the states to always be black, so weve set that
aesthetic as a constant rather than mapping it to a variable by including
it in an aes call.
Building Data Visualization Tools 321
Weve added a void theme (theme_void) to take out axes and background,
and we added a custom color scale from the viridis package (scale_-
fill_viridis) to customize the colors used in the choropleth.
If you have data with point locations, you can add those points to a map
created with ggplot, too, by adding a point geom. As an example, well use
some data related to the popular Serial podcast. The podcast covered a
murder in Baltimore. David Robinson posted a dataset of locations related
to the show on GitHub, which you can read in directly to R to use for some of
the mapping examples in this subset:
library(readr)
serial <- read_csv(paste0("https://raw.githubusercontent.com/",
"dgrtwo/serial-ggvis/master/input_data/",
"serial_podcast_data/serial_map_data.csv"))
head(serial, 3)
# A tibble: 3 5
x y Type Name Description
<int> <int> <chr> <chr> <chr>
1 356 437 cell-site L688 <NA>
2 740 360 cell-site L698 <NA>
3 910 340 cell-site L654 <NA>
He figured out a way to convert the x and y coordinates in this data to latitude
and longitude coordinates, and the following code cleans up the data using
that algorithm. The murder occurred when cell phones were just becoming
popular, and cell phone data was used in the case. The code also adds a
column for whether of not the location is a cell tower.
We can use ggplot to map these data on a base map of Baltimore City and
Baltimore County in Maryland. To do so, use the map_data function from ggplot2
to pull the county map. By specifying the region parameter as maryland,
you can limit the returned geographic polygon data Maryland counties.
This data includes a column named subregion with the county. You can use
that column to filter to just the data for Baltimore City (baltimore city) or
Baltimore County (baltimore):
If you create a ggplot object with this data and add a polygon geom, you will
have a base map of these two counties:
Serial data
To add the locations from the serial data to this map, you just need to add a
point geom, specifying the dataframe to use with the data parameter:
Serial data
When you add a geom to a ggplot object with mapped aesthetics, the geom will
inherit those aesthetics unless you explicitly override them with an aes call in
the geom function. That is why we did not have to explicitly map longitude to x
and latitude to y in the aes call when adding the points to the map (although,
as a note, if the column names for the longitude and latitude columns had
been different in the baltimore and serial dataframes, we would have needed
to reset these aesthetics when adding the points).
Further, we mapped the group column in the geographic data to the group
aesthetic, so the polygons would be plotted correctly. However, the serial
dataframe does not have a column for group. Therefore, we need to unset
the group aesthetic mapping in the point geom. We do that by specifying group
= NULL in the aes statement of the point geom.
Note that weve also customized the map a bit by setting constant colors for
the fill for the two counties (fill = "lightblue") and by setting the colors and
Building Data Visualization Tools 325
legend name for the points using scale_color_manual. By mapping the color of
the points to the tower column in the dataframe, we show points that are cell
towers in a different color than all other points.
The ggplot function requires that you input data in a dataframe. In the
examples shown in this section, we started with dataframes that included
geographic locations (latitude and longitude) as columns. This is the required
format of data for mapping with ggplot2 (or with extensions like ggmap).
Sometimes, however, you will want to plot geographic data in R that is in
a different format. In particular, most R functions that read shapefiles will
read the data into a spatial object rather than a dataframe. To map this data
with ggplot2 and related packages, you will need to transform the data into a
dataframe. You can do this using the fortify function from ggplot2. Well cover
this process in more detail in a later section, when we present spatial objects.
can either use the longitude and latitude of the center point of the map or you
can use a character string to specify a location. If you do the second, get_map
will use the Google Maps API to geocode the string to a latitude and longitude
and then get the map (think of searching in Google Maps in the search box
for a location). This will work well for most cities, and you can also use it
with landmarks, but it might fail to geocode less well-known locations. You
can also input an address as a character string when pulling a base map and
Google will usually be able to successfully geocode and pull the right map.
You can use the zoom parameter to set the amount the map is zoomed in on
that location; this value should be between 3 and 20, with lower values more
zoomed out and higher values more zoomed in.
For example, you can use the following code to pull a map of Beijing:
## install.packages("ggmap")
library(ggmap)
beijing <- get_map("Beijing", zoom = 12)
The get_map function returns a ggmap object. You can plot this object using the
ggmap function:
ggmap(beijing)
Building Data Visualization Tools 327
Map of Beijing
The output of ggmap is a ggplot object, so you can add elements to it in the same
way you would work with any other ggplot object. For example, to set the void
theme and add a title, you could run:
ggmap(beijing) +
theme_void() +
ggtitle("Beijing, China")
Building Data Visualization Tools 328
Map of Beijing
While the default source for maps with get_map is Google Maps, you can
also use the function to pull maps from OpenStreetMap and Stamen Maps.
Further, you can specify the type of map, which allows you to pull a variety
of maps including street maps and terrain maps. You specify where to get the
map using the source parameter and what type of map to use with the maptype
parameter.
Here are example maps of Estes Park, in the mountains of Colorado, pulled
using different map sources and map types. Also, note that weve used the
option extent = "device" when calling ggmap, which specifies that the map
should fill the whole plot area, instead of leaving room for axis labels and
titles. Finally, as with any ggplot object, we can save each map to an object.
We do that here so we can plot them together using the grid.arrange function,
which well describe in more detail in a later section in this course.
Building Data Visualization Tools 329
library(gridExtra)
grid.arrange(map_1, map_2, map_3, nrow = 1)
Once you have pulled one of these base maps into R, you can add ggplot
elements to them, including point and polygon geoms for locations. For
Building Data Visualization Tools 331
example, you could pull in a base map of the Baltimore County area and add
the elements we plotted from the serial dataframe in the last subsection:
Note that we used alpha to add some transparency to the polygons so you could
see the base map through them.
Now that weve gone through some examples, here is a step-by-step review
of how the mapping process works with ggmap:
1. The get_map function pulls in a base map from the Google Maps API (or
Building Data Visualization Tools 333
another map server like Stamen Maps). The returned value is a ggmap
object.
2. The ggmap function plots this ggmap object and returns a ggplot object.
You can use this resulting ggplot object as you would any other ggplot
object (e.g., add geoms, change theme).
3. Call other ggplot2 functions on this output to add locations and customize
the map. Map longitude in the data to the x aesthetic and latitude to the y
aesthetic. Note that you are adding locations using a new dataframe for
the geom. Just as with regular ggplot objects, if you use a new dataframe
for a geom, you must specify it with the data parameter for that geom.
Because geoms do not take dataframes as their first arguments, you cant
specify the dataframe first without data = and rely on position with
geoms. (By contrast, the ggplot function does take the data parameter as
its first argument, so thats why you can get away with not using data
= when specifying a dataframe in the original ggplot call for a regular
ggplot object.)
You can use the ggmap package to do a number of other interesting tasks related
to geographic data. For example, the package allows you to use the Google
Maps API, through the geocode function, to get the latitude and longitude of
specific locations based on character strings of the location or its address. For
example, you can get the location of the Supreme Court of the United States
by calling:
You can compute map distances, too, using the mapdist function with two
locations:
Building Data Visualization Tools 334
mapdist("Baltimore, MD",
"1 First St NE, Washington, DC") %>%
select(from, to, miles)
from to miles
1 Baltimore, MD 1 First St NE, Washington, DC 37.90664
To find out more about how Google Maps is performing this and other tasks,
you can read its API documentation.
For these GIS-style tasks, the ggmap package is not running its own
algorithms but rather using the Google Maps API. This package
cannot do other GIS tasks, like finding the centroids or areas of
spatial polygons. To use R as a GIS for more substantive tasks, youll
need to use other R packages, like sp and rgdal.
If you need to map US states and counties, the choroplethr and choroplethrMaps
packages offer functions for fast and straightforward mapping. This package
also offers an interesting example of incorporating mapping functions within
an R package. You can explore the code for the package, as well as some docu-
mentation, at the choroplethr packages GitHub page: https://github.com/trulia/choropleth
library(choroplethr)
library(choroplethrMaps)
data(df_pop_county)
df_pop_county %>% slice(1:3)
region value
1 1001 54590
2 1003 183226
3 1005 27469
As long as you are using a dataframe with a numeric column named region
with each countys FIPS code and a column named value with the value youd
like to map (population in this case), you can create a choropleth just by
running the county_choropleth function on the dataframe.
county_choropleth(df_pop_county)
Building Data Visualization Tools 336
If you want to only plot some of states, you can use the state_zoom argument:
To plot values over a reference map from Google Maps, you can use the
reference_map argument:
This example is using one of the datasets that comes with the choroplethr
package, but you can map any dataset that includes a column with county
FIPS and a column with the value you would like to plot. All you have to do is
(1) make sure the county FIPS is in a numeric class and (2) name the columns
for FIPS and the value to plot as region and value, respectively (the
rename function from dplyr is useful here). For example, here is a dataframe
giving storm events that were listed in NOAAs Storm Events database near
Building Data Visualization Tools 339
library(readr)
floyd_events <- read_csv("data/floyd_events.csv")
floyd_events %>% slice(1:3)
# A tibble: 3 4
begin_date end_date fips type
<date> <date> <chr> <chr>
1 1999-09-16 1999-09-17 25011 Heavy Rain
2 1999-09-16 1999-09-17 25001 Heavy Rain
3 1999-09-16 1999-09-17 25015 Heavy Rain
You can use the following code to plot the number of events listed for each
US county by cleaning and summarizing the data in a pipe chain and then
piping the output to the county_choropleth function. The choropleth mapping
functions require that each county is included only once, so we used group_by
and summarize to collapse the dataframe to have only a single observation for
each county.
floyd_events %>%
group_by(fips) %>%
dplyr::summarize(n_events = n()) %>%
mutate(fips = as.numeric(fips)) %>%
dplyr::rename(region = fips,
value = n_events) %>%
county_choropleth(state_zoom = c("north carolina", "virginia"),
reference_map = TRUE)
Building Data Visualization Tools 340
The map created by county_choropleth (and the other maps created by func-
tions in the choroplethr package) is a ggplot object, so you can add elements
to it. For example, to create a map of flood events that includes the track of
Hurricane Floyd on the map, you can run:
Building Data Visualization Tools 341
floyd_events %>%
dplyr::group_by(fips) %>%
dplyr::summarize(flood = sum(grepl("Flood", type))) %>%
dplyr::mutate(fips = as.numeric(fips)) %>%
dplyr::rename(region = fips,
value = flood) %>%
county_choropleth(state_zoom = c("north carolina", "maryland",
"delaware", "new jersey",
"virginia", "south carolina",
"pennsylvania", "new york",
"connecticut", "massachusetts",
"new hampshire", "vermont",
"maine", "rhode island"),
reference_map = TRUE) +
geom_path(data = floyd_track, aes(x = -longitude, y = latitude,
group = NA),
color = "red")
Building Data Visualization Tools 342
To create county choropleths with the choroplethr package that are more
customized, you can use the packages CountyChoropleth, which is an R6 object
for creating custom county choropleths. To create an object, you can run
CountyChoropleth$new with the data youd like to map. As with county_choropleth,
this data should have a column named region with county FIPS codes
in a numeric class and a column named values with the values to plot.
To map counties in which a flood event was reported around the time of
Building Data Visualization Tools 343
Floyd, you can start by cleaning your data and then creating an object using
CountyChoropleth$new:
As a note, in cleaning the data here, we wanted to limit the dataset to only
observations where the event type included the word Flood (this will pull
events listed as Flood or Flash Flood), so weve used the grepl function to
filter to just those observations.
Once you have created a basic object using CountyChoropleth, you can use
the methods for this type of object to customize the map substantially. For
example, you can set the states using the set_zoom method:
At any point, you can render the object using the render method (or render_-
with_reference_map, to plot the map with the Google reference map added):
floyd_map$render()
Building Data Visualization Tools 344
To find out what options are available for this object type, in terms of methods
you can use or attributes you can change, you can run:
Building Data Visualization Tools 345
names(floyd_map)
[1] ".__enclos_env__" "add_state_outline"
[3] "ggplot_polygon" "projection"
[5] "ggplot_scale" "warn"
[7] "legend" "title"
[9] "choropleth.df" "map.df"
[11] "user.df" "clone"
[13] "clip" "initialize"
[15] "set_zoom" "render_state_outline"
[17] "render_helper" "render"
[19] "set_num_colors" "get_zoom"
[21] "format_levels" "theme_inset"
[23] "theme_clean" "get_scale"
[25] "prepare_map" "bind"
[27] "discretize" "render_with_reference_map"
[29] "get_choropleth_as_polygon" "get_reference_map"
[31] "get_y_scale" "get_x_scale"
[33] "get_bounding_box" "get_max_lat"
[35] "get_min_lat" "get_max_long"
[37] "get_min_long"
a <- floyd_map$render() +
geom_path(data = floyd_track, aes(x = -longitude, y = latitude,
group = NA),
color = "red", size = 2, alpha = 0.6) +
xlim(floyd_map$get_bounding_box()[c(1, 3)]) +
ylim(floyd_map$get_bounding_box()[c(2, 4)])
b <- floyd_map$render_with_reference_map() +
geom_path(data = floyd_track, aes(x = -longitude, y = latitude,
group = NA),
color = "red", size = 2, alpha = 0.6) +
xlim(floyd_xlim) +
ylim(floyd_ylim)
library(gridExtra)
grid.arrange(a, b, ncol = 2)
Building Data Visualization Tools 346
states, were using this get_bounding_box method to get these boundaries, and
then weve used those values for the xlim and ylim functions when we create
the final ggplot objects. Finally, the rendered maps are ggplot objects, so to
include the hurricane track, we can add ggplot elements to the map using +,
as with any ggplot object. We used the grid.arrange function from the gridExtra
package to put the two maps (with and without the background Google map)
side-by-side.
So far, we have relied on ggplot and related packages for mapping. However,
there are other systems for mapping in R. In particular, geographic data in R is
often stored in spatial objects (e.g., SpatialPolygons, SpatialPointsDataframe),
particularly when it is read in from the shapefiles commonly used to store
spatial data outside of R.
In this subsection we will introduce these spatial objects, show how to work
with them in R (including how to convert them to dataframes so they can
be used with the ggplot-based mapping covered in earlier subsections), and
briefly describe shapefiles.
Spatial objects in R
R has a series of special object types for spatial data. For many mapping /
GIS tasks, you will need your data to be in one of these objects. These spatial
objects include objects that just contain geographies (e.g., locations along
the borders of countries) or objects that contain geographies and associated
attributes of each element of the geography (e.g., county boundaries as well
as the population of each country). The most common spatial objects in R are:
SpatialPolygons
SpatialPoints
SpatialLines
SpatialPolygonsDataFrame
SpatialPointsDataFrame
SpatialLinesDataFrame
Building Data Visualization Tools 348
The tigris package lets you pull spatial data directly from the US Census. This
data comes into R as a spatial object. To provide a basic overview of working
with spatial object in R, we will use an example spatial object pulled with this
package.
The tigris package includes a function called tracts that allows you to pull the
geographic data on boundaries of U.S. Census tracts. You can use the state and
county parameters to limit the result to certain counties, and you can set cb =
FALSE if a lower-resolution (and smaller) file is adequate. To pull census tract
boundaries for Denver, CO, you can run:
library(tigris)
library(sp)
denver_tracts <- tracts(state = "CO", county = 31, cb = TRUE)
By running class on the returned object, you can see that this function has
returned a SpatialPolygonsDataFrame object.
class(denver_tracts)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"
Spatial objects like this have a plot methods that can be called to plot the
object. This means that you can map these census tract boundaries by calling:
plot(denver_tracts)
Building Data Visualization Tools 349
There number of other methods for this specific object type. For example,
bbox will print out the bounding box of the spatial object (range of latitudes
and longitudes covered by the data).
bbox(denver_tracts)
min max
x -105.10993 -104.60030
y 39.61443 39.91425
The is.projected and proj4string functions give you some information about
the current Coordinate Reference System of the data (we describe more about
Coordinate Reference Systems later in this subsection).
is.projected(denver_tracts)
[1] FALSE
proj4string(denver_tracts)
[1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"
using @. For example, heres the beginning of the dataframe for the denver_-
tracts spatial object:
head(denver_tracts@data)
STATEFP COUNTYFP TRACTCE AFFGEOID GEOID NAME LSAD
25 08 031 000201 1400000US08031000201 08031000201 2.01 CT
26 08 031 000302 1400000US08031000302 08031000302 3.02 CT
27 08 031 001101 1400000US08031001101 08031001101 11.01 CT
28 08 031 002802 1400000US08031002802 08031002802 28.02 CT
29 08 031 003300 1400000US08031003300 08031003300 33 CT
30 08 031 004006 1400000US08031004006 08031004006 40.06 CT
ALAND AWATER
25 2084579 0
26 1444043 0
27 898885 0
28 886798 0
29 1288718 0
30 1953041 0
For this spatial object, the data includes identifying information (state, county,
tract), but also some attribute data (area of the tract that is land, area of the
tract that is water).
You can add different layers of spatial objects onto the same plot. To do that,
just use add = TRUE for added layers. For example, to add primary roads to the
Denver census tract map, you can pull a spatial object with roads using the
primary_roads function from the tigris package (note: this data includes roads
across the U.S. and so might take a few seconds to download or render) and
then use plot with add = TRUE to add the roads to the map:
Now you can use the data in to create a map using ggplot2 functions:
denver_tracts_df %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "lightblue", color = "black") +
theme_void()
Building Data Visualization Tools 353
Spatial objects will have a Coordinate Reference Systems (CRSs), which spec-
ifies how points on a curved earth are laid out on a two-dimensional map. A
CRS can be geographic (e.g., WGS84, for longitude-latitude data) or projected
(e.g., UTM, NADS83). The full details of map projections is beyond the scope
Building Data Visualization Tools 354
of this course, but if youd like to find out more details, this section of the
documentation for QGIS is very helpful. For working with spatial objects in
R, it is important to realize that spatial objects have a Coordinate Reference
System attribute and that you can run into problems if you try to work directly
with two spatial objects with different Coordinate Reference Systems.
To find out the CRS for a spatial object (like a SpatialPoints object) that already
has a CRS, you can use proj4string. For example, to get the CRS of the Denver
census tract data, you can run:
proj4string(denver_tracts)
[1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"
## Generic code
proj4string(my_spatial_object) <- "+proj=longlat +datum=NAD83"
Note that this call does not create a projection or reproject the data. Rather,
this call is specify to R the CRS the data currently is in.
The CRS function from the sp package creates CRS class objects that can be
used in this specification. You input projection arguments into this func-
tion as a character string (for example, CRS("+proj=longlat +datum=NAD27")).
You can also, however, use a shorter EPSG code for a projection (for ex-
ample, CRS("+init=epsg:28992")). The http://www.spatialreference.org website
lists these projection strings and can be useful in determining a string to use
when setting projection information or re-projecting data.
Building Data Visualization Tools 355
library(sp)
CRS("+proj=longlat +datum=NAD27")
CRS arguments:
+proj=longlat +datum=NAD27 +ellps=clrk66
+nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat
CRS("+init=epsg:28992")
CRS arguments:
+init=epsg:28992 +proj=sterea +lat_0=52.15616055555555
+lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000
+ellps=bessel
+towgs84=565.4171,50.3319,465.5524,-0.398957,0.343988,-1.87740,4.0725
+units=m +no_defs
If a spatial object has a CRS and you want to change it, you should do so using
the spTransform function from the rgdal package. You input the spatial object
whose CRS you want to change as well as the CRS object to which to change
it:
## Generic code
my_spatial_object <- spTransform(my_spatial_object,
CRS = CRS("+init=epsg:4267"))
If you want to ensure that the CRS of two spatial objects agree, you can use
proj4string to pull the CRS from one of the spatial objects and specify that as
the output CRS for an spTransform call on the other object, like:
## Generic code
my_spatial_object <- spTransform(my_spatial_object,
CRS = proj4string(another_sp_object))
If you are interested in finding out more, Melanie Frazier has created an ex-
cellent resource on Coordinate Reference Systems and maps in R: https://www.nceas.ucsb
The coord_map function in ggplot2 can help you in plotting maps with dif-
ferent projections. This function does not change any aspect of the data
being mapped, but rather changes the projection when mapping the data.
In fact, since this function is used with ggplot-style mapping, all data being
mapped will be in dataframes rather than spatial objects and so will not have
specifications for CRS. The following examples, which are adapted from the
help file for the coord_map function, show example output when the coord_map
element is added to a map of the United States:
Building Data Visualization Tools 356
Shapefiles
Shapefiles are a file format that is often used for saving and sharing geo-
graphic data, particularly when using GIS software. This format is not R-
specific, but R can read in and write out shapefiles. The format is typically not
a single file, but rather a directory of related files in a directory. Shapefiles
often include both geographic information and also data describing attributes
(for example, a shapefile might include the locations of country borders as
well as the population of each of the countries).
To read shapefiles into R, use the readOGR function from the rgdal package.
You can also write out spatial objects youve created or modified in R to
shapefiles using the writeOGR from the same package. The readShape* family
of functions from the maptools package can also be used to read shapefiles
into R. These functions all read the spatial data in as a spatial object. For
example, the shapefiles of country borders and populations would be read in
as a SpatialPolygonsDataFrame object. Once you read a shapefile into R, you
can work with the spatial object in the same way we showed how to work
with the Denver census tracts spatial object earlier in this subsection.
To find out more about shapefiles, R, and ggplot2, check out the
wiki listing at https://github.com/tidyverse/ggplot2/wiki/plotting-
polygon-shapefiles.
R as GIS
In addition to mapping, you can also use R for a number of GIS-style tasks,
including:
Clipping
Creating buffers
Measuring areas of polygons
Counting points in polygons
These tasks can be done with GIS software, and if you are doing extensive GIS
work, it may be worthwhile to use specialized software. However, if you just
need to do a few GIS tasks as part of a larger workflow, you should consider
using R for these steps. Some advantages to using R for GIS tasks are:
Building Data Visualization Tools 359
R is free
You can write all code in a script, so research is more reproducible
You save time and effort by staying in one software system, rather than
moving data between different software
To show some of the GIS-style tasks that can be done from R, well use some
driver-level data from the Fatality Analysis Reporting System (FARS) for 2001
2010, which we have saved as fars_colorado.RData:
load("data/fars_colorado.RData")
driver_data %>%
dplyr::select(1:5) %>% dplyr::slice(1:5)
state st_case county date latitude
1 8 80001 51 2001-01-01 10:00:00 39.10972
2 8 80002 31 2001-01-04 19:00:00 39.68215
3 8 80003 31 2001-01-03 07:00:00 39.63500
4 8 80004 31 2001-01-05 20:00:00 39.71304
5 8 80005 29 2001-01-05 10:00:00 39.09733
The dataset includes a column for the county in which each accident oc-
curred, so you can also aggregate the data by county and use a function
from the choroplethr package to quickly create a county-specific choropleth of
accident counts (note that, because the data is driver specific, this will count
every car in an accident):
Building Data Visualization Tools 361
library(stringr)
county_accidents <- driver_data %>%
dplyr::mutate(county = str_pad(county, width = 3,
side = "left", pad = "0")) %>%
tidyr::unite(region, state, county, sep = "") %>%
dplyr::group_by(region) %>%
dplyr::summarize(value = n()) %>%
dplyr::mutate(region = as.numeric(region))
county_accidents %>% slice(1:4)
# A tibble: 4 2
region value
<dbl> <int>
1 8001 617
2 8003 77
3 8005 522
4 8007 41
As a note, this code uses the str_pad function from the stringr package to pad
1- or 2-digit county FIPS codes with leading zeros before pasting them to the
state FIPS code and uses the n function from dplyr with summarize to count the
number of observations ins each county.
This technique of creating a choropleth only worked because we had a
column in the data linking accidents to counties. In same cases, you will want
to create a choropleth based on counts of points but will not have this linking
information in the data. For example, we might want to look at accident
counts by census tract in Denver. To do this, well need to link each accident
(point) to a census tract (polygon), and then we can count up the number of
points linked to each polygon. We can do this with some of the GIS-style tools
available in R.
Building Data Visualization Tools 363
library(sp)
denver_fars_sp <- denver_fars
coordinates(denver_fars_sp) <- c("longitud", "latitude")
proj4string(denver_fars_sp) <- CRS("+init=epsg:4326")
class(denver_fars_sp)
[1] "SpatialPointsDataFrame"
attr(,"package")
[1] "sp"
To be able to pair up polygons and points, the spatial objects need to have the
same CRS. To help later with calculating the area of each polygon, well use a
projected CRS that is reasonable for Colorado and reproject the spatial data
using the spTransform function:
Building Data Visualization Tools 364
Now that the objects with the accident locations and with the census tracts
are both spatial objects with the same CRS, we can combine them on a map.
Because they are spatial objects, we can do that using plot:
plot(denver_tracts_proj)
plot(denver_fars_proj, add = TRUE, col = "red", pch = 1)
Now, that both datasets are in spatial objects and have the same CRS, you can
use the poly.counts function to count how many of the accidents are in each
Building Data Visualization Tools 365
census tract. This function inputs a spatial points object and a spatial polygons
object and outputs a numeric vector with the count of points in each polygon:
library(GISTools)
tract_counts <- poly.counts(denver_fars_proj, denver_tracts_proj)
head(tract_counts)
25 26 27 28 29 30
7 2 2 0 0 4
You can use a choropleth to show these accident counts. In this case, the
quickest way to do this is probably to use the choropleth function in the
GISTools package.
choropleth(denver_tracts, tract_counts)
Building Data Visualization Tools 366
There are other functions in R that do other GIS tasks. For example, There is
function in the GISTools package that calculates the area of each polygon.
head(poly.areas(denver_tracts_proj))
25 26 27 28 29 30
2100172.2 1442824.1 897886.3 881530.5 1282812.2 1948187.1
You can use this functionality to create a choropleth of the rate of fatal
accidents per population in Denver census tracts:
Building Data Visualization Tools 367
choropleth(denver_tracts,
tract_counts / poly.areas(denver_tracts_proj))
Raster data
When mapping in R, you may also need to map raster data. You can think of
raster data as data shown with pixels the graphing region is divided into
even squares, and color is constant within each square.
There is a function in the raster package that allows you to rasterize data.
That is, you take spatial points data, divide the region into squares, and count
the number of points (or other summary) within each square. When you do
Building Data Visualization Tools 368
this, you need to set the x- and y-range for the raster squares. You can use
bbox on a spatial object to get an idea of its ranges to help you specify these
limits. You can use the res parameter in raster to set how large the raster boxes
should be. For example, here is some code for rasterizing the accident data
for Denver:
library(raster)
bbox(denver_fars_sp)
min max
longitud -105.10973 -104.0122
latitude 39.61715 39.8381
denver_raster <- raster(xmn = -105.09, ymn = 39.60,
xmx = -104.71, ymx = 39.86,
res = 0.02)
den_acc_raster <- rasterize(geometry(denver_fars_sp),
denver_raster,
fun = "count")
You can use the image function to plot this raster alone:
Raster data
You can use plot with add = TRUE to add the raster to a base plot of Denver. In
this case, you will likely want to set some transparency (alpha) so you can see
the base map through the raster:
plot(denver_tracts)
plot(den_acc_raster, add = TRUE, alpha = 0.5)
Building Data Visualization Tools 370
There is a lot more you can learn about mapping in R than we could cover
here. Here are some good resources if you would like to learn more:
4.4 htmlWidgets
Overview of htmlWidgets
leaflet: Mapping
dygraphs: Time series
plotly: A variety of plots, including maps
rbokeh: A variety of plots, including maps
networkD3: Network data
d3heatmap: Heatmaps
DT: Data tables
DiagrammeR: Diagrams and flowcharts
The leaflet and plotly packages are two of the most useful and developed
packages in this collection of htmlWidgets. In this section, we will overview
what you can make with these two packages.
If you are interested in learning all the details about the JavaScript
on which these htmlWidgets are built, you may find the short book
Getting Started with D3 by Mike Dewar interesting.
Building Data Visualization Tools 372
plotly package
This section on the plotly package requires the use of a web browser
to see results. Therefore, we recommend that you go to the web
version of this book to view this particular section and to interact
with the graphics examples.
When using the first method, most graphics other than maps will be created
using the plot_ly function. For example, if you want to plot an interactive
scatterplot of time versus shots for the World Cup 2010 data (which we have
created as a static plot earlier in this section), you can do so with the following
code:
Building Data Visualization Tools 373
library(faraway)
data(worldcup)
library(plotly)
Plotly plot
If you view this plot in a format where it is interactive, you can see that labels
pop up for each point as you pass the cursor over it. Further, there are some
buttons at the top of the graph that allow interactive actions like zooming,
panning, and selection of a subset of the graph.
This code specifies the dataframe with the data to plot, what type of plot to
create, and mappings of variables to aesthetics. In this case, we want to show
Time on the x-axis and Shots on the y-axis, so we specify those for the x and y
parameters. Further, we want to show player position with color, so we map
Building Data Visualization Tools 374
While you usually wont use syntax like this when using ggplot2 in
interactive coding, you will use it to avoid non-standard evaluation
when using ggplot2 code in functions you write for a package. See
the section on non-standard evaluation earlier in the book for more
on this concept.
Building Data Visualization Tools 376
By default, the pop-ups will show the mapped aesthetics when you move
the cursor over each point. However, you can change this default to show
something different when the viewer scrolls over each point. For example,
the plot we created above for the World Cup data maps player time to the x
aesthetic, shots to the y aesthetic, and color to the players position. Therefore,
by default these three values will be shown for a point if you move the cursor
over the point. However, you might prefer to show each players name, which
is contained in the rownames of the worldcup data. You can do this by using
dplyr tools to move the rownames to a column named Name and then mapping
that column to the text aesthetic and specifying that aesthetic to the hoverinfo
parameter:
worldcup %>%
mutate(Name = rownames(worldcup)) %>%
plot_ly(x = ~ Time, y = ~ Shots, color = ~ Position) %>%
add_markers(text = ~ Name, hoverinfo = "text")
Building Data Visualization Tools 377
You can use the paste function to create a more customized text label. Use
HTML tags for any formatting. For example, to show both the players name
and team in a more attractive format, you could run:
Building Data Visualization Tools 378
worldcup %>%
mutate(Name = rownames(worldcup)) %>%
plot_ly(x = ~ Time, y = ~ Shots, color = ~ Position) %>%
add_markers(text = ~ paste("<b>Name:</b> ", Name, "<br />",
"<b>Team:</b> ", Team),
hoverinfo = "text")
If you arent familiar with HTML syntax, you may find it helpful to use a HTML
cheatsheet like this one.
Just like with ggplot2, the mappings you need depend on the type of plot you
are creating. For example, scatterplots (type = "scatter") need x and y defined,
while a surface plot (type = "surface") can be created with a single vector of
elevation, using a mapping to the z aesthetic.
The plotly package is designed so you can pipe data into plot_ly and add
elements by piping into add_* functions (this idea is similar to adding elements
to a ggplot object with +). For example, you could create the same scatterplot
we just created by piping the World Cup data into plotly, and then piping that
output to add_markers, without needing to specify that the type of plot should
be a scatterplot as we did in the last code chunk:
worldcup %>%
plot_ly(x = ~ Time, y = ~ Shots, color = ~ Position) %>%
add_markers()
add_markers
add_lines
add_paths
add_polygons
add_segments
add_histogram
If you pipe to the rangeslider function, it allows the viewer to zoom in on part
of the x range. This functionality can be particularly nice for time series. For
example, you can read in data on the maximum winds for Hurricane Floyd
at different points along its track. You can pipe the result of reading in the csv
directly into the plot_ly call. To show a time series of wind speeds, map the
time stamp to the x aesthetic and the wind to the y aesthetic. You can then
add a line and range slider:
Building Data Visualization Tools 380
read_csv("data/floyd_track.csv") %>%
plot_ly(x = ~ datetime, y = ~ max_wind) %>%
add_lines() %>%
rangeslider()
Notice that, in the output, you can change the range of data plotted in the top
graph by interactively adjusting the window shown in the lower plot.
Building Data Visualization Tools 381
You can make a 3-D scatterplot with plot_ly by mapping a variable to the z
variable. For example, to plot a scatter plot of time, shots, and passes in the
World Cup 2010 data, you can run (note that size is set with a constant value
to make the points larger):
worldcup %>%
plot_ly(x = ~ Time, y = ~ Shots, z = ~ Passes,
color = ~ Position, size = I(3)) %>%
add_markers()
Building Data Visualization Tools 382
3-D scatterplot
Again, if you move the cursor over the points in the scatterplot, you can see
the value of the point. Further, the tool bar above the plot includes buttons
that allow you to rotate the plot and look at it from different angles.
Similarly, you can create 3-D surface plots with the plot_ly function. In this
case, if you have a matrix of data regularly spaced on x- and y-dimensions,
with the cell values in the matrix giving values of a third variable, you can
Building Data Visualization Tools 383
create a surface map with plot_ly by mapping the matrix values to the z
aesthetic. The helpfile for plot_ly includes an example using the volcano data
that comes with R. This data is in a matrix format, and each value gives the
elevation for a particular pair of x- and y-coordinates for a volcano in New
Zealand.
class(volcano)
[1] "matrix"
volcano[1:4, 1:4]
[,1] [,2] [,3] [,4]
[1,] 100 100 101 101
[2,] 101 101 102 102
[3,] 102 102 103 103
[4,] 103 103 104 104
You can use the following code to create a 3-D surface plot of this data.
The other way to create a plotly graph is to first create a ggplot object and
then transform it into an interactive graphic using the ggplotly function.
Earlier in this subsection, we used plot_ly to create an interactive scatterplot
with the World Cup. We could have created the same plot by first creating a
ggplot object with the scatterplot and then passing it to the ggplotly function:
Building Data Visualization Tools 385
Using ggplotly
Building Data Visualization Tools 386
If you get an error when you try this code, make sure you have
the latest versions of ggplot2 and plotly installed. It may be nec-
essary for you to install the development version of plotly di-
rectly from GitHub, which you can do using devtools::install_-
github("ropensci/plotly").
If you would like to find out more about what you can do with the plotly
package, the creator of the package has written a bookdown book on the
package that you can read here.
Leaflet
Leaflet is a JavaScript library that you can use to create very attractive inter-
active maps. You will recognize the output, as maps created using Leaflet are
now very common on websites. You can find out more about the JavaScript
version here: http://leafletjs.com. The leaflet package allows you to create
these maps from within R. As with other htmlWidgets, you can explore
these maps in the Viewer pane of RStudio and also add them to HTML R
Markdown output and Shiny web applications.
For the examples in these section, well use the data on fatal accidents and
census tracts in Denver, Colorado. This data is contained in the denver_tracts
and driver_data datasets created in an earlier subsection of the book. If you
need to, you can reload those using the following code (replace the filepath
in the load call with the filepath to where you have saved this example data
on your own computer):
library(tigris)
denver_tracts <- tracts(state = "CO", county = 31, cb = TRUE)
load("data/fars_colorado.RData")
denver_fars <- driver_data %>%
filter(county == 31 & longitud < -104.5)
To start creating a leaflet map in R, you need to initialize a leaflet object (this is
similar to how you initialize a ggplot object when creating plots with ggplot2).
You do this with the leaflet function. If you just run leaflet() without adding
any elements, however, you just get a blank leaflet area:
Building Data Visualization Tools 387
library(leaflet)
leaflet()
In leaflet, the map background is composed of map tiles, which you can pull
from a number of different sources. To get a background for your leaflet map,
youll need to add tiles to the object created by leaflet. If you dont add any
elements other than tiles, the leaflet map will zoom out to show the world:
leaflet() %>%
addTiles()
Building Data Visualization Tools 388
Once you have a leaflet object and map tiles, youll add other elements to
show your data. This is similar to adding geoms to a ggplot object.
A common element youll want to add are points showing locations. You
can add points using markers (these will give the map pins you may be
familiar with from Google maps) or circle markers. You add these elements by
piping the current leaflet object into an addMarkers or addCircleMarkers function.
Building Data Visualization Tools 389
leaflet() %>%
addTiles() %>%
addMarkers(data = denver_fars, lng = ~ longitud, lat = ~ latitude)
Building Data Visualization Tools 390
In the call to addMarkers, the lng and lat parameters tell R which columns
contain data on longitude and latitude for each point. These parameters
are not needed if you are using a spatial object (e.g., SpatialPointsDataFrame).
Further, R will try to guess which columns show longitude and latitude in a
regular dataframe if you do not specify these parameters.
To use circles for your markers instead of pins, use addCircleMarkers. You can
Building Data Visualization Tools 391
leaflet() %>%
addTiles() %>%
addCircleMarkers(data = denver_fars, radius = 2,
lng = ~ longitud, lat = ~ latitude)
If you have a lot of overlapping data, you may prefer to use the clusterOptions
argument when adding markers. When using this option, markers are shown
as clusters that group together when you zoom out but split up when you
zoom in, so they can be useful when you have very dense points you would
like to map, as in this example.
leaflet() %>%
addTiles() %>%
addMarkers(data = denver_fars,
lng = ~ longitud, lat = ~ latitude,
clusterOptions = markerClusterOptions())
Building Data Visualization Tools 393
Clustering markers
The background map comes from the map tiles you add to the leaflet object.
For the background, the default is to use map tiles from OpenStreetMap.
However, you can use different tiles by changing the source of the tiles. To
do this, use the addProviderTiles function in place of the addTiles function and
specify the provider of the tiles you would like to use. To figure out what you
would like to use, you can see previews of provider choices here: http://leaflet-
extras.github.io/leaflet-providers/preview/index.html.
Building Data Visualization Tools 394
For example, to use Stamen watercolor map tiles, you can call:
leaflet() %>%
addProviderTiles("Stamen.Watercolor") %>%
addCircleMarkers(data = denver_fars, radius = 2,
lng = ~ longitud, lat = ~ latitude)
leaflet() %>%
addProviderTiles("Thunderforest.TransportDark") %>%
addCircleMarkers(data = denver_fars, radius = 2, color = I("red"),
lng = ~ longitud, lat = ~ latitude)
You can also add pop-ups that show information about a point when a user
clicks on the point. To do this, use the popup option in the function in the
function where you add the element to the leaflet object. The popup parameter
requires a character vector, so if you want to show something currently in a
different class vector, wrap it in paste. For example, to add popups giving the
age of the driver for the map of accidents, you can run:
leaflet() %>%
addTiles() %>%
addCircleMarkers(data = denver_fars, radius = 2,
lng = ~ longitud, lat = ~ latitude,
popup = ~ paste(age))
Building Data Visualization Tools 397
You can build nicely formatted popups by adding HTML tags into the char-
acter string for the pop-up. For example, to make it clearer to viewers that
the pop-up is showing age, you could use paste and some HTML formatting to
create the character string for the popup parameter.
Building Data Visualization Tools 398
leaflet() %>%
addTiles() %>%
addCircleMarkers(data = denver_fars, radius = 2,
lng = ~ longitud, lat = ~ latitude,
popup = ~ paste("<b>Driver age:</b>", age))
If you are going to make more complex pop-ups, you might want to create a
Building Data Visualization Tools 399
column with the pop-up strings before passing the data into the leaflet call.
For example, you could create pop-ups that show driver age, the date and
time of the accident, and blood alcohol content if that data is available:
denver_fars %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(radius = 2, lng = ~ longitud, lat = ~ latitude,
popup = ~ popup_info)
Building Data Visualization Tools 400
In the popups, you can use HTML to format things like color, typeface, and
size. You can also add links.
To use color to show a value, you need to do a few things. First, you need
to the the colorFactor function (or another in its family) to create a function
for mapping from values to colors. Then, you need to use this within the call
to add the markers. For example, the drunk_dr column in the denver_fars data
Building Data Visualization Tools 401
gives the number of drunk drivers involved in an accident. You can use the
following code to show that value using color in the leaflet map:
library(viridis)
pal <- colorFactor(viridis(5), denver_fars$drunk_dr)
leaflet() %>%
addProviderTiles("OpenStreetMap.BlackAndWhite") %>%
addCircleMarkers(data = denver_fars, radius = 2,
lng = ~ longitud, lat = ~ latitude,
popup = ~ popup_info,
color = ~ pal(drunk_dr))
Building Data Visualization Tools 402
The colorFactor function (and related functions) actually creates a new func-
tion, which is why its syntax in this call is a bit different than the syntax
used to set other parameters. Note that in this code we are using the viridis
function from the viridis package within the pal call to use a viridis color
palette for the points.
Once you have mapped a variable to color, you can add a legend to explain
Building Data Visualization Tools 403
the mapping. You can do that with the addLegend function, which must include
values for the color palette and values for each point from this color palette.
library(viridis)
pal <- colorFactor(viridis(5), denver_fars$drunk_dr)
leaflet() %>%
addProviderTiles("OpenStreetMap.BlackAndWhite") %>%
addCircleMarkers(data = denver_fars, radius = 2,
lng = ~ longitud, lat = ~ latitude,
popup = ~ popup_info,
color = ~ pal(drunk_dr)) %>%
addLegend(pal = pal, values = denver_fars$drunk_dr)
Building Data Visualization Tools 404
You can add polygons to leaflet objects with the addPolygons function. For
example, you can use the following code to add the census tract boundaries
for Denver to a leaflet object:
Building Data Visualization Tools 405
leaflet() %>%
addProviderTiles("OpenStreetMap.BlackAndWhite") %>%
addPolygons(data = denver_tracts)
leaflet() %>%
addProviderTiles("OpenStreetMap.BlackAndWhite") %>%
addPolygons(data = denver_tracts,
popup = paste0("Tract ID: ", denver_tracts@data$NAME))
Note that, because the denver_tracts object is a spatial object, weve used @data
to pull a value from the spatial objects attribute dataframe to use in the pop-
Building Data Visualization Tools 407
ups, but we do not need to specify lat or lng in the addPolygons call.
You can overlay multiple elements on a leaflet map. For example, you add
elements to show both accidents and tracts by adding accidents using markers
and adding census tracts using polygons:
leaflet() %>%
addProviderTiles("Thunderforest.Transport") %>%
addPolygons(data = denver_tracts,
popup = paste0("Tract ID: ", denver_tracts@data$NAME),
color = "#000000", fillColor = "969696",
weight = 2) %>%
addCircleMarkers(data = denver_fars, lat = ~ latitude,
lng = ~ longitud, radius = 2,
popup = ~ popup_info, opacity = 0.9,
color = ~ pal(drunk_dr)) %>%
addLegend(pal = pal, values = denver_fars$drunk_dr, opacity = 0.9)
Building Data Visualization Tools 408
You can allow the user to pick which layers to show on the graph by adding
addLayersControls. When using this function, add group specifications to each
of your map layers, and then specify which to include as overlays in the
overlayGroups parameter of addLayersControl. For example, this code adds layer
control to the map of Denver accidents:
Building Data Visualization Tools 409
leaflet() %>%
addProviderTiles("Thunderforest.Transport") %>%
addPolygons(data = denver_tracts,
popup = paste0("Tract ID: ", denver_tracts@data$NAME),
color = "#000000", fillColor = "969696",
weight = 2, group = "tracts") %>%
addCircleMarkers(data = denver_fars, lat = ~ latitude,
lng = ~ longitud, radius = 2,
popup = ~ popup_info, opacity = 0.9,
color = ~ pal(drunk_dr),
group = "accidents") %>%
addLegend(pal = pal, values = denver_fars$drunk_dr, opacity = 0.9) %>%
addLayersControl(baseGroups = c("base map"),
overlayGroups = c("tracts", "accidents"))
Building Data Visualization Tools 410
To find out more about using the R leaflet package, including tips for includ-
ing leaflet maps in R Shiny applications, see http://rstudio.github.io/leaflet/.
If you find a JavaScript visualization library and would like to create bindings
to R, you can create your own package for a new htmlWidget.
Building Data Visualization Tools 411
The ggplot2 package is built on top of grid graphics, so the grid graphics system
plays well with ggplot2 objects. In particular, ggplot objects can be added to
larger plot output using grid graphics functions, and grid graphics functions
can be used to add elements to ggplot objects. Grid graphics functions can
also be used to create almost any imaginable plot from scratch. A few other
graphics packages, including the lattice package, are also built using grid
graphics.
Since it does take more time and code to create plots using grid graphics
compared to plotting with ggplot2, it is usually only worth using grid graphics
when you need to create a very unusual plot that cannot be created using
ggplot2. As people add more geoms and other extensions to ggplot, there are
more capabilities for creating customized plots directly in ggplot2 without
needing to use the lower-level functions from the grid package. However,
they are useful to learn because they provide you the tools to create your
own extensions, including geoms.
Grid graphics and Rs base graphics are two separate systems. You cannot
easily edit a plot created using base graphics with grid graphics functions. If
you have to integrate output from these two systems, you may be able to using
the gridBase package, but it will not be as straightforward as editing an object
build using grid graphics (including ggplot objects). While we have focused
on plotting using ggplot2 in this course, we have covered a few plots created
using base R, specifically the maps created by running a plot call on a spatial
object, like a SpatialPoints object.
Grobs
The most critical concept of grid graphics to understand for extending ggplot2
it the concept of grobs. Grobs are graphical objects that you can make and
change with grid graphics functions. For example, you may create a circle
grob or points grobs. Once you have created one or more of these grobs,
Building Data Visualization Tools 413
you can add them to or take them away from larger grid graphics objects,
including ggplot objects. These grobs are the actual objects that get printed to
a graphics device when you print a grid graphics plot; if you tried to create a
grid graphics plot without any grobs, you would get a blank plot.
The grid package has a Grob family of functions that either make or change
grobs. If you want to build a custom geom for ggplot that is unusual enough
that you cannot rely on inheriting from an existing geom, you will need to
use functions from the Grob family of functions to code your geom. You
can create a variety of different types of grobs to plot different elements.
Possible grobs that can be created using functions in the grid package include
circles, rectangles, points, lines, polygons, curves, axes, rasters, segments,
and plot frames. You can create grobs using the functions from the *Grob
family of functions in the grid package; these functions include circleGrob,
linesGrob, polygonGrob, rasterGrob, rectGrob, segmentsGrob, legendGrob, xaxisGrob,
and yaxisGrob.
Functions that create grobs typically include parameters to specify the loca-
tion where the grobs should be placed. For example, the pointsGrob function
includes x and y parameters, while the segmentsGrob includes parameters for
the starting and ending location of each segment (x0, x1, y0, y1).
The grob family of functions also includes a parameter called gp for setting
graphical parameters like color, fill, line type, line width, etc., for grob objects.
The input to this function must be a gpar object, which can be created using
the gpar function. For example, to create a gray circle grob, you could run:
library(grid)
my_circle <- circleGrob(x = 0.5, y = 0.5, r = 0.5,
gp = gpar(col = "gray", lty = 3))
Aesthetics that you can set by specifying a gpar object for the gp parameter of
a grob include color (col), fill (fill), transparency (alpha), line type (lty), line
Building Data Visualization Tools 414
width (lwd), line end and join styles (lineend and linejoin, respectively), and
font elements (fontsize, fontface, fontfamily). See the helpfile for gpar for more
on gpar objects.
Once you have created a grob object, you can use the grid.draw function to
plot it to the current graphics device. For example, to plot the circle grob we
just created, you could run:
grid.draw(my_circle)
Grid circle
In this case, the circle will fill up the full graphics device and will be centered
in the middle of the plot region. Later in this subsection, well explain how
to use coordinates and coordinate systems to place a grob and how to use
viewports to move into subregions of the plotting space.
You can edit a grob after you have drawn it using the grid.edit function. For
example, the following code creates a circle grob, draws it, creates and draws
a rectangle grob, and then goes back and edits the circle grob within the plot
region to change the line type and color (run this code one line at a time within
your R session to see the changes). Note that the grob is assigned a name to
allow referencing it with the grid.edit call.
Building Data Visualization Tools 415
Grid objects
Grid objects
As mentioned earlier, ggplot2 was built using the grid system, which means
that ggplot objects often integrate well into grid graphics plots. In many ways,
ggplot objects can be treated as grid graphics grobs. For example, you can use
the grid.draw function from grid to write a ggplot object to the current graphics
device:
Grid scatterplot
This functionality means that ggplot objects can be added to plots with other
grobs. For example, once you have defined my_circle and wc_plot using the
code above, try running the following code (clear your graphics device first
using the broom icon in the RStudio Plots panel):
grid.draw(wc_plot)
grid.draw(my_circle)
In this case, the resulting plot is not very useful, but this functionality will
be more interesting once we introduce how to use viewports and coordinate
systems.
You can also edit elements of a ggplot object using grid graphics functions.
First, you will need to list out all the graphics elements in the ggplot object,
so you can find the name of the one you want to change. Then you can use
grid.edit to edit that element of the plot.
To find the names of the elements in this ggplot object, first plot the object
to RStudios graphics device (as done with the last call), then run grid.force,
run grid.ls() to find the name of the element you want to change, then use
grid.edit to change it. As a note, the exact names of elements will change each
time you print out the plot, so you will need to write the grid.edit call based
on the grid.ls results for a specific plotting of the ggplot object.
For example, you can run this call to print out the World Cup plot coded
earlier and list the names of all elements:
wc_plot
grid.force()
grid.ls()
layout
background.1-7-10-1
panel.6-4-6-4
grill.gTree.1413
panel.background..rect.1404
panel.grid.minor.y..polyline.1406
panel.grid.minor.x..polyline.1408
panel.grid.major.y..polyline.1410
panel.grid.major.x..polyline.1412
NULL
geom_point.points.1400
NULL
panel.border..zeroGrob.1401
spacer.7-5-7-5
spacer.7-3-7-3
spacer.5-5-5-5
spacer.5-3-5-3
axis-t.5-4-5-4
axis-l.6-3-6-3
axis.line.y..zeroGrob.1432
Building Data Visualization Tools 419
axis
axis.1-1-1-1
GRID.text.1429
axis.1-2-1-2
axis-r.6-5-6-5
axis-b.7-4-7-4
axis.line.x..zeroGrob.1425
axis
axis.1-1-1-1
axis.2-1-2-1
GRID.text.1422
xlab-t.4-4-4-4
xlab-b.8-4-8-4
GRID.text.1416
ylab-l.6-2-6-2
GRID.text.1419
ylab-r.6-6-6-6
subtitle.3-4-3-4
title.2-4-2-4
caption.9-4-9-4
Then, you can change the color of the points to red and the y-axis label to be
bold by using grid.edit on those elements (note that if you are running this
code yourself, you will need to get the exact names from the grid.ls output
on your device):
You can use the ggplotGrob function from the ggplot2 package to explicitly
make a ggplot grob from a ggplot object.
A gTree grob contains one or more children grobs. It can very useful for
creating grobs that need to contain multiple elements, like a boxplot, which
needs to include a rectangle, lines, and points, or a labeling grob that includes
text surrounded by a rectangle. For example, to create a grob that looks like
a lollipop, you can run:
Building Data Visualization Tools 420
Adding a lollipop
You can use the grid.ls function to list all the children grobs in a gTree:
grid.ls(lollipop)
GRID.gTree.8194
GRID.circle.8192
GRID.segments.8193
Viewports
Much of the power of grid graphics comes from the ability to move in and out
of working spaces around the full graph area. As an example, say you would
Building Data Visualization Tools 421
like to create a map of the states of the US with a small pie chart added at the
centroid of each state showing the distribution of population in that state by
education level. This kind of plot is where grid graphics shines (although it
appears that you now can create such a plot directly in ggplot2). In this case,
you want to zoom in at the coordinates of a state centroid, have your own
smaller working space at that location, add a pie chart showing data specific
to that state within that working space, then zoom out and do the process
again for a different state centroid.
In grid graphics, these smaller working spaces within the larger plot are
called viewports. Viewports are the plotting windows that you can move into
and out of to customize plots using grid graphics. You can navigate to one
of the viewports, make some changes, and then pop back up and navigate
deeply into another viewport in the plot. In short, viewports provide a way
to navigate around and work within different subspaces on a plot.
Using grid graphics, you will create plots by making viewports, navigating
into them, writing grobs, and then moving to a different viewport to continue
plotting.
To start, you can make a new viewport with the viewport function. For
example, to create a viewport in the top right quarter of the full plotting area
and write a rounded rectangle and the lollipop grob we defined earlier in this
section (weve written a rectangle grob before creating a using the viewport,
so you can see the area of the full plotting area), you can run:
grid.draw(rectGrob())
sample_vp <- viewport(x = 0.5, y = 0.5,
width = 0.5, height = 0.5,
just = c("left", "bottom"))
pushViewport(sample_vp)
grid.draw(roundrectGrob())
grid.draw(lollipop)
popViewport()
Building Data Visualization Tools 422
This code creates a viewport using the viewport function, navigates into it
using pushViewport, writes the grobs using grid.draw, and the navigates out of
the viewport using popViewport.
In this code, the x and y parameters of the viewport function specify the view-
ports location, and the just parameter specifies how to justify the viewport
in relation to this location. By default, these locations are specified based on a
range of 0 to 1 on each side of the plotting area, so x = 0.5 and y = 0.5 specifies
the center of the plotting area, while just = c("left", "bottom") specifies that
the viewport should have this location at its bottom left corner. If you wanted
to place the viewport in the center of the plotting area, for example, you could
run:
grid.draw(rectGrob())
sample_vp <- viewport(x = 0.5, y = 0.5,
width = 0.5, height = 0.5,
just = c("center", "center"))
pushViewport(sample_vp)
grid.draw(roundrectGrob())
grid.draw(lollipop)
popViewport()
Building Data Visualization Tools 423
The width and height parameters specify the size of the viewport, again based
on default units that 1 is the full width of one side of the plotting area (later
in this section, we discuss how to use different coordinate systems). For
example, if you wanted to make the viewport smaller, you could run:
grid.draw(rectGrob())
sample_vp <- viewport(x = 0.75, y = 0.75,
width = 0.25, height = 0.25,
just = c("left", "bottom"))
pushViewport(sample_vp)
grid.draw(roundrectGrob())
grid.draw(lollipop)
popViewport()
Building Data Visualization Tools 424
You can only operate in one viewport at a time. Once you are in that viewport,
you can write grobs within the viewport. If you want to place the next grob
in a different viewport, you will need to navigate out of that viewport before
you can do so. Notice that all the previous code examples use popViewport to
navigate out of the viewport after writing the desired grobs. We could then
create a new viewport somewhere else and write new grobs there:
grid.draw(rectGrob())
Multiple viewports
You can also nest viewports inside each other. In this case, a new viewport is
defined relative to the current viewport. For example, if you are in a viewport
and position a new viewport at x = 0.5 and y = 0.5, this viewport will be
centered in your current viewport rather than in the overall plotting area.
grid.draw(rectGrob())
pushViewport(sample_vp_1)
grid.draw(roundrectGrob(gp = gpar(col = "red")))
pushViewport(sample_vp_2)
grid.draw(roundrectGrob())
grid.draw(lollipop)
popViewport(2)
Building Data Visualization Tools 426
Nested viewports
Note that in this code we use the call popViewport(2) to navigate back to the
main plotting area. This is because we have navigated down to a viewport
within a viewport, so we need to pop up two levels to get out of the viewports.
Given this ability to nest viewports, a grid graphics object can end up with
a complex tree of viewports and grobs. Any of these elements can be cus-
tomized, as long as you can navigate back down to the specific element you
want to change.
You can use the grid.ls function to list all the elements of the plot in the
current graphics device, if it was created using grid graphics.
grid.draw(rectGrob())
grid.ls()
GRID.rect.8207
GRID.roundrect.8208
GRID.gTree.8194
GRID.circle.8192
GRID.segments.8193
For ggplot objects, you can also use grid.ls, but you should first run grid.force
to set the grobs as they appear in the current graph (or as they will appear
when you plot this specific ggplot object), so you can see child grobs in the
listing:
worldcup %>%
ggplot(aes(x = Time, y = Passes)) +
geom_point()
grid.force()
Building Data Visualization Tools 428
Building Data Visualization Tools 429
grid.ls()
layout
background.1-7-10-1
panel.6-4-6-4
grill.gTree.8223
panel.background..rect.8214
panel.grid.minor.y..polyline.8216
panel.grid.minor.x..polyline.8218
panel.grid.major.y..polyline.8220
panel.grid.major.x..polyline.8222
NULL
geom_point.points.8210
NULL
panel.border..zeroGrob.8211
spacer.7-5-7-5
spacer.7-3-7-3
spacer.5-5-5-5
spacer.5-3-5-3
axis-t.5-4-5-4
axis-l.6-3-6-3
axis.line.y..zeroGrob.8242
axis
axis.1-1-1-1
GRID.text.8239
axis.1-2-1-2
axis-r.6-5-6-5
axis-b.7-4-7-4
axis.line.x..zeroGrob.8235
axis
axis.1-1-1-1
axis.2-1-2-1
GRID.text.8232
xlab-t.4-4-4-4
xlab-b.8-4-8-4
GRID.text.8226
ylab-l.6-2-6-2
GRID.text.8229
ylab-r.6-6-6-6
subtitle.3-4-3-4
title.2-4-2-4
caption.9-4-9-4
You can use ggplot objects in plots with viewports. For example, you can use
the following code to add an inset map for the map we created earlier in this
section of Baltimore County and Baltimore City in Maryland. The following
code creates ggplot objects with the main plot and the inset map and uses
Building Data Visualization Tools 430
viewports to create a plot showing both. Note that in the viewport weve
added two rectangle grobs, one in white with some transparency to provide
the background of the map inset, and one with no fill and color set to black
to provide the inset border.
grid.draw(ggplotGrob(balt_map))
md_inset <- viewport(x = 0, y = 0,
just = c("left", "bottom"),
width = 0.35, height = 0.35)
pushViewport(md_inset)
grid.draw(rectGrob(gp = gpar(alpha = 0.5, col = "white")))
grid.draw(rectGrob(gp = gpar(fill = NA, col = "black")))
grid.draw(ggplotGrob(maryland_map))
popViewport()
Building Data Visualization Tools 431
Once you have created a grob and moved into the viewport in which you want
to plot it, you need a way to specify where in the viewport to write the grob.
The numbers you use to specify x- and y-placements for a grob will depend on
the coordinate system you use. In grid graphics, you have a variety of options
Building Data Visualization Tools 432
for the units to use in this coordinate system, and picking the right units for
this coordinate system will make it much easier to create the plot you want.
There are several units that can be used for coordinate systems, and you
typically will use different units to place objects. For example, you may want
to add points to a plot based on the current x- and y-scales in that plot region,
in which case you can use native units. The native unit is often the most useful
when creating extensions for ggplot2, for example. The npc units are also
often useful in designing new plots these set the x- and y-ranges to go from 0
to 1, so you can use these units if you need to place an object in, for example,
the exact center of a viewport (c(0.5, 0.5) in npc units), or create a viewport
in the top right quarter of the plot region. Grid graphics also allows the use
of some units with absolute values, including inches (inches), centimeters (cm),
and millimeters (mm).
You can specify the coordinate system you would like to use when placing an
object by with the unit function (unit([numeric vector], units = "native")). For
example, if you create a viewport with the x-scale going from 0 to 100 and
the y-scale going from 0 to 10 (specified using xscale and yscale in the viewport
function), you can use native when drawing a grob in that viewport to base
the grob position on these scale values.
The gridExtra package provides useful extensions to the grid system, with an
emphasis on higher-level functions to work with grid graphic objects, rather
than the lower-level utilities in the grid package that are used to create and
edit specific lower-level elements of a plot. This package has particularly
useful functions that allow you to arrange and write multiple grobs to a
graphics device and to include tables in grid graphics objects.
The grid.arrange function from the gridExtra package makes it easy to create
a plot with multiple grid objects plotted on it. For example, you can use it to
write out one or more grobs youve created to a graphics device:
library(gridExtra)
grid.arrange(lollipop, circleGrob(),
rectGrob(), lollipop,
ncol = 2)
Building Data Visualization Tools 434
Using grid.arrange()
Note that this code is being used to arrange both grobs that were previously
created and saved to an R object (the lollipop grob created in earlier code in
this section) and grobs that are created within the grid.arrange call (the grobs
created with circleGrob and rectGrob). The ncol parameter is used to specify
the number of columns to include in the output.
Because ggplot2 was built on grid graphics, you can also use this function to
plot multiple ggplot objects to a graphics device. For example, say you wanted
to create a plot that has two plots based on the World Cup data side-by-side.
To create this plot, you can assign the ggplot objects for each separate graph
to objects in your R global environment (time_vs_shots and player_positions in
this example), and then input these objects to a grid.arrange call:
Using grid.arrange()
You can use the layout_matrix parameter to specify different layouts. For
example, if you want the scatterplot to take up one-third of the plotting area
and the bar chart to take up two-thirds, you could specify that with a matrix
with one 1 (for the first plot) and two 2s, all in one row:
grid.arrange(time_vs_shots, player_positions,
layout_matrix = matrix(c(1, 2, 2), ncol = 3))
You can fill multiple rows in the plotting device. To leave a space blank, use
NA in the layout matrix at that positions. For example:
grid.arrange(time_vs_shots, player_positions,
layout_matrix = matrix(c(1, NA, NA, NA, 2, 2),
byrow = TRUE, ncol = 3))
The gridExtra also has a function, tableGrob, that facilitates in adding tables to
grid graphic objects. For example, to add a table with the average time and
shots for players on the top four teams in the 2010 World Cup, you can create
a table grob using tableGrob and then add it to a larger plot created using grid
graphics:
grid.draw(ggplotGrob(time_vs_shots))
wc_table_vp <- viewport(x = 0.22, y = 0.85,
just = c("left", "top"),
height = 0.1, width = 0.2)
pushViewport(wc_table_vp)
grid.draw(worldcup_table)
popViewport()
Building Data Visualization Tools 438
These tables can be customized with different trend and can be modified by
adding other grob elements (for example, to highlight certain cells). To find
out more, see the vignette for table grobs.
Building Data Visualization Tools 439
Grid graphics provide an extensive graphics system that can allow you to
create almost any plot you can imagine in R. In this section, we have only
scraped the surface of the grid graphics system, so you might want to study
grid graphics in greater depth, especially if you often need to create very
tailored, unusual graphs.
There are a number of resources you can use to learn more about grid
graphics. The most comprehensive is the R Graphics book by Paul Murrell,
the creator of grid graphics. This book is now in its second edition, and its
first edition was written before ggplot2 became so popular. It is worth try to
get the second edition, which includes some content specifically on ggplot2
and how that package relates to grid graphics. The vignettes that go along
with the grid package are also by Paul Murrell and give a useful introduction
to grid graphics, and the vignettes for the gridExtra package are also a useful
next step for finding out more.
Links to pdfs of vignettes for the grid graphics package are available at
https://stat.ethz.ch/R-manual/R-devel/library/grid/doc/index.html
Links to pdfs of vignettes for the gridGraphics package are available on
the packages CRAN page: https://cran.r-project.org/web/packages/gridExtra/index.h
Adding a theme (either existing or custom built by you) will override elements
of any default theme.
For example, here is a plot that uses the theme_classic() function:
library(ggplot2)
ggplot(data = mtcars, aes(x = disp, y = mpg)) +
geom_point() +
theme_classic()
Building Data Visualization Tools 441
Classic theme
Notice how the look and the feel of the plot is substantially different from the
default gray theme of ggplot2. The key differences in the theme_classic() setup
are the background color (white instead of gray), the colors of the grid lines
(none instead of white), and the presence of solid black x- and y-axes. Other
elements are the same, like the plotting character (solid circle) and fonts.
Building Data Visualization Tools 442
Note that themes in ggplot2 only allow you to modify the non-data
elements of a plot. Things like the title, axis labels, background,
etc. can be modified with a theme. If you want to change data
elements, like the plotting symbol or colors, you can modify those
things separately in their respective geom_* functions.
Why would one want to build a new theme? For many people, it is a matter
of personal preference with respect to colors, shapes, fonts, positioning of
labels, etc. Because plots, much like writing, are an expression of your ideas,
it is often desirable to customize those plots so that they accurately represent
your vision.
In corporate or institutional settings, developing themes can be a powerful
branding tool. Plots that are distributed on the web or through marketing
materials that have a common theme can be useful for reinforcing a brand.
For example, plots made by the FiveThirtyEight.com web site have a distinct
look and feel (see this article by Walt Hickey for one of many examples). When
you see one of those plots you instinctively know that it is a FiveThirtyEight
plot. Developing a theme for your organization can help to get others to better
understand what your organization is about when it produces data graphics.
Another advantage of having a pre-programmed theme is that it removes the
need for you to think about it later! One key reason why news organizations
like FiveThirtyEight or the New York Times have common themes for their
data graphics is because they are constantly producing those graphics on a
daily basis. If every plot required a custom look and feel with a separate
palette of colors, the entire process would grind to a halt. If you are in an en-
vironment where there is a need for reproducible graphics with a consistent
feel, then developing a custom theme is probably a good idea. While using the
default ggplot2 theme is perfectly fine from a data presentation standpoint,
why not try to stand out from the crowd?
Default Theme
As noted above, ggplot2 has a default theme, which is theme_gray(). This theme
produces the familiar gray-background-white-grid-lines plot. You can obtain
the default theme using the theme_get() function.
Building Data Visualization Tools 443
x <- theme_get()
class(x)
[1] "theme" "gg"
Now your plots will use the theme_minimal() theme without you having to
specify it.
Quitting R will erase the default theme setting. If you load ggplot2 in a future
session it will revert to the default gray theme. If youd like for ggplot2
to always use a different theme (either yours or one of the built-in ones),
you can set a load hook and put it in your .Rprofile file. For example, the
following hook sets the default theme to be theme_minimal() every time the
ggplot2 package is loaded.
Building Data Visualization Tools 445
setHook(packageEvent("ggplot2", "onLoad"),
function(...) ggplot2::theme_set(ggplot2::theme_minimal()))
Of course, you can always override this default theme by adding a theme
object to any of your plots that you construct in ggplot2.
Perhaps the easiest thing to start with when customizing your own theme
is to modify an existing theme (i.e. one that comes built-in to ggplot2). In
case your are interested in thoroughly exploring this area and learning from
others, there is also the ggthemes package on CRAN which provides a number
of additional themes for ggplot2.
Looking at the help page ?theme youll see that there are many things to modify.
We will start simple here by illustrating the general approach to making
theme modifications. We will begin with the theme_bw() theme. This theme is a
simple black and white theme that has little ornamentation and few features.
Suppose we want to make the default color for plot titles to be dark red. We
can change just that element by adding a theme() modification to the existing
theme.
Note that in our call to theme(), when we modify the plot.title attribute, we
cannot simply say color = "darkred". This must be wrapped in a call to the
element_text() function so that the elements of plot.title are appropriately
modified. In the help page for theme(), you will see that each attribute of a
theme is modified by using one of four element_* functions:
All of these functions work in the same way (although they contain different
elements) and each of them returns a list of values inheriting from the class
element. The ggplot2 functions know how to handle objects of this class and
will modify the theme of a plot accordingly.
Lets change a few more things about our new theme. We can make the box
surrounding the plot to look a little different by modifying the panel.border
element of the theme. First lets take a look at what the value is by default.
newtheme$panel.border
List of 5
$ fill : logi NA
$ colour : chr "grey20"
$ size : NULL
$ linetype : NULL
$ inherit.blank: logi TRUE
- attr(*, "class")= chr [1:2] "element_rect" "element"
You can see that this is an object of class element_rect and there are 5 elements
in this list, including the fill, colour (or color), size, and linetype. These
attributes have the same meaning as they do in the usual ggplot2 context.
We can modify the color attribute to make it steelblue and modify the size
attribute to make it a little bigger.
Now lets see what a typical plot might look like. The following is a plot of
minutes played an shots attempted from the worldcup dataset in the faraway
package.
library(faraway)
ggplot(data = worldcup, aes(Time, Shots)) +
geom_point() +
ggtitle("World Cup Data") +
newtheme
Building Data Visualization Tools 447
Revised theme
Complete themes
Summary
Building a new theme allows you to customize the look and feel of a plot to
match your personal preferences. It also allows you to define a consistent
branded presentation of your data graphics that can be clearly identified
with your organization or company.
Some of the key elements of a data graphic made with ggplot2 are geoms and
stats. The fact is, the ggplot2 package comes with tremendous capabilities that
allow users to make a wide range of interesting and rich data graphics. These
graphics can be made through a combination of calls to various geom_* and
stat_* functions (as well as other classes of functions).
So why would one want to build a new geom or stat on top of all that ggplot2
already provides?
There are two key reasons for building new geoms and stats for ggplot2:
eling approach or a novel plotting symbol. In this case you dont have
much choice and need to extend the functionality of ggplot2.
2. Simplify a complex workflow. With certain types of analyses you may
find yourself producing the same kind of plot elements repeatedly.
These elements may involve a combination of points, lines, facets, or
text and essentially encapsulate a single idea. In that case it may make
sense to develop a new geom to literally encapsulate the collection of
plot elements and to make it simpler to include these things in your
future plots.
Building new stats and geoms is the plotting equivalent of writing functions
(that may sound a little weird because stats and geoms are functions, but they
are thought of a little differently from generic functions). While the action
taken by a function can typically be executed using separate expressions
outside of a function context, it is often convenient for the user to encapsulate
those actions into a clean function. In addition, writing a function allows you
to easily parameterize certain elements of that code. Creating new geoms and
stats similarly allows for a simplification of code and for allowing users to
easily tweak certain elements of a plot without having to wade through an
entire mess of code every time.
Building a Geom
New geoms in ggplot2 inherit from a top level class called Geom and are
constructed using a two step process.
The basic setup for a new geom class will look something like the following.
Building Data Visualization Tools 451
The ggproto function is used to create the new class. Here, NEW will be
replaced by whatever name you come up with that best describes what your
new geom is adding to a plot. The four things listed inside the class are
required of all geoms and must be specified.
The required aesthetics should be straightforwardif your new geom makes
a special kind of scatterplot, for example, you will likely need x and y aesthet-
ics. Default values for aesthetics can include things like the plot symbol (i.e.
shape) or the color.
Implementing the draw_panel function is the hard part of creating a new geom.
Here you must have some knowledge of the grid package in order to access
the underlying elements of a ggplot2 plot, which based on the grid system.
However, you can implement a reasonable amount of things with knowledge
of just a few elements of grid.
The draw_panel function has three arguments to it. The data element is a data
frame containing one column for each aesthetic specified, panel_scales is a
list containing information about the x and y scales for the current panel,
and coord is an object that describes the coordinate system of your plot.
The coord and the panel_scales objects are not of much use except that they
transform the data so that you can plot them.
Building Data Visualization Tools 452
library(grid)
GeomMyPoint <- ggproto("GeomMyPoint", Geom,
required_aes = c("x", "y"),
default_aes = aes(shape = 1),
draw_key = draw_key_point,
draw_panel = function(data, panel_scales, coord) {
## Transform the data first
coords <- coord$transform(data, panel_scales)
In this example we print out the structure of the coords object with
the str() function just so you can see what is in it. Normally, when
building a new geom you wouldnt do this.
In addition to creating a new Geom class, you need to create the actually
function that will build a layer based on your geom specification. Here, we
call that new function geom_mypoint(), which is modeled after the built in
geom_point() function.
From the str() output we can see that the coords object contains the x and
y aesthetics, as well as the shape aesthetic that we specified as the default.
Note that both x and y have been rescaled to be between 0 and 1. This is the
normalized parent coordinate system.
Now we can try out our new geom_transparent() function with differing amounts
of data to see how the transparency works. Here is the entire worldcup dataset,
which has 595 observations.
Building Data Visualization Tools 456
library(dplyr)
ggplot(data = sample_n(worldcup, 150), aes(Time, Shots)) +
geom_transparent()
We can also reproduce a faceted plot from the previous section with our new
geom and the features of the geom will propagate to the panels.
Building Data Visualization Tools 459
Notice that the data for the Midfielder, Defender, and Forward panels
have some transparency because there are more points there but the Goal-
Building Data Visualization Tools 460
The advantage of creating a geom in this case is that it abstracts the computa-
tion, removes the need to modify the data each time, and allows for a simpler
communication of what is trying to be done in this plotting code.
Summary