R4beginners v3
R4beginners v3
R4beginners v3
2 Introduction
17 Painless data
visualization
6 Getting your
data into R
10 Easy ways to do
basic data analysis
by Sharon Machlis
edited by Johanna Ambrosio
R: a beginners guide
COMPUTERWORLD.COM
Introduction
R is hot. Whether measured by more than
6,100 add-on packages, the 41,000+ members of LinkedIns R group or the 170+ R
Meetup groups currently in existence, there
can be little doubt that interest in the R statistics language, especially for data analysis,
is soaring.
Who uses R?
Relatively high-profile users of R include:
Facebook: Used by some within the company for tasks such as analyzing user
behavior.
That also makes it easier for others to validate research results and check your work
for errors an issue that cropped up in the
news recently after an Excel coding error
was among several flaws found in an influential economics analysis report known as
Reinhart/Rogoff.
Sure, you can easily examine complex formulas on a spreadsheet. But its not nearly
2
R: a beginners guide
COMPUTERWORLD.COM
The top right window shows your workspace, which includes a list of objects currently in memory. Theres also a history tab
with a list of your prior commands; whats
handy there is that you can select one,
some or all of those lines of code and oneclick to send them either to the console or
to whatever file is active in your code editor.
The top left window is where youll probably do most of your work. Thats the R
R: a beginners guide
COMPUTERWORLD.COM
library(thepackagename)
If youd like to make sure your packages
stay up to date, you can run:
update.packages()
and get the latest versions for all your
installed packages.
setwd(~/mydirectory)
Note that the slashes always have to be forward slashes, even if youre on a Windows
system. For Windows, the command might
look something like:
setwd(C:/Sharon/Documents/
RProjects)
Help!
If you want to find out more about a function, you can type a question mark followed
by the function name one of the rare
times parentheses are not required in R,
like so:
?functionName
This is a shortcut to the help function,
which does use parentheses:
help(functionName)
Although Im not sure why youd want to
use this as opposed to the shorter ?functionName command.
install.packages(thepackagename)
R: a beginners guide
COMPUTERWORLD.COM
R: a beginners guide
COMPUTERWORLD.COM
printing with more options, but R beginners rarely seem to use it.
Sample data
data()
If your data use another character to separate the fields, not a comma, R also has the
more general read.table function. So if your
separator is a tab, for instance, this would
work:
mtcars
There are better ways of examining a data
set, which Ill get into later in this series.
Also, R does have a print() function for
6
R: a beginners guide
COMPUTERWORLD.COM
But your text columns may not be categories that you want to group and measure,
just names of companies or employees. If
you dont want your text data to be read in
as factors, add stringsAsFactor=FALSE to
read.table, like this:
x <- read.table(pipe(pbpaste),
sep=\t)
Other formats
R: a beginners guide
COMPUTERWORLD.COM
If youd like to try to connect R with a database, there are several dedicated packages
such as RPostgreSQL, RMySQL, RMongo,
RSQLite and RODBC.
(You can see the entire list of available R
packages at the CRAN website.)
Remote data
R enthusiasts have created add-on packages to help other users download data into
R with a minimum of fuss.
install.packages(quantmod)
library(quantmod)
getSymbols(AAPL)
barChart(AAPL)
R: a beginners guide
COMPUTERWORLD.COM
chartSeries(AAPL, subset=last 14
days)
save.image()
That stores your workspace to a file named
.RData by default. This will ensure you
dont lose all your work in the event of a
power glitch or system reboot while youve
stepped away.
When you close R, it asks if you want to
save your workspace. If you say yes, the
next time you start R that workspace will
be loaded. That saved file will be named
.RData as well. If you have different projects
in different directories, each can have its
own .RData workspace file.
save(variablename, file=filename.
rda)
rm(x)
load(filename.rda)
R: a beginners guide
COMPUTERWORLD.COM
Or:
tail(mydata, 10)
To quickly see how your R object is structured, you can use the str() function:
str(mydata)
This will tell you the type of object you
have; in the case of a data frame, it will
also tell you how many rows (observations
in statistical R-speak) and columns (variables to R) it contains, along with the type
of data in each column and the first few
entries in each column.
head(mydata, n=10)
Or just:
head(mydata, 10)
colnames(mydata)
tail(mydata)
10
R: a beginners guide
COMPUTERWORLD.COM
rownames(mydata)
library(psych)
You need to run the library command each
time you start a new R session if you want
to use the psych package.
summary(mydata)
Oddly, the mode() function returns information about data type instead of the
statistical mode; theres an add-on package,
modeest, that adds a mfv() function (most
frequent value) to find the statistical mode.
R also contains a load of more sophisticated functions that let you do analyses
with one or two commands: probability distributions, correlations, significance tests,
regressions, ANOVA (analysis of variance
between groups) and more.
As just one example, running the correlation function cor() on a dataframe such as:
If youd like even more statistical summaries from a single command, install and
11
R: a beginners guide
COMPUTERWORLD.COM
cor(mydata)
will give you a matrix of correlations for
each column of numerical data compared
with every other column of numerical data.
choose(15,4)
Or, perhaps you want to see all of the possible pair combinations of a group of 5
people, not simply count them. You can
create a vector with the peoples names and
store it in a variable called mypeople:
Note: Be aware that you can run into problems when trying to run some functions
on data where there are missing values. In
some cases, Rs default is to return NA even
if just a single value is missing. For example, while the summary() function returns
column statistics excluding missing values
(and also tells you how many NAs are in
the data), the mean() function will return
NA if even only one value is missing in a
vector.
combn(mypeople, 2)
mean(myvector, na.rm=TRUE)
If youve got data with some missing values,
read a functions help file by typing a question mark followed by the name of the function, such as:
?median
Use the combine function to see all possible combinations from a group.
12
R: a beginners guide
COMPUTERWORLD.COM
Chances are, though, youll want to subset your data by more than one column
at a time. Thats when youll want to use
bracket notation, what I think of as rowscomma-columns. Basically, you take the
name of your data frame and follow it by
[rows,columns]. The rows you want come
first, followed by a comma, followed by the
columns you want. So, if you want all rows
but just columns 2 through 4 of mtcars,
you can use:
names(mtcars)
Thats handy if you want to store the names
in a variable, perhaps called mtcars.colnames (or anything else youd like to call
it):
mtcars.colnames <- names(mtcars)
mtcars[,2:4]
mtcars$mpg
More broadly, then, the format for accessing a column by name would be:
dataframename$columnname
13
R: a beginners guide
COMPUTERWORLD.COM
mtcars[mtcars$mpg>20,]
R indexes from 1, not 0. So your first column is at [1] and not [0].
R is case sensitive everywhere. mtcars$mpg
is not the same as mtcars$MPG.
mtcars[mtcars$mpg>20,c(1,4)]
using column locations, or:
To create a vector of items that are not contiguous, you need to use the combine function c(). Typing mtcars[,(2,4)] without the
c will not work. You need that c in there:
mtcars[mtcars$mpg>20,c(mpg,hp)]
using the column names.
Why do you need to specify mtcars$mpg
in the row spot but mpg in the column
spot? Just another R syntax quirk is the
best answer I can give you.
mtcars[,c(2,4)]
What if want to select your data by data
characteristic, such as all cars with mpg >
20, and not column or row location? If you
use the column name notation and add a
condition like:
If youre finding that your selection statement is starting to get unwieldy, you can
put your row and column selections into
variables first, such as:
mtcars$mpg>20
mtcars[mpg20, cols]
making for a more compact select statement but more lines of code.
14
R: a beginners guide
COMPUTERWORLD.COM
If you just wanted to see the mpg information for the highest mpg:
detach()
Alternative to bracket
notation
Bracket syntax is pretty common in R code,
but its not your only option. If you dislike
that format, you might prefer the subset()
function instead, which works with vectors
and matrices as well as data frames. The
format is:
filter(mtcars, mpg>20)
To choose only certain columns, you use
the select() function with syntax such as
select(dataframename, columnName1,
columnName2). No quotation marks are
needed with the column names:
dataframename %>%
firstfunction(argument
15
R: a beginners guide
COMPUTERWORLD.COM
Counting factors
To tally up counts by factor, try the table
command. For the diamonds data set, to
see how many diamonds of each category
of cut are in the data, you can use:
table(diamonds$cut)
This will return how many diamonds of
each factor fair, good, very good, premium and ideal exist in the data. Want to
see a cross-tab by cut and color?
table(diamonds$cut, diamonds$color)
16
R: a beginners guide
COMPUTERWORLD.COM
plot(mtcars$disp, mtcars$mpg,
xlab=Engine displacement,
ylab=mpg, main=MPG vs engine displacement, las=1)
plot(mtcars$disp, mtcars$mpg)
?par
plot(mtcars$disp, mtcars$mpg,
xlab=Engine displacement,
ylab=mpg, main=MPG compared with
engine displacement)
In addition to the basic dataviz functionality included with standard R, there are
numerous add-on packages to expand Rs
visualization capabilities. Some packages
are for specific disciplines such as biosta17
R: a beginners guide
COMPUTERWORLD.COM
Using ggplot2
In particular, the ggplot2 package is quite
popular and worth a look for robust visualizations. ggplot2 requires a bit of time to
learn its Grammar of Graphics approach.
To use its functions, load the ggplot2 package into your current R session you only
need to do this once per R session with
the library() function:
library(ggplot2)
Onto some ggplot2 examples.
ggplot2 has a quick plot function called
qplot() that is similar to Rs basic plot()
function but adds some options. The basic
quick plot code:
qplot(disp, mpg, data=mtcars)
generates a scatterplot.
18
R: a beginners guide
COMPUTERWORLD.COM
ggplot(pressure, aes(x=temperature,
y=pressure)) + geom_line()
In these examples, I set only x and y aesthetics. But there are lots more aesthetics
we could add, such as color, axes and more.
19
R: a beginners guide
COMPUTERWORLD.COM
barplot(BOD$demand, main=Graph of
demand, names.arg = BOD$Time)
ggplot(pressure, aes(x=temperature,
y=pressure)) + geom_line() +
geom_point()
Bar graphs
To make a bar graph from the sample
BOD data frame included with R, the basic
R function is barplot(). So, to plot the
demand column from the BOD data set on
a bar graph, you can use the command:
barplot(BOD$demand)
Add main=Graph of demand if you want
a main headline on your graph:
barplot(BOD$demand, main=Graph of
demand)
20
R: a beginners guide
COMPUTERWORLD.COM
Histograms
11 7 14
Now you can create a bar graph of the cylinder count:
barplot(cylcount)
ggplot2s qplot() quick plotting function
can also create bar graphs:
hist(mydata$columnName, breaks = n)
where columnName is the name of your
column in a mydata dataframe that you
want to visualize, and n is the number of
bins you want.
qplot(mtcars$cyl)
boxplot(mtcars$mpg)
ggplot(mtcars, aes(factor(cyl))) +
geom_bar()
21
R: a beginners guide
COMPUTERWORLD.COM
boxplot(diamonds$x, diamonds$y,
diamonds$z)
rainbow(5)
For many more details, check the help command on a palette such as:
You can do graphical correlation matrices with the corrplot add-on package and
generate numerous probability distributions. See some of the links here or in the
resources section to find out more.
?rainbow
Using color
Looking at nothing but black and white
graphics can get tiresome after a while. Of
course, there are numerous ways of using
color in R.
There are also R functions that automatically generate a vector of n colors using a
specific color palette such as rainbow or
heat:
rainbow(n)
ggplot(mtcars, aes(x=factor(cyl))) +
geom_bar(fill=rainbow(3))
heat.colors(n)
terrain.colors(n)
topo.colors(n)
cm.colors(n)
22
R: a beginners guide
COMPUTERWORLD.COM
barplot(BOD$demand, col=rainbow(6))
You can use a single color if you want all
the items to be one color (but not monochrome), such as
Now that youve got the list of colors properly assigned to your list of scores, just add
the testcolors vector as your desired color
scheme:
barplot(BOD$demand, col=royalblue3)
Chances are, youll want to use color to
show certain characteristics of your data,
as opposed to simply assigning random
colors in a graphic. That goes a bit beyond
beginning R, but to give one example, say
youve got a vector of test scores:
barplot(testscores, col=testcolors)
Note that the name of a color must be in
quotation marks, but a variable name that
holds a list of colors should not be within
quote marks.
Add a graph headline:
barplot(testscores, col=testcolors,
main=Test scores)
barplot(testscores)
barplot(testscores, col=testcolors,
main=Test scores, ylim=c(0,100))
And you can make all the bars blue like this:
barplot(testscores, col=blue)
But what if you want the scores 80 and
above to be blue and the lower scores to be
red? To do this, create a vector of colors
of the same length and in the same order
as your data, adding a color to the vector
based on the data. In other words, since
the first test score is 96, the first color in
your color vector should be blue; since the
second score is 71, the second color in your
color vector should be red; and so on.
barplot(testscores, col=testcolors,
main=Test scores, ylim=c(0,100),
las=1)
And youve got a color-coded bar graph.
R: a beginners guide
COMPUTERWORLD.COM
qplot(factor(cyl), data=mtcars,
geom=bar, fill=factor(cyl))
24
R: a beginners guide
COMPUTERWORLD.COM
25
R: a beginners guide
COMPUTERWORLD.COM
When you create an array in most programming languages, the syntax goes something
like this:
x <- 3
Or:
But not:
x = 3
Or maybe:
myArray = [1, 1, 2, 3, 5, 8]
In R, though, theres an extra piece: To put
multiple values into a single variable, you
need the c() function, such as:
26
R: a beginners guide
COMPUTERWORLD.COM
Loopless loops
Iterating through a collection of data with
loops like for and while is a cornerstone of many programming languages.
Thats not the R way, though. While R does
27
R: a beginners guide
COMPUTERWORLD.COM
apply(my_matrix, 1, median)
returns the median of every row in my_
matrix and
apply(my_matrix, 2, median)
calculates the median of every column.
Thats telling you that your screen printout is starting at vector item number one.
If youve got a vector with lots of values
so the printout runs across multiple lines,
each line will start with a number in brack-
R: a beginners guide
COMPUTERWORLD.COM
If youve got a vector with lots of values so the printout runs across multiple
lines, each line will start with a number in
brackets, telling you which vector item number that particular line is starting with.
class(3)
class(3.0)
class(3L)
class(as.integer(3))
There are several as() functions for converting one data type to another, including
as.character(), as.list() and as.data.frame().
29
R: a beginners guide
COMPUTERWORLD.COM
Roger Peng, associate professor of biostatistics at the Johns Hopkins Bloomberg School
of Public Health, explains data types in R.
Terminating your R
expressions
30
R: a beginners guide
COMPUTERWORLD.COM
edit(mtcars)
31
R: a beginners guide
COMPUTERWORLD.COM
write.table(myData, testfile.txt,
sep=\t)
This will export all the data from an R
object called myData to a tab-separated file
called testfile.txt in the current working
directory. Changing sep=\t to sep=c will
generated a comma-separated file and so
on.
32
R: a beginners guide
COMPUTERWORLD.COM
60+ R resources to
improve your data skills
R data structures to running regressions
and conducting factor analyses. The beginners section may be a bit tough to follow
if you havent had any exposure to R, but
it offers a good foundation in data types,
imports and reshaping once youve had a
bit of experience. There are some particularly useful explanations and examples for
aggregating, restructuring and subsetting
data, as well as a lot of applied statistics.
Note that if your interest in graphics is
learning ggplot2, theres relatively little
on that here compared with base R graphics and the lattice package. You can see
an excerpt from the book online: Aggregation and restructuring data. By Robert I.
Kabacoff.
R: a beginners guide
COMPUTERWORLD.COM
34
R: a beginners guide
COMPUTERWORLD.COM
Online references
R: a beginners guide
COMPUTERWORLD.COM
Online tools
Videos
Twotorials. Youll either enjoy these snappy
2-minute twotorial videos or find them, oh,
corny or over the top. I think theyre both
informative and fun, a welcome antidote to
the typically dry how-tos you often find in
statistical programming. Analyst Anthony
Damico takes on R in 2-minute chunks,
from how to create a variable with R to
how to plot residuals from a regression in
R; he also tackles an occasional problem
such as how to calculate your ten, fifteen,
36
R: a beginners guide
COMPUTERWORLD.COM
Intro video for the Coursera Computing for Data Analysis course
Coursera: Data Analysis. This was more
of an applied statistics class that uses R as
opposed to one that teaches R; but if youve
got the R basics down and want to see it in
action, this might be a good choice. There
are no upcoming scheduled sessions for
this at Coursera, but instructor Jeff Leek
an assistant professor of biostatistics at
R: a beginners guide
COMPUTERWORLD.COM
How to Visualize and Compare Distributions. This short and highly readable
Flowing Data tutorial goes over traditional
visualizations such as histograms and box
plots. With downloadable code.
Handling and Processing Strings in
R. This PDF download covers many
things youre want to do with text, from
string lengths and formatting to search
and replace with regular expressions to
basic text analysis. By statistician Gaston
Sanchez.
R: a beginners guide
COMPUTERWORLD.COM
39
R: a beginners guide
COMPUTERWORLD.COM
How to turn CSV data into interactive visualizations with R and rCharts. 9page slideshow gives step-by-step instructions on
various options for generating interactive
graphics. The charts and graphs use jQuery
libraries as the underlying technology but
only a couple of line of R code are needed.
By Sharon Machlis, Computerworld.
40
R: a beginners guide
COMPUTERWORLD.COM
Revolutions. Theres plenty here of interest to all levels of R users. Although author
Revolution Analytics is in the business of
selling enterprise-class R platforms, the
blog is not focused exclusively on their
products.
Communities
Pretty much every social media platform has an R group. Id particularly
recommend:
Statistics and R on Google+. Community
members are knowledgeable and helpful,
and various conversation threads engage
both newbies and experts.
You can also find R groups on LinkedIn, Reddit and Facebook, among other
platforms.
Stackoverflow has a very active R community where people ask and answer coding
questions. If youve got a specific coding
challenge, its definitely worth searching
here to see if someone else has already
asked about something similar.
There are dozens of R User Meetups worldwide. In addition, there are other user
groups not connected with Meetup.com.
Revolution Analytics has an R User Group
Directory.
41
R: a beginners guide
COMPUTERWORLD.COM
Misc
Googles R Style Guide. Want to write neat
code with a consistent style? Youll probably want a style guide; and Google has
helpfully posted their internal R style for
all to use. If that one doesnt work for you,
Hadley Wickham has a fairly abbreviated R
style guide based on Googles but with a
few tweaks.
Search
Searching for R on a general search
engine like Google can be somewhat frustrating, given how many utterly unrelated
English words include the letter r. Some
search possibilities:
Apps
R Instructor. This app is primarily a welldesigned, very thorough index to R, offering snippets on how to import, summarize
and plot data, as well as an introductory
section. An I want to... section gives
42
R: a beginners guide
COMPUTERWORLD.COM
Software
Comprehensive R Archive Network
(CRAN). The most important of all: home
of the R Project for Statistical Computing,
including downloading the basic R platform, FAQs and tutorials as well as thousands of add-on packages. Also features
detailed documentation and a number of
links to more resources.
RStudio. You can download the free RStudio IDE as well as RStudios Shiny project
aimed at turning R analyses into interactive
Web applications.
Revolution Analytics. In addition to its
commercial Revolution R Enterprise, you
can request a download of their free Revolution R Community (youll need to provide
an email address). Both are designed to
improve R performance and reliability.
Tibco. This software company recently
released a free Tibco Enterprise Runtime
for R Developers Edition to go along with
its commercial Tibco Enterprise Runtime
for R engine aimed at helping to integrate R
analysis into other enterprise platforms.
Shiny for interactive Web apps. This opensource project from RStudio is aimed at
creating interactive Web applications from
R analysis and graphics. Theres a Shiny
tutorial at the RStudio site; to see more
43