R Tutorial
R Tutorial
R Tutorial
Tutorial
• It's free!
• It runs on a variety of platforms including Windows, Unix and
MacOS.
• It provides an unparalleled platform for programming new
statistical methods in an easy and straightforward manner.
• It contains advanced statistical routines not yet available in
other packages.
• It has state-of-the-art graphics capabilities.
Installing R
• How to download R:
– http://www.r-project.org/
– Google: “R”
– Windows, Linux, Mac OS X, source
– On mindhive:
• user@ba1:~$> R [terminal only]
• user@ba1:~$> R –g Tk & [application window]
• Files for this tutorial:
– http://web.mit.edu/tkp/www/R/R_Tutorial_Data.txt
– http://web.mit.edu/tkp/www/R/R_Tutorial_Inputs.txt
Tutorials
• Each of the following tutorials are in PDF format.
• P. Kuhnert & B. Venables, An Introduction to R: Software for
Statistical Modeling & Computing
• J.H. Maindonald, Using R for Data Analysis and Graphics
• B. Muenchen, R for SAS and SPSS Users
• W.J. Owen, The R Guide
• D. Rossiter, Introduction to the R Project for Statistical Computing
for Use at the ITC
• W.N. Venebles & D. M. Smith, An Introduction to R
Where to find R help and resources on the web
• R wiki:
ISBN: 9780596801700
http://rwiki.sciviews.org/doku.php
• R graph gallery:
http://addictedtor.free.fr/graphiques/thumbs.php
• Kickstarting R:
http://cran.r-project.org/doc/contrib/Lemon-kickstart/
More Links
• R time series tutorial
• R Concepts and Data Types presentation by Deepayan Sarkar
• Interpreting Output From lm()
• The R Wiki
• An Introduction to R
• Import / Export Manual
• R Reference Cards
Introduction
• R is “GNU S” — A language and environment for data
manipula-tion, calculation and graphical display.
– R is similar to the award-winning S system, which was developed at Bell
Laboratories by John Chambers et al.
– a suite of operators for calculations on arrays, in particular matrices,
– a large, coherent, integrated collection of intermediate tools for interactive
data analysis,
– graphical facilities for data analysis and display either directly at the computer
or on hardcopy
– a well developed programming language which includes conditionals, loops,
user defined recursive functions and input and output facilities.
Introduction
• The core of R is an interpreted computer language.
– It allows branching and looping as well as modular programming using functions.
– Most of the user-visible functions in R are written in R, calling upon a smaller set
of internal primitives.
– It is possible for the user to interface to procedures written in C, C++ or
FORTRAN languages for efficiency, and also to write additional primitives.
What R does and does not
o data handling and storage: o is not a database, but
numeric, textual connects to DBMSs
o matrix algebra o has no graphical user
interfaces, but connects to
o hash tables and regular
Java, TclTk
expressions
o language interpreter can be
o high-level data analytic and
very slow, but allows to call
statistical functions
own C/C++ code
o classes (“OO”)
o no spreadsheet view of data,
o graphics but connects to
o programming language: Excel/MsOffice
loops, branching, o no professional /
subroutines commercial support
Some Useful Functions
• length(object) # number of elements or components
• str(object) # structure of an object
• class(object) # class or type of an object
• names(object) # names
• c(object,object,...) # combine objects into a vector
• cbind(object, object, ...) # combine objects as columns
• rbind(object, object, ...) # combine objects as rows
• ls() # list current objects
• rm(object) # delete an object
• newobject <- edit(object) # edit copy and save a
• newobject
• fix(object) # edit in place
R Warning !
• Most programs (e.g. Excel), as well as humans, know how to deal with
rectangular tables in the form of tab-delimited text files.
• > x = read.delim(“filename.txt”)
• also: read.table, read.csv
• You can also use R's built in spreadsheet to enter the data
interactively, as in the following example.
• # enter data using editor
mydata <- data.frame(age=numeric(0), gender=character(0),
weight=numeric(0))
mydata <- edit(mydata)
# note that without the assignment in the line above,
# the edits are not saved!
Keyboard Input
• Usually you will obtain a dataframe by importing it from SAS,
SPSS, Excel, Stata, a database, or an ASCII file. To create it
interactively, you can do something like the following.
> Learning[Group=="A"]
[1] 0.90 0.87 0.90 0.85 0.93 0.93 0.89 0.80 0.98
[10] 0.88 0.88 0.94 0.99 0.92 0.83 0.65 0.98 0.82
[19] 0.93 0.81 0.97 0.95 0.70 1.00 0.90 0.99 0.95
[28] 0.95 0.97 1.00 0.99
> Learning[Group!="A"]
[1] 0.57 0.55 0.94 0.68 0.89 0.60 0.63 0.84 0.92
[10] 0.56 0.78 0.54 0.47 0.45 0.59 0.91 0.18 0.33
[19] 0.88 0.23 0.75 0.21 0.35 0.70 0.34 0.43 0.75
[28] 0.44 0.44 0.29 0.48 0.28
> Condition[Group=="B"&Learning<0.5]
[1] Low Low High High High High High High High
[10] High High High High High
Levels: High Low
Storing data
• Every R object can be stored into and restored from a file with the commands
“save” and “load”.
• This uses the XDR (external data representation) standard of Sun Microsystems
and others, and is portable between MS-Windows, Unix, Mac.
• To select certain rows based on logical tests on the values of one or more
variables:
> worms[Area>3&Slope<3,]
%y 2-digit year 07
%Y 4-digit year 2007
Date Values
# print today's date
today <- Sys.Date()
format(today, format="%B %d %Y")
"June 20 2007"
Variables, Lists, and Arrays
Object orientation
primitive (or: atomic) data types in R are:
Parlance:
• class: the “abstract” definition of it
• object: a concrete instance
• method: other word for ‘function’
• slot: a component of an object
Object orientation
Advantages:
Encapsulation (can use the objects and methods someone else has written without having
to care about the internals)
Generic functions (e.g. plot, print)
Inheritance (hierarchical organization of complexity)
Caveat:
Overcomplicated, baroque program architecture…
Variables
> a = 49
> sqrt(a)
[1] 7
numeric
> a = "The dog ate my homework"
> sub("dog","cat",a)
[1] "The cat ate my homework“
character
> a = (1+1==3) string
>a
[1] FALSE
logical
Variable Labels
R's ability to handle variable labels is somewhat
unsatisfying.
If you use the Hmisc package, you can take advantage of
some labeling features.
library(Hmisc)
label(mydata$myvar) <- "Variable label for variable
myvar"
describe(mydata)
Variable Labels
Unfortunately the label is only in effect for functions provided by
the Hmisc package, such as describe(). Your other option is to
use the variable label as the variable name and then refer to
the variable by position index.
names(mydata)[3] <- "This is the label for variable 3"
mydata[3] # list the variable
Vectors, matrices and arrays
• vector: an ordered collection of data of the same type
> a = c(1,2,3)
> a*2
[1] 2 4 6
• Example: the mean spot intensities of all 15488 spots on a chip: a vector of 15488
numbers
• example: the expression values for 10000 genes for 30 tissue biopsies: a matrix with
10000 rows and 30 columns.
, , 2
, , 1
[,1] [,2]
[1,] 6 11
[2,] 7 12
[3,] 8 13
[4,] 9 14
[5,] 10 15
, , 2
[,1] [,2]
[1,] 21 26
[2,] 22 27
[3,] 23 28
[4,] 24 29
[5,] 25 30
Subscripts with Arrays (III)
• To select columns of A (e.g. second and third) and rows (e.g. two
to four), of only the second table:
> A[2:4,2:3,2] : rows are the first, columns are the second,
and table are the third subscript
[,1] [,2]
[1,] 22 27
[2,] 23 28
[3,] 24 29
Lists
• vector: an ordered collection of data of the same type.
> a = c(7,5,1)
> a[2]
[1] 5
> plot(arc(Percents)~Percents,
+ pch=21,cex=2,xlim=c(0,1),ylim=c(0,pi),
+ main="The Arcsine Transformation")
> lines(c(0,1),c(0,pi),col="red",lwd=2)
R Packages
– One of the strengths of R is that the system can easily be
extended. The system allows you to write new functions
and package those functions in a so called `R package' (or
`R library'). The R package may also contain other R
objects, for example data sets or documentation. There is
a lively R user community and many R packages have been
written and made available on CRAN for other users. Just
a few examples, there are packages for portfolio
optimization, drawing maps, exporting objects to html,
time series analysis, spatial statistics and the list goes on
and on.
R Packages
– To attach another package to the system you can use the menu or the
library function. Via the menu:
Select the `Packages' menu and select `Load package...', a list of available
packages on your system will be displayed. Select one and click `OK', the
package is now attached to your current R session. Via the library function:
> library(MASS)
> shoes
$A
[1] 13.2 8.2 10.9 14.3 10.7 6.6 9.5 10.8 8.8 13.3
$B
[1] 14.0 8.8 11.2 14.2 11.8 6.4 9.8 11.3 9.3 13.6
Data Manipulation
Outline
• Creating New Variable
• Operators
• Built-in functions
• Control Structures
• User Defined Functions
• Sorting Data
• Merging Data
• Aggregating Data
• Reshaping Data
• Sub-setting Data
• Data Type Conversions
Introduction
# rename programmatically
library(reshape)
mydata <- rename(mydata, c(oldname="newname"))
grep(pattern, x , Search for pattern in x. If fixed =FALSE then pattern is a regular expression. If
ignore.case=FALSE, fixed=FALSE) fixed=TRUE then pattern is a text string. Returns matching indices.
grep("A", c("b","A","c"), fixed=TRUE) returns 2
sub(pattern, replacement, x, Find pattern in x and replace with replacement text. If fixed=FALSE then pattern is
ignore.case =FALSE, fixed=FALSE) a regular expression.
If fixed = T then pattern is a text string.
sub("\\s",".","Hello There") returns "Hello.There"
toupper(x) Uppercase
tolower(x) Lowercase
Stat/Prob Functions
• The following table describes functions related to
probaility distributions. For random number
generators below, you can use set.seed(1234) or
some other integer to create reproducible pseudo-
random numbers.
Function Description
dnorm(x) normal density function (by default m=0 sd=1)
# plot standard normal curve
x <- pretty(c(-3,3), 30)
y <- dnorm(x)
plot(x, y, type='l', xlab="Normal Deviate", ylab="Density", yaxs="i")
pnorm(q) cumulative normal probability for q
(area under the normal curve to the right of q)
pnorm(1.96) is 0.975
qnorm(p) normal quantile.
value at the p percentile of normal distribution
qnorm(.9) is 1.28 # 90th percentile
rnorm(n, m=0,sd=1) n random normal deviates with mean m
and standard deviation sd.
#50 random normal variates with mean=50, sd=10
x <- rnorm(50, m=50, sd=10)
dbinom(x, size, prob) binomial distribution where size is the sample size
pbinom(q, size, prob) and prob is the probability of a heads (pi)
qbinom(p, size, prob) # prob of 0 to 5 heads of fair coin out of 10 flips
rbinom(n, size, prob) dbinom(0:5, 10, .5)
# prob of 5 or less heads of fair coin out of 10 flips
pbinom(5, 10, .5)
dpois(x, lamda) poisson distribution with m=std=lamda
ppois(q, lamda) #probability of 0,1, or 2 events with lamda=4
qpois(p, lamda) dpois(0:2, 4)
rpois(n, lamda) # probability of at least 3 events with lamda=4
1- ppois(2,4)
dunif(x, min=0, max=1) uniform distribution, follows the same pattern
punif(q, min=0, max=1) as the normal distribution above.
qunif(p, min=0, max=1) #10 uniform random variates
runif(n, min=0, max=1) x <- runif(10)
Function Description
sd(x) standard deviation of object(x). also look at var(x) for variance and mad(x) for median absolute
deviation.
median(x) median
quantile(x, probs) quantiles where x is the numeric vector whose quantiles are desired and probs is a numeric vector with
probabilities in [0,1].
# 30th and 84th percentiles of x
y <- quantile(x, c(.3,.84))
range(x) range
sum(x) sum
diff(x, lag=1) lagged differences, with lag indicating which lag to use
min(x) minimum
max(x) maximum
Function Description