Week1 2020
Week1 2020
Week1 2020
About R
History
S is a statistical programming language developed by John M. Chambers and others in the late 1970s and
early 1980s at Bell Labs. According to John Chambers, the aim of the software was “to turn ideas into
software, quickly and faithfully.”
The S engine was licensed to and finally purchased by Insightful (now acquired by TIBCO), which sell a
value-added version called S-Plus (now marketed as Spotfire S+), which contains a graphical user interface.
S-Plus used to dominate the high-end market (academic and industrial research).
R is an implementation of the S programming language, which is in many respects superior to the original
S system. R was originally written by two researchers at the University of Auckland (New Zealand), Ross
Ihaka and Robert Gentleman, but is now maintained by the R Core Team. R is free software, you can obtain
it for free, and the source code of R is freely available, so (if you want) you can study how R works internally
and modify it as you like. R is extensible and a large selection of extensions packages can be obtained from
CRAN.
Why use R?
These days, R, along with Python, are the key platforms for statistics and Data Science. Reasons for R’s
dominant position are a very good graphics engine and the large number of extension packages (more than
15,000), so there are only few statistical methods not implemented in R.
Furthermore, programming languages are very similar. If you know how to program in R, you will find
learning other languages like C, C++, C#, Java, Javascript, PHP or Python much easier.
Tidyverse
Though R as a language is not the most elegant (especially compared to more recent languages such as Julia,
but there is a suite of R packages (‘tidyverse’) which provides tools for data manipulation and visualisation.
1
We will not explicitly cover the Tidyverse packages in this course but will instead focus on using base R. There
are several reasons for this; firstly the functions from the tidyverse-packages cover just a small proportion of
all things you can do in R, secondly this course is about learning general programming skills in R, not just
about data manipulation and visualisation, and thirdly you will learn Tidyverse programming in your Data
Analysis course so going over it twice is not a good use of anyones time.
Despite not covering it within the course, if you have learned Tidyverse then you are welcome to answer the
questions within labs and assignments using your Tidyverse skills (unless of course the question specifies a
particular non-tidyverse function for you to use). Within the course you will not be penalised on the approach
you have taken (Tidyverse or base R) or your coding efficiency (i.e. if it take 100 lines or 5 lines of code, they
key thing is getting the answer!).
2
Getting Started!
Installing R
You need to have access to R for this course. You can download R for free from CRAN or download Microsoft
R Open from MRAN.
R is available for Windows, Mac OS and Linux as well as some less common platforms.
RStudio
It is recommended that you also download and install RStudio Desktop, a powerful integrated development
environment (IDE) for R. RStudio contains a much better code editor. It has, for example, syntax highlighting,
i.e. it will automatically display your code in different colours to make it easier and quicker to read the code.
Even though other IDEs, such as Eclipse, Visual Studio Code, or Emacs can also be used with R, RStudio is
by far the most popular among R users.
RStudio is just a front-end for R, so to be able make use of RStudio, you need to also have R installed.
RStudio Desktop
RStudio Desktop Open Source is available for free from RStudio at this link https://www.rstudio.com/prod
ucts/rstudio/download/.
3
How to use RStudio
Once you have R and R Studio installed you can begin! Below is a screenshot of RStudio.
After starting RStudio it is best to start with either creating a new R script (by clicking on File > New File
> R Script or clicking on the left-most button) or opening an existing R script (by clicking on File > Open
File or clicking on the second button from the left).
You can type R commands directly into the R console running at the bottom-left of RStudio. However
it is typically better to type the R commands into the editor at the top-left. The commands can then be
submitted to R by highlighting the commands and clicking on the Run button. Pressing Ctrl-Enter (or
Ctrl-R) also submits the current selection or, if no text is selected, the current line of code.
The top-right has a list of the current objects in the workspace as well as a second tab with a history of
commands used in the past.
The bottom-right contains four tabs: one showing the files in the current working directory, one showing
the plots drawn so far, one showing all available extension packages and finally (and most importantly) one
showing the R help.
The script editor and the R console have a number of useful features, most of which are common to many
integrated development environments (IDEs):
Syntax highlighting The colour of the text is changed automatically according to the R syntax rules. This
makes the code easier to read.
Matching brackets When typing a closing bracket (), ], }) the corresponding opening bracket is highlighted.
This helps determining the correct number and positioning of closing brackets.
Auto completion If you start typing a variable name (or the name of an argument of a function) pressing
Tab will automatically complete the name if it is unique or show a context menu with all possible completions.
For functions the context menu also shows an explanation of each argument.
Help for current command Pressing F1 after having typed the name of a function (e.g. sort) opens the
help file for this function.
4
R packages
R comes with a default selection of packages, which should cover your “basic needs” in terms of data
management, data visualisation and modelling. However, there is a large selection of “add-on” R packages
available on CRAN, some of which we will use for this course. You can only use these R packages after you
have installed them.
Imagine you want to use an R package called ggplot2. In order to be able to use it, you first need to install it.
You can do so by entering
install.packages("mgcv")
into R. This will download and install the package mgcv, as well as all other packages which mgcv needs
(known as ‘dependencies’). Alternatively, you can click on the tab “Packages” in the bottom-right panel, and
then click on “Install”
Once you have installed an R package you can load it using the function library:
library(mgcv)
Now you can use the functions stored within the mgcv package.
If you run library(somepackage) and obtain the error message Error in library(somepackage) : there
is no package called 'somepackage'. then you do not have this package installed and need to install it.
5
A Brief History of Computing and Data
https://youtu.be/mjDbSsKkVdc
Duration: 3m51s
6
The following examples should act as a warning not to blindly trust a computer.
Is addition commutative? All of us know that 1020 − 1020 + 1 = 1. Using R we obtain the same result:
10^20-10^20+1
## [1] 1
Of course 1 + 1020 − 1020 = 1 as well. According to R, however,
1+10^20-10^20
## [1] 0
None of us would have made this mistake, as we would see at once that the sum of 1020 and −1020 is 0,
thus the answer must be 1. The computer processes this sum from left to right, and for the computer
1+1020 ≈ 1020 , as 1 is very small compared to 1020 . In fact the next smallest number a computer can represent
is 99, 999, 999, 999, 999, 983, 616, which is much further away from 1020 − 1 than 1020 itself. Subtracting 1020
then yields the wrong result 0. In other words, addition is not necessarily commutative on a computer, so the
order of the terms might matter.
A computer only has finite precision, so we cannot represent arbitrarily large numbers, and there are “gaps”
between the numbers. Like most other software, R uses IEEE 754 double precision floating point numbers.
Floating point numbers are the computer implementation of scientific notation (like “3 · 10−9 ”), i.e. the
significant and the exponent are stored separately. Storing the exponent separately makes the decimal point
“float”. The largest number that can be represented is 21024 ≈ 1.7977 · 10308 , which is large enough for most
purposes. If a computation results in a value larger than this, arithmetic overflow occurs. In the past, this
typically caused the program to abort. However, in IEEE 754 arithmetic and thus in R, the result is simply
set to -Inf or +Inf.
The problem causing the computer to get the wrong result are however the “gaps” between the numbers:
between each number and next smaller (or larger) number there is a gap of about 2 · 10−16 times the number.
And for 1020 this “gap” is larger than 1: see the figure below.
0.9999999999999997 1.0000000000000004 99, 999, 999, 999, 999, 983, 616 100, 000, 000, 000, 000, 016, 384
779553950749686919 440892098500626161
152736663818359375 69452667236328125
Note that in our example (and in many other situations) we can ensure that this problem does not occur by
making the computer carry out the operations in a certain order.
More simple arithmetic You are used to rounding errors from your calculators. For example both on a
computer and on a calculator 56 − 16 · 5 is not 0:
5/6 - 1/6 * 5
## [1] 1.110223e-16
A similar, but more surprising example is that 0.1 + 0.1 + 0.1 − 0.3 is not 0 on a computer:
0.1+0.1+0.1-0.3
## [1] 5.551115e-17
The result is almost (but only almost) 0. Again, we would have expected the computer to get this right.
Almost all modern computers (as opposed to calculators) internally use a binary system instead of the
decimal system we were taught at school. And in binary numbers 0.1 + 0.1 + 0.1 − 0.3 is 0.000110011 . . . +
0.000110011 . . . + 0.000110011 . . . − 0.01001001 . . .. As neither 0.1, nor 0.3 have a finite representation in a
binary system a rounding error occurs.
7
To quote from the book The Elements of Programming Style by Kernighan and Plauger: “10.0 times 0.1 is
hardly ever 1.0”.
Rounding errors for a single computation are typically very small. However computers often carry out a
long series of calculations, and typically rounding errors do not cancel out, but accumulate. Thus a complex
computation can be subject to a significant error.
R as a calculator
Basic arithmetic operators
R as a calculator
https://youtu.be/Gib3Wk2FFi8
Duration: 16m29s
This section gives an overview over the basic arithmetic operators and functions in R. The following table
contains the basic arithmetic operators available in R.
If an R expression contains more than one operator, we need to know in which order R evaluates the expression.
This is known as operator precedence in Computer Science. For example, does
2 / 3 * 2
2 1 2 4
compute = or · 2 = ?
3·2 3 3 3
R uses the following rules:
• R first evaluates ˆ and **, then the sign - (not difference), then %/% or %%, then * or /, and finally + or
- (difference, not sign).
• In case of ties (operators of same precedence) the expressions are evaluated from the left to the right.
2
Thus in the above example 2/3*2 computes · 2.
3
2
Use parenthesis to get R to perform calculations in a different order. For example, in order to calculate ,
3·2
you have to use
2 / (3 * 2)
8
14
2
Example 1 To compute we have to use
3
(2/3)^(1/4)
## [1] 0.903602
If we omit the parentheses and enter
2/3^1/4
## [1] 0.1666667
2
R computes 31
4 = 16 .
4 3+4
Task 1 Use R to compute 3 + , , and 271/3 .
5 5
## [1] Inf
for example gives Inf, whereas
1 / Inf
## [1] 0
gives 0. If you ask R to compute
Inf / Inf
## [1] NaN
it will return NaN (not a number): it cannot tell what the result is. Expressions like
sqrt(-1)
## [1] 0+1i
Note that a NaN (not a number) is not the same as NA (missing value, ‘not available'').
9
Example 2 log(0) returns -Inf, as limx&0 log(x) = −∞. Inf-10 returns Inf, as limx→+∞ x − 10 = +∞.
However Inf - Inf is NaN, as the limit is ambiguous. Similarly, sqrt(Inf) / Inf is NaN. R evaluates
sqrt(Inf) first, which is Inf. Inf/Inf is NaN.
You can also use the more common assignent operator = instead of <- in most (but not all) circumstances.
Assignments can be made in both directions, so you could also use
2/3 * 2 -> a
## [1] 1.333333
This is equivalent to
print(a)
## [1] 1.333333
which is what needs to be used inside control structures and functions (we will come back to this later).
You can define new variables using the values stored in other variables, as in
b <- a / 5
Note that changing a to something else will not automatically change b, so in the above example
a <- 10
b <- a / 5
a <- a + 30
b
## [1] 2
b will stay 2, even though a will be 40.
In case you forgot to use a variable, the last expression you computed is stored in .Last.value.
The use of variables is an important tool in programming. Variables ensure that even code performing
complex operations remains legible.
You can list all variables in the current workspace using
ls()
10
Alternatively, local objects are shown in the Enrivonment tab of RStudio.
Task 2 In the video we have considered the example of taking out a loan of £9,000 for 20 years with an
annual interest rate of 15%. The yearly repayment can be shown to be
1−v
P =L· .
v(1 − v n )
## [1] 1437.853
Of your yearly payments of £1437.86, how much is interest and how much is used for paying back the loan?
The interest you pay in the first year is L · i. The remainder of the first payment (P − L · i) is thus used for
paying back the loan. More generally, one can show that the payment in year k can be decomposed into
P = P · αk + P · (1 − αk )
| {z } | {z }
capital repayment interest
with αk = v n+1−k .
Compute how much of the 10th payment is used for paying back the loan (P · α10 ) and how much is interest
(P · (1 − α10 )).
https://youtu.be/Xrm1cp-WSLM
Duration: 6m48s
Logical variables
A logical variable can only hold the two values TRUE and FALSE. Logical variables are sometimes called
Boolean variables, after George Boole (1815–1864), an English mathematician, logician and philosopher. R
has the following three binary operators ! (negation), & (logical AND) and | (logical OR).
expr1 expr2 !expr1 (NOT) expr1 & expr2 (AND) expr1 | expr2 (OR)
TRUE TRUE FALSE TRUE TRUE
TRUE FALSE FALSE FALSE TRUE
FALSE TRUE TRUE FALSE TRUE
FALSE FALSE TRUE FALSE FALSE
11
Example 3
a <- TRUE
b <- FALSE
c <- a & !b
c
## [1] TRUE
In this example c is TRUE, because !b is TRUE and TRUE & TRUE is TRUE.
In R, & has higher precedence than |. So, in the absence of parentheses, & is evaluated before |. For
example, TRUE | FALSE & FALSE is treated by R as TRUE | (FALSE & FALSE), which is TRUE. We have to
use parentheses to calculate (TRUE | FALSE) & FALSE, which is FALSE.
Task 3 Consider three logical variables
a <- TRUE
b <- FALSE
c <- TRUE
R also allows using the shorthands T instead of TRUE and F instead of FALSE. It is however not recommended
to use these shorthands. Whereas TRUE and FALSE are reserved keywords, which cannot be overwritten, T
and F are just global variables set to TRUE and FALSE, respectively. This means they can be masked by local
variables. Nothing (in R) prevents you from setting
T <- FALSE
F <- TRUE
and T and F have become the exact opposite of what they are meant to be! Of course, few users would do
this deliberately, but it is not inconceivable that you happen to define a variable T or F, which can, depending
on its values, have exactly the same effect.
R also has “lazy” operators && and ||. In contrast to & and | they will only evaluate the arguments until the
result has become clear. On the other hand, & and | will always evaluate all arguments. expr1&&expr2 will
evaluate expr2 only if expr1 is TRUE (otherwise the result is guaranteed to be FALSE no matter what expr2
is). Similarly, expr1||expr2 will evaluate expr2 only if expr1 is FALSE (otherwise the result is guaranteed
to be TRUE no matter what expr2 is). && and || can be helpful in conditional if statements. You should not
use && and || for vectors of length greater than 1.
Comparison operators
The comparison operators in R are == (testing for exact equality), != (“not exactly equal”) <, <=, >, and >=.
The comparison operators return a logical value (i.e. TRUE or FALSE), so you can use the operators !, & and |
to combine them to more complex expressions.
Example 4 Consider a variable x set to the number 2.
x <- 2
## [1] FALSE
12
x <= 3
## [1] TRUE
If we want to test whether x is in the unit interval we can use
x>0 & x<1
## [1] FALSE
or, equivalently,
!(x<=0 | x>=1)
## [1] FALSE
Due to rounding and representation errors, you do not want to use == to compare non-integers. For example,
despite 0.3 − 2 × 0.1 = 0.1, R yields
0.3 - 2 * 0.1 == 0.1
## [1] FALSE
because the expression on the left-hand side is not exactly 0.1. We see this by subtracting 0.1 from the
left-hand side (which should then be exactly zero, but isn’t)
0.3-2*0.1 - 0.1
## [1] -2.775558e-17
For non-integers we really only want to test whether they are “nearly equal”, we can do so by comparing the
absolute difference to a small number (say 10−8 ).
abs(0.3-2*0.1 - 0.1) < 1e-8
## [1] TRUE
or use the built-in function all.equal.
isTRUE(all.equal(0.3-2*0.1, 0.1))
## [1] TRUE
355
Task 4 Create a variable x storing the fraction 113 . Use R to test
• whether this fraction is less than π (pi in R),
• whether this fraction is between 3 and 4, and
• whether this fraction is within ±10−6 of π.
13
Task Solutions
Task 1 You can use the following code.
3 + 4 / 5 # No parentheses necessary
## [1] 3.8
(3 + 4) / 5 # Parentheses needed
## [1] 1.4
27^(1/3) # Parentheses needed
## [1] 3
Task 2 To compute how the payment is split in year 10 we can use the code below.
n <- 20 # term of the loan
loan <- 9000 # amount
interest.rate <- 0.15 # effective annual interest rate
v <- 1 / (1+interest.rate) # effective annual discount factor
payment <- loan * (1-v) / (v*(1-v^n)) # yearly payments
k <- 10 # set k to 10 years
alpha10 <- v^(n+1-k) # split factor
capital10 <- payment * alpha10 # capital repaymment
capital10
## [1] 309.0568
interest10 <- payment * (1-alpha10) # interest part
interest10
## [1] 1128.796
So even after ten years the largest part of the repayment is interest!
Task 3 If we use R to work out the answers we obtain
!a & b
## [1] FALSE
a | !b
## [1] TRUE
!(a | !b)
## [1] FALSE
(a & b) | c
## [1] TRUE
Task 4 We can use the following R code.
x <- 355 / 113
x < pi
## [1] FALSE
x>3 & x<4
## [1] TRUE
14
abs(x-pi) < 1e-6
## [1] TRUE
15