Week1 2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Module 1 - Introduction to R and Basics

About R
History
S is a statistical programming language developed by John M. Chambers and others in the late 1970s and
early 1980s at Bell Labs. According to John Chambers, the aim of the software was “to turn ideas into
software, quickly and faithfully.”
The S engine was licensed to and finally purchased by Insightful (now acquired by TIBCO), which sell a
value-added version called S-Plus (now marketed as Spotfire S+), which contains a graphical user interface.
S-Plus used to dominate the high-end market (academic and industrial research).
R is an implementation of the S programming language, which is in many respects superior to the original
S system. R was originally written by two researchers at the University of Auckland (New Zealand), Ross
Ihaka and Robert Gentleman, but is now maintained by the R Core Team. R is free software, you can obtain
it for free, and the source code of R is freely available, so (if you want) you can study how R works internally
and modify it as you like. R is extensible and a large selection of extensions packages can be obtained from
CRAN.

Why use a command-line programme?


R is essentially a command-line programme, i.e. you control R by typing commands. Whilst command-line
programs were ubiquitous until the mid-1990s, most software nowadays uses menu-driven graphical user
interfaces (GUI), like Word, Excel or SPSS.
So why should we then use an “old-fashioned” command-line programme in the 21st century? Menu-driven
software works very well when used for a limited set of tasks. However, if one includes the R packages on
CRAN, R can carry out several hundred thousands of different tasks. These can simply not be arranged in a
menu in any meaningful way. In addition, most menu-driven software is very inflexible: you can often only
use it in the way the authors have designed it to be used.
Command-line based programmes are much more flexible, you can do things no one has done or even thought
of before by writing your own programs. You can write them from scratch or re-use some of the functions
already provided. Some menu-driven software (like e.g. SPSS) offer the option of using macros. However
these macros are often very clumsy and less elegant than R code. Integrated development environments like
RStudio have many features that make coding in R a lot less daunting.

Why use R?
These days, R, along with Python, are the key platforms for statistics and Data Science. Reasons for R’s
dominant position are a very good graphics engine and the large number of extension packages (more than
15,000), so there are only few statistical methods not implemented in R.
Furthermore, programming languages are very similar. If you know how to program in R, you will find
learning other languages like C, C++, C#, Java, Javascript, PHP or Python much easier.

Tidyverse
Though R as a language is not the most elegant (especially compared to more recent languages such as Julia,
but there is a suite of R packages (‘tidyverse’) which provides tools for data manipulation and visualisation.

1
We will not explicitly cover the Tidyverse packages in this course but will instead focus on using base R. There
are several reasons for this; firstly the functions from the tidyverse-packages cover just a small proportion of
all things you can do in R, secondly this course is about learning general programming skills in R, not just
about data manipulation and visualisation, and thirdly you will learn Tidyverse programming in your Data
Analysis course so going over it twice is not a good use of anyones time.
Despite not covering it within the course, if you have learned Tidyverse then you are welcome to answer the
questions within labs and assignments using your Tidyverse skills (unless of course the question specifies a
particular non-tidyverse function for you to use). Within the course you will not be penalised on the approach
you have taken (Tidyverse or base R) or your coding efficiency (i.e. if it take 100 lines or 5 lines of code, they
key thing is getting the answer!).

Comparing R to other programming languages


R is an interpreted language In an interpreted language like R, programme code is executed step-by-step,
without (explicitly invoked) prior translation to machine code (“compilation”), as would be required for
languages such as C, C++ or Java.
Interpreted languages are typically easier to debug as code can easily be run interactively on a line-by-line
basis. However there typically is a performance penalty involved. Compiled languages are typically a lot
faster. This is why many operations in R such as matrix multiplication are under the hood implemented in
Fortran or C.
Like many interpeted languages, R nowadays has a just-in-time compiler (JIT), which can translate the
commands to be executed into an intermediate binary format, which is quicker to execute. However, this
process is almost invisible to the user.
R is a dynamically-typed language Programming languages like C, C+ or Java require all variables to have a
declared type before they can be used.
R is object-oriented In R and other object-oriented languages, data constructs can behave differently depending
on their type. For example, the R method summary provides a summary for an object and the method plot
plots an object.
What these methods will actually do depends on what type of object they are invoked for. R’s object
orientation is a lot less visible than that of other programming languages, so many users don’t even notice it.
R is garbage-collected Just like in Java or Python, R looks after the memory management for you. Objects
that are no longer referred to will be automatially removed from the memory (“garbage-collected”).

2
Getting Started!
Installing R
You need to have access to R for this course. You can download R for free from CRAN or download Microsoft
R Open from MRAN.
R is available for Windows, Mac OS and Linux as well as some less common platforms.

CRAN: Comprehensive R Archive Network


You can download the standard version of R from CRAN here: https://cran.r-project.org/

Downloading and installing R for Windows


To download the Windows installer of R, just enter the following URL (or click on the link) https://cran.r-
project.org//bin/windows/base/release.html.
This will download the installer for the most recent version of R. Alternatively, you can go to the main CRAN
page, https://cran.r-project.org/ click on “Download R for Windows”, click on “base” and then click on
“Download R x.y.z for Windows” (where x.y.z is the current version number of R). You can then run the
installer, accepting all default settings.
Alternatively, you can download Microsoft R Open from MRAN https://mran.microsoft.com/download/.
Microsoft R Open contains some performance improvements over standard R (and uses a “frozen” package
repositorty), but is otherwise 100% identical to CRAN R.

RStudio
It is recommended that you also download and install RStudio Desktop, a powerful integrated development
environment (IDE) for R. RStudio contains a much better code editor. It has, for example, syntax highlighting,
i.e. it will automatically display your code in different colours to make it easier and quicker to read the code.
Even though other IDEs, such as Eclipse, Visual Studio Code, or Emacs can also be used with R, RStudio is
by far the most popular among R users.
RStudio is just a front-end for R, so to be able make use of RStudio, you need to also have R installed.

RStudio Desktop
RStudio Desktop Open Source is available for free from RStudio at this link https://www.rstudio.com/prod
ucts/rstudio/download/.

Installing RStudio Desktop for Windows


Go to https://www.rstudio.com/products/rstudio/download/ and scroll down to the section “All Installers”,
then click on “RStudio-x.y.z.exe” in the first row of the table. This should start the download of the RStudio
installer. You can then run the installer, accepting all default settings.

3
How to use RStudio
Once you have R and R Studio installed you can begin! Below is a screenshot of RStudio.

After starting RStudio it is best to start with either creating a new R script (by clicking on File > New File
> R Script or clicking on the left-most button) or opening an existing R script (by clicking on File > Open
File or clicking on the second button from the left).
You can type R commands directly into the R console running at the bottom-left of RStudio. However
it is typically better to type the R commands into the editor at the top-left. The commands can then be
submitted to R by highlighting the commands and clicking on the Run button. Pressing Ctrl-Enter (or
Ctrl-R) also submits the current selection or, if no text is selected, the current line of code.
The top-right has a list of the current objects in the workspace as well as a second tab with a history of
commands used in the past.
The bottom-right contains four tabs: one showing the files in the current working directory, one showing
the plots drawn so far, one showing all available extension packages and finally (and most importantly) one
showing the R help.
The script editor and the R console have a number of useful features, most of which are common to many
integrated development environments (IDEs):
Syntax highlighting The colour of the text is changed automatically according to the R syntax rules. This
makes the code easier to read.
Matching brackets When typing a closing bracket (), ], }) the corresponding opening bracket is highlighted.
This helps determining the correct number and positioning of closing brackets.
Auto completion If you start typing a variable name (or the name of an argument of a function) pressing
Tab will automatically complete the name if it is unique or show a context menu with all possible completions.
For functions the context menu also shows an explanation of each argument.
Help for current command Pressing F1 after having typed the name of a function (e.g. sort) opens the
help file for this function.

4
R packages
R comes with a default selection of packages, which should cover your “basic needs” in terms of data
management, data visualisation and modelling. However, there is a large selection of “add-on” R packages
available on CRAN, some of which we will use for this course. You can only use these R packages after you
have installed them.
Imagine you want to use an R package called ggplot2. In order to be able to use it, you first need to install it.
You can do so by entering
install.packages("mgcv")

into R. This will download and install the package mgcv, as well as all other packages which mgcv needs
(known as ‘dependencies’). Alternatively, you can click on the tab “Packages” in the bottom-right panel, and
then click on “Install”
Once you have installed an R package you can load it using the function library:
library(mgcv)

Now you can use the functions stored within the mgcv package.
If you run library(somepackage) and obtain the error message Error in library(somepackage) : there
is no package called 'somepackage'. then you do not have this package installed and need to install it.

5
A Brief History of Computing and Data

A Brief History of Computing and Data

https://youtu.be/mjDbSsKkVdc

Duration: 3m51s

Computers have become powerful


The advances in computational power and data storage are the drivers behind the ascent of Data Science and
Analytics. In this section we look at two simple examples illustrating how powerful desktop computers are
nowadays. Don’t worry too much about the details of the R code at this stage.
Matrix multiplication Suppose we want to multiply two matrices, each having 1,000 rows and columns.
How long would it take a “human computer”? Suppose that we can add or multiply two numbers in a second
(which is rather optimistic), i.e. we manage to do 1 floating point operation per second. The resulting matrix
has 1,000 × 1,000 entries, i.e. we must compute 1,000,000 numbers, each of which is a sum of 1,000 products,
i.e. we need to carry out 2 · 109 additions and multiplications, i.e. it would take us 2 · 109 seconds, i.e. more
than 32 years (working 24/7). Let’s see how long R takes to do this (results are from a mid-range Core i5
processor).
n <- 1e3
A <- matrix(runif(n^2),nrow=n) # Create a 1000x1000 matrix with random numbers.
B <- matrix(runif(n^2),nrow=n) # Another 1000x1000 matrix with random numbers.
system.time(C <- A%*%B) # Time how long it takes to multiply them.

## user system elapsed


## 0.662 0.003 0.669
# The third figure is the elapsed time in seconds.

So, this has only taken less than 1/10th of a second.


Sorting Suppose we have a data vector of 1,000,000 values. If we printed 5 values per row and 80 rows per
page, the numbers would fill 2,500 pages. How long will it take to sort these values?
n <- 1e6
x <- runif(n) # Create a vector with 1000000 random values.
system.time(sort(x)) # Time how long it takes to sort them.

## user system elapsed


## 0.061 0.009 0.071
# The third figure is the elapsed time in seconds.

Again, we obtained the result in much less than a second.

Computers make mistakes


The enormous arithmetic power of modern computers can lead to the wrong impression that computers are
infallible black boxes, which, regardless of what tasks we set them, obtain the right answer. Computers only
have a finite precision and do not have the oversight most of us have and take for granted.

6
The following examples should act as a warning not to blindly trust a computer.
Is addition commutative? All of us know that 1020 − 1020 + 1 = 1. Using R we obtain the same result:
10^20-10^20+1

## [1] 1
Of course 1 + 1020 − 1020 = 1 as well. According to R, however,
1+10^20-10^20

## [1] 0
None of us would have made this mistake, as we would see at once that the sum of 1020 and −1020 is 0,
thus the answer must be 1. The computer processes this sum from left to right, and for the computer
1+1020 ≈ 1020 , as 1 is very small compared to 1020 . In fact the next smallest number a computer can represent
is 99, 999, 999, 999, 999, 983, 616, which is much further away from 1020 − 1 than 1020 itself. Subtracting 1020
then yields the wrong result 0. In other words, addition is not necessarily commutative on a computer, so the
order of the terms might matter.
A computer only has finite precision, so we cannot represent arbitrarily large numbers, and there are “gaps”
between the numbers. Like most other software, R uses IEEE 754 double precision floating point numbers.
Floating point numbers are the computer implementation of scientific notation (like “3 · 10−9 ”), i.e. the
significant and the exponent are stored separately. Storing the exponent separately makes the decimal point
“float”. The largest number that can be represented is 21024 ≈ 1.7977 · 10308 , which is large enough for most
purposes. If a computation results in a value larger than this, arithmetic overflow occurs. In the past, this
typically caused the program to abort. However, in IEEE 754 arithmetic and thus in R, the result is simply
set to -Inf or +Inf.
The problem causing the computer to get the wrong result are however the “gaps” between the numbers:
between each number and next smaller (or larger) number there is a gap of about 2 · 10−16 times the number.
And for 1020 this “gap” is larger than 1: see the figure below.

1 100, 000, 000, 000, 000, 000, 000

0.9999999999999997 1.0000000000000004 99, 999, 999, 999, 999, 983, 616 100, 000, 000, 000, 000, 016, 384
779553950749686919 440892098500626161
152736663818359375 69452667236328125

Note that in our example (and in many other situations) we can ensure that this problem does not occur by
making the computer carry out the operations in a certain order.
More simple arithmetic You are used to rounding errors from your calculators. For example both on a
computer and on a calculator 56 − 16 · 5 is not 0:
5/6 - 1/6 * 5

## [1] 1.110223e-16
A similar, but more surprising example is that 0.1 + 0.1 + 0.1 − 0.3 is not 0 on a computer:
0.1+0.1+0.1-0.3

## [1] 5.551115e-17
The result is almost (but only almost) 0. Again, we would have expected the computer to get this right.
Almost all modern computers (as opposed to calculators) internally use a binary system instead of the
decimal system we were taught at school. And in binary numbers 0.1 + 0.1 + 0.1 − 0.3 is 0.000110011 . . . +
0.000110011 . . . + 0.000110011 . . . − 0.01001001 . . .. As neither 0.1, nor 0.3 have a finite representation in a
binary system a rounding error occurs.

7
To quote from the book The Elements of Programming Style by Kernighan and Plauger: “10.0 times 0.1 is
hardly ever 1.0”.
Rounding errors for a single computation are typically very small. However computers often carry out a
long series of calculations, and typically rounding errors do not cancel out, but accumulate. Thus a complex
computation can be subject to a significant error.

R as a calculator
Basic arithmetic operators

R as a calculator

https://youtu.be/Gib3Wk2FFi8

Duration: 16m29s

This section gives an overview over the basic arithmetic operators and functions in R. The following table
contains the basic arithmetic operators available in R.

Operator Meaning Example Result


+ Addition 3+2 5
- Subtraction 3-2 1
* Scalar multiplication 3*2 6
/ Division 5/2 2.5
%/% Integer division 5%/%2 2
%% Remainder after integer division 5%%2 1
ˆ or ** Power 5ˆ2 25

If an R expression contains more than one operator, we need to know in which order R evaluates the expression.
This is known as operator precedence in Computer Science. For example, does
2 / 3 * 2
2 1 2 4
compute = or · 2 = ?
3·2 3 3 3
R uses the following rules:
• R first evaluates ˆ and **, then the sign - (not difference), then %/% or %%, then * or /, and finally + or
- (difference, not sign).
• In case of ties (operators of same precedence) the expressions are evaluated from the left to the right.
2
Thus in the above example 2/3*2 computes · 2.
3
2
Use parenthesis to get R to perform calculations in a different order. For example, in order to calculate ,
3·2
you have to use
2 / (3 * 2)

8
  14
2
Example 1 To compute we have to use
3
(2/3)^(1/4)

## [1] 0.903602
If we omit the parentheses and enter
2/3^1/4

## [1] 0.1666667
2
R computes 31
4 = 16 .
4 3+4
Task 1 Use R to compute 3 + , , and 271/3 .
5 5

Mathematical functions and constants


A large choice of mathematical functions is available in R, such as abs, sign, sqrt, exp, log, sin, cos, tan,
asin, acos, atan, gamma, beta, etc.
The variable pi contains the value of π. You can generate the constant e using exp(1).

IEEE 754 special values


R supports the IEEE 754 special values Inf, -Inf, and NaN, so you can carry out very limited computations on
R ∪ {−∞, +∞}. These special values allow for mitigating some of the problems caused by numerical underflow
(number rounded to zero) and overflow (number larger than the largest number which can represented by the
computer).
1 / 0

## [1] Inf
for example gives Inf, whereas
1 / Inf

## [1] 0
gives 0. If you ask R to compute
Inf / Inf

## [1] NaN
it will return NaN (not a number): it cannot tell what the result is. Expressions like
sqrt(-1)

## Warning in sqrt(-1): NaNs produced


## [1] NaN
give a warning and the result is NaN. R can handle complex numbers, just use
sqrt(-1+0i)

## [1] 0+1i
Note that a NaN (not a number) is not the same as NA (missing value, ‘not available'').

9
Example 2 log(0) returns -Inf, as limx&0 log(x) = −∞. Inf-10 returns Inf, as limx→+∞ x − 10 = +∞.
However Inf - Inf is NaN, as the limit is ambiguous. Similarly, sqrt(Inf) / Inf is NaN. R evaluates
sqrt(Inf) first, which is Inf. Inf/Inf is NaN.

Variables and assignments


In all the above examples R simply returned a value. If we want to reuse the value, we need to assign the
value to a variable. Variables are a little bit like the memory function of your calculator, except that you can
use as many different variables as you like. The default assignment operator in R is <-. To store the result of
2/3 * 2 in a variable called a, we would use:
a <- 2/3 * 2

You can also use the more common assignent operator = instead of <- in most (but not all) circumstances.
Assignments can be made in both directions, so you could also use
2/3 * 2 -> a

to store the number 4


3 in the variable a, though -> is used very rarely.
Variable names are case-sensitive, i.e. you can define both a and A, and a and A can hold different values.
Historically, most R users only use lowercase letters and separate words using dots, e.g. two.words, though
underscores have become increasingly popular in variable and function names.
If we want to print the value of the variable a, we just enter its name:
a

## [1] 1.333333
This is equivalent to
print(a)

## [1] 1.333333
which is what needs to be used inside control structures and functions (we will come back to this later).
You can define new variables using the values stored in other variables, as in
b <- a / 5

Note that changing a to something else will not automatically change b, so in the above example
a <- 10
b <- a / 5
a <- a + 30
b

## [1] 2
b will stay 2, even though a will be 40.
In case you forgot to use a variable, the last expression you computed is stored in .Last.value.
The use of variables is an important tool in programming. Variables ensure that even code performing
complex operations remains legible.
You can list all variables in the current workspace using
ls()

## [1] "a" "A" "b" "B" "C" "n" "x"

10
Alternatively, local objects are shown in the Enrivonment tab of RStudio.
Task 2 In the video we have considered the example of taking out a loan of £9,000 for 20 years with an
annual interest rate of 15%. The yearly repayment can be shown to be
1−v
P =L· .
v(1 − v n )

We have used the following R code to calculate the yearly payment


n <- 20 # term of the loan
loan <- 9000 # amount
interest.rate <- 0.15 # effective annual interest rate
v <- 1 / (1+interest.rate) # effective annual discount factor
payment <- loan * (1-v) / (v*(1-v^n)) # yearly payments
payment

## [1] 1437.853
Of your yearly payments of £1437.86, how much is interest and how much is used for paying back the loan?
The interest you pay in the first year is L · i. The remainder of the first payment (P − L · i) is thus used for
paying back the loan. More generally, one can show that the payment in year k can be decomposed into

P = P · αk + P · (1 − αk )
| {z } | {z }
capital repayment interest

with αk = v n+1−k .
Compute how much of the 10th payment is used for paying back the loan (P · α10 ) and how much is interest
(P · (1 − α10 )).

Logical variables and comparisons

Other data types: Booleans

https://youtu.be/Xrm1cp-WSLM

Duration: 6m48s

Logical variables
A logical variable can only hold the two values TRUE and FALSE. Logical variables are sometimes called
Boolean variables, after George Boole (1815–1864), an English mathematician, logician and philosopher. R
has the following three binary operators ! (negation), & (logical AND) and | (logical OR).

expr1 expr2 !expr1 (NOT) expr1 & expr2 (AND) expr1 | expr2 (OR)
TRUE TRUE FALSE TRUE TRUE
TRUE FALSE FALSE FALSE TRUE
FALSE TRUE TRUE FALSE TRUE
FALSE FALSE TRUE FALSE FALSE

11
Example 3
a <- TRUE
b <- FALSE
c <- a & !b
c

## [1] TRUE
In this example c is TRUE, because !b is TRUE and TRUE & TRUE is TRUE.
In R, & has higher precedence than |. So, in the absence of parentheses, & is evaluated before |. For
example, TRUE | FALSE & FALSE is treated by R as TRUE | (FALSE & FALSE), which is TRUE. We have to
use parentheses to calculate (TRUE | FALSE) & FALSE, which is FALSE.
Task 3 Consider three logical variables
a <- TRUE
b <- FALSE
c <- TRUE

Without using R determine the values of . . .


!a & b
a | !b
!(a | !b)
(a & b) | c

R also allows using the shorthands T instead of TRUE and F instead of FALSE. It is however not recommended
to use these shorthands. Whereas TRUE and FALSE are reserved keywords, which cannot be overwritten, T
and F are just global variables set to TRUE and FALSE, respectively. This means they can be masked by local
variables. Nothing (in R) prevents you from setting
T <- FALSE
F <- TRUE

and T and F have become the exact opposite of what they are meant to be! Of course, few users would do
this deliberately, but it is not inconceivable that you happen to define a variable T or F, which can, depending
on its values, have exactly the same effect.
R also has “lazy” operators && and ||. In contrast to & and | they will only evaluate the arguments until the
result has become clear. On the other hand, & and | will always evaluate all arguments. expr1&&expr2 will
evaluate expr2 only if expr1 is TRUE (otherwise the result is guaranteed to be FALSE no matter what expr2
is). Similarly, expr1||expr2 will evaluate expr2 only if expr1 is FALSE (otherwise the result is guaranteed
to be TRUE no matter what expr2 is). && and || can be helpful in conditional if statements. You should not
use && and || for vectors of length greater than 1.

Comparison operators
The comparison operators in R are == (testing for exact equality), != (“not exactly equal”) <, <=, >, and >=.
The comparison operators return a logical value (i.e. TRUE or FALSE), so you can use the operators !, & and |
to combine them to more complex expressions.
Example 4 Consider a variable x set to the number 2.
x <- 2

We can then test whether it is negative or less than or equal to 3.


x < 0

## [1] FALSE

12
x <= 3

## [1] TRUE
If we want to test whether x is in the unit interval we can use
x>0 & x<1

## [1] FALSE
or, equivalently,
!(x<=0 | x>=1)

## [1] FALSE
Due to rounding and representation errors, you do not want to use == to compare non-integers. For example,
despite 0.3 − 2 × 0.1 = 0.1, R yields
0.3 - 2 * 0.1 == 0.1

## [1] FALSE
because the expression on the left-hand side is not exactly 0.1. We see this by subtracting 0.1 from the
left-hand side (which should then be exactly zero, but isn’t)
0.3-2*0.1 - 0.1

## [1] -2.775558e-17
For non-integers we really only want to test whether they are “nearly equal”, we can do so by comparing the
absolute difference to a small number (say 10−8 ).
abs(0.3-2*0.1 - 0.1) < 1e-8

## [1] TRUE
or use the built-in function all.equal.
isTRUE(all.equal(0.3-2*0.1, 0.1))

## [1] TRUE
355
Task 4 Create a variable x storing the fraction 113 . Use R to test
• whether this fraction is less than π (pi in R),
• whether this fraction is between 3 and 4, and
• whether this fraction is within ±10−6 of π.

13
Task Solutions
Task 1 You can use the following code.
3 + 4 / 5 # No parentheses necessary

## [1] 3.8
(3 + 4) / 5 # Parentheses needed

## [1] 1.4
27^(1/3) # Parentheses needed

## [1] 3
Task 2 To compute how the payment is split in year 10 we can use the code below.
n <- 20 # term of the loan
loan <- 9000 # amount
interest.rate <- 0.15 # effective annual interest rate
v <- 1 / (1+interest.rate) # effective annual discount factor
payment <- loan * (1-v) / (v*(1-v^n)) # yearly payments
k <- 10 # set k to 10 years
alpha10 <- v^(n+1-k) # split factor
capital10 <- payment * alpha10 # capital repaymment
capital10

## [1] 309.0568
interest10 <- payment * (1-alpha10) # interest part
interest10

## [1] 1128.796
So even after ten years the largest part of the repayment is interest!
Task 3 If we use R to work out the answers we obtain
!a & b

## [1] FALSE
a | !b

## [1] TRUE
!(a | !b)

## [1] FALSE
(a & b) | c

## [1] TRUE
Task 4 We can use the following R code.
x <- 355 / 113
x < pi

## [1] FALSE
x>3 & x<4

## [1] TRUE

14
abs(x-pi) < 1e-6

## [1] TRUE

15

You might also like