Chapter 3 Programming Basics: 3.1 Conditional Expressions
Chapter 3 Programming Basics: 3.1 Conditional Expressions
Chapter 3 Programming Basics: 3.1 Conditional Expressions
We teach R because it greatly facilitates data analysis, the main topic of this book. By
coding in R, we can efficiently perform exploratory data analysis, build data analysis
pipelines, and prepare data visualization to communicate results. However, R is not
just a data analysis environment but a programming language. Advanced R
programmers can develop complex packages and even improve R itself, but we do not
cover advanced programming in this book. Nonetheless, in this section, we introduce
three key programming concepts: conditional expressions, for-loops, and functions.
These are not just key building blocks for advanced programming, but are sometimes
useful during data analysis. We also note that there are several functions that are
widely used to program in R but that we will not cover in this book. These
include split, cut, do.call, and Reduce, as well as the data.table package. These are
worth learning if you plan to become an expert R programmer.
Here is a very simple example showing the general structure of an if-else statement.
The basic idea is to print the reciprocal of a unless a is 0:
a <- 0
if(a!=0){
print(1/a)
} else{
print("No reciprocal for 0.")
}
#> [1] "No reciprocal for 0."
Let’s look at one more example using the US murders data frame:
library(dslabs)
data(murders)
murder_rate <- murders$total / murders$population*100000
Here is a very simple example that tells us which states, if any, have a murder rate
lower than 0.5 per 100,000. The if statement protects us from the case in which no
state satisfies the condition.
ind <- which.min(murder_rate)
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.
}
#> [1] "Vermont"
If we try it again with a rate of 0.25, we get a different answer:
if(murder_rate[ind] < 0.25){
print(murders$state[ind])
} else{
print("No state has a murder rate that low.")
}
#> [1] "No state has a murder rate that low."
A related function that is very useful is ifelse. This
function takes three arguments: a
logical and two possible answers. If the logical is TRUE, the value in the second
argument is returned and if FALSE, the value in the third argument is returned. Here is
an example:
a <- 0
ifelse(a > 0, 1/a, NA)
#> [1] NA
The function is particularly useful because it works on vectors. It examines each entry
of the logical vector and returns elements from the vector provided in the second
argument, if the entry is TRUE, or elements from the vector provided in the third
argument, if the entry is FALSE.
a <- c(0, 1, 2, -4, 5)
result <- ifelse(a > 0, 1/a, NA)
This table helps us see what happened:
0 FALSE Inf NA NA
-4 FALSE -0.25 NA NA
Here is an example of how this function can be readily used to replace all the missing
values in a vector with zeros:
data(na_example)
no_nas <- ifelse(is.na(na_example), 0, na_example)
sum(is.na(no_nas))
#> [1] 0
Two other useful functions are any and all. The any function takes a vector of logicals
and returns TRUE if any of the entries is TRUE. The all function takes a vector of logicals
and returns TRUE if all of the entries are TRUE. Here is an example:
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.
z <- c(TRUE, TRUE, FALSE)
any(z)
#> [1] TRUE
all(z)
#> [1] FALSE
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.
We will learn more about how to create functions through experience as we face more
complex tasks.
3.3 Namespaces
Once you start becoming more of an R expert user, you will likely need to load several
add-on packages for some of your analysis. Once you start doing this, it is likely that
two packages use the same name for two different functions. And often these functions
do completely different things. In fact, you have already encountered this because
both dplyr and the R-base stats package define a filter function. There are five other
examples in dplyr. We know this because when we first load dplyr we see the
following message:
The following objects are masked from ‘package:stats’:
filter, lag
So what if we want to use the stats filter instead of the dplyr filter but dplyr appears
first in the search list? You can force the use of a specific name space by using double
colons (::) like this:
stats::filter
If we want to be absolutely sure we use the dplyr filter we can use
dplyr::filter
Also note that if we want to use a function in a package without loading the entire
package, we can use the double colon as well.
For more on this more advanced topic we recommend the R packages book 16.
3.4 For-loops
The formula for the sum of the series 1+2+⋯+n1+2+⋯+n is n(n+1)/2n(n+1)/2. What
if we weren’t sure that was the right function? How could we check? Using what we
learned about functions we can create one that computes the SnSn:
compute_s_n <- function(n){
x <- 1:n
sum(x)
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.
}
How can we compute SnSn for various values of nn, say n=1,…,25n=1,…,25? Do
we write 25 lines of code calling compute_s_n? No, that is what for-loops are for in
programming. In this case, we are performing exactly the same task over and over, and
the only thing that is changing is the value of nn. For-loops let us define the range that
our variable takes (in our example n=1,…,10n=1,…,10), then change the value and
evaluate expression as you loop.
Perhaps the simplest example of a for-loop is this useless piece of code:
for(i in 1:5){
print(i)
}
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
Here is the for-loop we would write for our SnSn example:
m <- 25
s_n <- vector(length = m) # create an empty vector
for(n in 1:m){
s_n[n] <- compute_s_n(n)
}
In each iteration n=1n=1, n=2n=2, etc…, we compute SnSn and store it in the nnth
entry of s_n.
Now we can create a plot to search for a pattern:
n <- 1:m
plot(n, s_n)
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.
If you noticed that it appears to be a quadratic, you are on the right track because the
formula is n(n+1)/2n(n+1)/2.
3.6 Exercises
1. What will this conditional expression return?
x <- c(1,2,-3,4)
if(all(x>0)){
print("All Postives")
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.
} else{
print("Not all positives")
}
2. Which of the following expressions is always FALSE when at least one entry of a
logical vector x is TRUE?
a. all(x)
b. any(x)
c. any(!x)
d. all(!x)
3. The function nchar tells you how many characters long a character vector is. Write a
line of code that assigns to the object new_names the state abbreviation when the state
name is longer than 8 characters.
4. Create a function sum_n that for any given value, say nn, computes the sum of the
integers from 1 to n (inclusive). Use the function to determine the sum of integers from
1 to 5,000.
5. Create a function altman_plot that takes two arguments, x and y, and plots the
difference against the sum.
6. After running the code below, what is the value of x?
x <- 3
my_func <- function(y){
x <- 5
y+5
}
7. Write a function compute_s_n that for any given nn computes the
sum Sn=12+22+32+…n2Sn=12+22+32+…n2. Report the value of the sum
when n=10n=10.
8. Define an empty numerical vector s_n of size 25 using s_n <- vector("numeric",
25) and store in the results of S1,S2,…S25S1,S2,…S25 using a for-loop.
9. Repeat exercise 8, but this time use sapply.
10. Repeat exercise 8, but this time use map_dbl.
11. Plot SnSn versus nn. Use points defined by n=1,…,25n=1,…,25.
12. Confirm that the formula for this sum is Sn=n(n+1)(2n+1)/6Sn=n(n+1)(2n+1)/6.
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.