R Exercise 1 - Introduction To R For Non-Programmers

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

A Quick Introduction to R for Non-

Programmers
Richard Blissett
2017-09-11

Downloading and installing R


Follow these instructions to download the most recent version of R:
1. Go to https://cran.r-project.org/.
2. Click the link at the top for your operating system.
3. For Windows, click “base,” and then click the download link at the top of the page.
4. For Mac, under the “Files” section on the next page, click on the first “.pkg” link.
For the purpose of this introduction, we will be working with RStudio, which is a third-party program created to
provide a comprensive environment in which we can use R. R comes with its own environment, but many people find
using RStudio to be easier.
Follow these instructions to download the most recent version of RStudio:
1. Go to https://www.rstudio.com/.
2. Click on the “Download” link below the RStudio icon.
3. Click on the “Download” link for the RStudio Desktop, Open Source License.
4. Click on the link under “Installers for Supported Platforms” for your operating system.

Working in RStudio and this tutorial


When you open RStudio, go to “File > New File > R Script.” Overall, your screen will look something like this.

Broadly, the four areas you see on your screen are as follows:
Top-left: Script editor, where you can write code and run code from a script
Bottom-left: Console, where you can run R commands one-by-one and see results
Top-right: Environment, which will list your R objects (will be explained)
Bottom-right: Where you will see plots you create, help files, etc.
This tutorial will run through a series of commands. In general, you can try any commands in the console region to
see if they work. For this tutorial, instead of typing directly into the console, type the commands into the R script
window. To run the command after you’ve typed it, highlight the code, and the press the “Run” button in the top right
of the script area. The purpose of this is to let you practice creating a save-able script that you can return to in the
future.
In addition, there are several points in this tutorial where I ask you to input information. You can do this by typing
directly into the provided boxes. Finally, you can save this tutorial and your work by going to “File > Print” in your
browser and changing your printer option to save this page as a .pdf file.

Setting yourself up
You may want to set up a folder somewhere on your computer where you will be save the work and data for this
tutorial. This generally, is called a “working directory.” Into this folder location, download and place
the apidata_2012_so.csv dataset. This is a modified subset of 2012 school district achivement data from California’s
Academic Performance Index system. Results from these data should not be used for actual research. We will get
back to this file later.
In R, when you tell it to open a data file or to save something, it needs to know where to look. Before we begin, set
your working directory in R. We will get into how commands work in R in a bit, but for now, know that you set the
working directory by using the setwd() command. In the parentheses, in quotes, put the full file path for the folder
on your computer.
So, run the setwd() command as shown below.
If you are using a Mac, it will look something like this:

setwd("~/Desktop/R Introduction")

If you are using Windows, it will look something like this:

setwd("C:/Users/Richard/Desktop/R Introduction")

You can use the getwd() command to make sure that it worked.
What is your working directory?
WD:

Comments
In programming, a “comment” is a note that you make to yourself/other programmers within the code itself.
Comments are not actual commands or code - they are just notes. It is very important that you write comments in
your script to keep good notes. See here for more on this.
To write a comment in R, use the # symbol. For example, below, I have put a comment for the getwd() command.

# Prints the working directory


getwd()

[1] "C:/Users/Richard/Desktop/R Intro for SPSS"


You will see me include comments with various commands throughout this tutorial. Include them in your own script,
and modify them as needed.

Working with objects


The objective of this section is not necessarily to make you a master of R programming, but rather to get you into the
intuition of how information works in R so that you can better understand the commands that I explain later. R is what
we call an “object-oriented language,” meaning for our purposes that the logic of working with R revolves around
information and ways to store that information in “objects.”
An “object” is how data and information are stored in R. For example, type the number 2.

[1] 2

This prints the number 2. Now, let’s store it in an object. Let’s call this object “a.” We do that using the <- operator,
which is a combination of the “less than” symbol and a regular dash. It basically means “put the thing on the right of
me into the thing on the left of me.” It’s a left arrow.

a <- 2

This time, it did not print the number 2 because you were assigning the value of 2 to an object. Look up in the
environment region of RStudio in the top right. You should see a listed, which confirms that you have now created
an object called a .
If you want to see what is in “a,” just type it on its own line.

[1] 2

You can also perform functions on objects. For example, the sqrt() function takes the square root of a number.

# Takes the square root of a


sqrt(a)

[1] 1.414214

You also could have assigned that output to another object, called “banana.”

banana <- sqrt(a)

Finally, you can do operations with multiple objects.

a+banana

[1] 3.414214

In short, you create an object by assigning some value to it.


You delete an object by using the rm() command.
# Removes the banana object
rm(banana)

Notice that the “banana” object is now gone from your environment window.

Lists
Objects can hold multiple things at once. Lists are exactly what they sound like. You can assign multiple things to an
object in a list by using the c() command, which stands for “combine.”

a <- c(5, 4, 1, 3, 7)
a

[1] 5 4 1 3 7

Notice that even though we already had an a object from before, we can overwrite the information easily by
assigning new information to it.
You can reference different elements in the list by referring to the number of the element, called an index. For
example, to get the fifth element…

a[5]

[1] 7

Matrices
Matrices are basically lists that have two dimensions. Consider the following matrix.

7 12 jackfruit
Madonna 0 W hat?
712jackfruitMadonna0What?

This is a 2x3 matrix with two rows and three columns. You could load it into R as follows. (You do not have to deeply
understand just yet how I did this - just copy and paste the following code into your own script. That said, return to
this later and see if you can understand what I did.)

brian <- matrix(c("7", "Madonna", "12", "0", "jackfruit", "What?"),


nrow=2, ncol=3)
brian

[,1] [,2] [,3]


[1,] "7" "12" "jackfruit"
[2,] "Madonna" "0" "What?"

Like lists, you can refer to matrix elements by index. You use the same [] notation, but now you put two numbers,
separated by a comma, to indicate the row and column, respectively.

brian[2,1]
[1] "Madonna"

If you just want to get information from an entire row, leave out the column number

brian[2,]

[1] "Madonna" "0" "What?"

And vice versa to get information from an entire column.

brian[,1]

[1] "7" "Madonna"

Data frames and datasets


Data frames are the most used data type in data analysis, and they are what data analysts are used to looking at. It
is exactly like a matrix, but you can also refer to columns by name.
Let’s load that California API dataset. The read.csv() command is especially suited for reading in data like the file
we have downloaded. (Again, we will get into commands next.) Copy and paste the following code into your script
and run it.

# Reads in the API data into the api object


api <- read.csv("apidata_2012_so.csv")

In your environment window, you can see that we have now loaded the data into an object called api . You can view
the dataset using the View() command or by just clicking on the name of the object in your environment window.

Congratulations, we did it! This, as we have seen in other data analysis contexts, is a dataset that we can do some
analyses with. There are, as you can see in the environment window, 1016 rows and 14 columns.
Before we get into the data analysis, let’s cover a couple of basic data manipulation things.

Manipulating variables
Because you are able to have different datasets open at once in different objects, variable references need to be
associated with the specific data object using the $ symbol. For example, if I wanted to get a summary of
the cst28_engl variable in api , I would type…

# Gets summary statistics for the cst28_engl variable


summary(api$cst28_engl)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


0 115 707 3015 3016 303153 27

(Again, we will get into commands like summary() in a hot second.)


To add a variable, just assign something to it.

# Calculates the total of the cst28_engl and cst28_math variables


api$total <- api$cst28_engl + api$cst28_math

If you view the api dataset, you can now see the total variable tagged onto the end.
Let’s say you wanted to create a variable that was a new variable called propfrpl , calculated as
the pctfrpl variable divided by 100. What code would you write to do this?
Run:

Try it out and see if it worked!


To delete a variable, assign NULL to it.

# Deletes the total variable


api$total <- NULL

Check the api dataset. It should be gone now!


What code would you use to delete the propfrpl variable?
Run:

Sometimes, you will want to create a subset of variables as well. For example, say you only wanted
the districttype and pctfrpl variables. You can do this by creating a list of the variables you want using
the c() command and then subsetting by name, as shown below.

# List of variables
vars <- c("districttype", "pctfrpl")
# Get subset
apisubset <- api[vars]

You can click on the apisubset object in the environment window to make sure that it worked. Note that I could have
done the above in one like, as shown below.

# Get subset
apisubset <- api[c("districttype", "pctfrpl")]
Can you pull a subset of api with the districtid , numstu , and pctmin variables?
Run:

Manipulating observations
To refer to specific observations, you can do it by row number, but that’s not how we typically do it in data analysis.
For example, the following code prints the information from the fourth row, as per the code we learned earlier about
matrices.

api[4,]

districtid year districtname districttype api cst28_engl cst911_engl


4 161143 2012 Berkeley Unified Unified 811 4458 2237
cst28_math cst911_math numstu pctfrpl pctell pctmin region
4 4445 2176 6724 39 13 67 South

Typically, we use some condition. Let’s say we wanted to refer to only those observations that are elementary
districts. Copy and paste the following code in your script.

# Creates newapi object, subset of api object with just "Elementary" districts
newapi <- api[api$districttype=="Elementary",]

Read this as: Assign to newapi the data from api , but only keep those observations for which districttype is
equal to “Elementary,” and copy all of the variables.
This is very similar to what we wrote before when we did api[4,] , except this time we replaced
the 4 with api$districttype=="Elementary" . Briefly, the latter expression (with the == symbol) is known as a
“logical condition.” What it does, generally, is tell R all of the numbers of the rows for which that condition is true.
The == operator checks for equality between two things. So, api$districttype=="Elementary" basically lets R
know the rows for which that condition is true (where the district type is equal to “Elementary”). A full list of logical
operators can be found here.
The “not equal” operator is != . You could have dropped all of those observations too using this operator. Run the
line below.

# Creates dropapi object, subset of api object without "Elementary" districts


dropapi <- api[api$districttype!="Elementary",]

Read this as: Assign to dropapi the data from api , but only keep those observations for which districttype is
NOT equal to “Elementary,” and copy all of the variables.
What code would you write to create a new object called highapi that is a copy of everything from api , but only
containing those observations for which the numstu variable is above 100?
Run:

Commands and help


Before hopping into actual data analysis and commands to do the things you would need to do in order to do a
normal research project, what is a command? A command, or function in R, is a special instruction for R to do
something beyond simple arithmetic. We have already seen several
examples: setwd() , getwd() , sqrt() , rm() , read.csv() , and summary() .
All functions in R have the same setup. They have a function name, like “read.csv,” and they have what we call
“arguments,” which go in parentheses, separated by commas. Arguments are basically the information that R needs
to know in order to run the function. For example, the main argument for the read.csv() function is the name of the
file that it’s supposed to read.
The easiest way to figure out how a particular function is supposed to work is to look up an example. There are
plenty of really useful resources online showing how to use basic functions, some of which are listed here.
Quick R website
RProgramming.net
UCLA Institute for Digital Research and Education
O’Reilly R Cookbook
Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill
Plus, every function comes with documentation. To get the documentation for a function, you can either type a ?
before the function name or type help([FUNCTION NAME]) . For example, let’s look at the help file for
the read.csv() function.

# Pulls up help file for the read.csv command


help(read.csv)

You should see the documentation, titles “Data Input,” pop up in the bottom right corner of your RStudio window.
Let’s read through it.
The description reads: “Reads a file in table format and creates a data frame from it, with cases corresponding to
lines and variables to fields in the file.” Pretty self-explanatory.
The “Usage” section, next, is important. In this case, the help file is showing the usage of several commands,
including read.table() and read.delim() , since they all have similar usage. Look at the usage for read.csv() .
You should see this:

read.csv(file, header = TRUE, sep = ",", quote = "\"",


dec = ".", fill = TRUE, comment.char = "", ...)

What does this jumble of things mean? Well, it’s showing you how this function is set up. The name is “read.csv”, and
it looks like it has at least six arguments: “header,” “sep,” “quote,” “dec,” “fill,” and “command.char.” The “…” notation
indicates that you can use the other arguments that are in read.table() above. Any argument that isn’t followed by
the = symbol is required. So, for read.csv() , it is required that we include the file name. The other arguments are
all optional, and they all have default values. For example, the read.csv() has a header argument, which is set
to TRUE by default. This means that unless you specify otherwise (by writing header = FALSE in your code),
the read.csv command will assume that the first row of your data file is the header row and read it in as such. You
will get more used to using commands and arguments as we go on.
How did I know what the “…” symbol meant? The “Arguments” section, below, gives a more detailed explanation for
each argument. As you can see at the bottom of the list, the “…” symbol is associated with the explanation, “Further
arguments to be passed to read.table .”
The “Details” section often includes important information to keep in mind when you run the function. For example,
this section tells us that “If row.names is not specified and the header line has one less entry than the number of
columns, the first column is taken to be the row names.”
The most useful part is often at the bottom: “Examples.” Again, while I think other examples you find online are often
more readable, these help files will often include their own examples at the bottom.
It is important to note that regardless of language (e.g., Stata, R, SPSS), even the most seasoned researchers have
often not memorized how to do every single analysis. The most important skill you can gain is the ability to think of
something you want to do with the data, understand the language to search for a solution, and implement that
solution. Google is your friend. Most things you will want to do, someone else has done before. And most of the
time, a solution has been posted online. If all else fails, ask a friend. And if even they don’t know, post a question on
the R mailing list or on Stack Overflow. The best thing you can do is be able to speak the language, not memorize
commands.
What command would you use to save the dataframe api to a .csv file?
Run:

Next steps
For the next step, which is calculating summary statistics and doing some bivariate analysis, follow this
link: http://rpubs.com/rslbliss/r_summaries_ws

You might also like