R Exercise 1 - Introduction To R For Non-Programmers
R Exercise 1 - Introduction To R For Non-Programmers
R Exercise 1 - Introduction To R For Non-Programmers
Programmers
Richard Blissett
2017-09-11
Broadly, the four areas you see on your screen are as follows:
Top-left: Script editor, where you can write code and run code from a script
Bottom-left: Console, where you can run R commands one-by-one and see results
Top-right: Environment, which will list your R objects (will be explained)
Bottom-right: Where you will see plots you create, help files, etc.
This tutorial will run through a series of commands. In general, you can try any commands in the console region to
see if they work. For this tutorial, instead of typing directly into the console, type the commands into the R script
window. To run the command after you’ve typed it, highlight the code, and the press the “Run” button in the top right
of the script area. The purpose of this is to let you practice creating a save-able script that you can return to in the
future.
In addition, there are several points in this tutorial where I ask you to input information. You can do this by typing
directly into the provided boxes. Finally, you can save this tutorial and your work by going to “File > Print” in your
browser and changing your printer option to save this page as a .pdf file.
Setting yourself up
You may want to set up a folder somewhere on your computer where you will be save the work and data for this
tutorial. This generally, is called a “working directory.” Into this folder location, download and place
the apidata_2012_so.csv dataset. This is a modified subset of 2012 school district achivement data from California’s
Academic Performance Index system. Results from these data should not be used for actual research. We will get
back to this file later.
In R, when you tell it to open a data file or to save something, it needs to know where to look. Before we begin, set
your working directory in R. We will get into how commands work in R in a bit, but for now, know that you set the
working directory by using the setwd() command. In the parentheses, in quotes, put the full file path for the folder
on your computer.
So, run the setwd() command as shown below.
If you are using a Mac, it will look something like this:
setwd("~/Desktop/R Introduction")
setwd("C:/Users/Richard/Desktop/R Introduction")
You can use the getwd() command to make sure that it worked.
What is your working directory?
WD:
Comments
In programming, a “comment” is a note that you make to yourself/other programmers within the code itself.
Comments are not actual commands or code - they are just notes. It is very important that you write comments in
your script to keep good notes. See here for more on this.
To write a comment in R, use the # symbol. For example, below, I have put a comment for the getwd() command.
[1] 2
This prints the number 2. Now, let’s store it in an object. Let’s call this object “a.” We do that using the <- operator,
which is a combination of the “less than” symbol and a regular dash. It basically means “put the thing on the right of
me into the thing on the left of me.” It’s a left arrow.
a <- 2
This time, it did not print the number 2 because you were assigning the value of 2 to an object. Look up in the
environment region of RStudio in the top right. You should see a listed, which confirms that you have now created
an object called a .
If you want to see what is in “a,” just type it on its own line.
[1] 2
You can also perform functions on objects. For example, the sqrt() function takes the square root of a number.
[1] 1.414214
You also could have assigned that output to another object, called “banana.”
a+banana
[1] 3.414214
Notice that the “banana” object is now gone from your environment window.
Lists
Objects can hold multiple things at once. Lists are exactly what they sound like. You can assign multiple things to an
object in a list by using the c() command, which stands for “combine.”
a <- c(5, 4, 1, 3, 7)
a
[1] 5 4 1 3 7
Notice that even though we already had an a object from before, we can overwrite the information easily by
assigning new information to it.
You can reference different elements in the list by referring to the number of the element, called an index. For
example, to get the fifth element…
a[5]
[1] 7
Matrices
Matrices are basically lists that have two dimensions. Consider the following matrix.
7 12 jackfruit
Madonna 0 W hat?
712jackfruitMadonna0What?
This is a 2x3 matrix with two rows and three columns. You could load it into R as follows. (You do not have to deeply
understand just yet how I did this - just copy and paste the following code into your own script. That said, return to
this later and see if you can understand what I did.)
Like lists, you can refer to matrix elements by index. You use the same [] notation, but now you put two numbers,
separated by a comma, to indicate the row and column, respectively.
brian[2,1]
[1] "Madonna"
If you just want to get information from an entire row, leave out the column number
brian[2,]
brian[,1]
In your environment window, you can see that we have now loaded the data into an object called api . You can view
the dataset using the View() command or by just clicking on the name of the object in your environment window.
Congratulations, we did it! This, as we have seen in other data analysis contexts, is a dataset that we can do some
analyses with. There are, as you can see in the environment window, 1016 rows and 14 columns.
Before we get into the data analysis, let’s cover a couple of basic data manipulation things.
Manipulating variables
Because you are able to have different datasets open at once in different objects, variable references need to be
associated with the specific data object using the $ symbol. For example, if I wanted to get a summary of
the cst28_engl variable in api , I would type…
If you view the api dataset, you can now see the total variable tagged onto the end.
Let’s say you wanted to create a variable that was a new variable called propfrpl , calculated as
the pctfrpl variable divided by 100. What code would you write to do this?
Run:
Sometimes, you will want to create a subset of variables as well. For example, say you only wanted
the districttype and pctfrpl variables. You can do this by creating a list of the variables you want using
the c() command and then subsetting by name, as shown below.
# List of variables
vars <- c("districttype", "pctfrpl")
# Get subset
apisubset <- api[vars]
You can click on the apisubset object in the environment window to make sure that it worked. Note that I could have
done the above in one like, as shown below.
# Get subset
apisubset <- api[c("districttype", "pctfrpl")]
Can you pull a subset of api with the districtid , numstu , and pctmin variables?
Run:
Manipulating observations
To refer to specific observations, you can do it by row number, but that’s not how we typically do it in data analysis.
For example, the following code prints the information from the fourth row, as per the code we learned earlier about
matrices.
api[4,]
Typically, we use some condition. Let’s say we wanted to refer to only those observations that are elementary
districts. Copy and paste the following code in your script.
# Creates newapi object, subset of api object with just "Elementary" districts
newapi <- api[api$districttype=="Elementary",]
Read this as: Assign to newapi the data from api , but only keep those observations for which districttype is
equal to “Elementary,” and copy all of the variables.
This is very similar to what we wrote before when we did api[4,] , except this time we replaced
the 4 with api$districttype=="Elementary" . Briefly, the latter expression (with the == symbol) is known as a
“logical condition.” What it does, generally, is tell R all of the numbers of the rows for which that condition is true.
The == operator checks for equality between two things. So, api$districttype=="Elementary" basically lets R
know the rows for which that condition is true (where the district type is equal to “Elementary”). A full list of logical
operators can be found here.
The “not equal” operator is != . You could have dropped all of those observations too using this operator. Run the
line below.
Read this as: Assign to dropapi the data from api , but only keep those observations for which districttype is
NOT equal to “Elementary,” and copy all of the variables.
What code would you write to create a new object called highapi that is a copy of everything from api , but only
containing those observations for which the numstu variable is above 100?
Run:
You should see the documentation, titles “Data Input,” pop up in the bottom right corner of your RStudio window.
Let’s read through it.
The description reads: “Reads a file in table format and creates a data frame from it, with cases corresponding to
lines and variables to fields in the file.” Pretty self-explanatory.
The “Usage” section, next, is important. In this case, the help file is showing the usage of several commands,
including read.table() and read.delim() , since they all have similar usage. Look at the usage for read.csv() .
You should see this:
What does this jumble of things mean? Well, it’s showing you how this function is set up. The name is “read.csv”, and
it looks like it has at least six arguments: “header,” “sep,” “quote,” “dec,” “fill,” and “command.char.” The “…” notation
indicates that you can use the other arguments that are in read.table() above. Any argument that isn’t followed by
the = symbol is required. So, for read.csv() , it is required that we include the file name. The other arguments are
all optional, and they all have default values. For example, the read.csv() has a header argument, which is set
to TRUE by default. This means that unless you specify otherwise (by writing header = FALSE in your code),
the read.csv command will assume that the first row of your data file is the header row and read it in as such. You
will get more used to using commands and arguments as we go on.
How did I know what the “…” symbol meant? The “Arguments” section, below, gives a more detailed explanation for
each argument. As you can see at the bottom of the list, the “…” symbol is associated with the explanation, “Further
arguments to be passed to read.table .”
The “Details” section often includes important information to keep in mind when you run the function. For example,
this section tells us that “If row.names is not specified and the header line has one less entry than the number of
columns, the first column is taken to be the row names.”
The most useful part is often at the bottom: “Examples.” Again, while I think other examples you find online are often
more readable, these help files will often include their own examples at the bottom.
It is important to note that regardless of language (e.g., Stata, R, SPSS), even the most seasoned researchers have
often not memorized how to do every single analysis. The most important skill you can gain is the ability to think of
something you want to do with the data, understand the language to search for a solution, and implement that
solution. Google is your friend. Most things you will want to do, someone else has done before. And most of the
time, a solution has been posted online. If all else fails, ask a friend. And if even they don’t know, post a question on
the R mailing list or on Stack Overflow. The best thing you can do is be able to speak the language, not memorize
commands.
What command would you use to save the dataframe api to a .csv file?
Run:
Next steps
For the next step, which is calculating summary statistics and doing some bivariate analysis, follow this
link: http://rpubs.com/rslbliss/r_summaries_ws