Lesson 2: R Basics - Notes: Exploratory Data Analysis
Lesson 2: R Basics - Notes: Exploratory Data Analysis
Lesson 2: R Basics - Notes: Exploratory Data Analysis
Quick links
The Power of R
Intro to Lesson 2
Analyzing Tweets in Chicago
Quiz
Answers
Why R?
About R
ggplot2
Install RStudio on Windows
R Programming Language Installation
RStudio Installation
Install RStudio on a Mac
RStudio Layout
Quiz
Answer
Demystifying R
Answer
Getting Help
Read and Subset Data
R Markdown Documents
Answer
Factor Variables
Ordered Factors
Quiz
Answer
Setting Levels of Ordered Factors
Quiz
Answer
Data Munging
Advice for Data Scientists
Congratulations
Quiz
What was the name of the R package used by Corey? How many Tweets per day did
the system flag? (lower and upper limits) What did you find most interesting about
the article?
Answers
Textcat was the name of the R package that Corey used. We'll discuss R packages in
the next video and what they allow us to do. For the second question, the automated
system captures between ten tweets to 20 tweets a day. For each of the tweets, the
system will recommend that whoever sent the tweet file a report. Now, for this last
question, we accepted any text answer here.
I thought that the most interesting part was that Cory showed the open source R
code classifier on GitHub. GitHub is a repository for sharing code. You can learn more
about GitHub by following the links in the instructor notes. It's the most popular
repository for open source projects and it's really easy to get started.
Why R?
About R
Copyright © 2014 Udacity, Inc. All Rights Reserved.
R is the leading programming language for statistics and data analysis. One of the
main advantages of using R is that we can build up an analysis line by line in code.
We can save all of our work in a file and go back to see what we investigated at a later
date. Having R scripts allows you to easily share your work with others. And you can
see what others are doing with data. R also has over 2,000 user contributed packages
that increase its functionality. One example of this is the text analysis package,
TextCat, that you just saw.
ggplot2
Now, one of the main packages we'll use throughout this course is ggplot2. ggplot2
is a graphics package that lets us create plots and graphs with just a few commands.
We'll learn more about ggplot in the next lesson but just to give you a taste of
it, here's an example of a plot you can create. Don't worry about memorizing or
understanding all of this code. You'll have plenty of practice with this later in the
course. I just want to show you how a few lines of code, can create amazing graphics.
I'm going to load up the ggplot library and a color library. Then I'm going to load the
diamonds data set, and with this function I'll create a scatter plot. Let's check out
this plot in detail. This part shows the relationship between price and carat of almost
54,000 round cut diamonds. I'd say R is doing very well for such few lines of code. The
last thing I want to mention is that you can use R anywhere. It's free open source
software that works on any operating system. And as a result of this, R has a large
active and growing community of users.
the defaults here and then install it to R. I'll leave these as is, so that way a desktop
RStudio Layout
In the
top
right of
R
studio,
we
have
two
tabs. One called Environment and one called History.
The Environment will contain all the objects,
functions and values that are in the current working
memory for R. History on the other hand will keep a
running log of any of the R commands that we run.
Now we can run R scripts directly from our files but
we can also type them into the console. No matter
which way we type them in the History will capture them.
And finally, in the lower right, we have the Files, Plots, Packages, Help and Viewer
section. You'll learn more about these later but this is where our plots or graphs will
first appear when we run an R command that creates a visualization.
Quiz
Now that we've gone over the layout of R Studio, I want you to match some actions
that you might do in each panel of R. Remember this first panel is for R files or
Rscripts. This second area is for the workspace or the history. This third space is
Demystifying R
Now that you know the basic layout of our studio, you're
welcome to customize it, by going to Tools, and then
Options. From here you can choose the default directory if
you'd like. You can change formatting options for editing
code, or you can even change the font and the appearance.
Here, I'm using the tomorrow theme, but there are of
course some others. You can also make adjustments to the
panels so, feel free to play around with this to get a
configuration that you like. Now, you might not have any
idea what you like first so just wait until you play with r a
bit, and then you'll find a configuration that suits you. I'm going to leave this as is.
Now, your task is to read through this file and run and write code when you're
prompted to do so. When you get to the end of the file, you'll get to a question, and
you'll need to answer this question to move on to the next video in this lesson. Now,
if you're already familiar with R, I suggest trying to answer the questions, and then
you might want to go back to the files to see if you missed anything.
We know you learn best by doing, so it's important that you execute and run this
code as you go, and that you make sense of the output. Any output will appear
down here in the console. I also hope that you have moments of inspiration. So if
something does come to mind I want you to code it up and just run it in R. You won't
break anything and the worse thing that would happen would be an error message or
a warning message in this box. If you do happen to get stuck at any point post in the
discussions so that way your peers or one of the instructors can help.
Quiz
Download the demystifying.R file and open it in R Studio. As you read through it,
read and write code when prompted. When you get to the end of the file answer this
question: what is the average mpg (miles per gallon) for all of the cars in the mtcars
dataset?
Getting Help
As you go through this course, you might encounter commands that you've never
seen before. And we don't expect you to know everything, but we do want you to
keep the following in mind when you encounter any sort of challenges or difficulties.
First, take an active role in problem solving and be aware of the resources at your
disposal. You can type a question mark and the name of any function to bring up help
documentation in R.
Before we can read in the data, we need to set our current working directory. So,
to figure out what directory you're in now, you can type in getwd(). We can run
Copyright © 2014 Udacity, Inc. All Rights Reserved.
this command and see the output to the console. It looks like I'm already in the
downloads file, and that also appears here in the top of my console. So I don't need
to change my directory, but maybe you do. To change your directory, you can type
setwd(‘directory’). This will take a string which will be the file path to whatever
directory you want to go to. My guess is that your data set is in your downloads file.
So I would probably run this command (setwd(‘~/Downloads’)). Now it's important
to note that whether or not you are on a Mac or Windows machine you still need to
separate your paths or your folders with a forward slash. Also be sure that you use
quotes around your path.
Now in order to load up the data, we can use the read.csv(‘filename’) command.
This command takes a string, which is the name of the file. And here we're going to
pass it to a variable called statesInfo. statesInfo is going to save all of our data into a
data frame. When I run this code, I can see that states info appears in my
environment. I can double click on the data frame in the workspace, and this will let
me see the table of values in R Studio.
Now, let's say I wanted to get information on states located in only the Northeast.
Those states would be states like Connecticut and they have a state region of one.
I'm going to go back to my R-script and write a command that pulls in this data. This
subset command would look like this. Here I'm passing the data frame states info to
subset and I'm asking for it to retrieve any states that have a state.region equal to 1.
So if I want only the states in the Northeast, I would write this code. The name of the
data set is states info and then I want the rows that have a state region equal to one.
Now I can't just use state region here, I need to access the actual variable, so I have
to put states info and the dollar sign. This gives me the actual variable value and I can
see if it's equal to one. If it is equal to one, I want to return every single column in the
data frame. So for example, with Connecticut if it's state region is equal to one. I want
to return every single column in this row.
Now, I really want you to pay careful attention to the syntax in both of these
examples. Throughout this course, we tend to make use of the subset command, but
there might be instances where we use the other method. Just know that both
methods produce the same result. Now, I recommend that you try subsetting this
data frame for other regions of the country on your own. You could also try finding
out which states have an illiteracy rate of 0.5%, or which states have high school
graduation rates above 50%. Feel free to play around.
R Markdown Documents
Let's get some more practice working with data frames. You downloaded an R script
earlier and saw how we can save our work and run an R code from it. For your next
task, you're going to download an RMD file and run code in it. The file will look
something like this. Notice how this file is slightly different from the file that you saw
before. An R script can only have R code and comments. This file, however, the RMD
file, allows us to do so much more. It's an R Markdown document, or RMD.
So far this file only contains text that will be formatted using Markdown. Let's add
some R code to this. We can do this by clicking on Chunks and then going to Insert
Chunk. Now if you're friendly with the keyboard, you can use the shortcut
Cmd+Option+I. There's many other shortcuts in here and you can see them here in
this menu. Here I've added some code to see what this data set is. This is the cars
data set that also comes with R, and it contains 50 observations of speeding and
Answer
The cars that satisfy either this
condition or this condition were the
Fiat 128, the Honda Civic, the Lotus
Europa, and the Toyota Corolla. Let's
see how we can figure this out
using "R". Your task was to determine
which cars in the mtcars data set have
MPG greater than or equal to 30 and
an HP less than 60. To answer this
question you needed to subset the
data frame. I'll show you two ways of
doing this. In the first method, I'll use
the subset command on empty cars,
and I want to get the cars which have
an mpg greater than or equal to 30 or
whose horse power is less than 60. That would be the Fiat, the Honda Civic, the
Toyota Corolla, and the Lotus Europa. Using the bracket notation, the coats and
tacks would look like this, and there's the same output.
Factor Variables
Copyright © 2014 Udacity, Inc. All Rights Reserved.
Now that you're familiar with the basic R commands, let's look at some more data.
This time we'll be looking at data collected from a survey of Reddit users. Reddit's
a social and entertainment website where users can post links and comments
about trending news. This survey asks users about demographic information such
as gender, age, nationality and employment status. It even asks users what type of
cheese they would be. And whether they prefer dogs, cats, or turtles. Download the
data set from the instructor notes and load it into R. Once you've done that, take a
look at the data by using the str(data) function.
Now when I try to read in the file, sometimes I might get an error, and this is pretty
common, so I would suggest looking at your current working directory to figure out
the problem. Often times your directory isn't where your file is stored. Alright, so I've
set my directory, and now I'm going to try this code up here again. And there we go,
there's our data. Running the str(reddit) command, we can see that there's lots of
data here. Most of these variables have a type of factor.
Now, a factor is a categorical variable that has different flavors or levels to it. An
example of this would be employment status. This variable has many different levels
such as employed full time or employed part time or not working. One thing we
might be interested in is how many people are in each group of employment status.
We can table that variable to see the number in each of these groups. Running this
code - table(reddit$employment.status) - I can see the table.
Ordered Factors
Let's look more closely to these factor variables. For now I want to draw your
attention to the age.range variable right here. Notice that it says that we have a
factor variable with seven different levels. We can examine the levels of a variable, by
instead of creating a table of the age.range variable, let's create a plot that shows
how many users are in each bin. That is, we want to figure out how many surveyed
respondents are between the ages of 18 and 24, 25 and 34, and so on. I’m gonna
create this plot using the ggplot2 package, and the qplot function that comes with it.
Again, don't worry about understanding this code too much, we'll have practice with
Copyright © 2014 Udacity, Inc. All Rights Reserved.
this in the next lesson. When I run this code, I get my plot over here. I want you to
notice that the age groups appear to be in order. This is true for everyone except the
survey takers who are under the age of 18. Now, it would be really helpful if the under
18 bar was really on the left of the 18-24 bar. That way we could make comparisons
across the groups more easily. Now this is why we would want to have ordered
factors.
The variable age.range just contains factors with seven levels, but these levels aren't
arranged in any particular order. Sometimes you want to introduce order into our
data set. So that way we can make more readable plots. So, knowing a little bit about
ordered factors, let's see if you can answer this next question.
Quiz
If you haven't already done so, download the Reddit survey data and look at its
structure. After you looked at the structure of the variables, try and answer this
question. Which of these variables in the data set could also be converted to an
ordered factor? Just like age.range. Check any of the variables that apply.
Answer
Quiz
What I want you to do is to look up the documentation for the factor function. Or you
can read through the example here. Once you're ready, try to write the code in order
Copyright © 2014 Udacity, Inc. All Rights Reserved.
to order the levels of the age.range variable. And just as a reminder, the level should
take on the following values, 18-24, 25-34, 35-44, 45-54, 55-64, 65 and over, and under
18.
Answer
Data Munging
It's important to note that many of the data sets that we've used so far in this course
are what I would call tidy data sets. What I mean is that these data sets were
The important thing to know is that this is a necessary step prior to conducting EDA
and it's called data munging. There are plenty of tools for doing this kind of work,
and if you're interested in learning more about how to wrangle and adjust data,
check out our data munging course. Now, this course EDA won't cover data
munching, but techniques for doing so are vital for any data scientist.
Congratulations
Congratulations on finishing lesson two. In this lesson you
learned how to use R Studio and you learned about the basic
commands in R. If you found something particularly helpful or if
you have ways that we can improve the course, let us know by
posting in the forum. In the next lesson, you'll learn how to visualize and summarize
single variables within a data set. We hope to see you there.