R

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

title: "Lesson 3: R Sandbox Activity"

output: html_document
---

## Background for this activity


Welcome to the sandbox! This activity is going to provide you with the
opportunity to preview some of the cool things you can do in `R` that you
will be learning in this course. You will learn more about working with
packages and data and try out some important functions.

In this activity, you are going to install and load `R` packages; practice
using functions to view, clean, and visualize data; and learn more about
using `R markdown` to document your analysis. `R` is a powerful tool that
can do a lot of different things; this sandbox activity will help you get
more comfortable using `R` while demonstrating some of its functions in
action. In later activities, you will also get the opportunity to write
your own R code!

## Step 1: Using `R packages`


`Packages` are a key part of working with `R.`They contain bundles of code
called `functions` that allow you to perform a wide range of tasks in `R.`
Some of them even contain datasets that you can use to practice the skills
you have been learning throughout this course.

Some `packages` are installed by default, but many others can be downloaded
from an external source such as the Comprehensive R Archive Network, or
CRAN.

In this activity, you will be using a package called `tidyverse.` The


`tidyverse` package is actually a collection individual `packages` that can
help you perform a wide variety of analysis tasks.

To install the `tidyverse` package, execute the code in the code chunk
below by clicking on the green arrow button in the top right corner. When
you execute a code chunk in RMarkdown, the output will appear in the .rmd
area and your console.

```{r}
install.packages("tidyverse")
```

Once a package is installed, you can load it by running the `library()`


function with the package name inside the parentheses, like this:

```{r}
library(tidyverse)
```

Installing and loading the `tidyverse` package may take a few minutes-- be
sure to wait for it to finish running before moving on to the next steps!

Once the chunk above has finished running, you will get a report that
summarizes what packages were loaded because you ran the `library()`
function. The report will also let you know you if there are any
`functions` that have a conflict, but you don't need to worry about that
for now.

Now that you have loaded an `R package,` you can start exploring some data.

# Step 2: Viewing data

Many of the `tidyverse` packages contain sample datasets that you can use
to practice your `R` skills. The `diamonds` dataset in the `ggplot2`
package is a great example for previewing `R` functions.
Because you already loaded this package in the last step, the `diamonds`
dataset is ready for you to use.

One common function you can use to preview the data is the `head()`
function, which displays the columns and the first several rows of data.
You can test out how the `head()` function works by running the chunk
below:

```{r}
head(diamonds)
```

In addition to `head()` there are a number of other useful functions you


can use to summarize or preview the data. For example, the `str()` and
`glimpse()` functions will both return summaries of each column in your
data arranged horizontally. You can try out these two functions by running
the code chunks below:

```{r}
str(diamonds)
```

```{r}
glimpse(diamonds)
```

Another simple function that you may use regularly is the `colnames()`
function. It returns a list of column names from your dataset. You can
check out this function by running the code chunk below:

```{r}
colnames(diamonds)
```

After running the code chunk, you may have noticed a number in brackets.
This number helps you count the number of columns in your dataset. If you
have data with lots of columns and `colnames()` prints the results on
multiple lines, each line will have a number in brackets at the start of
the line indicating what number column that is! So, for example, "carat" is
the first column in the `diamonds` dataset. On the second line, there is
the number seven in brackets; "price" is the seventh column.

## Step 3: Cleaning data

One of the most frequent tasks you will have to perform as an analyst is to
clean and organize your data. `R` makes this easy! There are many functions
you can use to help you perform important tasks easily and quickly.

For example, you might need to rename the columns, or variables, in your
data. There is a function for that: `rename().` You can check out how it
works in the chunk below:

```{r}
rename(diamonds, carat_new = carat)
```

Here, the function is being used to change the name of `carat` to


`carat_new`. This is a pretty basic change, but `rename()` has many options
that can help you do more complex changes across all of the variables in
your data.

For example, you can rename more than one variable in the same `rename()`
code. The code below demonstrates how:
```{r}
rename(diamonds, carat_new = carat, cut_new = cut)
```

Another handy function for summarizing your data is `summarize().` You can
use it to generate a wide range of summary statistics for your data. For
example, if you wanted to know what the mean for `carat` was in this
dataset, you could run the code in the chunk below:

```{r}
summarize(diamonds, mean_carat = mean(carat))
```

These functions are a great way to get more familiar with your data and
start making observations about it. But sometimes, previewing tables isn't
enough to understand a dataset. Luckily, `R` has visualization tools built
in.

## Step 4: Visualizing data


With `R,` you can create data visualizations that are simple and easy to
understand or complicated and beautiful just by changing a bit of code. `R`
empowers you to present the same data in so many different ways, which can
help you create new insights or highlight important data findings. One of
the most commonly used visualization packages is the `ggplot2` package,
which is loaded automatically when you install and load `tidyverse.` The
`diamonds` dataset that you have been using so far is a `ggplot2` dataset.

To build a visualization with `ggplot2` you layer plot elements together


with a `+` symbol. You will learn a lot more about using `ggplot2` later in
the course, but here is a preview of how easy and flexible it is to make
visuals using code:

```{r}
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point()
```

The code above takes the `diamonds` data, plots the carat column on the X-
axis, the price column on the Y-axis, and represents the data as a scatter
plot using the `geom_point()` command.

`ggplot2` makes it easy to modify or improve your visuals. For example, if


you wanted to change the color of each point so that it represented another
variable, such as the cut of the diamond, you can change the code like
this:

```{r}
ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
geom_point()
```

Wow, that's a busy visual! Sometimes when you are trying to represent many
different aspects of your data in a visual, it can help to separate out
some of the components. For example, you could create a different plot for
each type of cut. `ggplot2` makes it easy to do this with the
`facet_wrap()` function:

```{r}
ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
geom_point() +
facet_wrap(~cut)
```
You will learn many other ways of working with `ggplot2` to make functional
and beautiful visuals later on. For now, hopefully you understand that it
is both flexible and powerful!

## Step 5: Documentation

You have been working in an `R markdown` file, which allows you to put code
and writing in the same place. Markdown is a simple language for adding
formatting to text documents. For example, all of the section headers have
been formatted by adding `##` to the beginning of the line. Markdown can be
used to format the text in other ways, such as creating bulleted lists:

- So if you have a list of things


- Or want to write bullets for another reason
- You can do that using markdown

When you have written, executed, and documented your code in an `R


markdown` document like this, you can use the `knit` button in the menu bar
at the top of the editing pane to export your work to a beautiful, readable
document for others.

## Activity Wrap-up
You have had a chance to explore more `R` tools that you can start using on
your own. You learned how to install and load `R packages`; functions for
viewing, cleaning, and visualizing data; and using `R markdown`to export
your work. Feel free to continue exploring these functions in the rmd file,
or use this code in your own RStudio project space. As you practice on your
own, consider how `R` is similar and different from the tools you have
already learned in this program, and how you might start using it for your
own data analysis projects. `R` provides a lot of flexibility and utility
that can make it a key tool in a data analyst's tool kit.

Make sure to mark this activity as complete in Coursera.

title: "Lesson 2: Importing and working with data"


output: html_document
---

## Background for this activity / Introduction


By now, you have some experience manually entering data in `R` to create a
data frame. As a data analyst, it will also be common for you to import
data from external files into your `R` console and use it to create a data
frame to analyze it. In this activity you will learn how to import data
from outside of `R` using the `read_csv()` function. Then, once you have
imported a data file, you will use `R` functions to manipulate and interact
with that data.

You can start importing and exploring data with the code chunks in the RMD
space. To interact with the code chunk, click the green arrow in the top-
right corner of the chunk. The executed code will appear in the RMD space
and your console.

Throughout this activity, you will also have the opportunity to practice
writing your own code by making changes to the code chunks yourself. If you
encounter an error or get stuck, you can always check the
Lesson2_Import_Solutions .rmd file in the Solutions folder under Week 3 for
the complete, correct code.

## The scenario
In this scenario, you are a junior data analyst working for a hotel booking
company. You have been asked to clean a .csv file that was created after
querying a database to combine two different tables from different hotels.
In order to learn more about this data, you are going to need to use
functions to preview the data's structure, including its columns and rows.
You will also need to use basic cleaning functions to prepare this data for
analysis.

## Step 1: Load packages

Start by installing your required package. If you have already installed


and loaded `tidyverse` in this session, feel free to skip the code chunks
in this step.

```{r}
install.packages("tidyverse")
```

Once a package is installed, we can load it by running the `library()`


function with the package name inside the parentheses:

```{r}
library(tidyverse)
```

## Step 2: Import data

One of the most common file types data analysts import into `R` is comma
separated values files, or .csv files. The `tidyverse` library package
`readr` has a number of functions for "reading in" or importing data,
including .csv files and other external sources.

In the chunk below, use the `read_csv()` function to import data from
a .csv in the project folder called "hotel_bookings.csv" and save it as a
data frame called `bookings_df`.

If this line causes an error, copy in the line setwd("projects/Course


7/Week 3") before it.

The results will display as column specifications:

```{r}
bookings_df <- read_csv("hotel_bookings.csv")
```

Now that you have the `bookings_df`, you can work with it using all of the
`R` functions you have learned so far.

## Step 3: Inspect & clean data

One common function you can use to preview the data is the `head()`
function, which returns the columns and first several rows of data. Check
out the `head()` function by running the chunk below:

```{r}
head(bookings_df)
```

In addition to `head()` there are a number of other useful functions to


summarize or preview your data frame. For example, the `str()` and function
will provide summaries of each column in your data arranged horizontally.
Check out the `str()` function by running the code chunk below:

```{r}
str(bookings_df)
```

To find out what columns you have in your data frame, try running the the
`colnames()` function in the code chunk below:

```{r}
colnames(bookings_df)
```

If you want to create another data frame using `bookings_df` that focuses
on the average daily rate, which is referred to as `adr` in the data frame,
and `adults`, you can use the following code chunk to do that:

```{r}
new_df <- select(bookings_df, `adr`, adults)
```

To create new variables in your data frame, you can use the `mutate()`
function. This will make changes to the data frame, but not to the original
data set you imported. That source data will remain unchanged.

```{r}
mutate(new_df, total = `adr` / adults)
```

# Step 4: Import your own data

Now you can find your own .csv to import! Using the RStudio Cloud
interface, import and save the file in the same folder as this R Markdown
document. To do this, go to the Files tab in the lower-right console. Then,
click the Upload button next to the + New Folder button. This will open a
popup to let you browse your computer for a file. Select any .csv file,
then click Open. Now, write code in the chunk below to read that data into
`R`:

```{r}

```
You can check the solutions document for this activity to check your work.

## Activity Wrap Up
Now that you know how to import data using the `read_csv()` function, you
will be able to work with data that has been stored externally right in
your `R` console. You can continue to practice these skills by modifying
the code chunks in the rmd file, or use this code as a starting point in
your own project console. As you become more familiar with the process of
importing data, consider how importing data from a .csv file changed the
way you accessed and interacted with the data. Did you do anything
differently? Being able to import data from external sources will allow you
to work with even more data, giving you even more options for analyzing
data in `R`.

Make sure to mark this activity as complete in Coursera.

title: "Lesson 2: Import Solutions"


output: html_document
---

## Importing and working with data activity solutions


This document contains the solutions for the importing and working with
data activity. You can use these solutions to check your work and ensure
that your code is correct or troubleshoot your code if it is returning
errors. If you haven't completed the activity yet, we suggest you go back
and finish it before reading the solutions.

If you experience errors, remember that you can search the internet and the
RStudio community for help:
https://community.rstudio.com/#

## Step 1: Load packages

Start by installing your required package. If you have already installed


and loaded `tidyverse` in this session, feel free to skip the code chunks
in this step.

```{r}
install.packages("tidyverse")
```
```{r}
library(tidyverse)
```
## Step 2: Import data

The data in this example is originally from the article Hotel Booking
Demand Datasets
(https://www.sciencedirect.com/science/article/pii/S2352340918315191),
written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief,
Volume 22, February 2019.

The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for
#TidyTuesday during the week of February 11th, 2020
(https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-
02-11/readme.md).

You can learn more about the dataset here:


https://www.kaggle.com/jessemostipak/hotel-booking-demand

In the chunk below, you will use the `read_csv()` function to import data
from a .csv in the project folder called "hotel_bookings.csv" and save it
as a data frame called `bookings_df`:

```{r}
bookings_df <- read_csv("hotel_bookings.csv")
```

Now that you have the `bookings_df`, you can work with it using all of the
`R` functions you have learned so far.

## Step 3: Inspect & clean data

One common function you can use to preview the data is the `head()`
function, which returns the columns and first several rows of data. Check
out the `head()` function by running the chunk below:

```{r}
head(bookings_df)
```

Check out the `str()` function by running the code chunk below:

```{r}
str(bookings_df)
```
To find out what columns you have in your data frame, try running the the
`colnames()` function in the code chunk below:

```{r}
colnames(bookings_df)
```

If you want to create another data frame using `bookings_df` that focuses
on the average daily rate, which is referred to as `adr` in the data frame,
and `adults`, you can use the following code chunk to do that:

```{r}
new_df <- select(bookings_df, `adr`, adults)
```

To create new variables in your data frame, you can use the `mutate()`
function. This will make changes to the data frame, but not to the original
data set you imported. That source data will remain unchanged.

```{r}
mutate(new_df, total = `adr` / adults)
```

## Step 4: Import your own data

Now you can find your own .csv to import! Using the RStudio Cloud
interface, import and save the file in the same folder as this R Markdown
document. Then write code in the chunk below to read that data into `R`:
```{r}
own_df <- read_csv("<filename.csv>")
```

title: "Lesson 3: Cleaning data"


output: html_document
---

## Background for this activity

In this activity, you’ll review a scenario, and focus on cleaning real


data in R. You will learn more about data cleaning functions and perform
basic calculations to gain initial insights into your data.

Throughout this activity, you will also have the opportunity to practice
writing your own code by making changes to the code chunks yourself. If you
encounter an error or get stuck, you can always check the
Lesson2_Clean_Solutions .rmd file in the Solutions folder under Week 3 for
the complete, correct code.

## The scenario

In this scenario, you are a junior data analyst working for a hotel booking
company. You have been asked to clean a .csv file that was created after
querying a database to combine two different tables from different hotels.
In order to learn more about this data, you are going to need to use
functions to preview the data's structure, including its columns and rows.
You will also need to use basic cleaning functions to prepare this data for
analysis.

## Step 1: Load packages


In order to start cleaning your data, you will need to by install the
required packages. If you have already installed and loaded `tidyverse`,
`skimr`, and `janitor` in this session, feel free to skip the code chunks
in this step.

```{r}
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")
```

Once a package is installed, you can load it by running the `library()`


function with the package name inside the parentheses:

```{r}
library(tidyverse)
library(skimr)
library(janitor)
```

## Step 2: Import data

The data you have been asked to clean is currently an external .csv file.
In order to view and clean it in `R`, you will need to import it. The
`tidyverse` library `readr` package has a number of functions for "reading
in" or importing data, including .csv files.

In the chunk below, you will use the `read_csv()` function to import data
from a .csv file in the project folder called "hotel_bookings.csv" and save
it as a data frame called `bookings_df`:

If this line causes an error, copy in the line setwd("projects/Course


7/Week 3") before it.

```{r}
bookings_df <- read_csv("hotel_bookings.csv")
```

## Step 3: Getting to know your data

Before you start cleaning your data, take some time to explore it. You can
use several functions that you are already familiar with to preview your
data, including the `head()` function in the code chunk below:

```{r}
head(bookings_df)
```

You can also summarize or preview the data with the `str()` and `glimpse()`
functions to get a better understanding of the data by running the code
chunks below:

```{r}
str(bookings_df)
```

```{r}
glimpse(bookings_df)
```

You can also use `colnames()` to check the names of the columns in your
data set. Run the code chunk below to find out the column names in this
data set:
```{r}
colnames(bookings_df)
```

Some packages contain more advanced functions for summarizing and exploring
your data. One example is the `skimr` package, which has a number of
functions for this purpose. For example, the `skim_without_charts()`
function provides a detailed summary of the data. Try running the code
below:

```{r}
skim_without_charts(bookings_df)
```

## Step 4: Cleaning your data

Based on the functions you have used so far, how would you describe your
data in a brief to your stakeholder? Now, let's say you are primarily
interested in the following variables: 'hotel', 'is_canceled', and
'lead_time'. Create a new data frame with just those columns, calling it
`trimmed_df` by adding the variable names to this code chunk:

```{r}
trimmed_df <- bookings_df %>%
select( , , )
```

Remember to check the solutions doc if you are having trouble filling out
any of these code chunks.

You might notice that some of the column names aren't very intuitive, so
you will want to rename them to make them easier to understand. You might
want to create the same exact data frame as above, but rename the variable
'hotel' to be named 'hotel_type' to be crystal clear on what the data is
about

Fill in the space to the left of the '=' symbol with the new variable name:

```{r}
trimmed_df %>%
select(hotel, is_canceled, lead_time) %>%
rename( = hotel)
```

Another common task is to either split or combine data in different


columns. In this example, you can combine the arrival month and year into
one column using the unite() function:

```{r}
example_df <- bookings_df %>%
select(arrival_date_year, arrival_date_month) %>%
unite(arrival_month_year, c("arrival_date_month", "arrival_date_year"),
sep = " ")
```

## Step 5: Another way of doing things

You can also use the`mutate()` function to make changes to your columns.
Let's say you wanted to create a new column that summed up all the adults,
children, and babies on a reservation for the total number of people.
Modify the code chunk below to create that new column:

```{r}
example_df <- bookings_df %>%
mutate(guests = )

head(example_df)
```

Great. Now it's time to calculate some summary statistics! Calculate the
total number of canceled bookings and the average lead time for booking -
you'll want to start your code after the %>% symbol. Make a column called
'number_canceled' to represent the total number of canceled bookings. Then,
make a column called 'average_lead_time' to represent the average lead
time. Use the `summarize()` function to do this in the code chunk below:

```{r}

example_df <- bookings_df %>%

head(example_df)
```

If you are having trouble completing any of the code chunks in these
activities, remember that you can reference the RMarkdown documents in the
'Solutions' for help.

## Activity Wrap Up
Now you have some experience cleaning and analyzing data in `R`; you used
basic cleaning functions like `rename()` and `clean_names()` and performed
basic calculations on real data. You can continue to practice these skills
by modifying the code chunks in the rmd file, or use this code as a
starting point in your own project console. One of the reasons `R` is such
a powerful tool for data analysis is because you can perform so many
different tasks in one place. With the functions you have been learning in
this course, you can import data, create and view data frames, and even
clean data without leaving your console.

Make sure to mark this activity as complete in Coursera.

title: "Lesson 3: Cleaning Solutions"


output: html_document
---

## Cleaning data solutions


This document contains the solutions for the cleaning data activity. You
can use these solutions to check your work and ensure that your code is
correct or troubleshoot your code if it is returning errors. If you haven't
completed the activity yet, we suggest you go back and finish it before
reading the solutions.

If you experience errors, remember that you can search the internet and the
RStudio community for help:
https://community.rstudio.com/#

## Step 1: Load packages

Start by installing the required packages. If you have already installed


and loaded `tidyverse`, `skimr`, and `janitor` in this session, feel free
to skip the code chunks in this step.

```{r}
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")
```

Once a package is installed, you can load it by running the `library()`


function with the package name inside the parentheses:

```{r}
library(tidyverse)
library(skimr)
library(janitor)
```
## Step 2: Import data
The data in this example is originally from the article Hotel Booking
Demand Datasets
(https://www.sciencedirect.com/science/article/pii/S2352340918315191),
written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief,
Volume 22, February 2019.

The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for
#TidyTuesday during the week of February 11th, 2020
(https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-
02-11/readme.md).

You can learn more about the dataset here:


https://www.kaggle.com/jessemostipak/hotel-booking-demand

In the chunk below, you will use the `read_csv()` function to import data
from a .csv in the project folder called "hotel_bookings.csv" and save it
as a data frame called `bookings_df`:

```{r}
bookings_df <- read_csv("hotel_bookings.csv")
```
## Step 3: Getting to know your data

Before you start cleaning your data, take some time to explore it. You can
use several functions that you are already familiar with to preview your
data, including the `head()` function in the code chunk below:

```{r}
head(bookings_df)
```

You can summarize or preview the data with the `str()` and `glimpse()`
functions to get a better understanding of the data by running the code
chunks below:

```{r}
str(bookings_df)
```

```{r}
glimpse(bookings_df)
```

You can also use `colnames()` to check the names of the columns in your
data set. Run the code chunk below to find out the column names in this
data set:

```{r}
colnames(bookings_df)
```
Use the `skim_without_charts()` function from the `skimr` package by
running the code below:

```{r}
skim_without_charts(bookings_df)
```
## Step 4: Cleaning your data

Based on your notes you are primarily interested in the following


variables: hotel, is_canceled, lead_time. Create a new data frame with just
those columns, calling it `trimmed_df`.

```{r}
trimmed_df <- bookings_df %>%
select(hotel, is_canceled, lead_time)
```

Rename the variable 'hotel' to be named 'hotel_type' to be crystal clear on


what the data is about:

```{r}
trimmed_df %>%
select(hotel, is_canceled, lead_time) %>%
rename(hotel_type = hotel)
```

In this example, you can combine the arrival month and year into one column
using the unite() function:

```{r}
example_df <- bookings_df %>%
select(arrival_date_year, arrival_date_month) %>%
unite(arrival_month_year, c("arrival_date_month", "arrival_date_year"),
sep = " ")
```

## Step 5: Another way of doing things

You can also use the`mutate()` function to make changes to your columns.
Let's say you wanted to create a new column that summed up all the adults,
children, and babies on a reservation for the total number of people.
Modify the code chunk below to create that new column:

```{r}
example_df <- bookings_df %>%
mutate(guests = adults + children + babies)

head(example_df)
```

Great. Now it's time to calculate some summary statistics! Calculate the
total number of canceled bookings and the average lead time for booking -
you'll want to start your code after the %>% symbol. Make a column called
'number_canceled' to represent the total number of canceled bookings. Then,
make a column called 'average_lead_time' to represent the average lead
time. Use the `summarize()` function to do this in the code chunk below:

```{r}
example_df <- bookings_df %>%
summarize(number_canceled = sum(is_canceled),
average_lead_time = mean(lead_time))

head(example_df)
```

You might also like