R
R
R
output: html_document
---
In this activity, you are going to install and load `R` packages; practice
using functions to view, clean, and visualize data; and learn more about
using `R markdown` to document your analysis. `R` is a powerful tool that
can do a lot of different things; this sandbox activity will help you get
more comfortable using `R` while demonstrating some of its functions in
action. In later activities, you will also get the opportunity to write
your own R code!
Some `packages` are installed by default, but many others can be downloaded
from an external source such as the Comprehensive R Archive Network, or
CRAN.
To install the `tidyverse` package, execute the code in the code chunk
below by clicking on the green arrow button in the top right corner. When
you execute a code chunk in RMarkdown, the output will appear in the .rmd
area and your console.
```{r}
install.packages("tidyverse")
```
```{r}
library(tidyverse)
```
Installing and loading the `tidyverse` package may take a few minutes-- be
sure to wait for it to finish running before moving on to the next steps!
Once the chunk above has finished running, you will get a report that
summarizes what packages were loaded because you ran the `library()`
function. The report will also let you know you if there are any
`functions` that have a conflict, but you don't need to worry about that
for now.
Now that you have loaded an `R package,` you can start exploring some data.
Many of the `tidyverse` packages contain sample datasets that you can use
to practice your `R` skills. The `diamonds` dataset in the `ggplot2`
package is a great example for previewing `R` functions.
Because you already loaded this package in the last step, the `diamonds`
dataset is ready for you to use.
One common function you can use to preview the data is the `head()`
function, which displays the columns and the first several rows of data.
You can test out how the `head()` function works by running the chunk
below:
```{r}
head(diamonds)
```
```{r}
str(diamonds)
```
```{r}
glimpse(diamonds)
```
Another simple function that you may use regularly is the `colnames()`
function. It returns a list of column names from your dataset. You can
check out this function by running the code chunk below:
```{r}
colnames(diamonds)
```
After running the code chunk, you may have noticed a number in brackets.
This number helps you count the number of columns in your dataset. If you
have data with lots of columns and `colnames()` prints the results on
multiple lines, each line will have a number in brackets at the start of
the line indicating what number column that is! So, for example, "carat" is
the first column in the `diamonds` dataset. On the second line, there is
the number seven in brackets; "price" is the seventh column.
One of the most frequent tasks you will have to perform as an analyst is to
clean and organize your data. `R` makes this easy! There are many functions
you can use to help you perform important tasks easily and quickly.
For example, you might need to rename the columns, or variables, in your
data. There is a function for that: `rename().` You can check out how it
works in the chunk below:
```{r}
rename(diamonds, carat_new = carat)
```
For example, you can rename more than one variable in the same `rename()`
code. The code below demonstrates how:
```{r}
rename(diamonds, carat_new = carat, cut_new = cut)
```
Another handy function for summarizing your data is `summarize().` You can
use it to generate a wide range of summary statistics for your data. For
example, if you wanted to know what the mean for `carat` was in this
dataset, you could run the code in the chunk below:
```{r}
summarize(diamonds, mean_carat = mean(carat))
```
These functions are a great way to get more familiar with your data and
start making observations about it. But sometimes, previewing tables isn't
enough to understand a dataset. Luckily, `R` has visualization tools built
in.
```{r}
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point()
```
The code above takes the `diamonds` data, plots the carat column on the X-
axis, the price column on the Y-axis, and represents the data as a scatter
plot using the `geom_point()` command.
```{r}
ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
geom_point()
```
Wow, that's a busy visual! Sometimes when you are trying to represent many
different aspects of your data in a visual, it can help to separate out
some of the components. For example, you could create a different plot for
each type of cut. `ggplot2` makes it easy to do this with the
`facet_wrap()` function:
```{r}
ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
geom_point() +
facet_wrap(~cut)
```
You will learn many other ways of working with `ggplot2` to make functional
and beautiful visuals later on. For now, hopefully you understand that it
is both flexible and powerful!
## Step 5: Documentation
You have been working in an `R markdown` file, which allows you to put code
and writing in the same place. Markdown is a simple language for adding
formatting to text documents. For example, all of the section headers have
been formatted by adding `##` to the beginning of the line. Markdown can be
used to format the text in other ways, such as creating bulleted lists:
## Activity Wrap-up
You have had a chance to explore more `R` tools that you can start using on
your own. You learned how to install and load `R packages`; functions for
viewing, cleaning, and visualizing data; and using `R markdown`to export
your work. Feel free to continue exploring these functions in the rmd file,
or use this code in your own RStudio project space. As you practice on your
own, consider how `R` is similar and different from the tools you have
already learned in this program, and how you might start using it for your
own data analysis projects. `R` provides a lot of flexibility and utility
that can make it a key tool in a data analyst's tool kit.
You can start importing and exploring data with the code chunks in the RMD
space. To interact with the code chunk, click the green arrow in the top-
right corner of the chunk. The executed code will appear in the RMD space
and your console.
Throughout this activity, you will also have the opportunity to practice
writing your own code by making changes to the code chunks yourself. If you
encounter an error or get stuck, you can always check the
Lesson2_Import_Solutions .rmd file in the Solutions folder under Week 3 for
the complete, correct code.
## The scenario
In this scenario, you are a junior data analyst working for a hotel booking
company. You have been asked to clean a .csv file that was created after
querying a database to combine two different tables from different hotels.
In order to learn more about this data, you are going to need to use
functions to preview the data's structure, including its columns and rows.
You will also need to use basic cleaning functions to prepare this data for
analysis.
```{r}
install.packages("tidyverse")
```
```{r}
library(tidyverse)
```
One of the most common file types data analysts import into `R` is comma
separated values files, or .csv files. The `tidyverse` library package
`readr` has a number of functions for "reading in" or importing data,
including .csv files and other external sources.
In the chunk below, use the `read_csv()` function to import data from
a .csv in the project folder called "hotel_bookings.csv" and save it as a
data frame called `bookings_df`.
```{r}
bookings_df <- read_csv("hotel_bookings.csv")
```
Now that you have the `bookings_df`, you can work with it using all of the
`R` functions you have learned so far.
One common function you can use to preview the data is the `head()`
function, which returns the columns and first several rows of data. Check
out the `head()` function by running the chunk below:
```{r}
head(bookings_df)
```
```{r}
str(bookings_df)
```
To find out what columns you have in your data frame, try running the the
`colnames()` function in the code chunk below:
```{r}
colnames(bookings_df)
```
If you want to create another data frame using `bookings_df` that focuses
on the average daily rate, which is referred to as `adr` in the data frame,
and `adults`, you can use the following code chunk to do that:
```{r}
new_df <- select(bookings_df, `adr`, adults)
```
To create new variables in your data frame, you can use the `mutate()`
function. This will make changes to the data frame, but not to the original
data set you imported. That source data will remain unchanged.
```{r}
mutate(new_df, total = `adr` / adults)
```
Now you can find your own .csv to import! Using the RStudio Cloud
interface, import and save the file in the same folder as this R Markdown
document. To do this, go to the Files tab in the lower-right console. Then,
click the Upload button next to the + New Folder button. This will open a
popup to let you browse your computer for a file. Select any .csv file,
then click Open. Now, write code in the chunk below to read that data into
`R`:
```{r}
```
You can check the solutions document for this activity to check your work.
## Activity Wrap Up
Now that you know how to import data using the `read_csv()` function, you
will be able to work with data that has been stored externally right in
your `R` console. You can continue to practice these skills by modifying
the code chunks in the rmd file, or use this code as a starting point in
your own project console. As you become more familiar with the process of
importing data, consider how importing data from a .csv file changed the
way you accessed and interacted with the data. Did you do anything
differently? Being able to import data from external sources will allow you
to work with even more data, giving you even more options for analyzing
data in `R`.
If you experience errors, remember that you can search the internet and the
RStudio community for help:
https://community.rstudio.com/#
```{r}
install.packages("tidyverse")
```
```{r}
library(tidyverse)
```
## Step 2: Import data
The data in this example is originally from the article Hotel Booking
Demand Datasets
(https://www.sciencedirect.com/science/article/pii/S2352340918315191),
written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief,
Volume 22, February 2019.
The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for
#TidyTuesday during the week of February 11th, 2020
(https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-
02-11/readme.md).
In the chunk below, you will use the `read_csv()` function to import data
from a .csv in the project folder called "hotel_bookings.csv" and save it
as a data frame called `bookings_df`:
```{r}
bookings_df <- read_csv("hotel_bookings.csv")
```
Now that you have the `bookings_df`, you can work with it using all of the
`R` functions you have learned so far.
One common function you can use to preview the data is the `head()`
function, which returns the columns and first several rows of data. Check
out the `head()` function by running the chunk below:
```{r}
head(bookings_df)
```
Check out the `str()` function by running the code chunk below:
```{r}
str(bookings_df)
```
To find out what columns you have in your data frame, try running the the
`colnames()` function in the code chunk below:
```{r}
colnames(bookings_df)
```
If you want to create another data frame using `bookings_df` that focuses
on the average daily rate, which is referred to as `adr` in the data frame,
and `adults`, you can use the following code chunk to do that:
```{r}
new_df <- select(bookings_df, `adr`, adults)
```
To create new variables in your data frame, you can use the `mutate()`
function. This will make changes to the data frame, but not to the original
data set you imported. That source data will remain unchanged.
```{r}
mutate(new_df, total = `adr` / adults)
```
Now you can find your own .csv to import! Using the RStudio Cloud
interface, import and save the file in the same folder as this R Markdown
document. Then write code in the chunk below to read that data into `R`:
```{r}
own_df <- read_csv("<filename.csv>")
```
Throughout this activity, you will also have the opportunity to practice
writing your own code by making changes to the code chunks yourself. If you
encounter an error or get stuck, you can always check the
Lesson2_Clean_Solutions .rmd file in the Solutions folder under Week 3 for
the complete, correct code.
## The scenario
In this scenario, you are a junior data analyst working for a hotel booking
company. You have been asked to clean a .csv file that was created after
querying a database to combine two different tables from different hotels.
In order to learn more about this data, you are going to need to use
functions to preview the data's structure, including its columns and rows.
You will also need to use basic cleaning functions to prepare this data for
analysis.
```{r}
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")
```
```{r}
library(tidyverse)
library(skimr)
library(janitor)
```
The data you have been asked to clean is currently an external .csv file.
In order to view and clean it in `R`, you will need to import it. The
`tidyverse` library `readr` package has a number of functions for "reading
in" or importing data, including .csv files.
In the chunk below, you will use the `read_csv()` function to import data
from a .csv file in the project folder called "hotel_bookings.csv" and save
it as a data frame called `bookings_df`:
```{r}
bookings_df <- read_csv("hotel_bookings.csv")
```
Before you start cleaning your data, take some time to explore it. You can
use several functions that you are already familiar with to preview your
data, including the `head()` function in the code chunk below:
```{r}
head(bookings_df)
```
You can also summarize or preview the data with the `str()` and `glimpse()`
functions to get a better understanding of the data by running the code
chunks below:
```{r}
str(bookings_df)
```
```{r}
glimpse(bookings_df)
```
You can also use `colnames()` to check the names of the columns in your
data set. Run the code chunk below to find out the column names in this
data set:
```{r}
colnames(bookings_df)
```
Some packages contain more advanced functions for summarizing and exploring
your data. One example is the `skimr` package, which has a number of
functions for this purpose. For example, the `skim_without_charts()`
function provides a detailed summary of the data. Try running the code
below:
```{r}
skim_without_charts(bookings_df)
```
Based on the functions you have used so far, how would you describe your
data in a brief to your stakeholder? Now, let's say you are primarily
interested in the following variables: 'hotel', 'is_canceled', and
'lead_time'. Create a new data frame with just those columns, calling it
`trimmed_df` by adding the variable names to this code chunk:
```{r}
trimmed_df <- bookings_df %>%
select( , , )
```
Remember to check the solutions doc if you are having trouble filling out
any of these code chunks.
You might notice that some of the column names aren't very intuitive, so
you will want to rename them to make them easier to understand. You might
want to create the same exact data frame as above, but rename the variable
'hotel' to be named 'hotel_type' to be crystal clear on what the data is
about
Fill in the space to the left of the '=' symbol with the new variable name:
```{r}
trimmed_df %>%
select(hotel, is_canceled, lead_time) %>%
rename( = hotel)
```
```{r}
example_df <- bookings_df %>%
select(arrival_date_year, arrival_date_month) %>%
unite(arrival_month_year, c("arrival_date_month", "arrival_date_year"),
sep = " ")
```
You can also use the`mutate()` function to make changes to your columns.
Let's say you wanted to create a new column that summed up all the adults,
children, and babies on a reservation for the total number of people.
Modify the code chunk below to create that new column:
```{r}
example_df <- bookings_df %>%
mutate(guests = )
head(example_df)
```
Great. Now it's time to calculate some summary statistics! Calculate the
total number of canceled bookings and the average lead time for booking -
you'll want to start your code after the %>% symbol. Make a column called
'number_canceled' to represent the total number of canceled bookings. Then,
make a column called 'average_lead_time' to represent the average lead
time. Use the `summarize()` function to do this in the code chunk below:
```{r}
head(example_df)
```
If you are having trouble completing any of the code chunks in these
activities, remember that you can reference the RMarkdown documents in the
'Solutions' for help.
## Activity Wrap Up
Now you have some experience cleaning and analyzing data in `R`; you used
basic cleaning functions like `rename()` and `clean_names()` and performed
basic calculations on real data. You can continue to practice these skills
by modifying the code chunks in the rmd file, or use this code as a
starting point in your own project console. One of the reasons `R` is such
a powerful tool for data analysis is because you can perform so many
different tasks in one place. With the functions you have been learning in
this course, you can import data, create and view data frames, and even
clean data without leaving your console.
If you experience errors, remember that you can search the internet and the
RStudio community for help:
https://community.rstudio.com/#
```{r}
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")
```
```{r}
library(tidyverse)
library(skimr)
library(janitor)
```
## Step 2: Import data
The data in this example is originally from the article Hotel Booking
Demand Datasets
(https://www.sciencedirect.com/science/article/pii/S2352340918315191),
written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief,
Volume 22, February 2019.
The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for
#TidyTuesday during the week of February 11th, 2020
(https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-
02-11/readme.md).
In the chunk below, you will use the `read_csv()` function to import data
from a .csv in the project folder called "hotel_bookings.csv" and save it
as a data frame called `bookings_df`:
```{r}
bookings_df <- read_csv("hotel_bookings.csv")
```
## Step 3: Getting to know your data
Before you start cleaning your data, take some time to explore it. You can
use several functions that you are already familiar with to preview your
data, including the `head()` function in the code chunk below:
```{r}
head(bookings_df)
```
You can summarize or preview the data with the `str()` and `glimpse()`
functions to get a better understanding of the data by running the code
chunks below:
```{r}
str(bookings_df)
```
```{r}
glimpse(bookings_df)
```
You can also use `colnames()` to check the names of the columns in your
data set. Run the code chunk below to find out the column names in this
data set:
```{r}
colnames(bookings_df)
```
Use the `skim_without_charts()` function from the `skimr` package by
running the code below:
```{r}
skim_without_charts(bookings_df)
```
## Step 4: Cleaning your data
```{r}
trimmed_df <- bookings_df %>%
select(hotel, is_canceled, lead_time)
```
```{r}
trimmed_df %>%
select(hotel, is_canceled, lead_time) %>%
rename(hotel_type = hotel)
```
In this example, you can combine the arrival month and year into one column
using the unite() function:
```{r}
example_df <- bookings_df %>%
select(arrival_date_year, arrival_date_month) %>%
unite(arrival_month_year, c("arrival_date_month", "arrival_date_year"),
sep = " ")
```
You can also use the`mutate()` function to make changes to your columns.
Let's say you wanted to create a new column that summed up all the adults,
children, and babies on a reservation for the total number of people.
Modify the code chunk below to create that new column:
```{r}
example_df <- bookings_df %>%
mutate(guests = adults + children + babies)
head(example_df)
```
Great. Now it's time to calculate some summary statistics! Calculate the
total number of canceled bookings and the average lead time for booking -
you'll want to start your code after the %>% symbol. Make a column called
'number_canceled' to represent the total number of canceled bookings. Then,
make a column called 'average_lead_time' to represent the average lead
time. Use the `summarize()` function to do this in the code chunk below:
```{r}
example_df <- bookings_df %>%
summarize(number_canceled = sum(is_canceled),
average_lead_time = mean(lead_time))
head(example_df)
```