MIT 302 - Statistical Computing II - Tutorial 04

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

MIT 302 – STATISTICAL

COMPUTING 11
TUTORIAL 4: Data Visualization with R

1 Overview
Data visualization is a crucial aspect of statistical computing as it allows us to explore and
communicate patterns, relationships, and insights in our data effectively. R provides a wide
range of powerful packages and functions for creating various types of visualizations, from
basic plots to complex and interactive graphics.
1.1 Basic Plots
R offers several functions for creating basic plots, including:
• Scatter plots: Use the "plot" function to visualize the relationship between two numeric
variables.
• Line plots: Use the "plot" or "lines" function to display trends or time series data.
• Bar plots: Use the "barplot" function to compare categorical variables.
• Histograms: Use the "hist" function to visualize the distribution of a numeric variable.
1.2 Advanced Plots
R provides packages like "ggplot2" that offer more flexibility and customization for creating
advanced plots. The "ggplot2" package follows the grammar of graphics approach, allowing
you to build plots layer by layer.
• Box plots: Use the "geom_boxplot" function to create box plots, which display the
distribution of a numeric variable across different categories.
• Violin plots: Use the "geom_violin" function to create violin plots, which combine a
box plot with a kernel density plot.
• Heatmaps: Use functions like "heatmap" or "geom_tile" in "ggplot2" to visualize the
patterns and relationships in a matrix of values.
• Scatterplot matrices: Use the "pairs" function or the "ggpairs" function from the
"GGally" package to create scatterplot matrices for exploring multiple variables.
1.3 Interactive and Dynamic Visualizations
R provides packages like "plotly" and "shiny" for creating interactive and dynamic
visualizations.
• Interactive plots: Use the "plot_ly" function from the "plotly" package to create
interactive plots that allow zooming, panning, and tooltips.
• Dashboards and web applications: Use the "shiny" package to build interactive web
applications and dashboards that update dynamically based on user inputs.
1.4 Geographic Visualizations
R has packages like "ggplot2" and "leaflet" for creating geographic visualizations.
• Choropleth maps: Use the "geom_map" function in "ggplot2" or the "addPolygons"
function in "leaflet" to create choropleth maps, which represent data using different
colors on a map.
• Interactive maps: Use the "leaflet" package to create interactive and customizable maps
with various layers, markers, and pop-ups.
1.5 Customizing Plots
R allows you to customize plots by modifying aspects like labels, titles, axes, colors, and
themes.
• Labels and titles: Use functions like "labs" in "ggplot2" to customize plot labels and
titles.
• Axes and scales: Use functions like "scale_x_continuous" or "scale_x_discrete" in
"ggplot2" to modify the appearance of axes and scales.
• Colors and themes: Use functions like "scale_color_manual" or "theme" in "ggplot2"
to customize colors and themes of your plots.
R's extensive ecosystem of packages, along with its flexibility and programmability, makes it
a powerful tool for data visualization. By combining data visualization with statistical analysis,
you can gain deeper insights and effectively communicate your findings.

2 Advanced Plotting Techniques using ggplot2


ggplot2 is a powerful R package for creating visualizations that follows the grammar of
graphics approach. It provides a flexible and intuitive framework for constructing complex and
customized plots. Here are some advanced plotting techniques using ggplot2:
2.1 Faceting
Faceting allows you to create multiple plots based on subsets of your data. It is useful for
comparing patterns across different groups or levels of a categorical variable. ggplot2 provides
the "facet_wrap" and "facet_grid" functions for faceting.
2.1.1 Example
Let's say we have a dataset called "iris" that contains measurements of sepal length, sepal width,
petal length, petal width, and species for different flowers. We can create a scatter plot of sepal
length vs. sepal width, faceted by species:
library(ggplot2)

# Load the iris dataset


data(iris)

# Create a scatter plot with faceting


ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
facet_wrap(~ Species)
This code uses the "ggplot" function to initialize the plot and specifies the data frame and
aesthetic mappings. The "geom_point" function adds the scatter points, and the "facet_wrap"
function creates separate panels for each species.
2.2 Layering
ggplot2 allows you to layer multiple geometric objects and statistical transformations to create
complex visualizations. This layering approach enables you to add different types of plots, such
as lines, bars, and smooth curves, to the same plot.
2.2.1 Example
Let's continue with the "iris" dataset and create a scatter plot of sepal length vs. sepal width,
adding a smooth curve for each species:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE)
In this code, we use the "geom_smooth" function to add a linear regression line to each species.
The "method" argument specifies the type of smoothing method, and "se = FALSE" removes
the confidence intervals.
2.3 Customizing Themes
ggplot2 allows you to customize the appearance of your plots by modifying themes, axes,
labels, and colors. This flexibility helps you create visually appealing plots that match your
preferences or adhere to specific style guidelines.
2.3.1 Example
Let's customize the theme and color scheme of a scatter plot of petal length vs. petal width
from the "iris" dataset:
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
geom_point() +
theme_minimal() +
labs(title = "Scatter Plot of Petal Length vs. Petal Width",
x = "Petal Length",
y = "Petal Width") +
scale_color_manual(values = c("setosa" = "red",
"versicolor" = "blue",
"virginica" = "green"))
In this code, we use the "theme_minimal" function to set a minimalistic theme for our plot. The
"labs" function is used to set the plot title and axis labels. The "scale_color_manual" function
allows us to set custom colors for each species.
2.4 Annotations
ggplot2 provides various functions to add annotations, such as text labels, arrows, and reference
lines, to your plots. Annotations help highlight specific points or provide additional
information.
2.4.1 Example
Let's add a text label to a scatter plot of sepal length vs. sepal width to highlight a specific data
point:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
geom_text(aes(label = ifelse(Sepal.Width > 4, Species, "")),
hjust = 0, vjust = 0)
In this code, we use the "geom_text" function to add text labels to the plot. The "ifelse"
statement is used to conditionally assign labels based on a specific criterion (e.g., if sepal width
is greater than 4). The "hjust" and "vjust" arguments specify the horizontal and vertical
justification of the labels.
These advanced plotting techniques using ggplot2 provide you with the necessary tools to
create highly customized and informative visualizations. By combining faceting, layering,
theme customization, and annotations, you can effectively communicate complex patterns and
relationships in your data.

3 Interactive visualizations using Shiny


Shiny is an R package that allows you to create interactive web applications and dashboards
directly from your R code. With Shiny, you can build dynamic and interactive visualizations
that respond to user input, enabling you to explore and analyse your data in real-time. Here are
the key aspects of creating interactive visualizations using Shiny:
3.1 Building the User Interface (UI)
In Shiny, the user interface (UI) defines the layout and controls of your interactive visualization.
You can create various input elements such as sliders, checkboxes, dropdown menus, and text
boxes, which allow users to interact with your visualization.
3.1.1 Example
Let's say we have a dataset called "mtcars" that contains information about different car models.
We can create a Shiny application with a scatter plot of horsepower vs. miles per gallon (mpg)
and a slider input for selecting the number of cylinders to display:
library(shiny)

# Define the UI
ui <- fluidPage(
titlePanel("Interactive Scatter Plot"),
sidebarLayout(
sidebarPanel(
sliderInput(inputId = "cylinders",
label = "Number of Cylinders",
min = min(mtcars$cyl),
max = max(mtcars$cyl),
value = min(mtcars$cyl),
step = 1)
),
mainPanel(
plotOutput(outputId = "scatterPlot")
)
)
)

# Run the Shiny application


shinyApp(ui = ui, server = server)
In this code, we define the UI using the "fluidPage" function. The "titlePanel" function sets the
title of the application. The "sidebarLayout" function creates a sidebar panel for the slider input
and a main panel for the scatter plot. The "sliderInput" function creates the slider input element,
and "plotOutput" specifies where the scatter plot will be displayed.
3.2 Defining the Server Logic
The server logic in Shiny defines the behavior and interactivity of your visualization. It consists
of reactive expressions and functions that respond to user input and update the output
accordingly.
3.2.1 Example
Continuing with the previous example, let's define the server logic to generate the scatter plot
based on the selected number of cylinders:
# Define the server
server <- function(input, output) {
output$scatterPlot <- renderPlot({
filteredData <- mtcars[mtcars$cyl == input$cylinders, ]
plot(x = filteredData$hp, y = filteredData$mpg,
xlab = "Horsepower", ylab = "Miles per Gallon",
main = paste("Scatter Plot for", input$cylinders, "Cylinders"))
})
}
In this code, we use the "renderPlot" function to define the reactive expression for generating
the scatter plot. The "filteredData" object filters the "mtcars" dataset based on the selected
number of cylinders. The scatter plot is then created using the "plot" function, with the axes
labelled and the title dynamically updated based on the selected number of cylinders.
3.3 Deploying the Shiny Application
Once you have defined the UI and server logic, you can deploy your Shiny application locally
or on a web server. This allows you to share your interactive visualization with others.
3.3.1 Example
To run the Shiny application locally, you can use the "runApp" function:
runApp(appDir = ".", launch.browser = TRUE)
This code will launch the Shiny application in your default web browser, allowing you to
interact with the scatter plot and dynamically update it based on the selected number of
cylinders.
By combining the UI elements, server logic, and deployment capabilities of Shiny, you can
create powerful and interactive visualizations that enable users to explore and analyse data in
real-time. Shiny's versatility makes it suitable for a wide range of applications, from simple
interactive plots to complex dashboards.

4 Customizing visualizations for effective communication


Creating visualizations is not just about presenting data; it is also about effectively
communicating insights and conveying a clear message to the audience. Customizing
visualizations in R allows you to enhance the clarity, aesthetics, and storytelling aspects of your
plots. Here are some key considerations and techniques for customizing visualizations for
effective communication:
4.1 Choosing the Right Plot Type
Selecting an appropriate plot type is crucial for effectively communicating your message.
Different plot types are suitable for different data types and research questions. Consider factors
such as the nature of the data, the relationships you want to highlight, and the audience's
familiarity with different plot types.
4.1.1 Example
Let's consider the "mtcars" dataset again, which contains information about different car
models. Suppose we want to compare the fuel efficiency (mpg) of different car models. A box
plot can effectively show the distribution and median values of the mpg variable across
different car models:
library(ggplot2)

# Load the mtcars dataset


data(mtcars)

# Create a box plot of mpg by car model


ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot() +
xlab("Number of Cylinders") +
ylab("Miles per Gallon") +
ggtitle("Fuel Efficiency by Car Model")
In this code, we use the "ggplot" function to set up the plot and specify the aesthetics. The
"geom_boxplot" function adds the box plot layer to the plot, and we customize the axis labels
and the plot title accordingly.
4.2 Applying Appropriate Colour Schemes
Colour choice plays a crucial role in visualizations as it can convey information, highlight
patterns, and evoke emotions. Use colour schemes that are visually appealing, accessible to
colourblind individuals, and effectively differentiate between different groups or categories in
your data.
4.2.1 Example
Let's consider the "iris" dataset, which contains measurements of different flower species.
Suppose we want to create a scatter plot of petal length vs. petal width, differentiating the
species using a colour scheme:
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
geom_point() +
xlab("Petal Length") +
ylab("Petal Width") +
ggtitle("Scatter Plot of Petal Length vs. Petal Width") +
scale_color_manual(values = c("setosa" = "red", "versicolor" = "blue", "virginic
a" = "green"))
In this code, we use the "scale_color_manual" function to manually specify colours for each
species. It is important to choose colours that are visually distinct and not misleading.
4.3 Simplifying and Enhancing Labels
Clear and concise labels are essential for effective communication. Customize axis labels,
titles, and legends to provide meaningful context and facilitate understanding. Avoid cluttering
the plot with excessive text or unnecessary details.
4.3.1 Example
Let's consider the "mpg" dataset, which contains information about different car models' fuel
efficiency. Suppose we want to create a scatter plot of highway miles per gallon (hwy) vs. city
miles per gallon (cty), highlighting the car manufacturer:
ggplot(mpg, aes(x = cty, y = hwy, color = manufacturer)) +
geom_point() +
xlab("City MPG") +
ylab("Highway MPG") +
ggtitle("Fuel Efficiency Comparison") +
theme(legend.title = element_blank())
In this code, we customize the axis labels, plot title, and remove the legend
title using the "theme" function. By simplifying the labels and removing
unnecessary elements, we can focus the viewer's attention on the main
message of the visualization.
4.4 Adding Annotations
Annotations can provide additional context, highlight important observations, or guide the
viewer's attention to specific details in the plot. Annotations can take the form of text labels,
arrows, reference lines, or shaded areas, depending on the information you want to convey.
4.4.1 Example
Let's continue with the previous example and add a reference line to the scatter plot to indicate
the equal city and highway miles per gallon:
ggplot(mpg, aes(x = cty, y = hwy, color = manufacturer)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
xlab("City MPG") +
ylab("Highway MPG") +
ggtitle("Fuel Efficiency Comparison") +
theme(legend.title = element_blank())
In this code, we use the "geom_abline" function to add a dashed reference line with an intercept
of 0 and a slope of 1. This line helps viewers understand the relationship between city and
highway miles per gallon and identify deviations from that line.
By considering plot types, colour schemes, labels, and annotations, you can effectively
customize visualizations in R to enhance communication and convey your message clearly.
These customization techniques help capture the audience's attention, facilitate understanding,
and support data-driven storytelling.

You might also like