A Grammar of Graphics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45
At a glance
Powered by AI
The key takeaways are that the grammar of graphics is an abstraction that makes thinking about and communicating graphics easier. Ggplot is a package inspired by the grammar of graphics that makes creating statistical graphics easier. Future directions include continuing to develop foundations, dissemination, and exploring new graphical methods.

The grammar of graphics is an abstraction developed by Leland Wilkinson that makes thinking about, reasoning about, and communicating graphics easier. It provides a framework for describing graphical mappings between data and visual properties of graphical objects.

Ggplot is a high-level package for R that makes creating statistical graphics easier. It is inspired by and implements the grammar of graphics. Ggplot has a rich set of components and user-friendly wrappers. It aims to make graphics easier to create, facilitate new types of displays, and provide a continuum of expertise for users.

A grammar of graphics:

past, present, and future


Hadley Wickham
Iowa State University
http://had.co.nz/

Past

If any number of
magnitudes are each
the same multiple of
the same number of
other magnitudes,
then the sum is that
multiple of the sum.
Euclid, ~300 BC

If any number of
magnitudes are each
the same multiple of
the same number of
other magnitudes,
then the sum is that
multiple of the sum.
Euclid, ~300 BC

m(x) = (mx)

The grammar of graphics


An abstraction which makes thinking,

reasoning and communicating graphics


easier

Developed by Leland Wilkinson, particularly


in The Grammar of Graphics 1999/2005

One of the cornerstones of my research


(Ill talk about the others later)

Present

ggplot

High-level package for creating statistical graphics.


A rich set of components + user friendly wrappers
Inspired by The Grammar of Graphics
Leland Wilkinson 1999
John Chambers award in 2006

Philosophy of ggplot
Examples from a recent paper
New methods facilitated by ggplot

Philosophy

Make graphics easier


Use the grammar to facilitate research into
new types of display
Continuum of expertise:

start simple by using the results of the theory


grow in power by understanding the theory
begin to contribute new components

Orthogonal components and minimal special


cases should make learning easy(er?)

Examples

J.Hobbs, H.Wickham, H.Hofmann, and D.Cook.


Glaciers melt as mountains warm: A graphical
case study. Computational Statistics. Special issue for
ASA Statistical Computing and Graphics Data Expo 2006.

Exploratory graphics created with GGobi,


Mondrian, Manet, Gauguin and R, but needed
consistent high-quality graphics that work in
black and white for publication
So... used ggplot to recreate the graphics

qplot(long, lat, data = expo, geom="tile", fill = ozone,


facets = year ~ month) +
scale_fill_gradient(low="white", high="black") + map

ggplot(df, aes(x = long + res * x, y = lat + res * y)) + map +


geom_polygon(aes(group = interaction(long, lat)), fill=NA, colour="black")

30

h
t
i
w
d
e
t
a
e
r
c
y
l
r
l
u
a
i
o
t
i
t
n
I
n
o
i
t
a
l
corre

20

10

10

20

ggplot(rexpo, aes(x = long + res * rtime, y = lat + res * rpressure))


+ map + geom_line(aes(group
= id))
110
85
60

library(maps)
outlines <- as.data.frame(map("world",xlim=-c(113.8, 56.2),ylim=c(-21.2, 36.2)))
map <- c(
geom_path(aes(x = x, y = y), data = outlines, colour = alpha("grey20", 0.2)),
scale_x_continuous("", limits = c(-113.8, -56.2), breaks = c(-110, -85, -60)),
scale_y_continuous("", limits = c(-21.2, 36.2))
)

310

temperature

300

290

280

270
1995

1996

1997

1998

1999

2000

date

qplot(date, temperature, data=clustered, group=id, geom="line")


+ pacific + elnino(clustered$temperature)

pacific <- brush(cluster %in% c(5,6))


brush <- function(condition, background = "grey60", brush = "red") {
cond_string <- deparse(substitute(condition), width=500)
colour <- paste(
"ifelse(", cond_string, ", '", brush, "', '", background, "')", sep=""
)
order <- paste(
"ifelse(", cond_string, ", 2, 1)", sep=""
)
size <- paste(
"ifelse(", cond_string, ", 2, 1)", sep=""
)
list(
aes_string(colour = colour, order = order, size=size),
scale_colour_identity(),
scale_size_identity()
)
}

ggplot(clustered, aes(x = long, y = lat))


+ geom_tile(aes(width = 2.5, height = 2.5,
fill = factor(cluster)))
+ facet_grid(cluster ~ .)
+ map
+ scale_fill_brewer(palette="Spectral")

qplot(date, value, data = clusterm, group = id,


geom = "line", facets = cluster ~ variable,
colour = factor(cluster))
+ scale_y_continuous("", breaks=NA)
+ scale_colour_brewer(palette="Spectral")

New methods
Supplemental statistical summaries
Iterating between graphics and models
Tables of plots
Inspired by ideas of Tukey (and others)
Exploratory graphics, not as pretty

Intro to data
Response of trees to gypsy moth attack
5 genotypes of tree: Dan-2, Sau-2, Sau-3,

Wau-1, Wau-2
2 treatments: NGM / GM
2 nutrient levels: low / high
5 reps
Measured: weight, N, tannin, salicylates

qplot(genotype, weight, data=b)


70

60

weight

50

30

20

40

10

Dan2

Sau2

Sau3

genotype

Wau1

Wau2

70

qplot(genotype, weight, data=b,


colour=nutr)

60

weight

50

30

High

20

10

Dan2

Sau2

nutr
Low

40

Sau3

genotype

Wau1

Wau2

70

60

weight

50

30

20

40

Low

10

Sau3

Dan2

Sau2

genotype

nutr

Wau2

Wau1

High

Comparing means
For inference, interested in comparing the
means of the groups

But hard to do visually - eyes naturally


compare ranges

What can we do? - Visual ANOVA

Supplemental summaries
Fro
m

smry <- stat_summary(


Hm
isc
fun="mean_cl_boot", conf.int=0.68,
geom="crossbar", width=0.3
)

Adds another layer with summary statistics:


mean + bootstrap estimate of standard error

Motivation: still exploratory, so minimise


distributional assumptions, will model explicitly
later

70

qplot(genotype, weight, data=b,


colour=nutr)

60

weight

50

30

20

40

Low

10

Sau3

Dan2

Sau2

genotype

nutr

Wau2

Wau1

High

70

qplot(genotype, weight, data=b,


colour=nutr) + smry

60

weight

50

30

20

40

Low

10

Sau3

Dan2

Sau2

genotype

nutr

Wau2

Wau1

High

Iterating graphics
and modelling
Clearly strong genotype effect. Is there a
nutr effect? Is there a nutr-genotype
interaction?

Hard to see from this plot - what if we

remove the genotype main effect? What if


we remove the nutr main effect?

How does this compare an ANOVA?

70

qplot(genotype, weight, data=b,


colour=nutr) + smry

60

weight

50

30

20

40

Low

10

Sau3

Dan2

Sau2

genotype

nutr

Wau2

Wau1

High

20

10

nutr

weight2

10

20

Low

b$weight2 <- resid(lm(weight ~ genotype, data=b))


qplot(genotype, weight2, data=b, colour=nutr) + smry

Sau3

Dan2

Sau2

genotype

Wau2

Wau1

High

10

weight3

nutr
Low
High

10

20

b$weight3 <- resid(lm(weight ~ genotype + nutr, data=b))


qplot(genotype, weight3, data=b, colour=nutr) + smry

Sau3

Dan2

Sau2

genotype

Wau2

Wau1

Df Sum Sq Mean Sq F value Pr(>F)


genotype
4 13331
3333
36.22 8.4e-13 ***
nutr
1
1053
1053
11.44 0.0016 **
genotype:nutr 4
144
36
0.39 0.8141
Residuals
40
3681
92

anova(lm(weight ~ genotype * nutr, data=b))

Graphics Model
In the previous example, we used graphics
to iteratively build up a model - a la
stepwise regression!

But: here interested in gestalt, not accurate


prediction, and must remember that this is
just one possible model

What about model graphics?

Model Graphics

If we model first, we need graphical tools to


summarise model results, e.g. post-hoc
comparison of levels

We can do better than SAS! But its hard


work: effects, multComp and multCompView

Rich research area

60

weight

20

40

bc

Sau3

Dan2

Sau2

Wau2

Wau1

genotype

Low
High

nutr

60

weight

nutr
Low
High

40

ggplot(b, aes(x=genotype, y=weight))


20
+ geom_hline(intercept = mean(b$weight))
+ geom_crossbar(aes(y=fit, min=lower,max=upper),
data=geffect)
+ 0geom_point(aes(colour
a
a
b = nutr))
bc
c
+ geom_text(aes(label
= group),
Sau3
Dan2
Sau2
Wau2 data=geffect)
Wau1

genotype

Tables of plots
Often interested in marginal, as well as
conditional, relationships

Or comparing one subset to the whole,


rather than to other subsets

Like in contingency table, we often want to


see margins as well

Dan2

Sau2

Sau3

Wau1

Wau2

4.5

4.0

3.0

2.5

High

3.5

trt

2.0

NGM
GM

4.5
4.0

3.0

2.5
2.0

Low

3.5

GM

NGM

GM

NGM

GM

NGM

trt

GM

NGM

GM

NGM

Dan2

Sau2

Sau3

Wau1

Wau2

(all)

4.5

4.0

3.0

2.5

2.0

High

3.5

4.5

trt

3.5
3.0

2.5
2.0

4.5

3.0

2.5
2.0

GM

NGM

GM

NGM

GM

NGM

GM

NGM

GM

NGM

GM

trt

NGM

(all)

3.5

4.0

Low

4.0

NGM
GM

Arranging plots
Facilitate comparisons of interest
Small differences need to be closer

together (big differences can be far apart)

Connections to model?

Summary
Need to move beyond canned statistical

graphics to experimenting with new


graphical methods
Strong links between graphics and models,
how can we best use them?
Static graphics often aren't enough

Future

GGobi

ggplot2

rggobi
classifly

meifly
clusterfly

faceoff

geozoo

Bio- and bibliographic tools for statistics

reshape
fda
hints
localmds
lvboxplot
scagnostics
DescribeDisplay

GGobi

ggplot2

fda
hints
localmds
lvboxplot
scagnostics
DescribeDisplay

rggobi
classifly

meifly
clusterfly

faceoff

geozoo

New methods

reshape

Foundations of
statistical graphics

Dissemination
and outreach

??
??

GGobi
rggobi

classifly

meifly
clusterfly

faceoff

geozoo

New methods

ggplot2
A grammar of
interactive
graphics
A grammar of
graphics for
categorical data

Foundations of
statistical graphics

reshape2
fda
hints
localmds
lvboxplot
scagnostics
DescribeDisplay

Dissemination
and outreach

Questions?

You might also like