Am (121-130) Analisis Multinivel
Am (121-130) Analisis Multinivel
Am (121-130) Analisis Multinivel
ordinate (y axis). The value to the right is treated as an independent vari-
able and plotted on the abscissa (x axis). Alternatively, we can rearrange the
terms so that the independent variable comes first, with a comma separating
it from the dependent variable:
plot(anscombe$x1, anscombe$y1)
The values of 5.9 (on the x axis) and 9.35 (on the y axis) are simply where
we thought the text looked best, and can be easily adjusted to suit a user’s
preference.
106 Multilevel Modeling Using R
te
xt(5.9, 9.35, expression(paste("The value of ", italic(R)^2,
" is.67", sep = "")))
Here, paste is used to bind together the text contained within the quotes
and the mathematical expression. Note that sep = "" is used so that there
are no spaces added between the parts pasted together. Although we did not
include it in our figure, the implied regression line can be easily included in a
scatterplot. One way to do this is through the abline function, and to include
the intercept (denoted a) and the slope (denoted b). Thus, abline(a = 3,
b =.5) would add a regression line to the plot that has an intercept at 3 and
a slope of 0.5. Alternatively, to automate the operation, using abline(lm.
object) will extract the intercept and slope and include the regression line
in a scatterplot.
Finally, notice in Figure 6.2 that we have broken the y axis to make very
clear that the plot does not start at the origin. Whether this is needed may
be debatable; note that base R does not include this option, but we prefer to
10 ^
Y = 3 + X0.5
The value of R2 is 0.67
8
Y
2
4 6 8 10 12 14
X
FIGURE 6.2
Plot of Anscombe’s Data Set 1 with the regression line of best fit, axis breaks, and text
denoting the regression line and values of the squared multiple correlation coefficient.
(Source: Anscombe, F.J. (1973). Graphs in Statistical Analysis. American Statistician, 27, 17–21.
With permission.)
Graphing Data in Multilevel Contexts 107
include it in many situations. A broken axis can be identified with the axis.
break function from the plotrix package. The zigzag break is requested
via the style = "zigzag" option in axis.break and the particular axis
(1 for y and 2 for x). By default, R will set the axis to a point that is generally
appropriate. However, when the origin is not shown, there is no break in the
axis by default as some have argued is important.
Now, let us combine the various pieces of information that we have dis-
cussed to produce Figure 6.2, which was generated with the following syntax.
data(anscombe)
text(5.9, 10.15,
expression(italic(hat(Y)) = =3+italic(X)*.5))
20 30 40 50 60 70
4.0
3.5
GPA 3.0
2.5
2.0
70
60
50
CTA_Total
40
30
20
40
35
30
BS_Total 25
20
15
10
2.0 2.5 3.0 3.5 4.0 10 15 20 25 30 35 40
FIGURE 6.3
Pairs plot of Cassidy GPA data showing bivariate scatterplots for all three variables.
pairs(
cb
ind(GPA = Cassidy$GPA, CTA_Total = Cassidy$CTA.tot,
BS_Total = Cassidy$BStotal))
cor(na.omit(
cb
ind(GPA = Cassidy$GPA, CTA_Total = Cassidy$CTA.tot,
BS_Total = Cassidy$BStotal)))
Other options are available for dealing with missing data (see ?cor). We used
the na.omit function wrapped around the cbind function to obtain a list-
wise deletion data set in which the following correlation matrix is computed.
hist(resid.1,
fr
eq = FALSE, main = "Density of Residuals for Model 1",
xlab = "Residuals")
lines(density(resid.1))
This code first requests that a histogram be produced for the residuals. Note
that freq = FALSE is used. It instructs R to make the y axis scaled in
terms of probability, not the default (frequency). The solid line represents
the density estimate of the residuals, corresponding closely to the bars in the
histogram.
Rather than a histogram and overlaid density, a more straightforward way to
evaluate the distribution of the errors to the normal distribution is a quantile–
quantile (QQ) plot that includes a straight line reflecting the expected distribu-
tion of the data if in fact it is normal. In addition, the individual data points are
represented in the figure as dots. In a normal data distribution, the dots will
fall along or very close to the straight line. The following code produced the
QQ plot in Figure 6.5, based on the residuals from the GPA model.
0.8
0.6
Density
0.4
0.2
0.0
−1.5 −1.0 −0.5 0.0 0.5 1.0
Residuals
FIGURE 6.4
Histogram with overlaid density curve of residuals from GPA model in Chapter 1.
1
Sample Quantiles
−1
−2
−3
−3 −2 −1 0 1 2 3
Theoretical Quantiles
FIGURE 6.5
Plot comparing observed quantiles of residuals to theoretical quantiles from standard normal
distribution. When points fall along the line that has slope 1 and goes through the origin,
sample quantiles follow normal distribution.
Graphing Data in Multilevel Contexts 111
Notice that above we use the scale function that standardizes the residuals
to have a mean of zero (already done due to the regression model) and a
standard deviation of 1.
We can see in Figure 6.5 that points in the QQ plot diverge from the line
for the higher end of the distribution. This is consistent with the histogram
in Figure 6.4 that shows a shorter tail on the high end as compared to the low
end of the distribution. This plot and the histogram reveal that the model
residuals diverge from a perfectly normal distribution. Of course, the degree
of non-normality can be quantified (e.g., with skewness and kurtosis mea-
sures). However, in this chapter we are most interested in visualizing the
data, and especially looking for gross violations of assumptions. We should
note here the grey area between satisfaction of an assumption and a gross
violation. At this point, we leave the interpretation to the reader and provide
information on how such visualizations can be made.
6.3.1 dotplot
One very useful function in this package is dotplot. One way to use this
function is to plot a variable of interest on the x axis with a grouping variable
(e.g., classrooms) on the y axis. In the Prime Time data, identifiers for each
corporation, each school within corporation, and each individual student are
included. Classrooms within a school are simply numbered 1 – nj, where nj
is the number of classrooms in the school. This type of structure is typical
of large multilevel data sets. Suppose we want to see how reading achieve-
ment (geread) is distributed within and between classrooms. To begin, we
take a single school (767) in a single corporation (940). This is the first school
represented in the data set and we are using it solely for illustrative pur-
poses. Within this school are four classrooms. To compare the distributions
of geread scores between classrooms within the school, we can use the
dotplot function from the lattice package as follows:
dotplot(
class ~ geread,
data = Achieve.940.767, jitter.y = TRUE, ylab = "Classroom",
ma
in = "Dotplot of \'geread\' for Classrooms in School 767,
Which is Within Corporation 940")
Ac
hieve.940.767 <- Achieve[Achieve$corp = =940 &
Achieve$school = =767,]
The class ~ geread component of the code instructs the function to plot
geread for each classroom. The jitter.y parameter is used to jitter or
slightly shift overlapping data points in the graph. For example, if multiple
students in the same classroom have the same scores for geread, using
the jitter option will shift those points on the y axis to indicate clearly the
multiple values at the same x value. Finally, labels can be specified. Note
that the use of \'geread\' in the main title puts geread in a single quote
in the title. Calling the dotplot function using these R commands yields
Figure 6.6.
The figure shows the dispersion of geread for the classrooms in school
767. From this plot, we can see that students in each of the four classrooms
had generally similar reading achievement scores. However, it is also clear
that classrooms 2 and 4 have multiple students with outlying scores that are
higher than those of other individuals within the school. We hope we have
shown clearly how a researcher may make use of this type of plot when
examining the distributions of scores for individuals at a lower level of data
(e.g., students) nested within a higher level such as classrooms or schools.
Because the classrooms within a school are arbitrarily numbered, we
can alter the order in which they appear in the graph to make a display
more meaningful. Note that if the order of the classrooms had not been
arbitrary (e.g., honors classes numbered 1 and 2), we would need to be
3
Classroom
0 2 4 6 8 10 12
geread
FIGURE 6.6
The dotplot of classrooms in school 767 (within corporation 940) for geread.
114 Multilevel Modeling Using R
very careful about changing the order. However, in this case, no such con-
cerns are necessary. In particular, the use of the reorder function on the
left side of the ~ symbol will reorder the classrooms in ascending order of
the variable of interest (geread in this case) in terms of the mean. Thus,
we can modify Figure 6.6 to place the classes in descending order by the
mean of geread.
dotplot(
reorder(class, geread) ~ geread,
data = Achieve.940.767, jitter.y = TRUE, ylab = "Classroom",
ma
in = "Dotplot of \'geread\' for Classrooms in School 767,
Which is Within Corporation 940")
4
Classroom
0 2 4 6 8 10 12
geread
FIGURE 6.7
The dotplot of classrooms in school 767 (within corporation 940) for geread with classrooms
ordered by means (lowest to highest).