Reg Book Stat
Reg Book Stat
Reg Book Stat
Machine Learning
a Modern View of Statistical Regression and
Classification
Norman Matloff
University of California, Davis
2
Contents
Preface xv
i
ii CONTENTS
2.7.3 βb Is Consistent . . . . . . . . . . . . . . . . . . . . . 73
2.8 Inference under Homoscedasticity . . . . . . . . . . . . . . . 74
2.8.1 Review: Classical Inference on a Single Mean . . . . 74
2.8.2 Extension to the Regression Case . . . . . . . . . . . 76
2.8.3 Example: Bike-Sharing Data . . . . . . . . . . . . . 79
2.9 Collective Predictive Strength of the X (j) . . . . . . . . . . 81
2.9.1 Basic Properties . . . . . . . . . . . . . . . . . . . . 81
2.9.2 Definition of R2 . . . . . . . . . . . . . . . . . . . . 82
2.9.3 Bias Issues . . . . . . . . . . . . . . . . . . . . . . . 83
2.9.4 Adjusted-R2 . . . . . . . . . . . . . . . . . . . . . . 85
2.9.5 The “Leaving-One-Out Method” . . . . . . . . . . . 86
2.9.5.1 The Code . . . . . . . . . . . . . . . . . . . 87
2.9.5.2 Example: Bike-Sharing Data . . . . . . . . 90
2.9.5.3 Another Use of loom(): the Jackknife . . . 91
2.9.6 Other Measures . . . . . . . . . . . . . . . . . . . . . 92
2.9.7 The Verdict . . . . . . . . . . . . . . . . . . . . . . . 92
CONTENTS v
xvii
xviii PREFACE
This chapter will set the stage for the book, previewing many of the ma-
jor concepts to be presented in later chapters. The material here will be
referenced repeatedly throughout the book.
Let’s start with a well-known dataset, Bike Sharing, from the Machine
Learning Repository at the University of California, Irvine.1 Here we have
daily/hourly data on the number of riders, weather conditions, day-of-week,
month and so on. Regression analysis, which relates the mean of one vari-
able to the values of one or more other variables, may turn out to be useful
to us in at least two ways:
• Prediction:
The managers of the bike-sharing system may wish to predict rider-
ship, say for the following question:
1
2 CHAPTER 1. SETTING THE STAGE
• Description:
We may be interested in determining what factors affect ridership.
How much effect, for instance, does wind speed have in influencing
whether people wish to borrow a bike?
These twin goals, Prediction and Description, will arise frequently in this
book. Choice of methodology will often depend on the goal in the given
application.
Even without any knowledge of statistics, many people would find it rea-
sonable to predict via subpopulation means. In the above bike-sharing
example, say, this would work as follows.
Think of the “population” of all days, past, present and future, and their
associated values of number of riders, weather variables and so on.3 Our
data set is considered a sample from this population. Now consider the
subpopulation consisting of all days with the given conditions: Sundays,
sunny skies and 62-degree-temperatures.
It is intuitive that:
In fact, such a strategy is optimal, in the sense that it minimizes our ex-
pected squared prediction error. We will defer the proof to Section 1.19.3
in the Mathematical Complements section at the end of this chapter, but
2 As a consumer, I used to ignore these, but not with the sharp decline in the num-
ber of bricks-and-mortar bookstores which I could browse, I now often find Amazon’s
suggestions useful.
3 This is a somewhat slippery notion, because there may be systemic differences from
the present and the distant past and distant future, but let’s suppose we’ve resolved that
by limiting our time range.
4 CHAPTER 1. SETTING THE STAGE
what is important for now is to note that in the above prediction rule, we
are dealing with a conditional mean: Mean ridership, given day of the week
is Sunday, skies are sunny, and temperature is 62.
To make this more mathematically precise, keep in mind that in this book,
as with many other books, the expected value functional E() refers to popu-
lation mean. Say we are studying personal income, I, for some population,
and we choose a person at random from that population. Then E(I) is not
only the mean of that random variable, but much more importantly, it is
the mean income of all people in that population.
Similarly, we can define condition means, i.e., means of subpopulations.
Say G is gender. Then the conditional expected value, E(I | G = male) is
the mean income of all men in the population.
To illustrate this in the bike-sharing context, let’s define some variables:
• T , the temperature
There is one major problem, though: We don’t know the value of the right-
hand side of (1.1). All we know is what is in our sample data, whereas the
right-side of (1.1) is a population value, and thus unknown.
The difference between sample and population is of course at the
very core of statistics. In an election opinion survey, for instance, we
wish to know p, the proportion of people in the population who plan to
vote for Candidate Jones. But typically only 1200 people are sampled, and
4 Note that the “hat” notation ˆ is the traditional one for “estimate of.”
1.6. EXAMPLE: DO BASEBALL PLAYERS GAIN WEIGHT AS THEY AGE?5
we calculate the proportion of Jones supporters among them, pb, using that
as our estimate of p. This is why the news reports on these polls always
report the margin of error,5
Similarly, though we would like to know the value of E(R | W = Sunday, S =
sunny, T = 62), it is an unknown population value, and thus must
be estimated from our sample data, which we’ll do later in this chap-
ter.
Readers will greatly profit from constantly keeping in mind this
distinction between populations and samples.
Before going on, a bit of terminology: We will refer to the quantity to be
predicted, e.g. R above, as the response variable, and the quantities used
in prediction, e.g. W , S and T above, as the predictor variables. (By the
way, the machine learning community uses the term features rather than
predictors.)
Though the bike-sharing data set is the main example in this chapter, it
is rather sophisticated for introductory material. Thus we will set it aside
temporarily, and bring in a simpler data set for now. We’ll return to the
bike-sharing example in Section 1.13.
This new dataset involves 1015 major league baseball players, courtesy of
the UCLA Statistics Department. You can obtain the data either from
the UCLA Web page, or as the data set mlb in freqparcoord, a CRAN
package authored by Yingkang Xie and myself. The variables of interest to
us here are player weight W , height H and age A, especially the first two.
Here are the first few records:
> library ( freqpar coord )
> data ( mlb )
> head ( mlb )
Name Team P o s i t i o n Height
1 Adam Donachie BAL Catcher 74
2 Paul Bako BAL Catcher 74
3 Ramon Hernandez BAL Catcher 72
4 Kevin M i l l a r BAL F i r s t Baseman 72
5 Actually the radius of a 95% confidence interval for p.
6 CHAPTER 1. SETTING THE STAGE
Ahtletes strive to keep physically fit. Yet even they may gain
weight over time, as do people in the general population. To
what degree does this occur with the baseball players? This
question can be answered by performing a regression analysis of
weight against height and age, which we’ll do in Section 1.9.1.2.6
c = E(W | H = 72)
W (1.2)
6 Thephrasing here, “regression analysis of ... against ...,” is commonly used in this
field. The quantity before “against” is the response variable, and the ones following are
the predictors.
1.6. DO BASEBALL PLAYER GAIN WEIGHT? 7
Again, though, this is a population value, and all we have is sample data.
How will we estimate E(W | H = 72) from that data?
First, some important notation: Recalling that µ is the traditional Greek
letter to use for a population mean, let’s now use it to denote a function
that gives us subpopulation means:
So, µ(72.12) is the mean population weight of all players of height 72.12,
µ(73.88) is the mean population weight of all players of height 73.88, and
so on. These means are population values and thus unknown, but they do
exist.
So, to predict the weight of a 71.6-inch tall player, we would use µ(71.6) —
if we knew that value, which we don’t, since once again this is a population
value while we only have sample data. So, we need to estimate that value
from the (height, weight) pairs in our sample data, which we will denote
by (H1 , W1 ), ...(H1015 , W1015 ). How might we do that? In the next two
sections, we will explore ways to form our estimate, µ b(t).
Our height data is only measured to the nearest inch, so instead of esti-
mating values like µ(71.6), we’ll settle for µ(72) and so on. A very natural
estimate for µ(72), again using the “hat” symbol to indicate “estimate of,”
is the mean weight among all players in our sample for whom height is 72,
i.e.
b(t) at once:
R’s tapply() can give us all the µ
8 CHAPTER 1. SETTING THE STAGE
In case you are not familiar with tapply(), here is what just happened. We
asked R to partition the Weight variable into groups according to values
of the Height variable, and then compute the mean weight in each group.
So, the mean weight of people of height 72 in our sample was 192.5600.
In other words, we would set µ b(72) = 192.5600, µb(74) = 202.4566, and so
on. (More detail on tapply() is given in the Computational Complements
section at the end of this chapter.)
Since we are simply performing the elementary statistics operation of esti-
mating population means from samples, we can form confidence intervals
(CIs). For this, we’ll need the “n” and sample standard deviation for each
height group:
Here is how that first call to tapply() worked. Recall that this function
partitions the data by the Height variables, resulting in a weight vector for
each height value. We need to specify a function to apply to each of those
vectors, which in this case we choose to be R’s length() function. The
latter then gives us the count of weights for each height value, the “n” that
we need to form a CI.
1.6. DO BASEBALL PLAYER GAIN WEIGHT? 9
17.56349
190.3596 ± 1.96 √ (1.5)
150
or about (187.6,193.2).
The above analysis takes what is called a nonparametric approach. To see
why, let’s proceed to a parametric one, in the next section.
All models are wrong, but some are useful — famed statistician George Box
So far, we have assumed nothing about the shape that µ(t) would have, if it
were plotted on a graph. Again, it is unknown, but the function does exist,
and thus it does correspond to some curve. But we might consider making
an assumption on the shape of this unknown curve. That might seem odd,
but you’ll see below that this is a very powerful, intuitively reasonable idea.
b(t) we found above. We run
Toward this end, let’s plot those values of µ
> plot ( 6 7 : 8 3 , muhats )
µ(t) = c + dt (1.6)
for some constants c and d, over the range of t appropriate to human heights.
Or, in English,
Don’t forget the word mean here! We are assuming that the mean weights
in the various height subpopulations have the form (1.6), NOT that weight
itself is this function of height, which can’t be true.
This is called a parametric model for µ(t), with parameters c and d. We
will use this below to estimate µ(t). Our earlier estimation approach, in
1.6. DO BASEBALL PLAYER GAIN WEIGHT? 11
Coefficients :
( Intercept ) mlb$ Height
−151.133 4.783
We need not type this expression into R by hand. Here is why: Writing
the expression in matrix-multiply form, it is
1
(−151.133, 4.783) (1.9)
72
Be sure to see the need for that 1 in the second factor; it is used to mul-
tiply the -151.133. Now let’s use that matrix form to show how we can
conveniently compute that value in R:8
The key is that we can exploit the fact that R’s coef() function fetches the
coefficients c and d for us:
> coef ( lmout )
( I n t e r c e p t ) mlb$ Height
−151.133291 4.783332
using R’s predict() function for now. It will be introduced later, though, in Section
4.4.4.
12 CHAPTER 1. SETTING THE STAGE
So, using this model, we would predict a slightly heavier weight than our
earlier prediction.
We can form a confidence interval from this too, which for the 95% level
will be
Now here is a major point: The CI we obtained from our linear model,
(191.9,194.6), was narrower than the nonparametric approach gave us,
(187.6,193.2); the former has width of about 2.7, while the latter’s is 5.6.
In other words:
Why should the linear model be more effective? Here is some intuition,
say for estimating µ(72): As will be seen in Chapter 2, the lm() function
uses all of the data to estimate the regression coefficients. In our case here,
all 1015 data points played a role in the computation of µ̌(72), whereas
only 150 of our observations were used in calculating our nonparametric
estimate µb(72). The former, being based on much more data, should tend
to be more accurate.9
On the other hand, in some settings it may be difficult to find a valid para-
metric model, in which case a nonparametric approach may be much more
effective. This interplay between parametric and nonparametric models will
be a recurring theme in this book.
Let’s try a linear regression model on the CTR data in Section 1.3:
> c t r <−
read . table ( ’ S t a t e CTR Date . t x t ’ , h e a d e r=T, s e p= ’ \ t ’ )
> lm( c t r $CTR ˜ c t r $ C o l l e g e Grad )
...
Coefficients :
( I n t e r c e p t ) c t r $ C o l l e g e Grad
0.01412 −0.01373
...
So, a “typical” difference between one state and another is something like
0.05. Multiplying by the -0.01373 figure above, this translates to a difference
in click-through rate of about 0.0005. This is certainly not enough to have
any practical meaning.
So, putting aside such issues as whether our data constitute a sample from
some “population” of potential states, the data suggest that there is really
no substantial relation between educational level and CTR.
9 Note the phrase tend to here. As you know, in statistics one usually cannot say that
one estimator is always better than another, because anomalous samples do have some
nonzero probability of occurring.
14 CHAPTER 1. SETTING THE STAGE
Now let’s predict weight from height and age. We first need some notation.
Say we are predicting a response variable Y from variables X (1) , ..., X (k) .
The regression function is now defined to be
In other words, µ(t1 , ..., tk ) is the mean Y among all units (people, cars,
whatever) in the population for which X (1) = t1 , ..., X (k) = tk .
In our baseball data, Y , X (1) and X (2) might be weight, height and age,
respectively. Then µ(72, 25) would be the population mean weight among
all players of height 72 and age 25.
We will often use a vector notation
space on a page, we will often show them as transposes of rows. For instance, we will
often write (5, 12, 13)′ instead of
5
12 (1.13)
13
1.9. SEVERAL PREDICTOR VARIABLES 15
and we would predict the weight of a 72-inch tall, age 25 player to be about
190 pounds.
It was mentioned in Section 1.1 that regression analysis generally has one or
both of two goals, Prediction and Description. In light of the latter, some
brief comments on the magnitudes of the estimated coefficientsis would be
useful at this point:
16 CHAPTER 1. SETTING THE STAGE
Now let’s drop the linear model assumption (1.14), and estimate our re-
gression function “from scratch,” as we did in Section 1.6.2. But here we
will need to broaden our approach, as follows.
Again say we wish to estimate, using our data, the value of µ(72, 25). A
potential problem is that there likely will not be any data points in our
sample that exactly match those numbers, quite unlike the situation in
b(72) was based on 150 data points. Let’s check:
(1.4), where µ
> z <− mlb [ mlb$ Height == 72 & mlb$Age == 2 5 , ]
> z
[ 1 ] Name Team Position
[ 4 ] Height Weight Age
[ 7 ] PosCategory
<0 rows> ( or 0−length row . names)
So, indeed there were no data points matching the 72 and 25 numbers. Since
the ages are recorded to the nearest 0.01 year, this result is not surprising.
But at any rate we thus cannot set µb(72, 25) to be the mean weight among
our sample data points satisfying those conditions, as we did in Section
1.6.2. And even if we had had a few data points of that nature, that would
not have been enough to obtain an accurate estimate µ b(72, 25).
Instead, what is done is use data points that are close to the desired pre-
diction point. Again taking the weight/height/age case as a first example,
1.9. SEVERAL PREDICTOR VARIABLES 17
this means that we would estimate µ(72, 25) by the average weight in our
sample data among those data points for which height is near 72 and age
is near 25.
p
distance[(s1 , s2 , ..., sk ), (t1 , t2 , ..., tk )] = ((s1 − t1 )2 + ... + (sk − tk )2
(1.17)
For instance, the distance from a player in our sample of height 72.5 and
age 24.2 to the point (72,25) would be
p
(72.5 − 72)2 + (24.2 − 25)2 = 0.9434 (1.18)
Note that the Euclidean distance between s = (s1 , ..., sk and t = (t1 , ..., tk
is simply the Euclidean norm of the difference s − t (Section A.1(.
y <− xydata [ , y c o l ]
i f ( i s . vector ( r e g e s t p t s ) ) {
r e g e s t p t s <− matrix ( r e g e s t p t s , nrow=1)
colnames ( r e g e s t p t s ) <− colnames ( x )
}
tmp <− rbind ( x , r e g e s t p t s )
tmp <− s c a l e ( tmp )
x <− tmp [ 1 : nrow( x ) , ]
r e g e s t p t s <− tmp [ ( nrow( x ) + 1 ) :nrow( tmp ) , ]
i f ( ! i s . matrix ( r e g e s t p t s ) )
r e g e s t p t s <− matrix ( r e g e s t p t s , nrow=1)
tmp <− get . knnx ( data=x , query=r e g e s t p t s , k=k )
i d x <− tmp$nn . index
meannear <− function ( idxrow ) mean( y [ idxrow ] )
apply ( idx , 1 , meannear )
}
The parametric case is the simpler one. We fit our data, write down the
result, and then use that result in the future whenever we are called upon
to do a prediction.
Recall Section 1.9.1.1. It was mentioned there that in that setting, we prob-
ably are not interested in the Prediction goal, but just as an illustration,
suppose we do wish to predict. We fit our model to our data — called
our training data — resulting in our estimated regression function, (1.15).
From now on, whenever we need to predict a player’s weight, given his
height and age, we simply plug those values into (1.15).
1.11.1 Intuition
To see how overfitting may occur, consider the famous bias-variance trade-
off, illustrated in the following example. Again, keep in mind that the
treatment will at this point just be intuitive, not mathematical.
Long ago, when I was just finishing my doctoral study, I had my first
experience in statistical consulting. A chain of hospitals was interested
in comparing the levels of quality of care given to heart attack patients
at its various locations. A problem was noticed by the chain regarding
straight comparison of raw survival rates: One of the locations served a
largely elderly population, and since this demographic presumably has more
difficulty surviving a heart attack, this particular hospital may misleadingly
appear to be giving inferior care.
An analyst who may not realize the age issue here would thus be biasing
the results. The term “bias” here doesn’t mean deliberate distortion of
the analysis, just that one is using a less accurate model than one should,
actually “skewed” in the common vernacular. And it is permanent bias, in
the sense that it won’t disappear, no matter how large a sample we take.
Such a situation, in which an important variable is not included in the
analysis, is said to be underfitted. By adding more predictor variables in a
regression model, in this case age, we are reducing bias.
Or, suppose we use a regression model which is linear in our predictors,
but the true regression function is nonlinear. This is bias too, and again it
won’t go away even if we make the sample size huge.
On the other hand, we must keep in mind that our data is a sample from
a population. In the hospital example, for instance, the patients on which
we have data can be considered a sample from the (somewhat conceptual)
population of all patients at this hospital, past, present and future. A
different sample would produce different regression coefficient estimates.
In other words, there is variability in those coefficients from one sample to
another, i.e. variance. We hope that that variance is small, which gives us
11 Note that this assumes that nothing changes in the system under study between the
time we collect our training data and the time we do future predictions.
1.11. UNDERFITTING, OVERFITTING, BIAS AND VARIANCE 21
In Section 1.19.2 it is shown that for any statistical estimator θb (that has
finite variance),
Our estimator here is µb(t). This shows the tradeoff: Adding variables, such
as age in the hospital example, reduces squared bias but increases variance.
Or, equivalently, removing variables reduces variance but exacerbates bias.
It may for example be beneficial to accept a little bias in exchange for
a sizable reduction in variance, which we may achieve by removing some
predictors from our model..
The trick is somehow to find a “happy medium,” easier said than done.
Chapter 9 will cover this in depth, but for now, we introduce a common
method for approaching the problem:
1.11.2 Cross-Validation
Toward that end, it is common to artificially create a set of “new” data and
try things out. Instead of using all of our collected data as our training set,
we set aside part of it to serve as simulated “new” data. This is called the
validation set or test set. The remainder will be our actual training data.
In other words, we randomly partition our original data, taking one part as
our training set and the other part to play the role of new data. We fit our
model, or models, to the training set, then do prediction on the test set,
pretending its response variable values are unknown. We then compare to
12 I wish to thank Ariel Shin for this interpretation.
22 CHAPTER 1. SETTING THE STAGE
the real values. This will give us an idea of how well our models will predict
in the future. This is called cross-validation.
The above description is a little vague, and since there is nothing like code
to clarify the meaning of an algorithm, let’s develop some. Here first is
code to do the random partitioning of data, with a proportion p to go to
the training set:
x v a l p a r t <− function ( data , p ) {
n <− nrow( data )
n t r a i n <− round ( p∗n )
t r a i n i d x s <− sample ( 1 : n , n t r a i n , replace=FALSE)
v a l i d i d x s <− s e t d i f f ( 1 : n , t r a i n i d x s )
l i s t ( t r a i n=data [ t r a i n i d x s , ] , v a l i d=data [ v a l i d i d x s , ] )
}
This sets up the partitioning of the data into training and test sets.
Now to perform cross-validation, we’ll consider the parametric and non-
parametric cases separately, in the next two sections.
# arguments :
#
# data : f u l l data
# y c o l : column number o f r e s p . v a r .
# p r e d v a r s : column numbers o f p r e d i c t o r s
13 There are sophisticated packages on CRAN for this, such as cvTools. But to keep
things simple, and to better understand the concepts, we will write our own code. Sim-
ilarly, as mentioned, we will not use R’s predict() function for the time being.
1.11. UNDERFITTING, OVERFITTING, BIAS AND VARIANCE 23
# p : prop . f o r t r a i n i n g s e t
# meanabs : s e e ’ v a l u e ’ b e l o w
(−1, 2)(3, 8)′ 13
= (1.19)
(2, 5)(3, 8)′ 46
24 CHAPTER 1. SETTING THE STAGE
The reader should verify that “distributing out” that common (3, 8)′ factor
is valid algebra:
−1 2 3 13
= (1.20)
2 5 8 46
xvalknn <−
function ( data , y c o l , p r e d v a r s , k , p , meanabs=TRUE) {
tmp <− x v a l p a r t ( data , p )
t r a i n <− tmp$ t r a i n
v a l i d <− tmp$ v a l i d
t r a i n x y <− data [ , c ( p r e d v a r s , y c o l ) ]
v a l i d x <− v a l i d [ , p r e d v a r s ]
1.12. ROUGH RULE OF THUMB 25
v a l i d x <− as . matrix ( v a l i d x )
predy <− k n n e s t ( t r a i n x y , v a l i d x , k )
r e a l y <− v a l i d [ , y c o l ]
i f ( meanabs ) return (mean( abs ( predy − r e a l y ) ) )
l i s t ( predy = predy , r e a l y = r e a l y )
}
The two methods gave similar results. However, this depended on choosing
a value of 20 for k, the number of nearest neighbors. We could have tried
other values of k, and in fact could have used cross-validation to choose the
“best” value.
We now return to the bike-sharing data. Our little excursion to the simpler
data set, involving baseball player weights and heights, helped introduce
the concepts in a less complex setting. The bike-sharing data set is more
complicated in several ways:
Now that we know some of the basic issues from analyzing the baseball
data, we can treat this more complicated data set.
Let’s read in the bike-sharing data. We’ll restrict attention to the first
year,14 and since we will focus on the registered riders, let’s shorten the
name for convenience:
> s h a r <− read . csv ( ” day . c s v ” , h e a d e r=T)
> s h a r <− s h a r [ 1 : 3 6 5 , ]
> names( s h a r ) [ 1 5 ] <− ” r e g ”
14 There appears to have been some systemic change in the second year, and while this
could be modeled, we’ll keep things simple by considering only the first year.
28 CHAPTER 1. SETTING THE STAGE
Call :
lm( formula = s h a r $ r e g ˜ s h a r $temp + s h a r $temp2 )
Coefficients :
( Intercept ) s h a r $temp s h a r $temp2
−1058 16133 −11756
And note that, sure enough, the coefficient of the squared term, eb =
−11756, did indeed turn out to be negative.
Of course, we want to predict from many variables, not just temperature,
so let’s now turn to Complication (b) cited earlier, the presence of nominal
data. This is not much of a problem either.
Such situations are generally handled by setting up what are called indicator
variables or dummy variables. The former term alludes to the fact that our
variable will indicate whether a certain condition holds or not, with 1 coding
the yes case and 0 indicating no.
We could, for instance, set up such a variable for Tuesday data:
1.13. EXAMPLE: BIKE-SHARING DATA 29
Indeed, we could define six variables like this, one for each of the days
Monday through Saturday. Note that Sunday would then be indicated
indirectly, via the other variables all having the value 0. A direct Sunday
variable would be redundant, and in fact would present mathematical prob-
lems, as we’ll see in Chapter 8. (Actually, R’s lm() function can deal with
factor variables directly, as shown in Section 9.7.5.1. But we take the more
basic route here, in order to make sure the underlying principles are clear.)
However, let’s opt for a simpler analysis, in which we distinguish only be-
tween weekend days and week days, i.e. define a dummy variable that is 1
for Monday through Friday, and 0 for the other days. Actually, those who
assembled the data set already defined such a variable, which they named
workingday.15
We incorporate this into our linear model:
There are several other dummy variables that we could add to our model,
but for this introductory example let’s define just one more:
> s h a r $ c l e a r d a y <− as . integer ( s h a r $ w e a t h e r s i t == 1 )
(The use of the data argument saved multiple typing of the data set name
shar and clutter.)
15 More specifically, a value of 1 for this variable indicates that the day is in the
where t2 = t21 .
So, what should we predict for number of riders on the type of day described
at the outset of this chapter — Sunday, sunny, 62 degrees Fahrenheit? First,
note that the designers of the data set have scaled the temp variable to
[0,1], as
where the minimum and maximum here were -8 and 39, respectively. This
form may be easier to understand, as it is expressed in terms of where the
given temperature fits on the normal range of temperatures. A Fahrenheit
temperature of 62 degrees corresponds to a scaled value of 0.525. So, our
16 See more detail on this in Section 1.20.3.
1.13. EXAMPLE: BIKE-SHARING DATA 31
So, our predicted number of riders for sunny, 62-degree Sundays will be
about 3852.
As noted earlier, one can also form confidence intervals and perform sig-
nificance tests on the βi . We’ll go into this in Chapter 2, but some brief
comments on the magnitudes and signs of the βbi is useful at this point:
Let’s see what k-NN gives us as our predicted value for sunny, 62-degree
Sundays, say with values of 20 and 50 for k:
> k n n e s t ( s h a r [ , c ( 1 0 , 8 , 1 7 , 1 5 ) ] , matrix ( c ( 0 . 5 2 5 , 0 , 1 ) ,
nrow=1) ,20)
[ 1 ] 2881.8
> k n n e s t ( s h a r [ , c ( 1 0 , 8 , 1 7 , 1 5 ) ] , matrix ( c ( 0 . 5 2 5 , 0 , 1 ) ,
nrow=1) ,10)
[ 1 ] 3049.7
32 CHAPTER 1. SETTING THE STAGE
This is quite different from what the linear model gave us. Let’s see how
the two approaches compare in cross-validation:
Let’s take another look at (1.24), specifically the term involving the variable
workingday, a dummy indicating a nonholiday Monday through Friday.
Our estimate for β3 turned out to be 988.5, meaning that, holding tem-
perature and the other variables fixed, there are 988.5 additional riders on
workingdays.
But implicit in this model is that the workingday effect is the same on
low-temprerature days as on warmer days. For a broader model that does
not make this assumption, we could add an interaction term, consisting of
a product of workingday and temp:
How does this model work? Let’s illustrate it with a new data set.
1.14. INTERACTION TERMS 33
This data is from the 2000 U.S. Census, consisting of 20,090 programmers
and engineers in the Silicon Valley area. The data set is included in the
freqparcoord package on CRAN. Suppose we are working toward a De-
scription goal, specifically the effects of gender on wage income.
As with our bike-sharing data, we’ll add a quadratic term, in this case
on the age variable, reflecting the fact that many older programmers and
engineers encounter trouble finding work after age 35 or so. Let’s restrict
our analysis to workers having at least a Bachelor’s degree, and look at
the variables age, age2, sex (coded 1 for male, 2 for female), wkswrked
(number of weeks worked), ms, phd and wageinc:
> library ( freqpar coord )
> data ( prgeng )
> prgeng $ age2 <− prgeng $ age ˆ2
> edu <− prgeng $educ
> prgeng $ms <− as . integer ( edu == 1 4 )
> prgeng $phd <− as . integer ( edu == 1 6 )
> prgeng $fem <− prgeng $ s e x − 1
> tmp <− prgeng [ edu >= 1 3 , ]
> pe <− tmp [ , c ( 1 , 1 2 , 9 , 1 3 , 1 4 , 1 5 , 8 ) ]
> pe <− as . matrix ( pe )
Our model is
The results are striking in terms of gender: With age, education and so on
held constant, women are estimated to have incomes about $11,177 lower
than comparable men.
But this analysis implicitly assumes that the female wage deficit is, for
instance, uniform across educational levels. To see this, consider (1.31).
Being female makes a β6 difference, no matter what the values of ms and
phd are. To generalize our model in this regard, let’s define two interaction
variables:17
> msfem <− pe [ , 4 ] ∗ pe [ , 6 ]
> phdfem <− pe [ , 5 ] ∗ pe [ , 6 ]
> pe <− cbind ( pe , msfem , phdfem )
So, now instead of there being a single number for the “female effect,” β6 ,
we how have two:
R colon operator, which automates the process. This is not done here, so as to ensure
that the reader fully understands the meaning of interaction terms. For information on
the colon operator, type ?formula at the R prompt.
1.14. INTERACTION TERMS 35
The estimated values of the two female effects are -9091.230 -5088.779 =
-14180.01, and 9091.230 -14831.582 = -23922.81. A few points jump out
here:
• The gap is worse at the PhD level than the Master’s, likely because
of the generally higher wages for the latter.
Recall the hospital example in Section 1.11.1. There the response variable
is nominal, represented by a dummy variable taking the values 1 and 0,
depending on whether the patient survives or not. This is referred to as
a classification problem, because we are trying to predict which class the
population unit belongs to — in this case, whether the patient will belong
to the survival or nonsurvival class. We could set up dummy variables
for each of the hospital branches, and use these to assess whether some
were doing a better job than others, while correcting for variations in age
distribution from one branch to another. (Thus our goal here is Description
rather than directly Prediction itself.)
This will be explained in detail in Chapter 4, but the point is that we are
predicting a 1-0 variable. In a marketing context, we might be predicting
which customers are more likely to purchase a certain product. In a com-
puter vision context, we may want to predict whether an image contains a
certain object. In the future, if we are fortunate enough to develop relevant
data, we might even try our hand at predicting earthquakes.
Classification applications are extremely common. And in many cases there
are more than two classes, such as in indentifying many different printed
characters in computer vision.
In a number of applications, it is desirable to actually convert a problem
with a numeric response variable into a classification problem. For instance,
there may be some legal or contractural aspect that comes into play when
our variable V is above a certain level c, and we are only interested in
whether the requirement is satisfied. We could replace V with a new vari-
able
(
1, if V > c
Y = (1.33)
0, if V ≤ c
µ(t) = P (Y = 1 | X = t) (1.35)
The great implication of this is that the extensive knowledge about regression
analysis developed over the years can be applied to the classification problem.
An intuitive strategy — but, as we’ll see, NOT the only appropriate one —
would be to guess that Y = 1 if the conditional probability of 1 is greater
than 0.5, and guess 0 otherwise. In other words,
(
1, if µ(X) > 0.5
guess for Y = (1.36)
0, if µ(X) ≤ 0.5
It turns out that this strategy is optimal, in that it minimizes the overall
misclassification error rate (see Section 1.19.4 in the Mathematical Com-
plements portion of this chapter). However, it should be noted that this
is not the only possible criterion that might be used. We’ll return to this
issue in Chapter 5.
As before, note that (1.35) is a population quantity. We’ll need to estimate
it from our sample data.
Let’s take as our example the situation in which ridership is above 3,500
bikes, which we will call HighUsage:
> s h a r $ h i g h u s e <− as . integer ( s h a r $ r e g > 3 5 0 0 )
We’ll try to predict that variable. Let’s again use our earlier example, of
a Sunday, clear weather, 62 degrees. Should we guess that this will be a
High Usage day?
We can use our k-NN approach just as before:
> knnest ( shar [ , c ( 1 0 , 8 , 1 8 , 1 9 ) ] , c ( 0 . 5 2 5 , 0 , 1 ) , 2 0 )
[ 1 ] 0.1
could be used, but would not be very satisfying. The left-hand side of
(1.37), as a probability, should be in [0,1], but the right-hand side could in
principle fall far outside that range.
Instead, the most common model for conditional probability is logistic re-
gression:
1
ℓ(s) = (1.39)
1 + e−s
1
µ(t1 , t2 , t3 , t4 ) = (1.40)
1+ e−(β0 +β1 t1 +β2 t2 +β3 t3 +β4 t4 )
So, our parametric model gives an almost identical result here to the one
arising from k-NN, about a 10% probability of HighUsage.
We can perform cross-validation too, and will do so in later chapters. For
now, note that our accuracy criterion should change, say to the proportion
of misclassified data points.
Though statisticians tend to view things through the lens of exact models,
proper computation of standard errors and so on, much usage of regression
models is much less formal. Former Chinese leader Deng Xiaoping’s famous
line to convince his peers to accept capitalistt ways comes to mind: “Black
cat, white cat, it doesn’t matter as long as it catches mice.” This more
relaxed view is common in the machine learning community, for example.19
A ≈ WH (1.41)
The larger the rank, the better our approximation in (1.41). But we typi-
cally hope that a good approximation can be achieved with
k ≪ rank(A) (1.42)
The matrices W and H are calculated iteratively, with one of the major
methods being regression. Here is how:
We make initial guesses for W and H, say with random numbers. Now
consider an odd-numbered iteration. Suppose just for a moment that we
know the exact value of W , with H unknown. Then for each j we could
“predict” column j of A from the columns of W . The coefficient vector
returned by lm() will become column j of H. We do this for j = 1, 2, ..., v.
In even-numbered iterations, suppose we know H but not W . We could
take transposes,
A′ = H ′ W ′ (1.43)
19 There of course is the question of whether the cat really is catching mice, a topic
or columns.
42 CHAPTER 1. SETTING THE STAGE
and then just interchange the roles of W and H above. Here a call to lm()
gives us a row of W , and we do this for all rows.
R’s NMF package for NMF computation is quite versatile, with many,
many options. In its simplest form, though, it is quite easy to use. For a
matrix a and desired rank k, we simply run
> nout <− nmf ( a , k )
Now, perform NMF, find the approximation to A, and display it, as seen
44 CHAPTER 1. SETTING THE STAGE
one of rank only 50, with a 75% storage savings. This is not important for
one small p;cture, but possibly worthwhile if we have many large ones. The
approximation is not bad in that light, and may be good enough for image
recognition or other applications.
Indeed, in many if not most applications of NMF, we need to worry about
overfitting. As you will see later, overfitting in this context amounts to
using too high a value for our rank, something to be avoided.
This section will be just a wee bit mathematical. Usually this book aims to
relegate such material to the Mathematical Complements sections, so that
nonmathematical readers can skip such material. But in this section, we
ask the reader’s brief indulgence, and promise to focus on intuition.
Since the regression function is defined as a conditional expected value,
as in (1.3), for mathematical analysis we’ll need some properties. First, a
definition.
For any random variables U and V with defined expectation, either of which
could be vector-valued, define a new random variable W , as follows. First
note that the conditional expectation of V given U = t is a function of t,
1+t
µ(t) = E(V | U = t) = (1.46)
2
The foreboding appearance of this equation belies the fact that it is actually
quite intuitive, as follows. Say you want to compute the mean height of all
people in the U.S., and you already have available the mean heights in each
of the 50 states. You cannot simply take the straight average of those state
mean heights, because you need to give more weight to the more populous
states. In other words, the national mean height is a weighted average of
the state means, with the weight for each state being is its proportion of
the national population.
In (1.47), this corresponds to having V as height and U as state. State is
an integer-valued random variable, ranging from 1 to 50, so we have
EV = E[E(V | U )] (1.48)
= EW (1.49)
50
X
= P (U = i) E(V | U = i) (1.50)
i=1
The left-hand side, EV , is the overall mean height in the nation; E(V | U =
i) is the mean height in state i; and the weights in the weighted average
are the proportions of the national population in each state, P (U = i).
Not only can we look at the mean of W , but also its variance. By using the
46 CHAPTER 1. SETTING THE STAGE
various familiar properties of mean and variance, one can derive a similar
relation for variance:
For scalar V ,
One might initially guess that we only need the first term. To obtain the
national variance in height, we would take the weighted average of the state
variances. But this would not take into account that the mean heights vary
from state to state, thus also contributing to the national variance in height,
hence the second term.
This is proven in Section 2.12.10.3.
Now consider conditioning on two variables, say U1 and U2 . One can show
that
There is an elegant way to view all of this in terms of abstract vector spaces
— (1.47) becomes the Pythagorean Theorem! — which we will address later
in Mathematical Complements Sections 2.12.10 and 7.8.1.
• EW = P (Q)
This follows from
This is the squared distance from our estimator to the true value, averaged
over all possible samples.
48 CHAPTER 1. SETTING THE STAGE
Let’s rewrite the quantity on which we are taking the expected value:
2 2
θb − θ b
= θb − E θb + E θb − θ = (θ−E b 2 +(E θ−θ)
θ) b 2 b
+2(θ−E b θ−θ)
θ)(E b
(1.56)
Look at the three terms on the far right of (1.56). The expected of the first
b by definition of variance.
is V ar(θ),
E(θb − E θ)
b =0 (1.57)
Taking the expected value of both sides of (1.56), taking the above remarks
into account, we have
b
MSE(θ) = b + (E θb − θ)2
V ar(θ) (1.58)
= variance + bias2 (1.59)
In other words:
Claim: Consider all the functions f() with which we might predict Y from
X, i.e., Yb = f (X). The one that minimizes mean squared prediction error,
E[(Y − f (X))2 ], is the regression function, µ(t) = E(Y | X = t).
(Note that the above involves population quantities, not samples. Consider
the quantity E[(Y −f (X))2 ], for instance. It is the mean squared prediction
error (MSPE) over all (X, Y ) pairs in the population.)
To derive this, first ask, for any (finite-variance) random variable W , what
number c that minimizes the quantity E[(W −c)2 ]? The answer is c = EW .
1.19. MATHEMATICAL COMPLEMENTS 49
MSPE = E[(Y − f (X))2 ] = E E((Y − f (X))2 |X) (1.61)
This result concerns the classification context. It shows that if we know the
population distribution — we don’t, but are going through this exercise to
guide our intuition — the conditional mean provides the optimal action in
the classification context.
Remember, in this context, µ(t) = P (Y | X = t), i.e. the conditional mean
reduces to the conditional probability. Now plug in X for t, and we have
the following.
Claim: Consider all rules based on X that produce a guess Yb , taking on
values 0 and 1. The one that minimizes the overall misclassification rate
P (Yb 6= Y ) is
(
1, if µ(X) > 0.5
Yb = (1.62)
0, if µ(X) ≤ 0.5
Think of this simple situation: There is a biased coin, with known prob-
ability of heads p. The coin will be tossed once, and we are supposed to
guess the outcome.
Let’s name your guess g (a nonrandom constant), and let C denote the
as-yet-unknown outcome of the toss (1 for heads, 0 for tails). Then the
reader should check that, no matter whether we choose 0 or 1 for g, the
probability that we guess correctly is
Now to show the original claim, we use iterated expectation, a term alluding
to the following. Now returning to our original claim, write
h i
P (Yb = Y ) = E P (Yb = Y | X) (1.66)
P (Y = 1 | X) = µ(X) (1.67)
¿
Here I have instructed R to download the freqparcoord package, installing
it in ˜/R, the directory where I like to store my packages.
Official R parlance is package, not library, even though ironically one loads
a package using the library() function! For instance,
> library ( freqpar coord )
One can learn about the package in various ways. After loading it, for
instance, you can list its objects, such as
> l s ( ’ package : f r e q p a r c o o r d ’ )
[ 1 ] ” f r e q p a r c o o r d ” ” knndens ” ” knnreg ” ”posjitter”
” regdiag ”
[ 6 ] ” regdiagbas ” ”rmixmvnorm” ” smoothz ” ” smoothzpred ”
where we see objects (functions here) knndens() and so on. There is the
help() function, e.g.
> help ( package=f r e q p a r c o o r d )
I n f o r m a t i o n on package freqparcoord
Description :
Package : freqparcoord
Version : 1.1.0
Author : Norm M a t l o f f <normmatloff@gmail . com> and Yingkang Xie
<yingkang . xie@gmail . com>
Maintainer : Norm M a t l o f f <normmatloff@gmail . com>
...
> vignette ()
In Section 1.6.2 we had occasion to use R’s tapply(), a highly useful feature
of the language. To explain it, let’s start with useful function, split().
Consider this tiny data frame:
> x
gender height
1 m 66
2 f 67
3 m 72
4 f 63
$m
gender height
1 m 66
3 m 72
• xs is an R list
• xs$f and xs$m are data frames, the male and female subsets of x
The first argument of tapply() must be a vector, but the function that is
applied can be vector-valued. Say we want to find not only the mean but
also the standard deviation. We can do this:
> tapply ( x$ h e i g h t , x$ gender , function (w) c (mean(w) , sd (w) ) )
$f
[ 1 ] 64.333333 2.309401
$m
[ 1 ] 69.000000 4.242641
Here, our function (which we defined “on the spot,” within our call to
tapply(), produces a vector of two components. We asked tapply() to
call that function on our vector of heights, doing so separately for
each gender.
The return value from a call to lm() is an object of R’s S3 class structure;
the class, not surprisingly, is named ”lm”. It turns out that the functions
coef() and vcov() mentioned in this chapter are actually related to this
class, as follows.
Recall our usage, on the baseball player data:
> lmout <− lm( mlb$Weight ˜ mlb$ Height )
> coef ( lmout ) %∗% c ( 1 , 7 2 )
[ ,1]
[ 1 , ] 193.2666
1. Consider the bodyfat data mentioned in Section 1.2. Use lm() to form
a prediction equation for density from the other variables (skipping the
first three), and comment on whether use of indirect methods in this way
seems feasible.
2. Suppose the joint density of (X, Y ) is 3s2 e−st , 1 < s < 2, 0 < t < −∞.
Find the regression function µ(s) = E(Y |X = s).
3. For (X, Y ) in the notation of Section 1.19.3, show that the predicted
value µ(X) and the predicton error Y − µ(X) are uncorrelated.
4. Suppose X is a scalar random variable with density g. We are interested
in the nearest neighbors to a point t, based on a random sample X1 , ..., Xn
from g. Find Lk denote the cumulative distribution function of the distance
of the k th -nearest neighbor to t.
1.21. FURTHER EXPLORATION: DATA, CODE AND MATH PROBLEMS55