3 Transformations in Regression: Y X Y X
3 Transformations in Regression: Y X Y X
3 Transformations in Regression: Y X Y X
TRANSFORMATIONS IN REGRESSION
Transformations in Regression
Simple linear regression is appropriate when the scatterplot of Y against X show a linear trend. In many problems, non-linear relationships are evident in data plots. Linear regression techniques can still be used to model the dependence between Y and X , provided the data can be transformed to a scale where the relationship is roughly linear. In the ideal world, theory will suggest an appropriate transformation. In the absence of theory one usually resorts to empirical model building. Polynomial models are another method for handling nonlinear relationships. I will suggest transformations that you can try if the trend in your scatterplot has one of the following functional forms. The responses are assumed to be non-negative (in some cases strictly positive) in all cases.
(a) Y as a Positive Power of X
0.5 1.0 1.5 2.0 2.5
2.0
Y = 0X 1, 1 > 0
Y
Y = 0X 1, 1 < 0
0.0 0.0
1.0
0.5
1.0 X
1.5
2.0
0.5
1.0 X
1.5
2.0
10
0.0
0.5
1.0 X
1.5
2.0
0.2
0.4
0.6
Y = 0e1X, 1 > 0
15
Y = 0e1X, 1 < 0
0.8 0.0
0.5
1.0 X
1.5
2.0
The functional relationship between Y and X in (a) is given by Y = 0 X 1 , that is Y is related to a power of X , where the power is typically unknown. For the left plot, 1 > 0 whereas 1 < 0 for the plot on the right. For either situation, the logarithm of Y is linearly related to the logarithm of X (regardless of the base): log(Y ) = log(0 ) + 1 log(X ). You should consider a simple linear regression of Y = log(Y ) on X = log(X ). 24
TRANSFORMATIONS IN REGRESSION
The functional relationship between Y and X in (b) is given by Y = 0 exp(1 X ), that is Y is an exponential function of X . For the plot on the left, 1 > 0 whereas 1 < 0 for the plot on the right. In either situation, the natural logarithm of Y is linearly related to X : loge (Y ) = loge (0 ) + 1 X. You should consider a simple linear regression of Y = loge (Y ) on X . Actually, the base of the logarithm is not important here either.
(c) Y as a Logarithm of X
8 8
(c) Y as a Logarithm of X
Y = 0 + 1log(X ), 1 > 0
4 X
Y = 0 + 1log(X ), 1 < 0
4 X
(d) Y as a Reciprocal of X
7
(d) Y as a Reciprocal of X
5
1 Y = 0 + 1 , 1 > 0 X
Y
0.0
0.5
1.0
1.5 X
2.0
2.5
3.0
1 Y = 0 + 1 , 1 < 0 X
0.0 0.5 1.0 1.5 X 2.0 2.5 3.0
The functional relationship between Y and X in (c) is given by Y = 0 + 1 log(X ), that is Y is an logarithmic function of X . For the plot on the left, 1 > 0 whereas 1 < 0 for the plot on the right. In each situation, consider a simple linear regression of Y on X = log(X ). The functional relationship between Y and X in (d) is Y = 0 + 1 1 . X
Hence, consider a simple linear regression of Y on X = 1/X . Note that each plot in (d) has a horizontal asymptote of 0 . 25
TRANSFORMATIONS IN REGRESSION
In most problems, the trend or signal will be buried in a considerable amount of noise, or variability, so the best transformation may not be apparent. If two or more transformations are suggested try all of them and see which is best - look at diagnostics from the various ts rather than (meaningless) summaries such as R2 . In situations where a logarithmic transformation is suggested, you might try a square root transformation as well. It often does make a considerable dierence in the quality of the t whether you transform Y only, X only, or both. There are more organized schemes for choosing transformations, but this sort of trial and error is the most common practice. Note that the functional forms (a) - (d), while probably the most frequently encountered, are not at all the only ones used. The need to transform is sometimes much more apparent in a plot of the residuals against the predicted values from a linear t of the original data because you tend not to perceive subtle deviations from linearity. The Wind Speed example below illustrates this. Transformations also can help to control inuential values and outliers (recall that an outlying X -value can cause that point to exert undue inuence on the t). Functions such as log have the eect of bringing outlying values much closer to the rest of the data. The Brain Weights vs. Body Weights example below illustrates this. When I see a variable with a highly skewed distribution, I usually try transforming it to make it more symmetric. This can work both ways, of course - you can make a nice symmetrically distributed variable skewed by transforming it.
Computing Predictions
Transforming the response to a new scale causes no diculties if you wish to make predictions on the original scale. For example, suppose you t a linear regression of loge (Y ) on X . The tted values satisfy loge (Y ) = b0 + b1 X. The predicted response Yp for an individual with X = Xp is obtained by rst getting the predicted value for loge (Yp ): loge (Yp ) = b0 + b1 Xp . Our best guess for Yp is obtained by exponentiating our prediction for loge (Yp ): p = exp(loge (Yp )) = exp(b0 + b1 Xp ). Y The same idea can be used to get prediction intervals for Yp from a prediction interval for loge (Yp ) (just transform the lower and upper condence limits). Other transformations on Y are handled analogously. For example, how do you predict Y using a simple linear regression with 1/Y as the selected response?
26
TRANSFORMATIONS IN REGRESSION
. list speed dc,clean speed dc 1. 5 1.582 2. 6 1.822 3. 3.4 1.057 4. 2.7 .5 5. 10 2.236 6. 9.7 2.386 7. 9.55 2.294 8. 3.05 .558 9. 8.15 2.166 10. 6.2 1.866 11. 2.9 .653 12. 6.35 1.93 13. 4.6 1.562 14. 5.8 1.737 15. 7.4 2.088 16. 3.6 1.137 17. 7.85 2.179 18. 8.8 2.112 19. 7 1.8 20. 5.45 1.501 21. 9.1 2.303 22. 10.2 2.31 23. 4.1 1.194 24. 3.95 1.144 25. 2.45 .123
DC Output vs. Wind Speed
2.5 0 2 .5 1 DC 1.5 2
6 Speed
10
. regress dc speed Source | SS df MS Number of obs = 25 -------------+-----------------------------F( 1, 23) = 160.26 Model | 8.92961408 1 8.92961408 Prob > F = 0.0000 Residual | 1.28157328 23 .055720577 R-squared = 0.8745 -------------+-----------------------------Adj R-squared = 0.8690 Total | 10.2111874 24 .42546614 Root MSE = .23605 -----------------------------------------------------------------------------dc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------speed | .2411489 .0190492 12.66 0.000 .2017426 .2805551 _cons | .1308752 .1259894 1.04 0.310 -.1297537 .3915041 ------------------------------------------------------------------------------
27
TRANSFORMATIONS IN REGRESSION
The data plot shows a strong linear trend, but the relationship is nonlinear. If I ignore the nonlinearity and t a simple linear regression model, I get Predicted DC Output = .1309 + .2411 Wind Speed. Although the R2 from this t is high, R2 = .875, I am unhappy with the t of the model. The plot of the residuals against the tted values clearly points out the inadequacy:
Residual Plots
Normal Prob. Plot of Residuals
.4 Residuals .6 .4 .2 0 .2 .4 .2 0 .2 Inverse Normal .4 .5 Residuals .6 .4 .2 0 .2 .4
2.5
10 15 obs_order
20
25
The rvfplot shows that the linear regression systematically underestimates the DC output for wind speeds in the middle, and overestimates the DC output for low and high wind speeds. This model is not acceptable for making predictions - one can and should do better! The original data plot indicates that DC output approaches an upper limit of about 2.5 amps as the wind speed increases. Given this fact, and the trend in the plot, I decided to use the inverse of wind speed as a predictor of DC output. Another reasonable rst step would be a logarithmic transformation of wind speed but this function steadily increases without approaching a nite limit. Aside: The above plot is not the same as in the previous notes or in the lab. I decided to illustrate further the exibility of Stata and the power of do files. We obtained exactly those four plots in Minitab if we requested the 4-in-1 plots in regression. You might want to replace the histogram with a boxplot the modication is simple. The do file statements to produce the plot after running the regression command are:
28
TRANSFORMATIONS IN REGRESSION
predict residual, r quietly qnorm residual, saving(probplot, replace) nodraw /// title(Normal Prob. Plot of Residuals) quietly rvfplot, saving(respredplot, replace) nodraw /// title(Residuals vs. Fitted Values) quietly hist residual, freq saving(hist, replace) nodraw /// title(Histogram of the Residuals) generate obs_order = _n quietly twoway connect residual obs_order, saving(obs_order, replace) /// nodraw title(Residuals vs. Order of the Data) drop obs_order graph combine probplot.gph respredplot.gph hist.gph obs_order.gph, /// title(Residual Plots) This program will fail if the variable residual exists before you run it (that can be xed). A plot of DC output against one over the wind speed is fairly linear:
DC Output vs. Reciprocal of Speed
2.5 0 .1 .5 1 DC 1.5 2
.2
1/Speed
.3
.4
This suggests that a simple linear regression t on this scale is appropriate. Note that DC output is a decreasing function of one over the wind speed. . regress dc speed_inv Source | SS df MS Number of obs = 25 -------------+-----------------------------F( 1, 23) = 1128.43 Model | 10.0072178 1 10.0072178 Prob > F = 0.0000 Residual | .203969527 23 .00886824 R-squared = 0.9800 -------------+-----------------------------Adj R-squared = 0.9792 Total | 10.2111874 24 .42546614 Root MSE = .09417 -----------------------------------------------------------------------------dc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------speed_inv | -6.934547 .2064335 -33.59 0.000 -7.361588 -6.507507 _cons | 2.97886 .0449023 66.34 0.000 2.885973 3.071748 ------------------------------------------------------------------------------
29
TRANSFORMATIONS IN REGRESSION
Residual Plots
Normal Prob. Plot of Residuals
.2 Residuals .2 .1 0 .1 Residuals .1 0 .2 .1 0 Inverse Normal .1 .2 .2 0 .1
.5
2.5
10 15 obs_order
20
25
The LS regression line is 1 . Wind speed The residual plots show left skewness, but no serious outliers. The Shapiro-Wilk test has a p-value of 0.08. The transformation appears to work well, although if I tried harder I might be able to symmetrize the residuals a little better (I would start by transforming Y instead of X). I dont think it is worth the trouble here, though. It is fairly clear by examining the scatter plot (the one corresponding to the actual regression we did!) that there are no highly inuential points here. Still, we really should check the Cooks D values as a routine matter. Since 1 is a common cuto for Cooks D, and no values stand out much, we have little to be concerned over. . predict cooksd,cooksd . gene obs_order = _n . twoway spike cooksd obs_order Predicted DC output = 2.9789 6.9345
.2 0 0 .05 Cooks D .1 .15
10
obs_order
15
20
25
30
TRANSFORMATIONS IN REGRESSION
All our theory and modelling applies in the linear scale (the transformed problem where we t output to 1/speed). We really want to see how well things appear to work in the original scale, though. The following statements accomplish that. . . . > . > . regress dc speed_inv predict pred_dc,xb twoway (scatter dc speed_inv) (line pred_dc speed_inv,sort),legend(off) title(Prediction on Linear Scale) saving(l,replace) twoway (scatter dc speed) (line pred_dc speed,sort),legend(off) title(Prediction on Original Scale) saving(o,replace) graph combine l.gph o.gph
We would put condence and prediction bands on the plot in a similar manner. How would we predict output (with a prediction interval) for a wind speed of 15?
Prediction on Linear Scale
2.5 2.5
1.5
.5
.1
.2
speed_inv
.3
.4
0 2
.5
1.5
6 Speed
10
31
TRANSFORMATIONS IN REGRESSION
TRANSFORMATIONS IN REGRESSION
61. 62.
.104 4.235
2.5 50.4
A plot of the brain weights against the body weights is non-informative because many species have very small brain weights and body weights compared to the elephants: . scatter br bo,tit(Brain Weight vs. Body Wt. for 62 Mammals)
6000
0 0
2000
Brain_Wt
4000
2000
4000 Body_Wt
6000
8000
If we momentarily hold out the species with body weights exceeding 200kg or brain weights exceeding 200g , and replot the data, we see that the brain weight of mammals typically increases with the body weight, but the relationship is nonlinear: . scatter br bo if(bo<=200),tit(Brain Wt vs. Body Wt. for 62 Mammals)
1500
0 0
500
Brain_Wt
1000
50
100 Body_Wt
150
200
The trend suggests transforming both variables to a logarithmic scale to linearize the relationship between brain weight and body weight. It does not matter which base logarithm you choose. The relationship is no more linear with one base than another. I will use natural logarithms. What is even more compelling about the log transform here is the extreme right skewness of both variables logs pull extremely large values down much more than more modest values, so they tend to symmetrize such data (and regression works much better when both variables have reasonably symmetric distributions). 33
TRANSFORMATIONS IN REGRESSION
. graph box bod,name(bodbox) . graph box br,name(brbox) . graph combine bodbox brbox
8,000 6,000
6,000
2,000
The plot of loge (brain weight) against loge (body weight) is fairly linear: . . . > gene lbod=log(body_wt) gene lbr = log(brain_wt) scatter lbr lbod,title(Brain Wt. vs. Body Wt. on a log-log scale) xti(Log(Bod y Weight)) yti(Log(Brain Weight))
Brain Wt. vs. Body Wt. on a loglog scale
8
Log(Brain Weight) 2 4 6
2,000
Brain_Wt
Body_Wt 4,000
4,000
Log(Body Weight)
10
At this point I considered tting the model: loge (brain weight) = 0 + 1 loge (body weight) + . Summary information from tting this model:
34
TRANSFORMATIONS IN REGRESSION
. regre lbr lbo Source | SS df MS -------------+-----------------------------Model | 336.188164 1 336.188164 Residual | 28.9225677 60 .482042795 -------------+-----------------------------Total | 365.110732 61 5.98542184 Number of obs F( 1, 60) Prob > F R-squared Adj R-squared Root MSE = = = = = = 62 697.42 0.0000 0.9208 0.9195 .69429
-----------------------------------------------------------------------------lbr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lbod | .7516859 .0284635 26.41 0.000 .6947505 .8086214 _cons | 2.134787 .0960432 22.23 0.000 1.942672 2.326902 -----------------------------------------------------------------------------The tted relationship: Predicted loge (brain weight) = 2.135 + 0.752 loge (body weight), explains about 92% of the variation in loge (brain weight). The ttest for H0 : 1 = 0 is highly signicant (p value = 0 to three decimal places). This summary information combined with the data plot indicates that there is a strong linear relationship between loge (brain weight) and loge (body weight), with the average loge (brain weight) increasing as loge (body weight) increases. To predict brain weights, use the inverse transformation Predicted brain weight = exp{Predicted loge (brain weight)} or Predicted brain weight = exp{2.135 + 0.752 loge (body weight)} = exp(2.135) body weight0.752 = 8.457 body weight0.752 . These conclusions are tentative, subject to a careful residual analysis. Residual plots do not suggest any serious deciencies with the model, but do highlight one or more poorly tted species:
Residual Plots
Normal Prob. Plot of Residuals
2 Residuals 1 0 1 Residuals 1 0 1 2 1 0 Inverse Normal 1 2 2 2 2
2 4 Fitted values
20
obs_order
40
60
35
TRANSFORMATIONS IN REGRESSION
Can anyone guess what species these may be, and what further analyses might be reasonable? The largest and smallest residuals belong to observations 32 and 34 respectively (obtained from simply entering the data editor). Note that a normal probability (or Q Q) plot of the residuals is reasonably straight and the Shapiro-Wilk test of normality indicates no gross departures from normality: Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------res | 62 0.98268 0.967 -0.073 0.52927 Cooks D does not show any particular problems (until the value approaches 1, most data analysts do not worry much about it). Compare it to the value in the original scale where the distribution of both variables was so skewed.
150
Cooks D
.05
20
obs_order
40
60
0 0
50
Cooks D
100
.1
20
obs_order
40
60
Usually it is worth plotting the tted values back on the original scale as we did for the wind speed data. That would not be very useful here since the original scale obscures most of the data.
36